In this article, we propose a data leakage prevention (DLP) system that uses a classic text classification technique with machine learning algorithms to classify documents and prevent data leakage. The system employs the term frequency-inverse document frequency (TF-IDF) weighting method for feature extraction and improves the gradient boosting classification algorithm (IGBCA) for model training/tuning.
To tackle the issue of varied document lengths, we normalize the term frequencies by dividing them with the length of each document. This step helps avoid bias towards frequently occurring terms and ensures that all documents are treated equally. Additionally, we focus on scaling up least frequent terms and downweighting most frequent ones to maintain a balanced model.
The proposed DLP system achieves high scores for various parameters like sensitivity, specificity, F1-score, and precision during testing, indicating its efficiency in classifying documents accurately. However, we acknowledge that speed of classification and other system overheads are vital factors that need to be addressed in future enhancements.
By using this DLP system, organizations can effectively prevent data leakage by classifying documents based on their content. The system’s ability to handle varying document lengths and balance term frequencies ensures accurate classification, making it a valuable tool for maintaining sensitive information secure.
Computer Science, Machine Learning