Data Prevention via Classification Models: A Comprehensive Review

In this article, we propose a data leakage prevention (DLP) system that uses a classic text classification technique with machine learning algorithms to classify documents and prevent data leakage. The system employs the term frequency-inverse document frequency (TF-IDF) weighting method for feature extraction and improves the gradient boosting classification algorithm (IGBCA) for model training/tuning.
To tackle the issue of varied document lengths, we normalize the term frequencies by dividing them with the length of each document. This step helps avoid bias towards frequently occurring terms and ensures that all documents are treated equally. Additionally, we focus on scaling up least frequent terms and downweighting most frequent ones to maintain a balanced model.
The proposed DLP system achieves high scores for various parameters like sensitivity, specificity, F1-score, and precision during testing, indicating its efficiency in classifying documents accurately. However, we acknowledge that speed of classification and other system overheads are vital factors that need to be addressed in future enhancements.
By using this DLP system, organizations can effectively prevent data leakage by classifying documents based on their content. The system’s ability to handle varying document lengths and balance term frequencies ensures accurate classification, making it a valuable tool for maintaining sensitive information secure.

ARXIV/2312.13711 authored by Kishu Gupta, Ashwani Kush.

Data Prevention via Classification Models: A Comprehensive Review

LLama 2 7B Chat

Categories

Tags

Archives

Data Prevention via Classification Models: A Comprehensive Review

LLama 2 7B Chat

Accurate Analysis of Image Captions with CoT-Based Methods

Unsupervised Audio-Caption Alignment via Correspondence Learning

Efficient Method for ML Model Accuracy Improvement in Non-IID Data Settings

Categories

Tags

Archives