In this article, we explore the process of pre-training data processing for domain adaptation in natural language processing (NLP). We discuss the importance of enhancing data quality and reduce the response time for processing terabyte-scale data. Our comprehensive data processing pipeline consists of four modules: normalizing, heuristic cleaning, multi-level deduplication, and toxicity filtering. These modules significantly reduce the response time for processing data while maintaining high-quality data.
Normalization
In this module, we transform all raw data into JSON format with keys such as data type, source, identifier, and content. We also check for missing line breaks to ensure consistency in formatting.
Heuristic Cleaning
Our heuristic multi-level cleaning strategy focuses on semantic issues like garbled characters, logical confusion, and low-quality lines. At the chapter level and line level, we concentrate on addressing these issues, while at the word level, we eliminate advertising trigger words. Finally, at the character level, we scrutinize cases of redundant and missing characters.
Multi-level Deduplication
We employ a multi-level deduplication approach to remove duplicate data. This module reduces data volume by 90% while preserving valuable information.
Toxicity Filtering
In this module, we use a language detector model to detect and filter out offensive or inappropriate content. We devise over a thousand heuristic cleaning rules to tackle issues in formats, contents, and encoding for dozens of data categories.
Conclusion
By enhancing data quality through pre-training data processing, we can improve the performance of NLP models in domain adaptation. Our comprehensive pipeline reduces response times while maintaining high-quality data. With a focus on demystifying complex concepts and using everyday language, this summary captures the essence of the article without oversimplifying.