Bridging the gap between complex scientific research and the curious minds eager to explore it.

Computation and Language, Computer Science

Enhancing Data Quality and Efficiency in Natural Language Processing with Comprehensive Pipeline

Enhancing Data Quality and Efficiency in Natural Language Processing with Comprehensive Pipeline

In this article, we explore the process of pre-training data processing for domain adaptation in natural language processing (NLP). We discuss the importance of enhancing data quality and reduce the response time for processing terabyte-scale data. Our comprehensive data processing pipeline consists of four modules: normalizing, heuristic cleaning, multi-level deduplication, and toxicity filtering. These modules significantly reduce the response time for processing data while maintaining high-quality data.

Normalization

In this module, we transform all raw data into JSON format with keys such as data type, source, identifier, and content. We also check for missing line breaks to ensure consistency in formatting.

Heuristic Cleaning

Our heuristic multi-level cleaning strategy focuses on semantic issues like garbled characters, logical confusion, and low-quality lines. At the chapter level and line level, we concentrate on addressing these issues, while at the word level, we eliminate advertising trigger words. Finally, at the character level, we scrutinize cases of redundant and missing characters.

Multi-level Deduplication

We employ a multi-level deduplication approach to remove duplicate data. This module reduces data volume by 90% while preserving valuable information.

Toxicity Filtering

In this module, we use a language detector model to detect and filter out offensive or inappropriate content. We devise over a thousand heuristic cleaning rules to tackle issues in formats, contents, and encoding for dozens of data categories.

Conclusion

By enhancing data quality through pre-training data processing, we can improve the performance of NLP models in domain adaptation. Our comprehensive pipeline reduces response times while maintaining high-quality data. With a focus on demystifying complex concepts and using everyday language, this summary captures the essence of the article without oversimplifying.