Enhancing Data Quality and Efficiency in Natural Language Processing with Comprehensive Pipeline

In this article, we explore the process of pre-training data processing for domain adaptation in natural language processing (NLP). We discuss the importance of enhancing data quality and reduce the response time for processing terabyte-scale data. Our comprehensive data processing pipeline consists of four modules: normalizing, heuristic cleaning, multi-level deduplication, and toxicity filtering. These modules significantly reduce the response time for processing data while maintaining high-quality data.

Normalization

In this module, we transform all raw data into JSON format with keys such as data type, source, identifier, and content. We also check for missing line breaks to ensure consistency in formatting.

Heuristic Cleaning

Our heuristic multi-level cleaning strategy focuses on semantic issues like garbled characters, logical confusion, and low-quality lines. At the chapter level and line level, we concentrate on addressing these issues, while at the word level, we eliminate advertising trigger words. Finally, at the character level, we scrutinize cases of redundant and missing characters.

Multi-level Deduplication

We employ a multi-level deduplication approach to remove duplicate data. This module reduces data volume by 90% while preserving valuable information.

Toxicity Filtering

In this module, we use a language detector model to detect and filter out offensive or inappropriate content. We devise over a thousand heuristic cleaning rules to tackle issues in formats, contents, and encoding for dozens of data categories.

Conclusion

By enhancing data quality through pre-training data processing, we can improve the performance of NLP models in domain adaptation. Our comprehensive pipeline reduces response times while maintaining high-quality data. With a focus on demystifying complex concepts and using everyday language, this summary captures the essence of the article without oversimplifying.

ARXIV/2312.14862 authored by Yin Luo, Qingchao Kong, Nan Xu, Jia Cao, Bao Hao, Baoyu Qu, Bo Chen, Chao Zhu, Chenyang Zhao, Donglei Zhang, Fan Feng, Feifei Zhao, Hailong Sun, Hanxuan Yang, Haojun Pan, Hongyu Liu, Jianbin Guo, Jiangtao Du, Jingyi Wang, Junfeng Li, Lei Sun, Liduo Liu, Lifeng Dong, Lili Liu, Lin Wang, Liwen Zhang, Minzheng Wang, Pin Wang, Ping Yu, Qingxiao Li, Rui Yan, Rui Zou, Ruiqun Li, Taiwen Huang, Xiaodong Wang, Xiaofei Wu, Xin Peng, Xina Zhang, Xing Fang, Xinglin Xiao, Yanni Hao, Yao Dong, Yigang Wang, Ying Liu, Yongyu Jiang, Yungan Wang, Yuqi Wang, Zhangsheng Wang, Zhaoxin Yu, Zhen Luo, Wenji Mao, Lei Wang, Dajun Zeng.

Enhancing Data Quality and Efficiency in Natural Language Processing with Comprehensive Pipeline

Normalization

Heuristic Cleaning

Multi-level Deduplication

Toxicity Filtering

Conclusion

LLama 2 7B Chat

Categories

Tags

Archives

Enhancing Data Quality and Efficiency in Natural Language Processing with Comprehensive Pipeline

Normalization

Heuristic Cleaning

Multi-level Deduplication

Toxicity Filtering

Conclusion

LLama 2 7B Chat

Accurate Analysis of Image Captions with CoT-Based Methods

Unsupervised Audio-Caption Alignment via Correspondence Learning

Efficient Method for ML Model Accuracy Improvement in Non-IID Data Settings

Categories

Tags

Archives