Bridging the gap between complex scientific research and the curious minds eager to explore it.

Computation and Language, Computer Science

Automating Document Classification with Summarization-based Data Augmentation

Automating Document Classification with Summarization-based Data Augmentation

Document classification is a challenging task for machine learning models due to the scarcity of training data. Existing methods often use costly summarization techniques, which can be time-consuming and may not provide accurate results. In this study, we propose a simple yet effective summarization-based data augmentation method called SUMMaug to generate pseudo abstractive training examples for document classification.
Our approach is inspired by how humans develop their ability to comprehend lengthy text. We start with shorter texts and gradually increase the difficulty level by reading longer texts. Similarly, our method generates easy-to-learn examples from the original training data by applying text summarization techniques. The resulting abstractive examples are then used for curriculum learning, where the model is fine-tuned on the generated examples to improve its ability to comprehend lengthy text.
We use an off-the-shelf summarization model that can handle diverse topics and adapt it to our task. Our experiments on two datasets show that SUMMaug outperforms existing baseline methods in terms of robustness and accuracy. We release our code and data at https://github.com/etsurin/summaug.

Key points

  • SUMMaug generates pseudo abstractive training examples by summarizing the original training data.
  • The resulting examples are easy-to-learn and help the model develop its ability to comprehend lengthy text.
  • Curriculum learning is used to fine-tune the model on the generated examples, which improves its accuracy.
  • Existing methods often use costly summarization techniques that may not provide accurate results.
  • SUMMaug is a simple yet effective method for data augmentation in document classification.