Abstractive summarization has recently opened up new possibilities with large language models (LLMs). Researchers investigated how well these models can summarize and compared their performance to human-written summaries. The results showed that LLM-generated summaries are highly grammatical, fluent, and relevant despite lower scores on automatic metrics like ROUGE or BERTScore. In fact, on the XSum dataset, GPT-3.5 summaries performed almost as well as re-annotated human-written summaries and much better than the ground truth according to human evaluators.
However, summarizing abstractly is different from other tasks, such as question answering or extractive summarization. Identifying salient information in a summary is not straightforward, as it requires understanding the context of the input. To address this challenge, researchers used a proxy approach based on computing the relative position of bigrams in generated summaries within the source documents. They found that all datasets except XSum and Reddit-TIFU have some lead bias: salient bigrams from reference summaries are more likely to appear earlier in the source.
The study demonstrates that LLMs can generate high-quality abstract summaries, but it also highlights the importance of considering context and lead bias when evaluating their performance. By using a proxy approach and accounting for lead bias, researchers can better understand how well LLMs capture salient information in abstract summaries.
Computation and Language, Computer Science