Bridging the gap between complex scientific research and the curious minds eager to explore it.

Computation and Language, Computer Science

Benchmarking Large Language Models for News Summarization: A Comparison of Human-Generated and Model-Generated Summaries

Benchmarking Large Language Models for News Summarization: A Comparison of Human-Generated and Model-Generated Summaries

Evaluating Large Language Models (LLMs) for Conversational Summarization: A Review
In recent years, there has been growing interest in evaluating the abilities of Large Language Models (LLMs) for summarizing conversations. However, this task remains challenging due to the subjective nature of summaries and the difficulty of developing reliable benchmarks. Existing works have focused on news summarization, but conversational summarization has been overlooked. In this review, we argue for including conversational summarization in the benchmarks and propose a web-based interface for collecting human preferences between model-generated and reference summaries.
The authors of recent work on benchmarking LLM abilities for news articles summarization highlight that existing datasets suffer from fundamental limitations and that target summaries may not always serve as reliable ground truth. They hired freelance writers to generate high-quality target summaries and compared the models’ performance with that of expert writers. The results showed that the models’ generated summaries were on par with those written by experts, indicating their potential for conversational summarization.
To address the challenge of evaluating LLMs for conversational summarization, we propose a web-based interface that collects human preferences between model-generated and reference summaries. The interface displays a conversation along with two summaries, one generated by the model and the other a reference summary. Volunteers and employees of anonymous organizations were recruited to rate the summaries, and their preferences were analyzed.
The results show that humans preferred the model-generated summaries in most cases, indicating their potential for conversational summarization. However, there were some instances where the human preference was reversed, highlighting the subjective nature of summaries. The study demonstrates the importance of including conversational summarization in the benchmarks and the need for a more comprehensive evaluation framework.
In conclusion, evaluating LLMs for conversational summarization remains a challenge due to the subjective nature of summaries. However, proposed web-based interface provides a more comprehensive way of evaluating models. By collecting human preferences between model-generated and reference summaries, we can better understand the quality of the generated summaries and develop more accurate benchmarks for conversational summarization.