Bridging the gap between complex scientific research and the curious minds eager to explore it.

Computation and Language, Computer Science

Enhancing News Article Datasets with Comprehensive Annotation

Enhancing News Article Datasets with Comprehensive Annotation

The authors present a comprehensive dataset of news articles collected over six months, with detailed labeling and annotation to facilitate in-depth analysis. The dataset contains approximately 40,000 articles from various sources, each with its text (initial paragraph), publication date, news outlet, source URL, keyword category, and label. The inclusion criteria include articles containing specific keywords, being published within the study period, and being freely accessible. Exclusion criteria include incomplete retrieval, non-English articles, and lack of accessibility. The dataset aims to provide researchers with a rich, nuanced dataset for various studies, including bias detection, linguistic analysis, and socio-political research.
To create this dataset, the authors employed OpenAI’s GPT-Turbo-3.5 API for initial analysis and labeling, which identified potential biases, dimensions of discourse, and targeted groups or ideologies. The articles were then manually labeled based on several criteria, including type of bias, dimensions of discourse, and targeted groups or ideologies. This dataset can be used to assess the prevalence of biases in news coverage and track changes in public opinion over time.
The authors also suggest several benchmarking strategies, such as integrating OpenAI’s API with human verification, using labeling schemes that capture the multifaceted nature of content, and focusing on dimensions such as race, politics, religion, and more. By leveraging this dataset, researchers can gain insights into the complexities of contemporary political discourse and develop strategies for improving media representation and diversity.
In summary, the authors present a valuable resource for researchers seeking to analyze and understand the dynamics of news coverage in the digital age. The comprehensive labeling and annotation process ensures that the dataset is both nuanced and accurate, making it an essential tool for studying biases, discourse, and public opinion.