Bridging the gap between complex scientific research and the curious minds eager to explore it.

Computer Science, Social and Information Networks

Unlocking Scientific Insights: Topic Modeling for ArXiv Documents

Unlocking Scientific Insights: Topic Modeling for ArXiv Documents

In this paper, we propose a novel approach to analyzing large bodies of scientific literature by clustering research areas into topics using a topic model. The approach is grounded in theory, specifically the concept of capital (Bourdieu, 1980, 1986), and utilizes quantitative constructs to translate theoretical insights into practical application. We begin by introducing the context of the article, including the works of Aleta et al. (2019), Battiston et al. (2019), Tripodi et al. (2020), and Liu et al. (2023). These works provide a foundation for our approach by demonstrating the effectiveness of topic models in analyzing scientific literature.
We then present the conceptual framework underlying our approach, which involves coarse-graining the literature into only 20 topics using a topic model. We explain how we select these topics, taking into account the level of arbitrariness involved in choosing the number of topics and how this affects the results. We also discuss how we discard ambiguous keywords that do not carry any scientific content or context, reducing the average effective amount of topics per document from 7.5 to 3.2.
The main contribution of our approach is the use of the Embedded Topic Model (ETM), developed by Dieng et al. (2020), which leverages pretrained embeddings representations of keywords to provide more reliable classifications of heavy-tailed vocabulary distributions. We train the model on the abstracts of 186, 162 documents published between 2000 and 2019 and obtain a list of topics that are listed in Appendix A.3, Table 1. These topics have their most frequent keywords, as well as the context in which they are used.
Our approach provides several benefits over traditional methods of analyzing scientific literature. Firstly, it allows for the clustering of research areas into topics based on their content, rather than solely relying on manual annotation or keyword extraction. Secondly, it leverages theoretical insights to provide a more nuanced understanding of the relationships between different topics and research areas. Finally, it provides a practical application of topic modeling that can be adapted to any body of scientific literature, making it a versatile tool for analyzing complex bodies of text.
In conclusion, our approach represents a significant advancement in the field of natural language processing and information analysis. By leveraging the power of topic models and theoretical insights, we provide a practical solution for clustering research areas into topics that can be applied to any body of scientific literature. The use of ETM allows for more reliable classifications of heavy-tailed vocabulary distributions, providing a more comprehensive understanding of the content and context of scientific texts.