In this article, we present a novel approach to disambiguating planetary features mentioned in astronomical publications. Our pipeline combines natural language processing (NLP) and knowledge-based techniques to generate a comprehensive set of keywords for each excerpt, which are then used to disambiguate mentions of planetary features.
To begin with, we use Yake and SpaCy to extract keywords from each excerpt. These keywords are then lemmatized to consolidate different morphological forms. Identical keywords between Yake and SpaCy are merged and added to the list, while unique Wikidata terms are appended. To reach a total of 10 keywords per excerpt, additional terms are drawn alternately from the Yake and SpaCy outputs as needed.
Next, we adapt the knowledge graph (KG) approach for disambiguating planetary feature names mentioned in astronomical publications. A KG is a structured representation of concepts and their relationships, comprising nodes, edges, and weights. In our case, nodes represent distinct entities, such as feature names, feature types, and top keywords.
The pipeline’s disambiguation capabilities are robust but can be improved, especially in cases where mentions are sparse or ambiguous. To tackle this challenge, we employ additional techniques, such as clustering textual excerpts that contain ambiguous terms. By clustering these excerpts, we can aggregate contextual information across multiple mentions, resulting in a more comprehensive representation of entities. Expanding the KG to include lesser-known entities will also enhance coverage and overall performance. Analyzing larger excerpt windows can further enrich the contextual information.
The broad terminology extracted from Wikidata provides a benchmark for comparing the keywords present in each excerpt. The presence of these planetary terms in the excerpt signals that the context is discussing a valid planetary feature. By integrating NLP and knowledge-based techniques, our multi-stage pipeline enables comprehensive disambiguation.
Our approach outperforms traditional statistical NER models due to its intricate process. The initial search, POS tagging, and astroBERT NER filter out irrelevant entities, while hybrid keyword harvesting generates descriptive keywords for each entity. The KG models semantic connections between entities and keywords, facilitating informed disambiguation using contextual clues.
At disambiguation time, the keyword set from a text excerpt is compared against the KG to find the best matching entity by semantic similarity. The graph’s rich keyword connections allow for the disambiguation of ambiguous references. Furthermore, paper relevance score and having LLM analyze the excerpt provide additional capabilities to handle difficult cases.
In summary, our approach leverages the strengths of NLP and knowledge-based techniques to disambiguate planetary features in astronomical publications with unprecedented accuracy. By combining these techniques, we can effectively improve the disambiguation of sparse or obscure cases, ensuring that valid planetary features are accurately identified and described.
Computation and Language, Computer Science