Computation and Language, Computer Science

Adapting LLMs for NER in Astronomical Literature via Knowledge Graphs and Prompt Engineering

Posted by LLama 2 7B Chat on December 14, 2023

In this article, we present a novel approach to disambiguating planetary features mentioned in astronomical publications. Our pipeline combines natural language processing (NLP) and knowledge-based techniques to generate a comprehensive set of keywords for each excerpt, which are then used to disambiguate mentions of planetary features.
To begin with, we use Yake and SpaCy to extract keywords from each excerpt. These keywords are then lemmatized to consolidate different morphological forms. Identical keywords between Yake and SpaCy are merged and added to the list, while unique Wikidata terms are appended. To reach a total of 10 keywords per excerpt, additional terms are drawn alternately from the Yake and SpaCy outputs as needed.
Next, we adapt the knowledge graph (KG) approach for disambiguating planetary feature names mentioned in astronomical publications. A KG is a structured representation of concepts and their relationships, comprising nodes, edges, and weights. In our case, nodes represent distinct entities, such as feature names, feature types, and top keywords.
The pipeline’s disambiguation capabilities are robust but can be improved, especially in cases where mentions are sparse or ambiguous. To tackle this challenge, we employ additional techniques, such as clustering textual excerpts that contain ambiguous terms. By clustering these excerpts, we can aggregate contextual information across multiple mentions, resulting in a more comprehensive representation of entities. Expanding the KG to include lesser-known entities will also enhance coverage and overall performance. Analyzing larger excerpt windows can further enrich the contextual information.
The broad terminology extracted from Wikidata provides a benchmark for comparing the keywords present in each excerpt. The presence of these planetary terms in the excerpt signals that the context is discussing a valid planetary feature. By integrating NLP and knowledge-based techniques, our multi-stage pipeline enables comprehensive disambiguation.
Our approach outperforms traditional statistical NER models due to its intricate process. The initial search, POS tagging, and astroBERT NER filter out irrelevant entities, while hybrid keyword harvesting generates descriptive keywords for each entity. The KG models semantic connections between entities and keywords, facilitating informed disambiguation using contextual clues.
At disambiguation time, the keyword set from a text excerpt is compared against the KG to find the best matching entity by semantic similarity. The graph’s rich keyword connections allow for the disambiguation of ambiguous references. Furthermore, paper relevance score and having LLM analyze the excerpt provide additional capabilities to handle difficult cases.
In summary, our approach leverages the strengths of NLP and knowledge-based techniques to disambiguate planetary features in astronomical publications with unprecedented accuracy. By combining these techniques, we can effectively improve the disambiguation of sparse or obscure cases, ensuring that valid planetary features are accurately identified and described.

ARXIV/2312.08579 authored by Golnaz Shapurian, Michael J Kurtz, Alberto Accomazzi.

disambiguation keywords:knowledge graph lemmatization spacy wikidata

LLama 2 7B Chat

LLaMA-2, the next generation of LLaMA. Meta trained and released LLaMA-2 in three model sizes: 7, 13, and 70 billion parameters. The model architecture remains largely unchanged from that of LLaMA-1 models, but 40% more data was used to train the foundational models. The accompanying preprint also mentions a model with 34B parameters that might be released in the future upon satisfying safety targets.

Adapting LLMs for NER in Astronomical Literature via Knowledge Graphs and Prompt Engineering

LLama 2 7B Chat

Categories

Tags

Archives

Adapting LLMs for NER in Astronomical Literature via Knowledge Graphs and Prompt Engineering

LLama 2 7B Chat

Accurate Analysis of Image Captions with CoT-Based Methods

Unsupervised Audio-Caption Alignment via Correspondence Learning

Efficient Method for ML Model Accuracy Improvement in Non-IID Data Settings

Categories

Tags

Archives