In this paper, we explore a novel approach to disambiguating polyphonic characters using large language models (LLMs). Polyphonic characters are those that can be pronounced in multiple ways, making them challenging to decipher. Our proposed method leverages the pre-training process of LLMs to construct a multi-level semantic dictionary for all polyphonic characters. This dictionary is then incorporated into the prompt used by the LLM during the disambiguation stage.
We evaluate our method using the CPP dataset, a publicly available collection of Chinese sentences with polyphonic characters. Our results demonstrate that our proposed method outperforms five baseline models, showcasing the effectiveness of combining external knowledge with LLMs for polyphone disambiguation.
Our approach is based on the idea that some external knowledge, such as the meanings and collocations of characters, can be useful for the disambiguation model. By constructing a multi-level semantic dictionary from the internet, we are able to incorporate this knowledge into the prompt used by the LLM. This allows the model to better understand the context and nuances of each polyphonic character, leading to more accurate disambiguation results.
To further improve our method, we plan to explore how the scale of LLMs affects its performance, as well as how to incorporate Chain-of-Thought techniques into the task. These directions offer exciting opportunities for future research and have the potential to significantly advance the field of polyphone disambiguation.
In summary, our paper presents a novel approach to polyphone disambiguation that leverages large language models and external knowledge to achieve more accurate results. By constructing a multi-level semantic dictionary and incorporating it into the prompt used by the LLM, we are able to demystify complex concepts related to polyphonic characters and provide more comprehensive disambiguation capabilities. This work has important implications for applications such as language translation and text summarization, where accurate character recognition is critical.
Computation and Language, Computer Science