In this paper, the authors explore the use of large language models (LLMs) to generate context information for spoken language understanding tasks. They investigate the effectiveness of different LLM sizes in generating useful context information and propose an approach called generative context-aware fine-tuning to distill the generated information during the fine-tuning of self-supervised speech models.
The authors start by explaining that when performing tasks like automatic speech recognition or spoken language understanding, access to previous text or audio provides valuable contextual information. They hypothesize that LLMs could generate useful context information using the preceding text and propose an approach to distill this information during fine-tuning of self-supervised speech models.
The authors test different LLM sizes in generating context information and find that a larger LLM size can provide better context information, but requires more computation resources. They also observe that the context generated by the 7B LLM is worse than the ground truth text.
To address this issue, the authors propose an approach called generative context-aware fine-tuning, which allows the fine-tuned model to make improved predictions without access to the true surrounding segments or the LLM at inference time, while requiring only a small additional context module. They evaluate the effectiveness of their proposed approach using a series of experiments and show that it improves the performance of the speech models.
The authors also compare their approach with other state-of-the-art methods and demonstrate its superiority in terms of both accuracy and computational efficiency. They conclude by highlighting the potential applications of their proposed approach in real-world scenarios, such as voice assistants or automotive systems.
In conclusion, this paper presents a novel approach to improving spoken language understanding tasks using LLMs generated context information. By distilling the generated context information during fine-tuning, the authors are able to improve the performance of speech models without sacrificing computational efficiency. This work has important implications for real-world applications where spoken language understanding is critical, such as voice assistants or automotive systems.
Computation and Language, Computer Science