Machine translation is a crucial tool for bridging language gaps, but evaluating its quality can be challenging. BLEU score is a widely used metric that measures the similarity between generated and reference texts. However, relying solely on BLEU may lead to over optimization and neglect of semantic nuances. This article delves into the limitations of BLEU and explores alternative approaches to improve machine translation quality.
The article begins by contextualizing BLEU score in the context of code-switched text, highlighting its relevance for evaluating machine translations. It then delves into the limitations of BLEU, including its reliance on n-gram precision and lack of consideration for semantic nuances. The author acknowledges that while BLEU provides a practical and automated evaluation of translation quality, it may not align with human judgments on translation quality.
To address these limitations, the article proposes alternative approaches to improve machine translation quality. These include selecting models that are small enough to train and fine-tune on Hinglish data, preferring smaller-sized models for ease of fine-tuning over architectural intricacies and training methodologies, and recognizing the dissimilarity between translation models trained on individual language data and the unique challenges posed by code-switching data.
The article also explores the use of multi-lingual Large Language Models (LLMs) for code-switching tasks, acknowledging their complex fine-tuning process may result in suboptimal performance compared to smaller, finely tuned models. The author then highlights their choice of both multi-lingual and singlish (only English) models for experimentation, considering factors such as model size, architecture, and training techniques.
To enhance the model’s understanding of contextual information within language, the article introduces some noise in the datasets, recognizing that Hinglish is not a standard language compared to Hindi or English. The author adds bilingual data (Hinglish) to their dataset, removing words other than English or Hindi and adding variations in spellings to capture contextual differences.
Finally, the article presents the results of their experimentation using BLEU score evaluation, highlighting the trade-offs between different model sizes and training techniques. The author concludes that while BLEU score remains an essential metric for evaluating machine translation quality, it is crucial to consider other factors to ensure comprehensive assessment and optimization of machine translation systems.
Computation and Language, Computer Science