Enhanced Gloss2Text Model for Adaptive Translation of Visual-Grounded Text

Optimization techniques have been crucial in various fields, but their complexity often hinders their widespread adoption. In this article, we propose a vision-based sentence property learning model to enhance the adaptive translation of gloss words for efficient optimization. By leveraging visual modality and incorporating context knowledge, our approach surpasses existing cutting-edge methods in evaluations on public datasets.

Vision-Based Sentence Property Learning

Our proposed model, Enhanced Gloss2Text (E GT), enhances the translation of gloss words by learning their properties through visual modalities. The visual modality provides essential contextual information to facilitate the adaptation of gloss words in diverse scenarios. In contrast to traditional methods relying solely on language models, E GT’s integration of visual and contextual knowledge enables more accurate and efficient optimization.

Experimental Setup

To evaluate our proposed model, we selected public datasets CSL-Daily [16] and adopted a Transformer encoder with 2 layers and 8 attention heads. The learning rate was set to 10e − 5, and the batch size was fixed at 8. We initialized the generator with a pretrained Chinese BART model and unified the length of the target sentence to 60. After training for 40 epochs, we evaluated our model using BLEU-1 to BLEU-4 [25] and ROUGE-L [26].

Model Comparison

Our proposed E GT model outperformed existing cutting-edge methods in various evaluations, demonstrating its superiority. By incorporating visual and contextual knowledge, our approach not only improves the accuracy of gloss word translation but also facilitates efficient optimization.

Conclusion and Future Work

In conclusion, our proposed Enhanced Gloss2Text model offers a promising solution for efficient optimization by leveraging vision-based sentence property learning. By incorporating contextual knowledge and visual modalities, our approach surpasses existing methods in evaluations on public datasets. In the future, we plan to expand the evaluation of our model to more languages and explore its applicability to other fields.

ARXIV/2312.10210 authored by Liqiang Jing, Xuemeng Song, Xinxing Zu, Na Zheng, Zhongzhou Zhao, Liqiang Nie.

Enhanced Gloss2Text Model for Adaptive Translation of Visual-Grounded Text

Vision-Based Sentence Property Learning

Experimental Setup

Model Comparison

Conclusion and Future Work

LLama 2 7B Chat

Categories

Tags

Archives

Enhanced Gloss2Text Model for Adaptive Translation of Visual-Grounded Text

Vision-Based Sentence Property Learning

Experimental Setup

Model Comparison

Conclusion and Future Work

LLama 2 7B Chat

Accurate Analysis of Image Captions with CoT-Based Methods

Unsupervised Audio-Caption Alignment via Correspondence Learning

Efficient Method for ML Model Accuracy Improvement in Non-IID Data Settings

Categories

Tags

Archives