Bridging the gap between complex scientific research and the curious minds eager to explore it.

Computer Science, Computer Vision and Pattern Recognition

Improving Caption Quality with Multimodal Fusion

Improving Caption Quality with Multimodal Fusion

In this article, we explore the potential of leveraging multiple sources to improve image captioning systems. Our proposed approach, Retrieval-Augmented Transformer (RAT), combines the strengths of different models to generate more pertinent and accurate captions. By incorporating retrieval information into the transformer architecture, RAT can better capture long-tail words and named entities, resulting in improved performance on benchmark datasets.
We evaluate RAT against state-of-the-art image captioning models, demonstrating its effectiveness in generating more informative and relevant captions. Our experiments show that RAT outperforms the baseline models by a significant margin, particularly when dealing with long-tail words and named entities.
Our approach is based on the idea of leveraging multiple sources to improve image captioning systems. By combining the strengths of different models, we can generate more accurate and informative captions, even for long-tail words and named entities. This is achieved by incorporating retrieval information into the transformer architecture, allowing the model to better capture the context and semantics of the input images.
In summary, our work presents a novel approach to image captioning that leverages multiple sources to improve performance. By combining the strengths of different models and incorporating retrieval information into the transformer architecture, we can generate more accurate and informative captions, even for long-tail words and named entities. This has important implications for applications such as visual question answering, image retrieval, and accessibility for visually impaired individuals.