Exploring Multitask Learning for Image Captioning and Question Answering

Large multimodal models (LMMs) have shown impressive performance on various tasks such as visual question answering, image captioning, and visual document understanding. However, deploying a multitude of specialist models for each task is impractical due to their large scale. To address this challenge, researchers propose the development of generalist LMMs that can handle multiple tasks using the same set of model parameters.
The article discusses the challenges of building a single generalist model to solve multiple tasks and the potential drawbacks of fine-tuning the model parameters with supervised data representing multiple tasks. The authors suggest that recent research suggests that it is unlikely that these tasks share the same configuration of modalities, as they require diverse capabilities such as recognizing fine-grained identity of visual content, world-knowledge outside of the visual scene, and reading and understanding texts from images.
To overcome these challenges, the authors propose a flexible architecture that can adapt to changing requirements without necessitating a complete overhaul. The proposed architecture allows for seamless integration of additional low-rank specialist modules as needed, enhancing the model’s capability without limiting its adaptability to changing scenarios.
The article also presents experimental results demonstrating the effectiveness of the proposed approach in various combinations of tasks, including image captioning and VQA. The findings suggest that the proposed architecture can improve performance on each task while maintaining a single set of model parameters, making it an attractive option for real-world applications.
In summary, the article presents a flexible and efficient approach to building generalist LMMs that can handle multiple tasks using the same set of model parameters, offering a promising solution for improving performance on various multimodal tasks while reducing complexity and computational cost.

ARXIV/2312.00968 authored by Jialin Wu, Xia Hu, Yaqing Wang, Bo Pang, Radu Soricut.

Exploring Multitask Learning for Image Captioning and Question Answering

LLama 2 7B Chat

Categories

Tags

Archives

Exploring Multitask Learning for Image Captioning and Question Answering

LLama 2 7B Chat

Accurate Analysis of Image Captions with CoT-Based Methods

Unsupervised Audio-Caption Alignment via Correspondence Learning

Efficient Method for ML Model Accuracy Improvement in Non-IID Data Settings

Categories

Tags

Archives