Bridging the gap between complex scientific research and the curious minds eager to explore it.

Computer Science, Computer Vision and Pattern Recognition

Exploring Multitask Learning for Image Captioning and Question Answering

Exploring Multitask Learning for Image Captioning and Question Answering

Large multimodal models (LMMs) have shown impressive performance on various tasks such as visual question answering, image captioning, and visual document understanding. However, deploying a multitude of specialist models for each task is impractical due to their large scale. To address this challenge, researchers propose the development of generalist LMMs that can handle multiple tasks using the same set of model parameters.
The article discusses the challenges of building a single generalist model to solve multiple tasks and the potential drawbacks of fine-tuning the model parameters with supervised data representing multiple tasks. The authors suggest that recent research suggests that it is unlikely that these tasks share the same configuration of modalities, as they require diverse capabilities such as recognizing fine-grained identity of visual content, world-knowledge outside of the visual scene, and reading and understanding texts from images.
To overcome these challenges, the authors propose a flexible architecture that can adapt to changing requirements without necessitating a complete overhaul. The proposed architecture allows for seamless integration of additional low-rank specialist modules as needed, enhancing the model’s capability without limiting its adaptability to changing scenarios.
The article also presents experimental results demonstrating the effectiveness of the proposed approach in various combinations of tasks, including image captioning and VQA. The findings suggest that the proposed architecture can improve performance on each task while maintaining a single set of model parameters, making it an attractive option for real-world applications.
In summary, the article presents a flexible and efficient approach to building generalist LMMs that can handle multiple tasks using the same set of model parameters, offering a promising solution for improving performance on various multimodal tasks while reducing complexity and computational cost.