Bridging the gap between complex scientific research and the curious minds eager to explore it.

Computer Science, Computer Vision and Pattern Recognition

Ablation Study on Model Architecture and Pre-train Data for Visual Question Answering

Ablation Study on Model Architecture and Pre-train Data for Visual Question Answering

Visual Question Answering (VQA) is a task that involves answering questions about images using natural language. The article focuses on CogVLM, a new framework for VQA that leverages both generalist and task-specific fine-tuning models to improve performance. The authors aim to enhance the model’s ability to comprehend high-resolution images and adapt it for GUI application scenarios. They pre-train their data by collecting text recognition samples, OCR of natural images, and academic documents.
Generalist Models vs Task-Specific Fine-Tuning
CogVLM utilizes both generalist and task-specific fine-tuning models to improve VQA performance. Generalist models are pre-trained on a large dataset of images and are good at recognizing texts of various sizes, orientations, and fonts. On the other hand, task-specific fine-tuning models are trained on a specific dataset related to GUI applications, such as web pages. By combining both types of models, CogVLM can better understand the content in high-resolution images and adapt to different scenarios.
Data Collection and Pre-training
To enhance the model’s ability to comprehend high-resolution images, the authors collect text recognition samples, OCR of natural images, and academic documents. For text recognition, they use a synthetic renderings dataset with text from language pre-training dataset (80M). For OCR, they collect natural images from COYO and LAION-2B and employ Paddle-OCR to extract texts and their bounding boxes. For academic documents, they construct image-text pairs including text, formula, and tables from the source code (LaTeX) release on arXiv.
Improving Performance with CogVLM
CogVLM significantly improves VQA performance compared to other state-of-the-art models. The authors achieve an average score of 79.5 on the Visual Question Answering benchmark, which is higher than any other model in the table. They also observe that their approach outperforms the best generalist and task-specific fine-tuning models across both categories.

Conclusion: Enhancing VQA Performance with CogVLM

In conclusion, CogVLM is a powerful framework for VQA that leverages both generalist and task-specific fine-tuning models to improve performance. By collecting and pre-training data on text recognition, OCR of natural images, and academic documents, the authors are able to enhance the model’s ability to comprehend high-resolution images and adapt it for GUI application scenarios. The result is a significant improvement in VQA performance compared to other state-of-the-art models.