Unifying Language and Vision Models for Efficient Policy Learning

In this article, we explore the challenges of crafting useful instructions for robots to perform tasks using language models. Existing methods primarily focus on augmenting text with more detailed instructions, such as task planning, but these approaches have limitations. Our proposed method, OCI (Open-Ended Contextual Instruction), addresses these challenges by leveraging a fine-tuned MLLM (Multi-modal Large Language Model) that can comprehend language and the environment, correlating objects’ locations to their identities.

The OCI framework consists of two key components

Fine-Tuned MLLM: We use pretrained weights trained on a combined dataset of Conceptual Caption, SBU, and LAION, with 20,000 training steps and a batch size of 256. This model is capable of correlating objects’ locations to their identities.
Feature Reuse Mechanism: We use the features embedding from the MLLM to improve policy learning. This mechanism leverages the contextual information in the language model to enhance the policy network’s understanding of an object’s position and generate a valid action trajectory.
By chaining what and where into a unified and useful instruction for manipulation policy learning, OCI enables robots to understand objects’ positions and perform tasks more effectively. Our approach addresses the limitations of existing methods by providing a more comprehensive and effective way of crafting instructions for robots.
In summary, this article presents a novel approach to crafting useful instructions for robots using language models, which can improve their ability to understand objects’ positions and perform tasks more effectively. By leveraging a fine-tuned MLLM and a feature reuse mechanism, OCI provides a more comprehensive and effective way of instructing robots, enabling them to better interpret human instructions and complete tasks with greater accuracy.

ARXIV/2401.02814 authored by Junjie Wen, Yichen Zhu, Minjie Zhu, Jinming Li, Zhiyuan Xu, Zhengping Che, Chaomin Shen, Yaxin Peng, Dong Liu, Feifei Feng, Jian Tang.

Unifying Language and Vision Models for Efficient Policy Learning

The OCI framework consists of two key components

LLama 2 7B Chat

Categories

Tags

Archives

Unifying Language and Vision Models for Efficient Policy Learning

The OCI framework consists of two key components

LLama 2 7B Chat

Accurate Analysis of Image Captions with CoT-Based Methods

Unsupervised Audio-Caption Alignment via Correspondence Learning

Efficient Method for ML Model Accuracy Improvement in Non-IID Data Settings

Categories

Tags

Archives