In this paper, the authors propose a new model called LEGO (Leveraging Egocentric Generative Options) to improve multimodal generation in computer vision and language fields. The key innovation of LEGO is its ability to generate egocentric action frames that provide step-by-step instructions for executing a query action, making it easier for users to understand and interpret the generated text.
To illustrate this concept, imagine you want to learn how to play a specific game. Traditional language models (LLMs) would provide general instructions that may not be tailored to your current situation, such as "Move the red piece to the blue square." While visual large language models (VLLMs) can generate more specific guidance based on visual prompts, the generated text still lacks clarity and ease of interpretation. LEGO fills this gap by generating egocentric action frames that show you exactly how to move your pieces in a game, making it easier for you to learn and master the skill.
LEGO takes two inputs: a user’s query (e.g., "Show me how to play chess") and an egocentric image captured from the user’s perspective. The model then generates an egocentric action frame that demonstrates the execution of the query action, such as a sequence of moves for playing chess. This approach allows LEGO to provide more accurate and relevant instructions than traditional LLMs or VLLMs, making it easier for users to learn and understand new skills.
In summary, LEGO is a novel model that leverages egocentric generation options to improve multimodal generation in computer vision and language fields. By generating egocentric action frames that demonstrate the execution of a query action, LEGO makes it easier for users to understand and interpret generated text, making it a valuable tool for learning new skills.
Computer Science, Computer Vision and Pattern Recognition