In this paper, we propose a novel approach to active open-vocabulary recognition, which is critical for embodied AI applications like robotics and autonomous vehicles. Our proposed agent is designed to handle challenging scenarios where the agent encounters new objects or classes during deployment. To address the limitations of recent CLIP models in unconstrained environments, we introduce three essential requirements for an active open-vocabulary recognition agent:
- Intelligent Perception: The agent must be able to perceive and acquire informative visual observations to enhance recognition performance, especially under suboptimal viewing conditions. We employ a self-attention module that selectively weights different frames to maintain essential information within the global feature.
- Effective Integration: The agent must integrate accumulated evidence from observations, including for novel categories not encountered during training. A successful integration mechanism facilitates accurate class prediction, encompassing both base and novel classes.
- Robust Generalization: The recognition policy should demonstrate robust generalization capabilities to handle unseen objects or classes in real-world scenarios.
We evaluate our proposed approach using CLIP models and assess their performance across varying viewpoints and occlusion levels. Our results reveal a marked sensitivity to suboptimal viewing conditions, underscoring the importance of our proposed agent’s ability to intelligently acquire informative visual observations. By integrating a self-attention module and employing similarity measures to avoid class-related biases, we enhance the reliability of open-vocabulary recognition agents in embodied perception scenarios.
In summary, our work addresses the limitations of recent CLIP models in unconstrained environments by introducing an active open-vocabulary recognition agent that can intelligently perceive and integrate visual observations while demonstrating robust generalization capabilities. By meeting these three essential requirements, our proposed agent enhances the reliability and versatility of embodied AI applications like robotics and autonomous vehicles.