To further refine the selected views, the authors employ hierarchical prompts powered by large language models (LLMs). These prompts are designed to evaluate the semantic representation of each view and generate hand-crafted prompts containing categories and their textual features. The prediction of each view is calculated separately using the generated prompts, and the entropy of logitsi is computed as before.
The proposed method is evaluated on several benchmark datasets, and the results demonstrate its effectiveness in improving the accuracy of visual grounding compared to existing methods. The authors also provide a detailed analysis of the selected views, which reveals that they are more diverse and informative than those obtained using traditional methods.
In summary, the article presents a novel approach to visual grounding that leverages both CLIP and hierarchical prompts to improve the accuracy and diversity of the selected views. The proposed method has important implications for various applications, including robotics, autonomous driving, and human-computer interaction.
Computer Science, Computer Vision and Pattern Recognition