Bridging the gap between complex scientific research and the curious minds eager to explore it.

Computer Science, Computer Vision and Pattern Recognition

Unifying Object Detection and Phrase Localization with Text-Region Similarity

Unifying Object Detection and Phrase Localization with Text-Region Similarity

Object detection is a crucial task in computer vision that involves identifying objects within images or videos. Pre-trained models, such as GLIP and Grounding DINO, have shown promising results in this field by using text descriptions to generalize object detection tasks. However, these models still face challenges when it comes to recognizing new categories or objects that are not present in their training data. To address this issue, the paper proposes a novel approach called visual prompt pipelines, which leverages both text and visual information to improve object detection accuracy.
The proposed method consists of three main components: visual prompt construction, similarity dictionary, and training loop. The visual prompt construction component involves generating vectors that represent categories or objects within an image. These vectors are created by sampling them independently from a distribution and then correlating the within-class vectors while keeping between-class vectors unrelated. This process allows the detector to distinguish the category to which the vectors belong early in optimization.
The similarity dictionary component is derived from noun phrases in the pre-training data by text-region similarity. During training, it generates confusing text prompts as negative examples to improve the discriminative representation of the visual prompts. The final component is the training loop, which combines both text and visual prompts to optimize object detection accuracy.
The authors evaluate their approach on several benchmark datasets and show that their visual prompt pipelines outperform traditional text-based prompts by a significant margin. Specifically, they achieve an average result of 67.7 mAP (mean Average Precision), which is higher than the results obtained with text prompts, context prompts, or offset prompts.
In summary, the paper presents a novel approach to object detection that leverages both text and visual information to improve accuracy. By generating visual prompts that represent categories or objects within images and using a similarity dictionary to generate confusing negative examples, the proposed method demonstrates significant improvements over traditional text-based prompts. The authors’ approach has important implications for a wide range of applications, from autonomous driving to medical imaging, where accurate object detection is critical.