The article presents a novel approach to establishing an object state change (OSC) taxonomy, which is critical for robust object detection and understanding in various applications. The proposed method leverages large-scale how-to instructional videos as the training source, employs LLMs and VLMs for OSC mining and pseudo-labeling, and identifies 20 most frequent state transitions with over 200 appearances. These transitions are annotated in a video paired with an OSC category, ensuring accurate identification of the object state changes.
The methodology involves two stages: text mining and pseudo-label generation. In the text mining stage, LLAMA2 is utilized to analyze ASR transcriptions for identifying candidate videos and their associated OSC categories. This step helps corral a long tail of OSCs, including rare state change terms from free-form natural language narrations given by how-to instructors in the training videos.
In the pseudo-label generation stage, a novel pipeline facilitated by VLMs is employed to form three textual state descriptions for an OSC category’s initial, transitioning, and end states. These state descriptions are then used to compute the cross-modal similarity between every frame in training videos and the three state descriptions, resulting in a score matrix that assigns pseudo labels to each time point in the video.
The proposed approach offers several advantages, including scaling up training and ensuring broad generalization through leveraging LLMs and VLMs for OSC mining and pseudo-labeling. The article provides an efficient and effective method for establishing a robust OSC taxonomy, which is crucial for various applications such as robotics and computer vision.
Computer Science, Computer Vision and Pattern Recognition