In this section, we provide a detailed overview of the implementation details of our model for visual grounding task. We adopt PyTorch and train our model with NVIDIA A800 GPUs, using the Vision Transformer as the image encoder. Our text encoder is initialized by CLIP, while the rest of the model weights are randomly initialized. During training, we use the AdamW optimizer with a weight decay of 5e-4 and pre-train and fine-tune our model for 50 epochs. We also introduce a warm-up strategy for pre-training on our MRES-32M dataset and fine-tuning on the specific downstream grounding dataset. The initial learning rate is set to 1e-5 with a cosine decay schedule. Additionally, we highlight that previous works have primarily focused on classic object-level RES methods and datasets, while our work pays attention to the vital fine-grained part-level grounding.
To understand this section, imagine you are building a house. Just like how architects design blueprints for a house, researchers in this field create models for visual grounding tasks. PyTorch is like the software that helps build the house, and NVIDIA A800 GPUs provide the necessary power to construct it quickly and efficiently. The Vision Transformer is like the image encoder, which processes the visual information just like how a contractor would handle the physical structure of the house. CLIP is the text encoder, similar to how an interior designer would prepare the textual details for the house.
During training, think of the AdamW optimizer as the builder who adds the necessary tweaks and adjustments to ensure the model learns properly. The weight decay of 5e-4 is like adding a safety net to prevent any potential errors in construction. Pre-training on MRES-32M is like laying the foundation for the house, while fine-tuning on the specific downstream grounding dataset is like customizing the design to fit the needs of the occupants. The initial learning rate is like the speed at which the builder constructs the house, and the cosine decay schedule is like how the builder gradually slows down as the construction nears completion.
Lastly, remember that previous works have primarily focused on classic object-level RES methods and datasets, while our work pays attention to the vital fine-grained part-level grounding, which is like adding personalized touches to the interior design of the house to make it more comfortable and functional for its occupants.
Computer Science, Computer Vision and Pattern Recognition