In this article, we provide a detailed explanation of the implementation details for several machine learning techniques used in natural language processing (NLP): Zero-Shot (ZS), Fine-tuning (FT), Whole-Scene (WiSE-FT), and Few-shot Learning with Image Prompts (FLYP). We follow the standard protocols of previous works, using default configurations, hyperparameters, and prompt templates proposed by other researchers.
For ZS, we use AdamW with a learning rate of 1e-5 for about 25,000 training iterations to train the model. The ensemble coefficient of WiSE-FT is set at 0.5, which has been shown to work well in previous studies. For FT, we use AdamW with a learning rate of either 3e-5 (for CaRot) or 1e-5 (for other classes), and train the model for about 10 epochs with a batch size of 512.
To adapt these techniques to image prompts, we use FL with an ensemble coefficient of 0.5 and train for about 10 epochs with a batch size of 64. For label smoothing regularization, we use the mean squared error between the predicted probabilities and the true labels as the regularization term.
Our implementation details provide a step-by-step guide for researchers to replicate our experiments and explore these techniques further. By following the standard protocols, we ensure consistency across different studies and facilitate comparison of results.
In summary, this article provides a thorough explanation of the implementation details for ZS, FT, WiSE-FT, and FLYP, allowing researchers to easily replicate and build upon previous works in NLP. By using default configurations, hyperparameters, and prompt templates proposed by other researchers, we ensure consistency across studies and facilitate comparison of results.
Computer Science, Computer Vision and Pattern Recognition