Conditional Caption Generation: Improving Image Descriptions with Text Instructions

Human activity recognition (HAR) is a field that aims to identify and classify the actions performed by individuals in videos or images. Existing approaches to HAR rely heavily on large-scale video datasets, which can be challenging to obtain and maintain. To address this issue, researchers have started exploring the use of language models (LLMs) as a more versatile and efficient way to understand human activities.
In this article, the authors conduct ablation studies on the number of examples included in the prompt for HAR. They experiment with different numbers of exemplars and measure the edit distance between the generated and target actions using the Ego4D v2 validation set. The results show that including more exemplars in the prompt leads to better performance, but there is a trade-off between quality and quantity.
The authors also compare the performance of HAR models based on different content types in the prompt, such as action sequences, captions, and noun lists. They find that using a combination of these content types yields the best results.
Overall, this article highlights the importance of considering the number of examples in the prompt when developing HAR models. By optimizing the prompt content, researchers can improve the accuracy and efficiency of HAR systems, paving the way for more widespread adoption in various applications.
Analogy: Imagine trying to build a recipe book without enough cooking examples. You might end up with a bunch of untested and potentially disastrous dishes! Similarly, HAR models need sufficient exemplars in the prompt to accurately recognize human actions.

ARXIV/2311.17944 authored by Sanghwan Kim, Daoji Huang, Yongqin Xian, Otmar Hilliges, Luc Van Gool, Xi Wang.

Conditional Caption Generation: Improving Image Descriptions with Text Instructions

LLama 2 7B Chat

Categories

Tags

Archives

Conditional Caption Generation: Improving Image Descriptions with Text Instructions

LLama 2 7B Chat

Accurate Analysis of Image Captions with CoT-Based Methods

Unsupervised Audio-Caption Alignment via Correspondence Learning

Efficient Method for ML Model Accuracy Improvement in Non-IID Data Settings

Categories

Tags

Archives