Bridging the gap between complex scientific research and the curious minds eager to explore it.

Computer Science, Machine Learning

Text-Driven Human Video Generation: A Survey of Recent Approaches

Text-Driven Human Video Generation: A Survey of Recent Approaches

Reinforcement learning (RL) is a popular approach to training agents to perform tasks, such as making a good cup of coffee or playing a game. RL algorithms rely on a reward function to guide the agent’s actions and maximize the reward over time. However, designing a mathematically expressible reward function can be challenging for most real-world applications.
One solution is to use pre-trained models like CLIP, which have been trained on large datasets of text and images. These models can be fine-tuned for specific tasks by retraining the vision pipeline on egocentric datasets or manually collected text-video pairs. However, there are some limitations to using these models. For example, they may not capture concepts defined relatively between two objects, such as the hat worn by a person in an image.
To overcome these challenges, recent advances have focused on developing RL algorithms that can handle interactive tasks, such as making a good cup of coffee or playing a game. These algorithms aim to learn policies that output actions for engaging with the environment in a manner that maximizes the reward. They often rely on heuristically designed reward functions based on domain expertise and can be optimized using RL algorithms.
One possible pitfall of these similarity-based frameworks is that they may yield a "bag-of-words" model that cannot capture concepts defined relatively between two objects. To address this, researchers have proposed retraining the vision pipeline from scratch on egocentric datasets or manually collected text-video pairs for the task of interest.
In summary, recent advances in reinforcement learning have focused on developing RL algorithms that can handle interactive tasks and learn policies that output actions for engaging with the environment in a manner that maximizes the reward. While pre-trained models like CLIP can be useful, there are limitations to using these models, and heuristically designed reward functions based on domain expertise can often allow hacking in even fairly simple scenarios. By retraining the vision pipeline from scratch, researchers have proposed a possible solution to address this pitfall.