Personalizing Text-to-Image Generation using Textual Inversion

In this article, we explore the concept of personalized motion synthesis for talking heads using a lightweight architecture based on diffusion models. The proposed method, called "Ours," leverages person-specific fine-tuning and audio noise ablation to achieve high-quality results. We demonstrate that our approach outperforms existing methods in terms of robustness and quality, particularly when dealing with medium or low audio noise levels.
To understand personalized motion synthesis, imagine a movie scene where an actor’s character changes their style mid-sentence. Our method can help achieve this seamless transition by adapting the motion to match the speaking style of the individual in the video. This is achieved through a lightweight architecture that enables training on a single high-performance GPU within 30 hours, allowing for person-specific fine-tuning and audio noise ablation.
We evaluate our method using several experiments, showcasing its effectiveness in various scenarios. For instance, we compare our approach with attention-based conditioning mechanisms and the Faceformer transformer backbone, demonstrating that our lightweight architecture outperforms these more complex methods. We also demonstrate that 30 seconds of video are sufficient for fine-tuning while 100 seconds further improve all scores.
Our findings highlight the importance of personalization in motion synthesis, as it greatly enhances the realism and naturalness of the generated videos. By using audio noise ablation, we can significantly improve the robustness of our method across various audio conditions.
In summary, this article presents a lightweight architecture for personalized motion synthesis that leverages person-specific fine-tuning and audio noise ablation to produce high-quality results. Our approach outperforms existing methods in terms of robustness and quality, making it an attractive choice for applications where realism and naturalness are paramount.

ARXIV/2312.00870 authored by Balamurugan Thambiraja, Sadegh Aliakbarian, Darren Cosker, Justus Thies.

Personalizing Text-to-Image Generation using Textual Inversion

LLama 2 7B Chat

Categories

Tags

Archives

Personalizing Text-to-Image Generation using Textual Inversion

LLama 2 7B Chat

Accurate Analysis of Image Captions with CoT-Based Methods

Unsupervised Audio-Caption Alignment via Correspondence Learning

Efficient Method for ML Model Accuracy Improvement in Non-IID Data Settings

Categories

Tags

Archives