In this article, the authors propose a novel approach to speech synthesis called OpenVoice, which decouples the speech synthesis process into two stages: tone color transfer and style transfer. The goal is to enable the generation of speech with controllable emotions, accents, and styles without sacrificing naturalness or quality.
The authors explain that traditional speech synthesis methods rely on a single neural network to generate both the tone color (emotions, accents, etc.) and the style (speaker identity, etc.) of the output speech. However, this approach can lead to suboptimal results, as the model must learn to balance these conflicting requirements.
To address this challenge, OpenVoice introduces a decoupled architecture that separates these tasks into two stages: tone color transfer and style transfer. In the first stage, the authors use a reference speaker TTS model to generate a base voice with a specific tone color. Then, in the second stage, they apply a flow-based architecture to manipulate the base voice and achieve the desired style transfer.
The intuition behind OpenVoice is to think of speech synthesis as a game of "Telephone," where each participant adds their own twist to the message. In this case, the base speaker (stage 1) passes on their voice with a specific tone color, and the flow layers (stage 2) add their own style to the voice, while preserving its emotional content.
The authors provide several examples of how OpenVoice can be used to generate speech with different emotions, accents, and styles. They also show that their approach is surprisingly effective in controlling the tone color and style of the output speech without sacrificing naturalness or quality.
To train the model, the authors use a combination of supervised and unsupervised learning techniques, including reinforcement learning and adversarial training. They also provide a detailed explanation of their internal version of OpenVoice and make the source code publicly available.
In summary, OpenVoice is a novel approach to speech synthesis that decouples the tone color transfer and style transfer stages, enabling the generation of speech with controllable emotions, accents, and styles without sacrificing naturalness or quality. By using a reference speaker TTS model in stage 1 and a flow-based architecture in stage 2, the authors are able to achieve surprisingly effective results and demonstrate the potential of their approach in controlling the tone color and style of speech synthesis.