Bridging the gap between complex scientific research and the curious minds eager to explore it.

Computer Science, Sound

Voice Conversion Techniques: An Overview

Voice Conversion Techniques: An Overview

In this paper, we propose a new approach to non-autoregressive text-to-speech (TTS) that leverages explicit duration modeling for low-resource highly expressive speech. Traditional TTS systems rely on autoregressive models that predict the next audio sample given the previous ones, but these can be computationally expensive and difficult to train. Our proposed approach uses a parallel wavenet to directly generate the audio spectrogram, which allows for faster and more efficient processing. We demonstrate the effectiveness of our method on Amazon’s internal high-quality dataset, achieving state-of-the-art results in terms of perceived naturalness and intelligibility.
Background and Related Work

TTS systems have been around for decades, with early work focusing on rule-based synthesis (RBS) [1]. However, the rise of deep learning has led to a shift towards non-autoregressive TTS, which can generate audio samples in parallel rather than sequentially. This approach has several advantages, including faster processing times and improved parallelism. However, these models often suffer from over-smoothing, where the generated speech lacks prosody and naturalness [2].
To address this issue, some researchers have proposed using explicit duration modeling techniques, such as those based on probabilistic context-free grammars (PCFGs) [3]. However, these approaches can be computationally expensive and difficult to train. Alternatively, some works have used parallel wavenets [4], which are able to generate audio spectrograms in parallel, but these models suffer from overfitting and require careful hyperparameter tuning.
Our proposed approach combines the advantages of non-autoregressive TTS with the benefits of explicit duration modeling, resulting in a faster and more efficient method that produces high-quality speech.
Methodology

Our proposed method consists of two main components: a parallel wavenet and an explicit duration model. The parallel wavenet is used to generate the audio spectrogram, while the explicit duration model is used to incorporate information about the duration of each speech segment.
The parallel wavenet is based on a multi-layer perceptron (MLP) with a parallel architecture, which allows for faster processing times [5]. This network takes as input a sequence of phonemes and outputs a probability distribution over the possible audio spectrograms. To ensure that the generated speech is natural and intelligible, we use a combination of acoustic features and linguistic information, such as pitch and duration.
The explicit duration model is based on a probabilistic context-free grammar (PCFG) [6], which allows us to model the duration of each speech segment in terms of a set of probabilities. This model is trained separately from the wavenet and is used to compute the probability distribution over the possible durations of each phoneme.
Results and Discussion

We evaluate our proposed method on Amazon’s internal high-quality dataset, which consists of 118 gender-balanced speakers and approximately 91k utterances with an average recording length of 3.9s. We compare our results to those obtained using traditional non-autoregressive TTS systems and explicit duration modeling techniques. Our proposed method achieves state-of-the-art results in terms of perceived naturalness and intelligibility, while also being faster and more efficient than previous approaches.
Conclusion

In this paper, we proposed a new approach to non-autoregressive TTS that leverages explicit duration modeling for low-resource highly expressive speech. Our proposed method combines the advantages of parallel wavenets with the benefits of explicit duration modeling, resulting in a faster and more efficient method that produces high-quality speech. We demonstrate the effectiveness of our approach on Amazon’s internal high-quality dataset, achieving state-of-the-art results in terms of perceived naturalness and intelligibility.