Bridging the gap between complex scientific research and the curious minds eager to explore it.

Computer Science, Sound

High-Individuality Voice Conversion Based on Concatenative Speech Synthesis

High-Individuality Voice Conversion Based on Concatenative Speech Synthesis

Voice conversion technology has been gaining traction in recent years, with the aim of creating a personalized and customizable voice experience. However, most existing methods suffer from limited individuality and naturalness. To address this issue, this paper proposes a novel approach called NeuCoSVC, which utilizes a neural network-based method for high-individuality voice conversion.

SSL Feature Extraction and Matching

The proposed method begins by extracting fixed-dimensional features from the audio input using a pre-trained self-supervised learning (SSL) model. These features capture both linguistic and timbre information, enabling the matching process to select semantically related SSL features from the reference utterances. The matching operation is performed using the k-nearest neighbors (kNN) method, which ensures that the output voice retains the source content while altering only the speaker characteristics.

Neural Harmonic Signal Generator

To improve the accuracy of the matching process, the authors employ the mean of the last 5 layers from WavLM-Large for matching, while utilizing the 6th layer for synthesis. This decision is motivated by the fact that the last 5 layers contain a greater amount of discriminative content information, thereby enhancing the matching accuracy.

Audio Synthesizer

The final step involves generating the synthesized voice using the matched SSL features and the neural harmonic signal generator. This results in a high-individuality voice conversion with improved naturalness and quality.

Subjective Evaluation

To evaluate the performance of NeuCoSVC, the authors conducted a subjective evaluation among 20 participants. The results showed that the proposed method outperformed other state-of-the-art methods in terms of naturalness and individuality.

Conclusion

In conclusion, this paper presents a novel approach to voice conversion called NeuCoSVC, which leverages a neural network-based method for high-individuality voice conversion. The proposed method demonstrates improved naturalness and quality compared to existing methods, making it a promising solution for personalized and customizable voice experiences.