Bridging the gap between complex scientific research and the curious minds eager to explore it.

Computation and Language, Computer Science

Enhancing Whisper with Prompt Tuning for Target-Speaker ASR

Enhancing Whisper with Prompt Tuning for Target-Speaker ASR

In this article, we aim to improve the performance of Whisper, a popular Automatic Speech Recognition (ASR) system, by adapting it to target speakers. Our approach involves optimizing the system’s parameters using prompts that are tailored to each speaker’s voice. We evaluate our method on a dataset called Libri2Mix, which contains overlapped speech mixtures of different speakers’ utterances.
To understand how Whisper works, imagine a machine learning model that can recognize spoken words like a human interpreter. Just as we humans use different tools to recognize speech in different situations (e.g., earbuds for music, headphones for calls), our model adapts to the speaker’s voice by adjusting its parameters. However, unlike humans, this machine learning model needs help finding the right settings for each speaker. That’s where prompt tuning comes in.
Prompt tuning is like personalizing a pair of glasses for each person. We create customized prompts that match each speaker’s voice and adjust Whisper’s parameters accordingly. This optimization process improves the system’s accuracy and enables it to recognize speech more effectively.
We evaluate our method on three different prompt lengths, and the results show that as the length increases, so does the performance. However, beyond a certain point (16 in this case), further improvements are less noticeable. Therefore, we set the prompt length to 16 for all subsequent experiments.
Interestingly, we found that using a shared MLP (Multi-Layer Perceptron) for reparameterization leads to decreased performance compared to not using it at all. This is likely due to the shared MLP limiting the learning of inter-layer variations. To overcome this, we use separate MLPs for each layer, which yields better results.
In summary, our article proposes a method to adapt Whisper’s parameters to target speakers by optimizing prompts using a shared MLP and separate MLPs for each layer. By doing so, we improve the system’s accuracy and enable it to recognize speech more effectively. Our findings demonstrate that increasing the prompt length can lead to improved performance, but beyond a certain point, further improvements are less noticeable.