Bridging the gap between complex scientific research and the curious minds eager to explore it.

Audio and Speech Processing, Electrical Engineering and Systems Science

Attention-Based Speech Recognition with Confidence Estimation

Attention-Based Speech Recognition with Confidence Estimation

In this article, we’ll dive into the intricacies of training an attention-based end-to-end speech recognition model. We’ll explore how we divided the training data to prepare it for conidence estimation models and how we performed automatic speech recognition (ASR) using the ASR model and a language model.
First, let’s set the stage: recent state-of-the-art end-to-end speech recognition systems employ Transformers, which are powerful neural networks that can process input sequences of varying lengths. However, these models are trained on datasets with an imbalanced number of correctly recognized and incorrectly recognized token samples. To tackle this challenge, we employed a technique called "shallow fusion fashion" to combine the attention-based scores with the language model scores.
Now, let’s get into the nitty-gritty: we divided our training data into two parts to prepare it for the confidence estimation models. We used half of the data to train both the ASR model and the language model, while the other half served as a validation set. We then performed ASR on the entire training dataset using the trained ASR model and language model.
But wait, there’s more! To make things interesting, we introduced a new metric called "class-balanced loss." This measure takes into account the imbalance in the number of correctly recognized and incorrectly recognized token samples during training. By using this loss function, we can train our models to be more robust towards errors and improve their overall performance.
In summary, this article provides insights into the experimental settings for attention-based end-to-end speech recognition systems. We explored how we divided our training data, performed ASR using the ASR model and language model, and introduced a new loss function to address the class imbalance issue. By employing these techniques, we can improve the performance of our models and achieve better accuracy in end-to-end speech recognition tasks.