Automatic Speech Recognition (ASR) has shifted from traditional pipeline methods to end-to-end (E2E) ones, which have shown remarkable success in recent years. However, these E2E models typically rely on external language models for decoding, which can complicate the decoding process. In this article, we propose a new model that achieves the same performance as large-v2 models but with lower real-time factor (RTF) than medium versions. Our model is trained on a massive corpus of Japanese speech data, and it demonstrates better performance in downstream tasks compared to other models.
Speech Representation
The mainstream of ASR has shifted towards E2E methods, which use a connectionist temporal classification (CTC) or an attention-based encoder-decoder model. These models are trained in a self-supervised manner on large amounts of unlabeled speech data, which can attract attention. However, these methods typically involve using external language models for decoding, which can make the decoding process more complicated. Our proposed model uses a single encoder-decoder model without any pre-training or decoding fusion, and it achieves the same performance as large-v2 models with lower RTF.
Temporal Classification
Yin et al. (31) created a corpus of Japanese speech data called ReazonSpeech, which is massive and free for use. This corpus contains more filler, disfluency, and mispronounced words than other corpora, and it also has different transcription tendencies than written text. We measured the average RTF on an NVIDIA T4 GPU using the first 100 utterances from the JSUT basic5000, and our model achieved the same performance as large-v2 models with lower RTF than medium versions.
Conclusion
Our proposed model achieves the same performance as large-v2 models but with lower RTF. This suggests that training on a larger speech corpus may close the gap between our model and the large-v3 model. Our approach simplifies the decoding process by using a single encoder-decoder model without any pre-training or decoding fusion, making it more efficient and easier to use in practical applications. By leveraging massive amounts of unlabeled speech data, our model can attract attention and achieve better performance in downstream tasks.