Bridging the gap between complex scientific research and the curious minds eager to explore it.

Audio and Speech Processing, Electrical Engineering and Systems Science

The impact of pre-training on downstream speech recognition performance in ASR.

The impact of pre-training on downstream speech recognition performance in ASR.

The paper explores the use of continuous pre-training (CP) for improving the performance of fusion-based automatic speech recognition (ASR) systems. The authors experiment with various CP strategies and compare their downstream performance on diverse target domains. They find that CP can significantly improve ASR accuracy, especially when combined with encoder-decoder (E2E) fine-tuning.

CP Strategies

The authors experiment with two CP strategies: introducing a linear CTC head and using the pre-trained feature extractor as a frozen component for E2E fine-tuning. They find that both strategies lead to improved ASR performance, but the latter approach yields better results.

Target Domains

The authors evaluate the performance of CP on several target domains, including Wall Street Journal (WSJ) and Switchboard (SWBD). They find that CP performs well across different domains, with an average improvement of 10-15% in WER.

Performance Comparison

The authors compare the downstream performance of FusDom, a fusion-based ASR system that combines multiple pre-trained models, with Vanilla CP, which uses only the pre-trained features without any fusion. They find that FusDom consistently outperforms Vanilla CP on all target domains, indicating the advantage of combining pre-trained models.

Conclusion

The paper demonstrates the effectiveness of continuous pre-training for improving the performance of fusion-based ASR systems. By leveraging pre-trained models and fine-tuning them on specific target domains, CP can significantly improve ASR accuracy. The findings of this study have important implications for practical applications of ASR systems in various industries.