The paper explores the use of continuous pre-training (CP) for improving the performance of fusion-based automatic speech recognition (ASR) systems. The authors experiment with various CP strategies and compare their downstream performance on diverse target domains. They find that CP can significantly improve ASR accuracy, especially when combined with encoder-decoder (E2E) fine-tuning.
CP Strategies
The authors experiment with two CP strategies: introducing a linear CTC head and using the pre-trained feature extractor as a frozen component for E2E fine-tuning. They find that both strategies lead to improved ASR performance, but the latter approach yields better results.
Target Domains
The authors evaluate the performance of CP on several target domains, including Wall Street Journal (WSJ) and Switchboard (SWBD). They find that CP performs well across different domains, with an average improvement of 10-15% in WER.
Performance Comparison
The authors compare the downstream performance of FusDom, a fusion-based ASR system that combines multiple pre-trained models, with Vanilla CP, which uses only the pre-trained features without any fusion. They find that FusDom consistently outperforms Vanilla CP on all target domains, indicating the advantage of combining pre-trained models.
Conclusion
The paper demonstrates the effectiveness of continuous pre-training for improving the performance of fusion-based ASR systems. By leveraging pre-trained models and fine-tuning them on specific target domains, CP can significantly improve ASR accuracy. The findings of this study have important implications for practical applications of ASR systems in various industries.