Keyword spotting is a technology that helps detect and identify specific keywords in audio or video data. Recently, there has been growing interest in open-vocabulary keyword spotting, which allows users to customize keywords instead of relying on predefined ones. In this article, we investigate the impact of varying keyword lengths on system performance using experiments with different keyword lengths. Our findings reveal that longer keywords tend to result in better overall performance due to several reasons. Firstly, shorter sequences need more context to make accurate predictions, which can lead to false alarms. Secondly, longer sequences have more opportunities to use the full potential of the decoder, making it better at distinguishing between different words.
We used a baseline CTC model and compared it with two other models: one with an integrated encoder and another with a casual decoder. The results show that the full decoder model performs the best among the three, indicating that the extra steps in the decoder provide more context for accurate predictions. Additionally, we found that longer keywords result in fewer false alarms, which is beneficial for users who want to avoid irrelevant notifications.
In summary, our findings demonstrate the importance of keyword length on keyword spotting performance and suggest that longer keywords lead to better accuracy and fewer false alarms. By understanding these factors, developers can improve their systems’ performance and create more user-friendly experiences.
Audio and Speech Processing, Electrical Engineering and Systems Science