In this article, we propose a new approach to sound source localization that improves upon previous methods by incorporating both local and global context-aware information. Traditional methods rely solely on global graphs, which can lead to difficulties in specifying each node’s semantics. Our proposed model, called Lightweight Attention-Fused Multi-Level Graph Learning (MLGL), addresses this issue by using attention mechanisms to focus on the most relevant nodes in a given context.
To extract node representations with explicit semantic information, we use a combination of fine-grained and coarse-grained labels from the DeLTA dataset. These labels provide a more detailed understanding of the different types of environmental sounds and their relationships. We then build local context-aware graphs (LcGs) for each node using these representations, which allows us to capture the unique characteristics of each sound source in a particular context.
The next step is to fuse these LcGs using attention mechanisms, which enables the model to selectively focus on the most relevant nodes when computing the representation of a given sound source. This approach not only improves the accuracy of localization but also provides a more detailed understanding of the relationships between different sound sources.
In addition, we propose a hierarchical graph representation learning (HGRL) method that combines both local and global information to enhance the common objective representations of the same node in different contexts. This allows the model to capture the complex relationships between different sound sources and their contextual variations.
The contributions of this work are threefold: Firstly, MLGL provides higher explainability for its reliance on local and global context-aware graphs, which can help developers better understand how the model is making predictions. Secondly, MLGL outperforms traditional CNN-based models by leveraging graph neural networks that capture the relations between nodes well. Finally, MLGL shows that AEs from some sources significantly correlate with AR, which is consistent with human perception of these environmental sound sources.
In summary, our proposed MLGL model offers a more comprehensive and accurate approach to sound source localization by incorporating both local and global context-aware information. By using attention mechanisms and hierarchical graph representation learning, we can better capture the complex relationships between different sound sources and their contextual variations.
Audio and Speech Processing, Electrical Engineering and Systems Science