In this article, we propose a novel approach to multimodal fusion called CMDUP (Cross-Attention with Multi-modal Differentiable Up-sampling). Our method combines the strengths of two existing techniques: cross-attention and differentiable up-sampling. Cross-attention allows the model to attend to different parts of the input modality, while differentiable up-sampling enables the model to generate high-resolution outputs from low-resolution inputs.
The CMDUP module consists of three main components: (1) query feature extraction, where we project the input modalities onto a common space; (2) key and value feature extraction, where we compute the attention scores between the modalities; and (3) multi-modal fusion, where we combine the modalities using the attention scores. The final output is obtained by applying a nonlinear transformation to the fused modalities.
One of the main advantages of CMDUP is its ability to effectively aggregate information from both modalities without relying on modality alignment. This is achieved through the use of cross-attention, which allows the model to learn a joint representation of the input modalities. Additionally, the differentiable up-sampling mechanism enables the model to generate high-resolution outputs from low-resolution inputs, making it more efficient than traditional upsampling methods.
In summary, CMDUP is a powerful tool for multimodal fusion that combines the strengths of cross-attention and differentiable up-sampling. Its ability to effectively aggregate information from both modalities without relying on modality alignment makes it particularly useful in tasks such as visual tracking, where the model needs to integrate information from multiple sources.
Computer Science, Computer Vision and Pattern Recognition