In this article, we propose a novel approach to transforming multi-scale data for efficient convolutional neural network (CNN) performance. Our method, called multi-scale patching and masking, allows us to generate input tokens with semantically meaningful information at different resolutions while avoiding the loss of useful information. We also introduce cross-modal contrastive learning to maximize the consistency among inter-modal groups, maintaining useful information and eliminating noise.
Multi-Scale Data Transformation
The traditional approach to generating multi-scale data involves down-sampling, which can lead to losing important information. Our method is lightweight and adaptive, using different patch lengths to obtain input tokens with meaningful information at various resolutions. We also define adaptive mask ratios according to the scale of features to be extracted, ensuring that only relevant information is utilized.
Cross-Modal Contrastive Learning
Most works on contrastive representation learning rely on generating positive pairs by data augmentation, which can introduce faulty samples and negatively impact performance. Our approach chooses the same segments from different modalities as positive pairs instead of generating them, ensuring that the information is accurate and meaningful. This allows us to learn semantic information shared between modalities while ignoring redundant information across segments.
Benefits
Our proposed method offers several benefits over traditional approaches. Firstly, it avoids losing useful information during down-sampling, allowing for more accurate representation learning. Secondly, it selects positive pairs from the same modalities instead of generating them, resulting in better performance. Finally, our approach is lightweight and adaptive, making it suitable for real-world applications.
Conclusion
In summary, this article presents a novel approach to multi-scale data transformation for efficient CNN performance. Our method utilizes adaptive patch lengths and mask ratios to preserve useful information while avoiding noise. Additionally, we propose cross-modal contrastive learning to maximize consistency among inter-modal groups, leading to improved performance in downstream tasks. By leveraging these techniques, we can improve the accuracy of time series classification and enhance our understanding of physiological signals.