In this article, the authors propose a novel deep learning architecture called the Double Swin-Transformer block to improve the efficiency and accuracy of image segmentation in computer vision. The proposed model is designed to address two main limitations of traditional convolutional neural networks (CNNs): their inability to capture long-range dependencies and their sensitivity to the downsampling process.
To overcome these limitations, the authors introduce the Double Swin-Transformer block, which consists of two consecutive Swin-Transformer blocks. Each Swin-Transformer block is composed of an LN (linear layer) layer, an MSA (multi-headed self-attention) module, a residual connection, and a two-layer MLP (multiplicative linear transformation) with a GELU activation function. The key innovation of the Double Swin-Transformer block is the use of a shifted window-based multi-headed self-attention module, which enables the model to capture long-range dependencies more effectively.
To further enhance the model’s performance, the authors apply the W-MSA (window-based multi-headed self-attention) module and the SW-MSA (shifted-window-based multi-headed self-attention) module to the two consecutive Swin-Transformer blocks, respectively. These modules allow the model to learn global and remote semantic information interactions more effectively.
The authors also propose a novel decoder architecture that combines the Double Swin-Transformer block with a patch-expanding layer to compensate for the loss of spatial information caused by downsampling. The patch-expanding layer reshapes feature maps of adjacent dimensions into larger feature maps with a resolution of 2× upsampling, while the Swin Transformer block is responsible for feature representation learning. Finally, the feature maps’ resolution is restored to the input resolution (W×H) by 4-fold upsampling using the last patch expanding layer.
In summary, the Double Swin-Transformer block proposed in this article represents a significant advancement in image segmentation technology. By combining the strengths of traditional CNNs with the innovations of attention mechanisms, the authors have created a more efficient and accurate model that can capture long-range dependencies and learn global and remote semantic information interactions more effectively. The novel decoder architecture also helps to compensate for the loss of spatial information caused by downsampling, making the model more robust and practical for real-world applications.
Electrical Engineering and Systems Science, Image and Video Processing