Abstract
Accurate MRI image segmentation is crucial for disease diagnosis, but current Transformer-based methods face two key challenges: limited capability to capture detailed information, leading to blurred boundaries and false localization, and the lack of MRI-specific embedding paradigms for attention modules, which limits their potential and representation capability. To address these challenges, this paper proposes a multi-scheme cross-level attention embedded U-shape Transformer (MSCL-SwinUNet). This model integrates cross-level spatial-wise attention (SW-Attention) to transfer detailed information from encoder to decoder, cross-stage channel-wise attention (CW-Attention) to filter out redundant features and enhance task-related channels, and multi-stage scale-wise attention (ScaleW-Attention) to adaptively process multi-scale features. Extensive experiments on the ACDC, MM-WHS and Synapse datasets demonstrate that the proposed MSCL-SwinUNet surpasses state-of-the-art methods in accuracy and generalizability. Visualization further confirms the superiority of our model in preserving detailed boundaries. This work not only advances Transformer-based segmentation in medical imaging but also provides new insights into designing MRI-specific attention embedding paradigms.Our code is available at https://github.com/waylans/MSCL-SwinUNet.
Similar content being viewed by others
Introduction
In recent years, medical imaging techniques have developed rapidly and become widely used, such as Ultrasound images, X-rays, Computed Tomography (CT), Magnetic Resonance Imaging (MRI), etc. As an important medical imaging technology, MRI plays an important role on detection and early diagnosis for tumors, heart diseases, cerebrovascular diseases, etc. Compared to other auxiliary examination methods, MRI has advantages on fast scanning speed, high tissue resolution and clearer images, which can help doctors “see” hard-noticed lesion regions.
With the development of convolutional neural networks (CNNs), computer-aided MRI diagnosis has become more and more intelligent. many methods have made huge progress on lesion semantic segmentation1,2,3,4,5,6 of MRI images. As a cutting-the-edge technique, Transformer promotes the development of medical image segmentation. Many methods7,8,9 have been proved to have generalized capability on semantic segmentation with different-modality medical images including MRI images. Although Transformer-based segmentation networks have achieved promising results, two critical limitations remain unaddressed in current literature.First, the self-attention mechanism’s strength in capturing global dependencies comes at the cost of diminished sensitivity to fine-grained anatomical structures, often leading to blurred organ boundaries or incorrect segmentation of small regions.Second, attention modules (e.g., spatial-wise, channel-wise, scale-wise) are often introduced in a generic manner, without adaptation to the distinctive characteristics of MRI data-such as its tissue homogeneity, modality-specific contrast, and noise distribution.These limitations motivate the design of our proposed MSCL-SwinUNet, which integrates three cross-level attention modules to enhance semantic representation, inter-layer alignment, and modality adaptability.
To address the above limitations in MRI segmentation, we propose MSCL-SwinUNet, a structurally enhanced U-shape Transformer that embeds a multi-scheme, cross-level attention mechanism. As illustrated in Fig.1, three complementary attention modules-spatial-wise, channel-wise, and scale-wise-are systematically placed at skip connections, encoder blocks, and decoder stages, respectively. Unlike prior approaches that embed attention modules in isolation, our coordinated design, built upon Swin Transformer10 and SwinUNet8, enforces semantic alignment across network stages and yields modality-specific feature representations for complex anatomical structures.The high-level features extracted by the encoder are ultimately served as input into the decoder, upsampled by the “Patch Expanding” layer, and fused with multi-scale features from the encoder through skip connections. U-shape structure can restore the spatial resolution of the feature map for further prediction.
Based on the Swin-Transformer blocks and U-shape design, our proposed MSCL-SwinUNet embeds multi-scheme cross-level attention mechanism to enhance the MRI segmentation capability. As shown in Fig.1(c), we embed cross-level “SW-Attention” modules at different stages of encoder and decoder. Here, we use spatial-wise attention model to replace simple “Skip-Connection” operation. Obviously, Feature maps of encoder and decoder at same stage are not consist. Directly applying “Skip-Connection” may amplify the weakness of feature inconsistency. Therefore, we insert “SW-Attention” module between the encoder’s and decoder’s feature map at each stage of network. This design can better transfer the detailed information from encoder to decoder in a learnable manner. With our designed cross-level “SW-Attention” modules, the aforementioned first weakness can be eased. Besides cross-level “SW-Attention” modules, we then embed cross-stage “CW-Attention” modules between 2 neighboring stages of decoder. Even though Transformer blocks extract long-range relation, the channel-wise attention extraction is overlooked. In this paper, we further involve channel-wise attention mechanism in decoder to filter more task-related (MRI-segmentation-related) channels as well as adjust the connection between low-level and high-level features of the decoder. Moreover, we further introduce multi-stage “ScaleW-Attention” modules. It can better handle predictions of different scales feature maps by ensemble learning strategy. Aimed at better segmenting MRI images, we design cross-stage “CW-Attention” and multi-stage “ScaleW-Attention” modules to further explore the representation potential, which eases the aforementioned second weakness.
Compared with prior Transformer-based medical segmentation methods, MSCL-SwinUNet introduces a more structured integration of attention mechanisms. The cross-level SW-Attention enhances semantic consistency between encoder and decoder; CW-Attention refines intra-decoder communication by emphasizing informative channels; and ScaleW-Attention introduces multi-scale adaptability at the output level. These modules collectively improve segmentation performance, especially in anatomically complex regions. However, these improvements come with additional computational overhead due to the use of multiple attention branches. In later sections, we propose efficiency-aware solutions (e.g., selective attention placement, lightweight variants, and knowledge distillation) to mitigate this complexity. This trade-off between segmentation accuracy and computational efficiency constitutes one of the key novelties of our method.
To evaluate the performance of our proposed MSCL-SwinUNet, we conduct experiments on benckmark datasets, ACDC (Automatic Cardiac Diagnosis Challenge)11, MM-WHS (Multi-Modality Whole Heart Segmentation)12,13 and Synapse (Synapse Multi-Organ Segmentation Dataset). The first two datasets (ACDC and MM-WHS) are used for MRI semantic segmentation task. Here, we further use Synapse dataset to evaluate the generalized capability of MSCL-SwinUNet on CT image semantic segmentation task. Comparison results and visualization analysis demonstrate that our method outperforms previous notable methods. The main contributions of this paper are summarized as follows.
-
We propose MSCL-SwinUNet for MRI semantic segmentation task. With multi-scheme cross-level attention mechanism embedded, the MRI segmentation performance of U-shape Transformer gets obviously improved.
-
We embed cross-level “SW-Attention” modules at different stages of encoder and decoder to replace simple “Skip-Connection” operation. With this design, detailed information can be better transferred from encoder to decoder in a consistent manner.
-
Aimed at better segmenting MRI images, we design cross-stage “CW-Attention” and multi-stage “ScaleW-Attention” modules in decoder to further explore the representation potential. This design also provides insight on formulating an embedding paradigm of different attention modules in U-shape Transformer.
Related work
Medical image segmentation with U-shape network
In the early stages of the research field of medical image segmentation, methods based on contours were primarily used. With the widespread use and development of deep learning, Fully Convolutional Networks (FCN14) have been applied to medical image segmentation tasks. To further enhance the extraction ability of detail information, U-Net4,5 has become favored by many researchers due to its symmetric “encoder-decoder” structure with skip-connections. Consequently, several U-shaped networks have been proposed. Li et al.15 propose H-DenseNet, which is a U-Net model with mixed dense connections for liver and tumor segmentation. Huang et al.16 propose U-Net3+ model, which combines deep supervision in different scales. Alam et al.17 propose a multi-encoder U-shape network with shared decoder to ease the of signal artefacts appearing in MRI. Yu et al.18 propose an improved U-Net with cross-modality image augmentation technique to tackle unsupervised cardiac LGE MRI segmentation task.
Attention mechanism in medical image segmentation
To improve the segmentation accuracy of segmentation networks, many notable methods have incorporated attention mechanisms into the networks. Schlemper et al.19 propose an attention gate that can automatically learn and focus on different target shapes into the U-shaped model for medical image segmentation. Roy et al.20 combines the squeeze & excitation (SE) module to propose a “scSE” module, which runs spatial and channel attention mechanisms in parallel. Gu et al.21 introduces spatial, channel, and scale attention mechanisms in the encoding part of the U-shaped structure to make the model focus more on the foreground area, related feature channels and the most significant scale features. There are also many excellent works22,23,24,25, which is proposed to mix multiple attention mechanisms. Munia et al.26 proposed an attention-guided hierarchical fusion U-Net that integrates uncertainty modeling into segmentation. Their model effectively handles ambiguous boundaries and noisy annotations by leveraging multi-level attention and uncertainty estimation, which aligns with our motivation to improve boundary localization through SW-Attention. Nisha et al.27 developed a multi-scale attention U-Net using an EfficientNetB4 encoder for brain MRI segmentation. This work highlights the efficiency and performance trade-offs enabled by integrating lightweight backbones with multi-scale attention modules, which also motivates our design of the ScaleW-Attention mechanism.Pan et al.28 proposed a multi-scale Conv-Attention U-Net that adaptively fuses convolutional features with attention maps at multiple levels. This hybrid fusion strategy further supports the idea of integrating spatial, channel, and scale-aware attention, similar to the multi-scheme design in our MSCL-SwinUNet.
Transformer in medical image segmentation
In recent years, Transformer structure10,29,30 has achieved tremendous success in computer vision field. On medical images, many researchers use this cutting-the-edge structure to improve the segmentation performance. TransFuse7 and TransUNet9 both adopt ViT as encoder to extract features. Gu et al.8 propose SwinUNet, which utilized a U-shaped model based on the Swin-Transformer structure, achieving global learning and long-range information interaction. Zhang et al.31 develop a sparse Transformer model with multi-scale information fusion (SMTF). It obtains local and global information through combination and convolution calculations. Shen et al.32 utilize the HarDNet68 and Transformer structures to extract local and global information from input features, and use an adaptive fusion module to merge multi-level information, thereby realizing channel information interaction. Chen et al.33 design a dual-axis attention mechanism (IEDA-Trans) model. It integrates shallow and deep information by multi-scale fusion. Li et al.34 propose SegTran with squeeze-and-expansion (SE) transformer to explore more diverse information. Liang et al.35 propose MAXFormer. It designs an efficient parallel local–global transformer, which can enhance the performance of medical image segmentation for clinical treatment. Zhao et al.6 propose \(\hbox {DS}^2\)Net with Transformer structure (SegFormer36) to improve the unsupervised domain adaptation segmentation task on ultrasound images. Inspired by the emergence of medical foundation models, Wang et al.37 proposed Triad, a vision transformer-based foundation model for 3D MRI. Pretrained on large-scale datasets, it jointly supports segmentation, registration, and classification tasks. This foundation aligns with our vision of extending MSCL-SwinUNet into a unified backbone for multi-task learning in the future.
Datasets
To evaluate the performance of our proposed MSCL-SwinUNet, we select two widely-applied MRI semantic segmentation datasets, ACDC (https://humanheart-project.creatis.insa-lyon.fr/database/#collection/637218c173e9f0047faa00fb)11 and MM-WHS (https://github.com/FupingWu90/CT_MR_2D_Dataset_DA)12,13. To evaluate the generalization capability of MSCL-SwinUNet, we also select a notable CT dataset, Synapse (Synapse Multi-Organ Segmentation Dataset) (https://www.synapse.org/Synapse:syn3193805/wiki/217789). ACDC (Automatic Cardiac Diagnosis Challenge) is collected from 150 heart examinations of different patients using MRI scanners. The MR images of each patient, as well as the left ventricle (LV), right ventricle (RV), and myocardium (MYO), are all accurately labeled. The dataset contains 70 training samples, 10 validation samples, and 20 testing samples. MM-WHS (Multi-Modality Whole Heart Segmentation)12,13 is a well-known multi-modality medical image dataset containing CT and MR images. This dataset contains 52 CT images and 46 MR images, in which the LV, MYO and RV of 20 CT and 20 MR are annotated with gold standard segmentation mask. In this paper, we use 16 slices of each 3D MR image. 2D slices were extracted from the long-axis view around the center of left ventricular cavity13. Specifically, 320 MR images are labeled and used for experiments. The Synapse dataset contains 3,779 clinical images obtained by CT, with 30 categories. In this paper, 18 categories are used as the training set, and the rest 12 categories are used as the test set.
Method
Main architecture
To enhance the capability of capturing detailed information and provide an insight on formulating a MRI-specific paradigm on embedding different attention modules into U-shape Transformer network, we propose MSCL-SwinUNet. It contains three main components, which are “Cross-Level Spatial-Wise Attention”, “Cross-Stage Channel-Wise Attention” and “Multi-Stage Scale-Wise Attention”.
The encoder of MSCL-SwinUNet
As shown in Fig.1 (c), MRI images are first input into the “Patch Partition”(\(f_{p}\)) to generate sequence embeddings. Then, the sequence embeddings are projected by “Linear Embedding”(\(f_{l}\)). Next, the projected image patches pass through multiple Swin-Transformer blocks and “Patch Merging” layers to generate hierarchical feature maps. In encoder of MSCL-SwinUNet, we design 4 stages, so the output feature maps from Swin-Transformer blocks at different stage are denoted as \(\{\varvec{F_{e}^{i}}\}_{i=1, 2, 3, 4}\). If we denote the Swin-Transformer block at “Stage 1” as \(g_{e}^{1}\), \(\{\varvec{F_{e}^{i}}\}_{i=1, 2, 3, 4}\) can be formulated in Eq.1.
where, \(\{g_{e}^{i}\}_{i=2, 3, 4}\) indicate “Patch Merging + Swin-Transformer blocks” from stage 2\(\sim\)4. \(\varvec{X}\) indicates the input image.
The decoder of MSCL-SwinUNet
As illustrated in Fig. 1(c), the decoder of MSCL-SwinUNet consists of four hierarchical stages. Each stage integrates a Swin Transformer block \(g_d^i\) (\(i=1,2,3,4\)) for contextual modeling. Between every two consecutive stages, we insert a Cross-Stage Channel-Wise Attention (CW-Attention) module \(f_{CW}^i\) to refine features. Additionally, we adopt Patch Expanding layers \(f_{pe}^i\) to progressively increase spatial resolution and reduce channel dimensionality for efficient upsampling.
The decoding process begins from the bottleneck feature map \(\varvec{F_e^4}\) obtained from the final encoder stage. The initial decoder output \(\varvec{F_d^4}\) is computed as:
where \(g_d^4(\cdot )\) denotes the Swin Transformer block at decoder Stage 4, \(f_{CW}^4(\cdot )\) applies cross-stage channel-wise attention to suppress redundant channels and emphasize salient ones, and \(f_{pe}^4(\cdot )\) performs upsampling and channel projection via a linear layer.
The decoder continues to process features through subsequent stages (\(i=3,2,1\)) using both the CW-Attention and Cross-Level Spatial-Wise Attention (SW-Attention) modules. The SW-Attention module fuses encoder and decoder features at the same resolution level to alleviate spatial misalignment and preserve boundary details. Furthermore, the last Patch Expanding layer restores the feature map to the original image resolution to enable pixel-level prediction.
Cross-level spatial-wise attention module
MRI image segmentation often suffers from boundary ambiguity due to low tissue contrast and partial volume effects. In such scenarios, encoder and decoder features at the same stage can be spatially misaligned, especially when using traditional skip connections. To address this, we design a Cross-Level Spatial-Wise Attention (SW-Attention) module that explicitly aligns spatial responses between encoder and decoder through adaptive mask selection.
In conventional U-shape networks, skip-connections directly transfer encoder features to the decoder, but this operation overlooks spatial inconsistency, especially in MRI data. Our SW-Attention aims to mitigate this by computing spatial attention masks from both encoder and decoder and adaptively fusing them.
As illustrated in Fig. 2, the encoder feature map \(\varvec{F_{e}^{i-1}}\) and decoder feature map \(\varvec{F_{d}^{i}}\) are first summed and passed through two convolutional layers \(Conv_e\) and \(Conv_d\), followed by channel-wise global pooling to produce spatial attention masks:
where \(\varvec{M_{e}^{i-1}}\) and \(\varvec{M_{d}^{i}}\) are the spatial-wise attention masks corresponding to encoder and decoder, respectively; C is the channel number.
To adaptively balance the contribution from both branches, we apply a spatial-wise softmax operation to normalize the two masks:
where \(\varvec{(M_{e}^{i-1})'}\) and \(\varvec{(M_{d}^{i})'}\) represent the normalized spatial attention weights.
These weights are then used to reweight the original features through element-wise multiplication:
where \(\times\) denotes broadcasted element-wise multiplication across all channels.
The spatially aligned features from encoder and decoder are then fused via element-wise addition:
where \(\varvec{F_{ed}^{i-1}}\) is the fused output that captures both encoder details and decoder context.
Finally, the fused map is passed through the Swin Transformer block \(g_d^{i-1}\), CW-Attention module \(f_{CW}^{i-1}\), and Patch Expanding layer \(f_{pe}^{i-1}\) to produce the decoder output for the next stage:
where \(g_d^{i-1}(\cdot )\) extracts local context, \(f_{CW}^{i-1}(\cdot )\) enhances channel discriminability, and \(f_{pe}^{i-1}(\cdot )\) upsamples and refines the spatial resolution.
Cross-Stage Channel-Wise Attention Module. \(\varvec{F_{ed}^{j}}\) first passes through the \(j^{th}\) Swin-Transformer Block (\(g_{d}^{j}\)) and then the output feature map (\(\varvec{G_{ed}^{j}}\)) is served as input of CW-Attention block. The whole CW-Attention module is divided into two parts, which are “Channel-Selection” and “Channel-Relation.”.
Cross-stage channel-wise attention module
In MRI segmentation, redundant channel responses and inter-class similarity are common, particularly due to homogeneous tissue appearance. For example, myocardial walls and adjacent blood pools may exhibit similar intensity patterns. To address this challenge, we propose a Channel-Wise Attention (CW-Attention) module that enhances task-relevant channels while suppressing background noise.
Specifically, CW-Attention is applied between adjacent decoder stages to refine feature propagation. It consists of two parts: a channel selection module to assign importance weights, and a channel relation module to explore inter-channel dependencies.
As shown in Fig. 3, the input to CW-Attention is the output of the Swin Transformer block:
where \(g_d^j(\cdot )\) denotes the \(j^{th}\) decoder block, and \(\varvec{F_{ed}^{j-1}}\) is the fused feature map from the previous stage.
Channel-Selection Part. A global feature vector is first extracted from \(\varvec{G_{ed}^{j}}\) using global average pooling followed by a fully connected layer and sigmoid activation. This vector is used to weight each channel:
where \(H^j\) and \(W^j\) are the height and width of the feature map, and \(\times\) denotes broadcasted channel-wise multiplication.
Channel-Relation Part. To model inter-channel dependencies, we compute a channel affinity matrix via normalized dot-product:
where \(M_{ed}^{j}(m, n)\) is the affinity between channel m and n, and C is the total number of channels.
The refined feature representation from this part is:
where \(\times\) denotes matrix multiplication between the feature map and its channel affinity matrix.
Finally, the two outputs are fused to obtain the enhanced feature map:
where \(\varvec{F_d^j}\) is forwarded to the next stage of the decoder.
Multi-stage scale-wise attention module
MRI anatomy typically involves structures of diverse scales, such as small myocardial walls and large ventricular cavities. Standard feature aggregation may bias toward dominant structures, ignoring small yet critical regions. To resolve this, we introduce the Scale-Wise Attention (ScaleW-Attention) module, which adaptively fuses multi-scale features and learns scale-aware channel weights to ensure balanced representation.
As shown in Fig. 2 and Fig. 4, features from decoder stages 1, 2, and 3 are aligned in resolution through upsampling or downsampling:
where \(Up2x(\cdot )\) and \(Up4x(\cdot )\) denote bilinear upsampling by 2 and 4 times, and \(Down2x(\cdot )\) and \(Down4x(\cdot )\) denote downsampling via average pooling.
Attention masks are generated for each fused map by computing channel-wise global averages followed by normalization:
where \(norm(\cdot )\) applies min-max normalization over spatial dimensions.
The original decoder features are then reweighted by their corresponding attention masks:
where \(\times\) denotes element-wise multiplication.
Finally, the segmentation prediction map is obtained by combining reweighted features across scales:
where \(conv_{1}, conv_{2}, conv_{3}\) are \(1 \times 1\) convolution layers for channel alignment, and Up2x, Up4x upsample features to the same resolution. The result \(\varvec{P}\) is passed to the final classifier for pixel-wise prediction.
Experiments and analysis
Evaluation metrics and implementation details
Evaluation Metrics. We adopt Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) to quantitatively assess segmentation performance. The metrics are defined as follows:
where \(\varvec{M_{P}}\) and \(\varvec{M_{gt}}\) are respectively denote the predicted segmentation mask and the ground-truth. \(d(\cdot )\) represents Euclidean distance from two point sets.
Training Details. MSCL-SwinUNet is implemented under Python 3.8 and Pytorch 1.11.0. In training phase, data augmentation is performed on mini-batch samples using flipping and rotation operations. The input image size is 224\(\times\)224. Our model is optimized using the SGD optimizer with a momentum of 0.9 and weight decay of 1e-5.
Experimental results
Efficiency evaluation
To assess the computational efficiency of MSCL-SwinUNet, we measured the number of parameters (Params) and floating-point operations (FLOPs) under an input resolution of \(224 \times 224\). The comparison with SwinUNet is summarized in Table 1.
As shown, after incorporating multi-scheme attention mechanisms, the parameter count of MSCL-SwinUNet increases from 59.02M to 61.97M, and the FLOPs increase from 45.02G to 51.78G. Although this results in a moderate rise in computational cost, the substantial improvement in segmentation performance justifies the overhead. Therefore, the proposed model remains practical and feasible for real-world medical image analysis applications.
Comparison results on ACDC
As shown in Tab.2, we first present the experimental results of the MSCL-SwinUNet on the ACDC dataset. Compared to previous methods, we can analyze in the following aspects. (1) In this paper, our proposed model is improved based on SwinUNet, so it can be regarded as our baseline model. From Tab.2, we can find that MSCL-SwinUNet achieves 2.33% higher segmentation results on Mean DSC value. It proves the effectiveness of our proposed components. (2) MSCL-SwinUNet outperforms previous state-of-the-art method (MAXFormer). This encouraging result shows the overall superiority of MSCL-SwinUNet. (3) On segmenting regions from specific categories, MSCL-SwinUNet achieves best performance on “MYO” and “RV” and second best performance on “LV”.
Comparison results on MM-WHS
To further evaluate the effectiveness of our proposed method, we conduct comparative experiments on both MRI and CT images from the MM-WHS dataset. As the MM-WHS dataset is commonly used for cross-modality medical image segmentation tasks, segmentation results on MRI and CT are rarely reported in previous works. Therefore, we re-implemented SwinUNet (our baseline model) and MAXFormer (a state-of-the-art method on the ACDC dataset) for a fair comparison with our proposed MSCL-SwinUNet.
As shown in Table 3, MSCL-SwinUNet achieves a mean DSC of 90.78% on MRI data, outperforming SwinUNet by 1.97 percentage points, which demonstrates the effectiveness of the three attention modules we introduced. Moreover, compared to MAXFormer, MSCL-SwinUNet also improves the mean DSC by 0.68 percentage points on MRI. In addition, MSCL-SwinUNet achieves the highest performance on CT images as well, with a mean DSC of 91.42%, surpassing both SwinUNet and MAXFormer, further highlighting the superiority and generalizability of our approach.
To evaluate whether the performance improvement of MSCL-SwinUNet over existing methods is statistically significant, we conducted 5-fold cross-validation on the MRI subset of the MM-WHS dataset. The mIoU results were then subjected to a statistical analysis using the Mann-Whitney U test. Compared with the baseline methods SwinUNet and MAXFormer, the p-values obtained were 0.00288 and 0.00155, respectively, both of which are significantly below the standard threshold of 0.05. These results indicate that the performance improvements achieved by MSCL-SwinUNet are statistically significant, further confirming the effectiveness and robustness of the proposed method.
Comparison results on Synapse
To show the generalized ability of our proposed MSCL-SwinUNet, we conduct CT image semantic segmentation experiments on Synapse dataset. Here, DSC and HD are used as the evaluation metrics for all methods. The specific results are shown in Tab.4.
According to Tab.4, on the Synapse dataset, the segmentation effect of our proposed MSCL-SwinUNet is superior to the currently most advanced model (MAXFormer). Specifically, the DSC of the MSCL-SwinUNet reaches 83.82%, which surpasses MAXFormer by 0.16%. Compared to SwinUNet, MSCL-SwinUNet outperforms by 4.69%. For HD value, MSCL-SwinUNet is little inferior than MAXFormer, which achieves the second best performance. As a whole, the encouraging results shown in Tab.4 prove that our method have generalized ability to work well on CT image semantic segmentation task.
Ablation study
To explore the separated impact of different key components of MSCL-SwinUNet, we conduct ablation study on ACDC dataset. As shown in Tab.5, we first reimplement baseline model for fair comparison. Based on baseline SwinUNet, we respectively add each attention module to evaluate the separate effectiveness. From the table, we can analyze in the following points. (1) Obviously, each attention module can make effect on baseline and gain performance improvement. (2) When integrating “SW-Att.” and “CW-Att.”, the model gains large improvement. That’s because two modules work together to refine the feature maps shown in Fig.1. (3) “ScaleW-Att.” is relatively independent and effective. It utilizes the ensemble learning strategy to fuse different predictions. (4) When combining three key attention modules together, MSCL-SwinUNet achieves best performance, which means three components are compatible.
Visualization and analysis
To further prove the effectiveness of our proposed MSCL-SwinUNet, we conduct visualization and analysis on ACDC and MM-WHS datasets.
As shown in Fig.5, on ACDC dataset, we will analyze in the following points. (1) From the visualization results, we can intuitively observe that MSCL-SwinUNet shows better performance than SwinUNet and MAXFormer, especially has better prediction on lesion edge. (2) SwinUNet has obviously superiority on segmenting “MYO”, which coincides with the results shown in Tab.2. (3) Compared to SwinUNet, MSCL-SwinUNet shows obvious better performance. This can prove the effectiveness of our proposed three attention modules in an intuitive manner.
As shown in Fig.6, we also visualize the results on the MR images of MM-WHS dataset. On this dataset, we can also find that MSCL-SwinUNet also shows strong performance with accurate edge and detailed prediction.
Failure case analysis
Although MSCL-SwinUNet demonstrates superior performance on multiple MRI segmentation benchmarks, certain failure cases still emerge during evaluation. Analyzing these cases helps reveal the model’s limitations and inspires future improvements.
(1) Boundary Ambiguity: In some MRI slices, especially those with low contrast between lesions and surrounding tissues (e.g., myocardium vs. adjacent muscle), the model produces blurred or imprecise boundaries. This limitation arises because high-level features may lack fine-grained spatial detail, and the SW-Attention module, while improving information transfer, may not fully restore sharp edges under such conditions.
(2) Closely Adjacent Structures: When anatomical regions such as the left ventricle (LV) and right ventricle (RV) are very close or partially overlapping, MSCL-SwinUNet occasionally misclassifies these structures as one. This is mainly due to shared feature patterns and insufficient category separation in channel space, despite the use of CW-Attention.
(3) Small Target Omission: For small organs or regions-such as thin myocardial walls or minor structural features-our model may fail to segment them accurately. Although the ScaleW-Attention module facilitates multi-scale fusion, small-scale signals can still be overwhelmed by dominant features during deep decoding.
These observations suggest future improvements could include edge-guided supervision to enhance boundary awareness, instance-aware attention to better separate adjacent structures, and adaptive resolution refinement for better small object detection. We will pursue these directions in our subsequent research.
Limitations and future directions
Although MSCL-SwinUNet has demonstrated state-of-the-art performance across multiple MRI benchmark datasets, several limitations remain that warrant further investigation. First, the introduction of multiple attention mechanisms– including SW-Attention, CW-Attention, and ScaleW-Attention– enhances the model’s representational capacity but also incurs significant computational overhead. This may hinder its deployment in real-time or resource-constrained clinical environments. Second, the datasets used in this study (e.g., ACDC and MM-WHS) are derived from relatively homogeneous patient populations and acquired under consistent imaging conditions, which may introduce implicit biases and limit the model’s generalizability to diverse clinical cohorts and imaging devices. Furthermore, while preliminary cross-modality experiments on the Synapse CT dataset provide evidence of the model’s transferability, comprehensive evaluations on a broader range of modalities– such as ultrasound and X-ray– as well as larger and more heterogeneous datasets are still needed to fully establish its robustness and adaptability.
To address these limitations, future work will proceed along several key directions. First, we plan to conduct systematic cross-modality validation across diverse imaging types, incorporating unsupervised domain adaptation techniques to quantitatively assess the generalizability of the proposed attention mechanisms. Second, inspired by recent advances in medical foundation models, we aim to extend MSCL-SwinUNet into a unified backbone for multi-task learning, enabling large-scale pretraining that jointly optimizes segmentation, detection, and registration tasks across heterogeneous imaging corpora. Third, we intend to incorporate meta-learning techniques to handle noisy or weak labels often found in real-world MRI annotations, such as sample reweighting or learned label correction. Fourth, we will explore continual learning strategies, such as task-incremental learning and selective multi-task coordination, to support progressive model adaptation without catastrophic forgetting.
Lastly, to meet the real-time inference requirements of clinical point-of-care applications, we will develop lightweight and quantized variants of the network, thereby enabling low-latency deployment under stringent computational constraints. To more explicitly address the computational cost and model complexity introduced by our multi-attention design, we propose several concrete optimization strategies as future directions. First, attention modules can be selectively deployed at semantically critical stages, rather than uniformly applied across all layers, thereby reducing redundancy. Second, lightweight Transformer variants (e.g., MobileViT, EfficientFormer) may be employed to replace standard Swin Transformer blocks in early or less feature-rich layers. Third, resolution-aware downsampling of feature maps before attention computation– particularly within SW-Attention and ScaleW-Attention– can significantly reduce spatial complexity with minimal impact on representation quality. Additionally, shared-parameter designs for repeated attention structures and post-training compression techniques such as quantization and structured pruning are also under consideration. Collectively, these strategies aim to improve model efficiency while maintaining high segmentation accuracy, thereby facilitating practical deployment in real-time and resource-constrained clinical environments.
Conclusion
This paper proposes MSCL-SwinUNet, an enhanced variant of the SwinUNet model. The proposed architecture incorporates a Cross-Level Spatial-Wise Attention (SW-Attention) module, which replaces conventional skip connections with a learnable mechanism to more effectively transfer fine-grained details from encoder to decoder. Additionally, a Cross-Stage Channel-Wise Attention (CW-Attention) module is integrated into the decoder to strengthen the connections between low-level and high-level feature representations. Furthermore, a Multi-Stage Scale-Wise Attention (ScaleW-Attention) module is introduced to improve multi-scale prediction through an ensemble learning strategy.Extensive experiments on several benchmark medical image segmentation datasets-including ACDC, MM-WHS, and Synapse-demonstrate that MSCL-SwinUNet significantly outperforms both the baseline SwinUNet and the state-of-the-art MAXFormer in segmentation accuracy, while also exhibiting strong generalization capability on CT modalities. The design of MSCL-SwinUNet is driven by the need to address two major limitations of existing Transformer-based medical segmentation models: inadequate modeling of anatomical details and the absence of MRI-specific attention integration.By embedding SW-Attention, CW-Attention, and ScaleW-Attention modules, the proposed framework enhances cross-level feature learning and enables modality-aware attention embedding. Experimental results on public MRI datasets, supported by attention map visualizations, further confirm that these modules not only improve segmentation performance but also enhance model interpretability and clinical applicability. These findings validate the rationale behind our design and offer new insights into the development of modality-specialized medical image segmentation networks.
Data availability
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
References
Zhuang, X. & Shen, J. Multi-scale patch and multi-modality atlases for whole heart segmentation of MRI. Med. Image Anal.31, 77–87 (2016).
Borges, P. et al. Acquisition-invariant brain MRI segmentation with informative uncertainties. Med. Image Anal.92, 103058 (2024).
Chen, Z. et al. Enhancing cardiac MRI segmentation via classifier-guided two-stage network and all-slice information fusion transformer. In Medical Image Computing and Computer-Assisted Intervention (MICCAI) 14313, 145–154 (2023).
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI) 9351, 234–241 (2015).
Zhou, Z. et al. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis - and - Multimodal Learning for Clinical Decision Support 11045, 3–11 (2018).
Zhao, Q., Lyu, S., Bai, W. et al. A multi-modality ovarian tumor ultrasound image dataset for unsupervised cross-domain semantic segmentation. CoRR arXiv: abs/2207.06799 (2022).
Zhang, Y., Liu, H. & Hu, Q. Transfuse: Fusing transformers and cnns for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention (MICCAI) 12901, 14–24 (2021).
Cao, H., Wang, Y., Chen, J. et al. Swin-unet: Unet-like pure transformer for medical image segmentation. In European Conference on Computer Vision Workshops (ECCVW) (2022).
Chen, J., Lu, Y., Yu, Q. et al. Transunet: Transformers make strong encoders for medical image segmentation. CoRR abs/2102.04306 (2021).
Liu, Z., Lin, Y., Cao, Y. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE International Conference on Computer Vision (ICCV), 9992–10002.
Bernard, O. et al. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved?. IEEE Transactions on Med. Imaging (TMI) 37, 2514–2525 (2018).
Zhuang, X. & Shen, J. Multi-scale patch and multi-modality atlases for whole heart segmentation of MRI. Med. Image Anal.31, 77–87 (2016).
Wu, F. & Zhuang, X. CF distance: A new domain discrepancy metric and application to explicit domain adaptation for cross-modality cardiac image segmentation. IEEE Transactions on Med. Imaging (TMI) 39, 4274–4285 (2020).
Shelhamer, E., Long, J. & Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 39, 640–651 (2017).
Li, X. et al. H-denseunet: Hybrid densely connected unet for liver and tumor segmentation from CT volumes. IEEE Transactions on Med. Imaging (TMI) 37, 2663–2674 (2018).
Huang, H., Lin, L., Tong, R. et al. Unet 3+: A full-scale connected unet for medical image segmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1055–1059 (2020).
Alam, S. et al. Brain tumor segmentation from multiparametric MRI using a multi-encoder u-net architecture. In Medical Image Computing and Computer-Assisted Intervention (MICCAI) 12963, 289–301 (2021).
Yu, X. et al. Cardiac LGE MRI segmentation with cross-modality image augmentation and improved u-net. IEEE Journal of Biomedical and Health Informatics (JBHI) 27, 588–597 (2023).
Schlemper, J. et al. Attention gated networks: Learning to leverage salient regions in medical images. Med. Image Anal.53, 197–207 (2019).
Roy, A. G., Navab, N. spsampsps Wachinger, C. Concurrent spatial and channel ’squeeze spsampsps excitation’ in fully convolutional networks. In Medical Image Computing and Computer Assisted Intervention (MICCAI) 11070, 421–429 (2018).
Gu, R. et al. Ca-net: Comprehensive attention convolutional neural networks for explainable medical image segmentation. IEEE Transactions on Med. Imaging (TMI) 40, 699–711 (2021).
Kaul, C., Manandhar, S. & Pears, N. E. Focusnet: An attention-based fully convolutional network for medical image segmentation. In International Symposium on Biomedical Imaging (ISBI), 455–458 (2019).
Paschali, M. et al. 3dq: Compact quantized neural networks for volumetric whole brain segmentation. In Medical Image Computing and Computer Assisted Intervention (MICCAI) 11766, 438–446 (2019).
Wang, Z. et al. Non-local u-nets for biomedical image segmentation. Proc. AAAI Conf. Artif. Intell.https://doi.org/10.1609/aaai.v34i04.6100 (2020).
Chen, X., Zhang, R. & Yan, P. Feature fusion encoder decoder network for automatic liver lesion segmentation. In International Symposium on Biomedical Imaging (ISBI), 430–433 (2019).
Munia, A. A. et al. Attention-guided hierarchical fusion u-net for uncertainty-driven medical image segmentation. Inf. Fusion115, 102719 (2025).
JS, N. Brain tumor segmentation using multi-scale attention U-Net with EfficientNetB4 encoder for enhanced MRI analysis. Sci. Rep.15, 9914 (2025).
Pan, P., Zhang, C., Sun, J. & Guo, L. Multi-scale conv-attention u-net for medical image segmentation. Sci. Reports 15, 12041 (2025).
Dosovitskiy, A., Beyer, L., Kolesnikov, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR) (2021).
Liu, Q., Xu, Z., Bertasius, G. et al. Simpleclick: Interactive image segmentation with simple vision transformers. In IEEE International Conference on Computer Vision (ICCV), 22233–22243 (2023).
Zhang, X. et al. SMTF: Sparse transformer with multiscale contextual fusion for medical image segmentation. Biomed. Signal Process. Control87, 105458 (2024).
Shen, T. & Xu, H. Medical image segmentation based on transformer and hardnet structures. IEEE Access 11, 16621–16630 (2023).
Chen, L., Wang, T. & Ge, H. Tmtrans: texture mixed transformers for medical image segmentation. AI Commun. 36, 325–340 (2023).
Li, S., Sui, X., Luo, X. et al. Medical image segmentation using squeeze-and-expansion transformers. In International Joint Conference on Artificial Intelligence (IJCAI), 807–815 (2021).
Liang, Z. et al. Maxformer: Enhanced transformer for medical image segmentation with multi-attention and multi-scale features fusion. Knowl.-Based Syst.280, 110987 (2023).
Xie, E. et al. Segformer: Simple and efficient design for semantic segmentation with transformers. In Advances in Neural Information Processing Systems (NeuraIPS) 34, 12077–12090 (2021).
Wang, S. et al. Triad: Vision foundation model for 3d magnetic resonance imaging. arXiv preprint arXiv:2502.14064 (2025).
Liu, Y. et al. Transunet+: Redesigning the skip connection to enhance features in medical image segmentation. Knowl.-Based Syst.256, 109859 (2022).
Wang, H., Xie, S., Lin, L. et al. Mixed transformer u-net for medical image segmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2390–2394 (2022).
Rahman, M. M. & Marculescu, R. Medical image segmentation via cascaded attention decoding. In IEEE Winter Conference on Applications of Computer Vision, 6211–6220 (2023).
Milletari, F., Navab, N. & Ahmadi, S. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In International Conference on 3D Vision (3DV), 565–571 (2016).
Fu, S. et al. Domain adaptive relational reasoning for 3d multi-organ segmentation. In Medical Image Computing and Computer Assisted Intervention (MICCAI) 12261, 656–666 (2020).
Heidari, M., Kazerouni, A., Kadarvish, M. S. et al. Hiformer: Hierarchical multi-scale representations using transformers for medical image segmentation. In IEEE Winter Conference on Applications of Computer Vision (WACV), 6191–6201 (2023).
Azad, R. et al. Dae-former: Dual attention-guided efficient transformer for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention (MICCAI) 14277, 83–95 (2023).
Acknowledgements
This study was supported by the Research Start-up Fund Project for High-level Talents Batch II in 2024 of Chengdu Aeronautic Polytechnic (Grant No.ZZX0624158) and the Sichuan Provincial Science and Technology Achievement Transformation Demonstration Project (Grant No. 2022ZHCG0060).
Author information
Authors and Affiliations
Contributions
Qiang Wang: Conceptualization, Methodology, Writing - OriginalDraft, Project administration. Yongchong Xue: Conceptualization, Data Curation, Formal analysis, Validation, Visualization, Software.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Wang, Q., Xue, Y. Multi-scheme cross-level attention embedded U-shape transformer for MRI semantic segmentation. Sci Rep 15, 22891 (2025). https://doi.org/10.1038/s41598-025-06966-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-06966-y