Abstract
Despite significant breakthroughs in deep learning, brain tumor segmentation remains a challenging task due to the unclear tumor borders and the high degree of accuracy required. To overcome these concerns, we propose a new segmentation model, SwinCLNet, which integrates window-based multi-head self-attention, shifted window multi-head self-attention, cross-scale dual fusion, and residual large-kernel attention into the 3D U-Net architecture. First, the encoder employs the window-based multi-head and shifted window multi-head self-attention modules to capture rich contextual information. Second, the decoder employs the cross-scale dual fusion module, which precisely complements tumor boundary representation by fusing these enhanced features. Third, the SwinCLNet employs the residual large-kernel attention module over skip connections, using large-kernel attention to expand the receptive field and capture long-range spatial dependencies. Testing using the BraTS2023 and 2024 datasets demonstrated that the proposed SwinCLNet model has excellent performance in terms of the Dice score and Hausdorff distance for all brain tumor segmentation areas. In particular, the proposed model increased the average Dice score by approximately 4.53% and reduced the Hausdorff distance 95th percentile by approximately 30.89% compared with the average of benchmark models. These data demonstrate that the SwinCLNet model is particularly efficient in the difficult tumor core and enhancing tumor regions.
Introduction
Background
Recent developments in deep learning have enabled models to automatically extract intricate and hierarchical features from image data, and have had a major influence, particularly on medical image segmentation. For example, models based on convolutional neural networks (CNNs) have used regularly in medical segmentation1,2 due to their ability to capture local spatial patterns effectively. More recently, Transformer3,4, which was introduced in natural language processing, has been gaining more attention for its ability to capture long-range global contextual information. Through these advancements, it has become possible to accurately determine the characteristics of brain lesions.
However, these existing medical image segmentation models still have several limitations: i) they have low practicality in handling high-resolution data, such as 3D medical images, because of the limited acceptance range or high computational complexity3, ii) they are still not accurate enough to handle complex and diverse segmentation tasks and are not robust enough to overcome minor noise or changes, iii) their interpretability remains insufficient for seamless clinical adoption, which restricts widespread trust and utilization in healthcare settings, iv) they demand a large amount of computational power to process multimodal data or high-resolution 3D images, which makes it difficult to effectively apply such models to current clinical systems, and v) in particular, brain tumor image segmentation remains a very challenging task. Since brain tumors are very irregular, the intensity changes at each modality are large and the boundaries are often ambiguous due to the simultaneous presence of multiple tumor subregions, such as edema, necrosis, and enhancing core. In clinical settings, robust models that can produce reliable and interpretable results are critical.
Motivations and contributions
In this study, we specifically aimed to address the accuracy and reliability of a new model for the brain tumor segmentation. This new hybrid segmentation model combines the benefits of a CNN-based structure and a transformer-based structure. The main contributions of this study are summarized as follows.
-
First, we proposed a novel segmentation model, SwinCLNet, which is based on 3D U-Net and optimizes both local and global information processing. This was accomplished by methodically incorporating an advanced fusion block and a hybrid attention mechanism into the encoder-decoder framework.
-
Second, we designed a new attention block that alternates between window-based multi-head self-attention (W-MSA) and shifted window-based multi-head self-attention (SW-MSA), effectively capturing both local and global contextual information through cross-window information exchange. This alternating approach overcomes the limited receptive field of local attention while preserving its computational efficiency.
-
Third, we developed a new cross-scale dual fusion (CSDF) module, which fuses multiple resolution features in the decoder, enhancing boundary precision and overall segmentation performance. By adaptively aligning features from the upsampling path and the skip connection, this module improves the accuracy of tumor boundary reconstruction.
-
Fourth, we designed a new residual large-kernel attention (RLKA) module and applied it to the skip connections, further enhancing the feature representation by expanding the receptive field. With its residual connection, the RLKA module improves features without running the risk of information loss.
-
Finally, the proposed approach provides a useful hybrid technique that successfully exploits the robust feature integration of CSDF module, the broad receptive field of the RLKA module, and the detailed feature extraction of W-MSA/SW-MSA blocks. Through extensive evaluations using the BraTS2023 and 2024 datasets, we showed that the proposed model, SwinCLNet, achieves highly competitive segmentation performance across all tumor subregions in terms of the Dice score and Hausdorff distance 95th percentile (HD95), outperforming the benchmarked models. The proposed model recorded the average Dice score 88.31% which is an increase of approximately 4.53% and reduced the HD95 by approximately 30.89% compared with the average of benchmark models. The model’s efficacy was especially noticeable in the difficult tumor core (TC) and enhancing tumor (ET) regions.
Literature review
2D U-Net-based architectures
Recent research has focused on evolving the core U-Net architecture itself, particularly for 2D segmentation tasks. For example, Xiong et al.5 suggested leveraging large-scale foundation models by integrating the powerful Segment Anything Model (SAM)6 as a pre-trained encoder to enhance feature extraction capabilities for both natural and medical images. Also, there is a research focusing on modifying the skip-connection in 2D U-Net design. For example, Peng et al.7 introduced U-Net v2, which re-examines the design of skip connections to optimize feature fusion and gradient flow by infusing semantic information from high-level features into low-level features.
3D U-Net-based architectures
A prominent 3D U-Net 8 variant, nnU-Net 2, automatically adapts preprocessing, architecture, and training configurations based on dataset characteristics to achieve strong performance across various benchmarks. Fu et al. 9 designed a modality-specific, multi-encoder nnU-Net by independently processing each magnetic resonance imaging (MRI) modality before fusion, thereby significantly improving the segmentation accuracy for primary brain lymphoma in post-contrast T1-weighted clinical MRI. Spronck et al. 10 adapted nnU-Net for whole-slide image segmentation by incorporating domain-specific hyperparameter tuning, pathology-guided color augmentation, an efficient whole-slide imaging inference pipeline, and uncertainty-aware inference, thereby achieving SOTA performance in computational pathology challenges.
While 3D U-Net architectures built from stacked 3D convolutions8 provide a strong baseline, their inherently local receptive fields limit their ability to capture global contextual information. This emerges as a core shortcoming on the BraTS dataset. BraTS tumors are characterized by high heterogeneity, irregular shapes, and ambiguous boundaries between multiple subregions such as enhancing tumor, necrosis, and peritumoral edema. Distinguishing these complex and often infiltrating structures requires modeling long-range spatial dependencies across the entire volume, which standard CNNs, with their limited receptive fields, struggle to capture effectively.11,12
Transformer-based architectures
Recently, transformer-based architectures have emerged as alternatives to convolutional models owing to their superior capability in modeling global contextual information. TransUNet 3 pioneered the integration of vision transformers with U-Net by replacing the encoder with a transformer to leverage long-range feature dependencies. Chowdary et al. 13 proposed a transformer-based classification architecture by introducing dual-path local–global transformer and spatial attention fusion modules, which, when combined, enhance feature extraction at both the local and global levels, significantly improving the classification performance across multiple medical imaging datasets. Wang et al. 14 developed a hybrid model combining transformer blocks with convolutional encoders and downsamplers, using cross-phase tokens for multi-phase computed tomography (CT) fusion and a novel preprocessing unit to address annotation limitations, achieving approximately 90.9% accuracy in liver lesion classification. Oghbaie et al. 15 proposed an end-to-end transformer-based classification framework that handles variable-length volumetric medical data by introducing a volume-wise resolution randomization strategy and slice-level positional embedding, achieving a 21.96% improvement in the balanced accuracy for retinal optical coherence tomography classification tasks. Liao et al. 16 introduced the transformer-based annotation bias-aware (TAB) model, which employs learnable query-based transformers and multivariate Gaussian modeling to explicitly account for annotator preference and stochastic errors, thereby mitigating annotation bias and improving segmentation consistency on multi-annotator OD/OC datasets. Yang et al. 17 proposed DAST, a differentiable neural architecture search method that jointly optimizes transformer and convolutional operations under graphics processing unit memory constraints, uncovering optimal layer combinations and achieving SOTA performance on the KiTS’19 kidney CT segmentation challenge. Similarly, Zhou et al.18 proposed nnFormer, which integrates interleaved Transformer blocks within the robust nnU-Net framework to effectively capture both local and global dependencies in 3D volumetric data. Although transformer models can successfully learn the global contextual information, current transformers incur significant computational costs for high-resolution imaging such as 3D MRI.
Mamba-based architectures
Recently, State-Space Models (SSMs), particularly Mamba, have emerged as a compelling alternative to Transformers for modeling long-range dependencies. Transformers are limited by quadratic complexity (\(O(N^2)\)) with sequence length. However, Mamba-based architectures achieve linear complexity (O(N)) while effectively capturing global context using a selective SSM mechanism. This efficiency is particularly advantageous for high-resolution 3D medical images. Ma et al.19, for instance, embeds Mamba blocks within a U-Net structure to enhance long-range feature capture for biomedical images. Gong et al.20 specifically adapts this concept for 3D volumetric segmentation, combining Mamba’s efficiency with the robust isensee et al.2 framework. Liu et al.21 explore VMamba as a general-purpose vision backbone. Ruan et al.22 explore VM-UNet as a U-shape architecture for medical image segmentation.
Hybrid architectures
Some studies have utilized both U-Net and CNN or transformers to balance local features with long-range dependencies in complex 3D data. Ma et al. 23 improved the performance of U-Net by designing encoders, 3D channels, and spatial attention supervision modules based on 3D shift local and global transformers. Liu et al. 24 attempted to improve performance by partially using a transformer-based attraction mechanism while preserving the topology of U-Net. Hao et al. 4 designed a memory-efficient transformer-based segmentation architecture by introducing a depthwise separable shuffled convolution (DSPConv), an efficient vector aggregation attention, and a serial multi-head attention module, significantly reducing parameter count, memory usage, and computational complexity while preserving accuracy on the ACDC and Hippocampus datasets. Phan et al. 25 proposed UNest, a transformer architecture for unpaired medical image synthesis that incorporates structural inductive biases by leveraging foreground masks from a segment-anything model-based extractor and applying structural attention within anatomically relevant regions, resulting in up to 19.30% improvements across MRI, CT, and positron emission tomography (PET) synthesis tasks. Yang et al. 26 proposed a region attention transformer, which dynamically segments the input into semantic regions using the segment-anything model and applies region-based multi-head self-attention (MHA) within those partitions. This reduces interference from irrelevant image areas and introduces a focal region loss to enhance restoration in high-difficulty regions, achieving SOTA results in PET synthesis, CT denoising, and pathology super-resolution. Lin et al. 27 introduced a dynamic self-attention sparsification method for medical transformers by grouping similar feature tokens into feature prototypes during attention computation. This dependency distillation mechanism significantly reduces computational complexity while improving model performance as a plug-and-play module. Roy et al. 28 developed MedNeXt, a fully ConvNeXt-inspired 3D encoder–-decoder architecture with large-kernel residual blocks and compound scaling, which mirrors transformer-like performance improvements in a convolutional framework. This model achieved SOTA segmentation accuracy on multiple CT and MRI tasks. Xiao et al. 29, for instance, is a hybrid model balancing CNN local features with Transformer global modeling. It enhances the U-Net encoder with residual connections and replaces skip connections with a dual attention mechanism to better capture dependencies, while an efficient channel attention fusion method in the decoder suppresses redundant features. Following a similar hybrid philosophy, Zhang et al. 30 introduced CI-UNet, which utilizes ConvNeXt as its encoder to amalgamate computational efficiency and feature extraction, while an advanced cross-dimensional attention mechanism captures intricate global context. Similarly, LPF-Net 31 addresses the limitations of pure CNNs and Transformers by proposing a dual-branch encoder to extract both local details via a CNN and global context dependencies via a Swin Transformer, which are then combined using a progressive fusion strategy.
Some hybrid models focus on sophisticated feature fusion to handle boundary segmentation challenges in multi-scale region. Xiao et al. 32, for example, proposes a deep fusion of weak edge and context features. It uses a novel weak edge detection module (Otsu-WD) alongside a parallel GRU-based network to preserve edge context. A maximum index fusion mechanism then merges these details with multilayer context features, preventing their loss and achieving clearer boundary segmentation. Zhang et al. 33 proposed a unique approach for meningioma grading by integrating ViT and CNN architectures. Their model combines radiomics and deep learning to leverage the potential value of peritumoral edema (PTE) regions, enhancing grading accuracy. PIF-Net 34 uses hierarchical parallel paths with dense cross-path connections to enhance feature representation and filter ambiguities. DFPNet 35 also addresses complex boundaries by using two subnetworks: a top-down feature steered network to capture context and a bottom-up border network to optimize boundary features.
Other hybrid models focus specifically on redesigning the skip connections to improve feature fusion. Zhang et al. 36, for instance, argue that traditional skip connections may limit contextual information capture. Their proposed MRC-TransUNet combines a lightweight ViT with a U-Net, using skip connections only in the first layer and applying reciprocal attention modules in subsequent layers to compensate for detail loss. Similarly, Wen et al. 37 enhance the skip pathways in their 3D hybrid model, MRCM-UCTransNet, by integrating a multi-scale residual convolution module (MRCM) into the UCTransNet architecture to improve feature extraction.
Together with these advanced segmentation architectures, advancements in optimization for tensor-based medical data include the Step Gravitational Search Algorithm (SGSA) proposed by Fan et al. 38. This metaheuristic addresses inefficient training and high feature redundancy in Support Tensor Train Machines (STTM) by performing synchronous feature selection and parameter optimization on MRI datasets.
Methodology
Overall proposed model architecture
The overall structure of the proposed SwinCLNet model is illustrated in Fig. 1. The proposed model adopts a five-stage encoder-decoder structure based on 3D U-Net. The core of this model is the systematic integration of the proposed W-MSA/SW-MSA, RLKA, and CSDF modules into the encoder, skip connection, and decoder paths, respectively, to maximize feature representation and fusion capabilities. The encoder gradually downsamples the inputted 3D volume and extracts hierarchical features. At each of the five encoder stages, the feature’s representation is enhanced by a key attention mechanism that alternates between W-MSA and SW-MSA. The W-MSA captures the rich local context, while the SW-MSA enables cross-window connections to learn the global context. The output of this attention block then passes through the RLKA module in the skip connection path. The RLKA efficiently captures long-range dependencies to further enrich features, and this highly refined final feature map is delivered to the decoder. The decoder gradually restores the resolution of the feature map through upsampling with each decoder stage using the CSDF module. The CSDF effectively fuses features delivered from the lower layer with features delivered from RLKA, and finally, it passes the segmentation layer through to generate the tumor subregion prediction result for each voxel.
SW-MSA/W-MSA blocks
The core of the proposed model’s encoder is a 3D window-based self-attention mechanism designed to efficiently capture both local and global contextual information. This is achieved by alternating between two block types, SW-MSA and W-MSA, as shown in Fig. 2. By alternating these W-MSA and SW-MSA blocks, the model effectively learns a rich hierarchy of local and global features.
SW-MSA
To enable cross-window connections, the SW-MSA block applies a full self-attention process on a cyclically shifted version of the feature map. The detailed operation is as follows:
1. Cyclic Shift. Before any other operation, the input feature map is cyclically shifted along the depth, height, and width axes by half the window size, \(\left( \frac{w_d}{2}, \frac{w_h}{2}, \frac{w_w}{2}\right)\).
2. Window Partitioning. The shifted tensor \(\textbf{X} \in \mathbb {R}^{B \times C \times D \times H \times W}\) is partitioned into a batch of windows, resulting in a sequence-like tensor \(\textbf{X}_{\text {win}} \in \mathbb {R}^{(B \cdot n_{\text {win}}) \times N \times C}\), where \(n_{\text {win}}\) is the number of windows and N is the number of voxels per window.
3. Feature Embedding and MHA. The features are first projected from dimension C to \(d_{\text {model}}\) via a linear layer. Then, a standard pre-layer normalization (LN) MHA operation with a residual connection is performed:
4. Gated Depthwise Feed-Forward Network (GDFFN). The output from the attention block is then passed through a GDFFN. Since the proposed GDFFN operates on 3D volumes, the windowed features \(\hat{\textbf{X}}_{\text {emb}}\) are temporarily reverted to their 3D shape, processed by the GDFFN, and then partitioned back into the window sequence format. This step also includes a residual connection:
5. Reverse and Final Residual Connection. The processed features are projected back to the original channel dimension C and reassembled into a full 3D tensor \(\textbf{X}_{\text {proj}}\). A final residual connection merges these learned features with the original input tensor \(\textbf{X}\) via a 3D convolution, enhancing feature integration:
6. Reverse Shift. After the full attention and GDFFN operations, the resulting feature map is shifted back to its original alignment. This process forces attention to be computed across the boundaries of the original, non-shifted windows.
W-MSA
The W-MSA block efficiently computes self-attention within fixed, non-overlapping windows. Its operation is identical to the SW-MSA process detailed above and illustrated in Fig. 2 but omits the initial Cyclic Shift and the final Reverse Shift steps.
RLKA on skip connections
To further enhance the contextual representation of features on the skip connections, we introduced a RLKA module, which is shown in Fig. 3a. This module is applied after the W-MSA/SW-MSA blocks and is composed of a large-kernel attention (LKA) block within an external residual framework. This module emulates a self-attention mechanism by aggregating long-range spatial context and projecting channel information. However, standard 3D convolutions have the limitation of being computationally expensive. Accordingly, the proposed module employs decomposed convolutions to address this limitation, thereby achieving both high efficiency and an expanded receptive field. The internal operation of the LKA block is a four-step process.
1. Axial Decomposition. The input feature map \(\textbf{t}\) is processed by three parallel large-kernel depthwise convolutions, each operating along a single spatial axis: depth (d), height (h), and width (w).
2. Feature Aggregation. The outputs from these three axial convolutions are aggregated via element-wise summation:
3. Channel Mixing. The aggregated feature map \(\textbf{U}\) is then passed through a \(1 \times 1 \times 1\) point-wise convolution for channel-wise feature fusion.
4. Residual Connections. Finally, the output of the point-wise convolution is added back to the original input tensor \(\textbf{t}\), defining the complete LKA operation, \(\mathcal {F}_{\text {LKA}}(\cdot )\):
This internal residual connection (Eq 5) is critical for stabilizing the training of this deep transformation path (Steps 1-3), ensuring the block can easily learn an identity mapping and preventing gradient vanishing. This entire LKA block is wrapped by the main, external residual connection of the RLKA module. The final output \(\textbf{Y}\) is the element-wise sum of the input feature map \(\textbf{t}\) and the output of the LKA function, preserving the original features:
This external residual connection allows the LKA to learn an additive refinement, focusing only on n spatial context.
CSDF block
The proposed CSDF block enhances multi-scale feature fusion in the decoder by spatially aligning features from different resolutions. This CSDF process contributes to more accurate boundary delineation by aligning the semantic context with structural details, ultimately improving the segmentation of complex anatomical regions. This alignment is essential because decoder features \(\textbf{F}_{\text {low}}\) carry a coarse semantic context from deeper layers, whereas encoder features \(\textbf{F}_{\text {high}}\) preserve fine-grained spatial detail. The architecture of the CSDF module is shown in Fig. 3(b).
1. Upsampling Refinement. The low-resolution decoder features \(\textbf{F}_{\text {low}}\) are passed through an upsampling refinement block. This block consists of trilinear interpolation, followed by a 3D convolutional layer and an Instance Normalization layer, to spatially align and refine the semantic features before fusion:
where \(\textrm{Up}(\cdot )\) denotes trilinear upsampling.
2. Feature Concatenation. The resulting refined features \(\textbf{F}_{\text {refined}}\) are then concatenated channel-wise with the high-resolution encoder features \(\textbf{F}_{\text {high}}\):
where \(\textrm{Concat}(\cdot )\) represents channel-wise concatenation.
3. Multi-Scale Merging. Subsequently, the concatenated feature map \(\textbf{F}_{\text {concat}}\) is passed through a standard 3D convolutional block to merge the multi-scale information. This block consists of a 3D convolution, followed by an instance normalization layer and a LeakyReLU non-linear activation function. The entire operation is expressed as
where \(\textrm{Conv3D}(\cdot )\) is the 3D convolutional layer, and \(\textrm{IN}(\cdot )\) is instance normalization. The resulting feature map \(\textbf{F}_{\text {csdf}}\) is then forwarded to the subsequent decoder stage for progressive refinement.
Dataset and implementation details
Dataset
In this study, we used the official BraTS 2023 and BraTS 2024 datasets 39,40,41, which include 1,251 and 1,350 multi-modal MRI scans (FLAIR, T1, T1Gd, and T2), respectively, from adult glioma cases. Each MRI volume was \(240 \times 240 \times 155\) voxels. We performed a 5-fold cross-validation using the entire training data. To prevent any data leakage and ensure a fair evaluation, this partitioning was performed at the patient-level. This approach was adopted to robustly evaluate the model’s generalization performance, although it limits direct comparability to the official challenge leaderboard.
Preprocessing
For data preprocessing, we used an automated pipeline from nnU-Net, which performed voxel intensity normalization, resampled the images to isotropic resolution, and automatically determined the optimal patch size. During training, the four-channel MRI volume centered on the tumor was cropped into a 128 × 128 × 128 voxel patch.
Evaluation metrics
The performance was evaluated using two quantitative metrics: the Dice similarity coefficient (Dice) and the HD95. Here, \(A\) and \(B\) are the predicted and ground truth (GT) segmentations, respectively.
The Dice score measures the overlap between the predicted and GT segmentations and is defined as
A higher Dice score indicated better segmentation accuracy. In contrast, HD95 measures the spatial distance between the predicted and GT contours. It is defined as
where \(P_{95}\) represents the 95th percentile of the bidirectional distances between all the points on the predicted segmentation boundary and the nearest points on the GT boundary. A lower HD95 value indicated a more accurate boundary prediction.
Network architecture
The proposed SwinCLNet is a 3D U-Net-based architecture with six encoder stages and five decoder stages. The encoder path (Encoder-1 to Encoder-5) utilizes W-MSA/SW-MSA for progressive feature extraction. Encoder-6 serves as the bottleneck with 320-dimensional output. The decoder path uses the proposed CSDF blocks. RLKA modules are applied to all five skip connections before feature fusion in the corresponding decoder stages. The detailed architectural specifications are provided in Table 1.
Module configurations
The specific hyperparameters for the W-MSA/SW-MSA blocks are detailed in Table 2. This table outlines the configuration applied at each encoder stage. The d_model column specifies the feature dimension, which progressively increases from 32 to 320 as the network deepens. The num_heads column indicates the number of parallel attention heads, scaling from 2 to 8. The fixed 3D Window Size of \(4 \times 4 \times 4\) is utilized for all stages, and the Dropout rate of 0.1 is consistently applied for regularization. Here, the W-MSA/SW-MSA expansion ratio was set to 4 for all blocks.
RLKA specification
As specified in Table 3, the RLKA module uses a 3D depthwise separable convolution. This is decomposed into three sequential axial-wise depthwise convolutions, each with a Kernel size of 9, Dilation of 1, Stride of 1, and Padding of 4, followed by a 1\(\times\)1\(\times\)1 point-wise convolution.
CSDF specification
In Table 3, CSDF utilizes a 3\(\times\)3\(\times\)3 Kernel, a Dilation of 1, a Stride of 1, and a Padding of 1. It exploits Trilinear upsampling to align features, and then a standard 3D convolution and an Instance Normalization layer to complete the fusion process.
Loss function
We employed a compound loss function (\(L_{\text {total}}\)) combining Dice Loss (\(L_{\text {Dice}}\)) and Cross-Entropy Loss (\(L_{\text {CE}}\)). This combination is standard for addressing the class imbalance inherent in BraTS data. The loss is computed for each foreground class (WT, TC, ET) and averaged.
The Dice loss for a single class c is defined as:
where \(p_{i,c}\) is the predicted probability for voxel i of class c, \(g_{i,c}\) is the one-hot encoded ground truth, and \(\epsilon\) is a smoothing factor (1e-5).
Inference pipeline
To generate full-volume predictions from the patch-based model, we used a sliding-window approach. The details of the inference pipeline are summarized in Table 4. This process begins with Z-score normalization applied per-patient and per-modality to standardize input intensities. Inference is performed using a \(96 \times 96 \times 96\) sliding window size with a \(48 \times 48 \times 48\) stride, creating a 50% overlap. The resulting overlapping predictions are combined using a Gaussian weighted average, which assigns higher importance to predictions at the center of the window. Finally, the output segmentation is refined through post-processing. This includes Connected Component Analysis to remove small spurious predictions and Morphological Closing to fill minor holes and smooth contours.
Experiment results
We evaluated the segmentation performance of the whole tumor (WT) which included peritumoral edema, necrosis, non-ET, and ET; the TC which included necrosis and non-ET; and the ET. All experiments were performed on an Ubuntu 22.04 system using PyTorch 2.6.0 and an NVIDIA A6000 GPU with 48 GB of memory. The batch size and number of epochs were 4 and 300, respectively. The basic experiment followed the basic settings of nnU-Net, and learning was conducted using a stochastic gradient descent optimizer with Nesterov momentum. Additionally, the hyperparameters were automatically adjusted according to the basic learning plan of nnU-Net, and the learning rate was set to \(1 \times 10^{-2}\).
Accuracy evaluation
First, we evaluate the proposed SwinCLNet in terms of accuracy performance. Here, we compared the proposed SwinCLNet model with SOTA benchmark models using a 5-fold cross-validation (CV), which is shown in Table 5 and visually confirmed by the box plot distribution in Fig 4.
Using the BraTS2023 dataset, the proposed SwinCLNet achieved the highest average Dice score of \(88.31\% \pm 2.0\%\). Here, ± denotes standard deviation. This result is not only superior in mean performance but also demonstrates high stability compared to other models. For instance, the strong large-kernel CNN, MedNeXt-M, achieved \(87.30\% \pm 2.3\%\), and the strong VT-UNet baseline achieved \(86.88\% \pm 2.4\%\). The proposed model’s lower standard deviation suggests greater robustness across different data splits. Moreover, for the HD95 metric, SwinCLNet achieved a significantly lower average distance of \(8.95 \text { mm} \pm 3.1 \text { mm}\). This is a clear improvement over MedNeXt-M’s \(11.13 \text { mm} \pm 3.9 \text { mm}\), VT-UNet’s \(11.38 \text { mm} \pm 4.0 \text { mm}\) and nnU-Net’s \(12.37 \text { mm} \pm 4.3 \text { mm}\), again showing superior performance and lower variance.
This trend of superior and more stable performance continued on the BraTS 2024 dataset. The proposed SwinCLNet achieved a mean Dice of \(87.82\% \pm 2.1\%\), while MedNeXt-M recorded \(87.00\% \pm 2.4\%\) and the SOTA Transformer, Swin UNETR, recorded \(85.89\% \pm 2.7\%\). Regarding the HD95 metric, SwinCLNet confirmed its robust boundary delineation with a mean of \(9.71 \text { mm} \pm 3.3 \text { mm}\), clearly outperforming MedNeXt-M at \(11.43 \text { mm} \pm 4.1 \text { mm}\), Swin UNETR at \(13.14 \text { mm} \pm 4.6 \text { mm}\) and nnU-Net at \(13.18 \text { mm} \pm 4.7 \text { mm}\).
Furthermore, to validate that the proposed model’s performance improvement is statistically significant, we conducted a paired Wilcoxon signed-rank test on the 5-fold CV results. The resulting p-values are presented in Table 6. As shown, SwinCLNet demonstrated a statistically meaningful improvement (\(p < 0.05\)) in both Mean Dice and Mean HD95 compared to all benchmarked SOTA models, including nnU-Net, VT-UNet, Swin UNETR, SegMamba, and MedNeXt-M.
Second, an ablation study was conducted to determine the optimal numbers of enhancement blocks, which is shown in Table 7 and Fig. 5. To verify the stability of these results, each configuration was run using 3 different random seeds. The Model-0 baseline achieved an average Dice of \(80.79\% \pm 0.59\%\) and a mean HD95 of \(14.33 \text { mm} \pm 1.70 \text { mm}\). Performance consistently improved as enhancement blocks were added. The proposed Model-5, which applied modules to all stages, recorded the highest Dice score of \(88.56\% \pm 0.27\%\) and the lowest mean HD95 of \(8.63 \text { mm} \pm 0.77 \text { mm}\). Notably, the standard deviation progressively decreased as modules were added. Model-5 showed the lowest variance, which indicates the proposed full model is not only the most accurate but also the most stable with respect to random initialization.
Furthermore, we conducted the ablation study to find the optimal kernel size k for the RLKA module, which is shown in Table 8. As we increased the kernel size, both performance and computational overhead increased accordingly. However, the performance gains saturated at \(k=9\). Therefore, we selected the model with \(k=9\) as our final configuration.
Third, we investigated the individual contributions of each the proposed module. This study was also validated using three random seeds. Results are summarized in Table 9 and Fig. 6, while Fig. 7 provides the corresponding qualitative visualization. The Baseline model, which is Model-0 from Table 7, provided \(80.79\% \pm 0.59\%\) Mean Dice. Applying only the CSDF module improved performance to \(83.46\% \pm 0.49\%\). Applying only the W-MSA/SW-MSA module achieved a significant gain to \(86.06\% \pm 0.40\%\). Integrating both W-MSA/SW-MSA and CSDF further improved the mean Dice to \(88.42\% \pm 0.30\%\). Finally, the addition of the RLKA module as a final refinement yielded the highest performance at \(88.56\% \pm 0.27\%\) and the lowest HD95 at \(8.63 \text { mm} \pm 0.77 \text { mm}\). Applying any of the proposed modules produced a notable performance improvement over the baseline. The low standard deviations across all configurations confirm that these additive gains are consistent and not an artifact of random initialization.
Fourth, we provide visual segmentation results for a representative case, which is shown in Fig. 8. The figure compares the segmentation maps generated by the proposed model with the GT and other SOTA methods. In the segmentation maps, the green, red, and blue subregions represent the WT, TC, and ET, respectively. Earlier models such as 3D U-Net 1 and Attention U-Net43 roughly identified the tumor, but failed to identify the intricate boundaries of the subregions. More recent models such as Swin UNETR 46 and SegMamba 47 demonstrated improved results, but still showed minor inaccuracies. In contrast, the proposed model indicated the superior segmentation performance among all models compared. Additionally, we provide a more detailed comparison in Fig 9. While SegMamba 47 and MedNeXt 28 capture the main tumor body, they tend to miss or undersegment these small, disconnected blue ET regions. In contrast, the proposed SwinCLNet demonstrates its robustness by more accurately identifying these difficult sparse features.
Computing cost analysis
Table 10 details segmentation performance and computational costs. The proposed SwinCLNet achieves a Mean Dice score of \(88.31\%\), which exceeds all other listed models, including MedNeXt-M’s \(87.30\%\), VT-UNet’s \(86.88\%\), and Swin UNETR \(86.57\%\). Also, the proposed SwinCLNet shows efficient computational complexity. Although its parameter count and GPU memory usage are higher compared to the benchmark models, it demonstrates efficiency in terms of FLOPs and inference time. Regarding FLOPs, SwinCLNet has 153.0 GFLOPs, which is substantially lower than MedNeXt-M’s 248.0 GFLOPs, VT-UNet’s 198.5 GFLOPs, Swin UNETR’s 175.2 GFLOPs, and comparable to SegMamba and nnU-Net. Regarding inference time, SwinCLNet shows 177 ms, which is faster than MedNeXt-M at 225 ms, VT-UNet at 210 ms, and Swin UNETR at 189 ms. In summary, the proposed model achieves the superior segmentation performance while maintaining lower FLOPs and faster inference time compared to other benchmark models.
Table 11 details the increases in computational cost from the 3D-U-Net baseline for each added module. Adding RLKA increases the number of parameters by 0.36 M, FLOPs by 8.8 G, and inference time by 4 ms. Adding CSDualFusion increases parameters by 4.40 M, FLOPs by 2.3 G, and inference time by 19 ms. Adding SW-MSA increases parameters by 8.24 M, FLOPs by 43.4 G, and inference time by 36 ms.
Discussion, limitations and future work
The proposed model achieved superior segmentation performance. It achieved an average Dice score of 88.31% and 87.82% and an average HD95 of 8.95 mm and 9.71 mm on the BraTS 2023 and 2024 datasets, respectively, demonstrating superior performance compared to benchmark models. To verify statistical significance, a paired Wilcoxon signed-rank test was performed on the 5-fold cross-validation results. The results showed that the proposed SwinCLNet achieved statistically significant improvements (p-value < 0.05) in both average Dice and average HD95 compared to all benchmark models.
However, despite the superior performance and efficiency of the model, the improvement in the WT region was relatively limited compared to the TC and ET regions, as reflected in the reduced HD95 score for the ET region. The model may still face challenges in accurately segmenting the boundaries of highly heterogeneous regions or extremely small lesions. In particular, the WT boundary issue likely stems from such heterogeneity, while the ET region being small, sparse, and having ambiguous boundaries further contributes to instability in the HD95 metric, which is highly sensitive to minor boundary deviations in the BraTS dataset.
The first reason is that we employed a general loss function without incorporating any boundary specific components, which limited the model’s ability to emphasize ambiguous or heterogeneous regions such as WT. Moreover, no region targeted optimization such as region specific loss weighting or dedicated boundary aware objectives was applied, further constraining performance in these challenging areas. To address this, auxiliary loss modules emphasizing boundary aware features could be introduced to assign higher weights to ambiguous boundary pixels, particularly in the heterogeneous WT region. This approach may incorporate compound loss functions that integrate the main loss with boundary focused terms such as Boundary Loss or Hausdorff Distance Loss.
The second reason is that we employed a static window in the W-MSA/SW-MSA mechanisms. This static design is computationally efficient but may not be optimal for all sub-regions. For instance, applying larger windows to the WT region, where broad contextual understanding is essential, and smaller windows to the TC and ET regions, where precise boundary delineation is required, could improve segmentation performance. To this end, we plan to explore dynamic or adaptive windowing strategies, including architectures that process multiple window scales in parallel within each block and effectively fuse these multi-scale features. In addition, we will investigate advanced deformable or adaptive attention mechanisms that allow attention windows to flexibly adjust their shape and sampling positions according to input features, enabling more accurate modeling of irregular tumor boundaries.
The proposed model also achieves lower FLOPs and faster inference speed compared to major benchmark models. It recorded relatively low FLOPs of 153.0 GFLOPs, making it more efficient than MedNeXt-M at 248.0 GFLOPs and VT-UNet at 198.5 GFLOPs. The inference time was also faster at 177 ms compared to MedNeXt-M at 225 ms and Swin UNETR at 189 ms. However, it has increased memory usage and the number of model parameters. The primary reason for this increase is the expansion of the proposed model’s depth and width. Systematically integrating the proposed modules deepened the architecture, which in turn increased both model parameters and memory usage. To address this, first, model quantization can be applied. Specifically, post-training quantization (PTQ), which converts the model weights and activation values to lower-bit representations, could be utilized. This technique can significantly reduce memory usage while minimizing accuracy degradation. Second, knowledge distillation can be employed. This approach involves training a lighter ‘student’ model to mimic the output of the high-performance ‘teacher’ model. It offers a way to reduce memory consumption while preserving the model’s performance.
Finally, to further enhance applicability and robustness in real clinical settings, we plan to investigate data-efficient unsupervised or semi-supervised learning strategies48,49 and multimodal learning frameworks, along with the integration of large vision–language models (VLMs)50. In addition, we aim to develop more advanced models that remain resilient across different scanners and sequences, varying noise levels, anisotropic resolutions, and even extreme morphological variations.
Conclusion
We proposed a new SwinCLNet model that enhances brain tumor segmentation performance by integrating new W-MSA/SW-MSA, CSDF, and RLKA modules into the existing 3D U-Net framework. First, the proposed encoder block enhanced feature representation by alternating the W-MSA and SW-MSA modules. The SW-MSA module facilitated information sharing across windows via a shifting mechanism, whereas the W-MSA module captured the rich local context within a fixed 3D window. The model effectively learned both local and global contextual information owing to this dual approach. Second, the proposed decoder employed the CSDF module that integrated the features across multiple resolutions. Combining low- and high-level features contributed to improving the precision of tumor boundary delineation. Third, the proposed model applied the RLKA module to skip connections to expand the receptive field and further enhanced the spatial feature representation. Finally, through extensive experiments, we confirmed that the proposed model achieved superior segmentation performance compared to representative benchmark models while achieving lower computational complexity. These results demonstrated that the proposed SwinCLNet is an effective solution for the complex task of brain tumor segmentation.
Data availability
The BraTS 2023 and BraTS 2024 datasets analysed during the current study are available on the Synapse platform. Access to the BraTS 2023 dataset requires approval from the BraTS 2023 Challenge organizers. Similarly, access to the BraTS 2024 dataset requires approval from the BraTS 2024 Challenge organizers. Following approval, the data are available for download by authorized users. The specific datasets can be found by referencing their respective challenges on the platform (https://www.synapse.org/).
References
Çiçek, Ö., Abdulkadir, A., Lienkamp, S. S., Brox, T. & Ronneberger, O. 3d u-net: learning dense volumetric segmentation from sparse annotation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 19 424–432 (Springer, 2016).
Isensee, F. et al. nnu-net: Self-adapting framework for u-net-based medical image segmentation. arXiv preprint. arXiv:1809.10486 (2018).
Chen, J. et al. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint. arXiv:2102.04306 (2021).
Hao, Z., Quan, H. & Lu, Y. Emf-former: An efficient and memory-friendly transformer for medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention 231–241 (Springer, 2024).
Xiong, X. et al. Sam2-unet: Segment anything 2 makes strong encoder for natural and medical image segmentation. arXiv preprint. arXiv:2408.08870 (2024).
Kirillov, A. et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision 4015–4026 (2023).
Peng, Y., Chen, D. Z. & Sonka, M. U-net v2: Rethinking the skip connections of u-net for medical image segmentation. In 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI) 1–5 (IEEE, 2025).
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18 234–241 (Springer, 2015).
Fu, G. et al. Comparing foundation models and nnu-net for segmentation of primary brain lymphoma on clinical routine post-contrast t1-weighted mri. In Medical Imaging 2025: Clinical and Biomedical Imaging vol. 13410, 1341019 (SPIE, 2025).
Spronck, J. et al. nnunet meets pathology: bridging the gap for application to whole-slide images and computational biomarkers. In Medical Imaging with Deep Learning (2023).
Abid, M. A. & Munir, K. A systematic review on deep learning implementation in brain tumor segmentation, classification and prediction. Multimedia Tools and Applications 1–40 (2025).
Shamshad, F. et al. Transformers in medical imaging: A survey. Med. Image Anal. 88, 102802 (2023).
Chowdary, G. J. & Yin, Z. Med-former: A transformer based architecture for medical image classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention 448–457 (Springer, 2024).
Wang, X., Ying, H., Xu, X., Cai, X. & Zhang, M. Transliver: A hybrid transformer model for multi-phase liver lesion classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention 329–338 (Springer, 2023).
Oghbaie, M., Araújo, T., Emre, T., Schmidt-Erfurth, U. & Bogunović, H. Transformer-based end-to-end classification of variable-length volumetric data. In International Conference on Medical Image Computing and Computer-Assisted Intervention 358–367 (Springer, 2023).
Liao, Z., Hu, S., Xie, Y. & Xia, Y. Transformer-based annotation bias-aware medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention 24–34 (Springer, 2023).
Yang, D. et al. Dast: Differentiable architecture search with transformer for 3d medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention 747–756 (Springer, 2023).
Zhou, Y. et al. Nnformer: Interleaved transformer for volumetric medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention 217–228 (Springer, 2021).
Ma, J., Li, F. & Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722 (2024).
Gong, H. et al. nnmamba: 3d biomedical image segmentation, classification and landmark detection with state space model. In 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI) 1–5 (IEEE, 2025).
Liu, Y. et al. Vmamba: Visual state space model. Adv. Neural. Inf. Process. Syst. 37, 103031–103063 (2024).
Ruan, J., Li, J. & Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation ACM Transactions on Multimedia Computing, Communications and Applications (2024).
Ma, B. et al. Dtasunet: A local and global dual transformer with the attention supervision u-network for brain tumor segmentation. Sci. Rep. 14, 28379 (2024).
Liu, H., Li, Q., Nie, W., Xu, Z. & Liu, A. Causal intervention for brain tumor segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention 160–170 (Springer, 2024).
Phan, V. M. H. et al. Structural attention: Rethinking transformer for unpaired medical image synthesis. In International Conference on Medical Image Computing and Computer-Assisted Intervention 690–700 (Springer, 2024).
Yang, Z. et al. Region attention transformer for medical image restoration. In International Conference on Medical Image Computing and Computer-Assisted Intervention 603–613 (Springer, 2024).
Lin, X., Wang, Z., Yan, Z. & Yu, L. Revisiting self-attention in medical transformers via dependency sparsification. In International Conference on Medical Image Computing and Computer-Assisted Intervention 555–566 (Springer, 2024).
Roy, S. et al. Mednext: transformer-driven scaling of convnets for medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention 405–415 (Springer, 2023).
Xiao, L., Song, J., Xie, X. & Fan, C. Enhanced medical image segmentation using u-net with residual connections and dual attention mechanism. Eng. Appl. Artif. Intell. 153, 110794 (2025).
Zhang, Z., Wen, Y., Zhang, X. & Ma, Q. Ci-unet: Melding convnext and cross-dimensional attention for robust medical image segmentation. Biomed. Eng. Lett. 14, 341–353 (2024).
Xie, X. et al. Local and long-range progressive fusion network for knee joint segmentation. Biomed. Signal Process. Control 112, 108624 (2026).
Xiao, L., Zhou, B. & Fan, C. Automatic brain mri tumors segmentation based on deep fusion of weak edge and context features. Artif. Intell. Rev. 58, 154 (2025).
Zhang, Z. et al. Deep learning and radiomics-based approach to meningioma grading: Exploring the potential value of peritumoral edema regions. Phys. Med. Biol. 69, 105002 (2024).
Xie, X. et al. Pif-net: A parallel interweave fusion network for knee joint segmentation. Biomed. Signal Process. Control 109, 107967 (2025).
Xie, X. et al. Discriminative features pyramid network for medical image segmentation. Biocybern. Biomed. Eng. 44, 327–340 (2024).
Zhang, Z. et al. A novel deep learning model for medical image segmentation with convolutional neural network and transformer. Interdiscip. Sci. 15, 663–677 (2023).
Wen, X., Liu, Z., Chu, Y., Le, M. & Li, L. Mrcm-uctransnet: Automatic and accurate 3d tooth segmentation network from cone-beam ct images. Int. J. Imaging Syst. Technol. 34, e23139 (2024).
Fan, C., Yang, L. T. & Xiao, L. A step gravitational search algorithm for function optimization and sttm’s synchronous feature selection-parameter optimization. Artif. Intell. Rev. 58, 179 (2025).
Baid, U. et al. The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification. arXiv preprint. arXiv:2107.02314 (2021).
Menze, B. H. et al. The multimodal brain tumor image segmentation benchmark (brats). IEEE Trans. Med. Imaging 34, 1993–2024 (2014).
Bakas, S. et al. Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features. Sci. Data 4, 1–13 (2017).
Alom, M. Z., Yakopcic, C., Hasan, M., Taha, T. M. & Asari, V. K. Recurrent residual u-net for medical image segmentation. J. Med. Imaging 6, 014006–014006 (2019).
Oktay, O. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018).
Wenxuan, W. et al. Transbts: Multimodal brain tumor segmentation using transformer. In International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer 109–119 (2021).
Peiris, H., Hayat, M., Chen, Z., Egan, G. & Harandi, M. A robust volumetric transformer for accurate 3d tumor segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention 162–172 (Springer, 2022).
Hatamizadeh, A. et al. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In International MICCAI Brainlesion Workshop 272–284 (Springer, 2021).
Xing, Z., Ye, T., Yang, Y., Liu, G. & Zhu, L. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention 578–588 (Springer, 2024).
Chen, J., Zhang, J., Debattista, K. & Han, J. Semi-supervised unpaired medical image segmentation through task-affinity consistency. IEEE Trans. Med. Imaging 42, 594–605 (2022).
Chen, J. et al. Dynamic contrastive learning guided by class confidence and confusion degree for medical image segmentation. Pattern Recogn. 145, 109881 (2024).
Chen, J. et al. From gaze to insight: bridging human visual attention and vision language model explanation for weakly-supervised medical image segmentation. IEEE Trans. Med. Imaging (2025).
Funding
This research was supported by the Bio& Medical Technology Development Program of the National Research Foundation (NRF) funded by the Korean government (MSIT) (No. RS-2023-00223501) and the Glocal University Project funded by the Ministry of Education and the National Research Foundation of Korea (GLOCAL-202504310001).
Author information
Authors and Affiliations
Contributions
S.J. (Seyong Jin) conceived the study, developed the methodology, implemented the software, and conducted the experiments. Y.N. (Yeonwoo Noh) supported data curation and the validation process. W.N. (Wonjong Noh) contributed to the formal analysis. W.N., H.M. (Hyeonjoon Moon) and M.L. (Minwoo Lee) provided critical review and were responsible for funding acquisition. W.N. and H.M. also provided supervision and managed the project. All authors reviewed and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Accession codes
The BraTS 2023 and BraTS 2024 datasets used in this study are publicly available. The BraTS 2023 data was accessed through the official challenge page on the Synapse platform under Synapse ID syn51156910. The BraTS 2024 dataset was accessed via its respective Synapse portal under Synapse ID syn53708249.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Jin, S., Noh, Y., Moon, H. et al. SwinCLNet: a robust framework for brain tumor segmentation via shifted window attention and cross-scale fusion. Sci Rep 16, 2105 (2026). https://doi.org/10.1038/s41598-025-31937-8
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-31937-8








