Introduction

A brain tumor is an abnormal growth of cells within the brain or the central spinal canal. These cells multiply at an extremely rapid rate, forming a mass that can disrupt normal brain function. The brain tumor segmentation is a technique of locating and labeling a brain tumor’s precise position and boundaries in medical pictures, particularly using magnetic resonance imaging (MRI) scans. Based on the form of scan, such as T1, T2, FLAIR, or T1-CE, tumors may appear distinctly when viewing these precise images of the skull. In these scans, segmentation refers to picking out the malignant or tumor cell from the natural brain structure and occasionally even splitting the tumor into distinct areas, such as the core, the active edge, and the surrounding swelling1,2.

Three main areas are typically distinguished in brain tumor examination, especially when using medical examination: the core, the active edge, and the surrounding edema3. Typically called the necrotic core, the core of the malignancy or tumor is the heart of the tumor, composed of material that is either dead or not functioning. In general, this necrosis or mortality happens when the tumor develops too quickly for its access to blood, which causes central cell death. The active edge, sometimes referred to as the augmenting tumor region, surrounds the core and denotes the region where the malignancy persists in actively spreading and encroaching on neighboring tissues. This area is frequently the main focus of operational and medical treatments since it shows up clearly on contrast-enhanced images4. The inflammation (edema) or peripheral swelling (located behind the active edge) shows the tumor’s existence and does not consist of cancerous cells. This inflammation has a significant impact on treatment scheduling and treating symptoms because it damages nearby brain structures and worsens illnesses.

The primary technique for segmenting brain cancers is MRI, which offers remarkable soft tissue clarity and enables the integration of several imaging frames that emphasize distinct tissue properties. In contrast to CT scans5, which perform better at visualizing bones, MRI makes it easy to discern the differences among the tumor’s center, its actively growing areas, and any adjacent swelling. The MRI also spares patients from radiation exposure, and it is preferable to use it repeatedly over an extended period6. Every MRI scan category, including T1, T2, FLAIR, and T1 with contrast enhancement, provides distinct visuals that aid in precisely distinguishing different tumor segments. Alternative procedures for imaging, such as CT or PET scans, can potentially be employed in specific situations, but they either tend to be unsuccessful in defining the intricate layers of brain malignancies or miss the delicate tissue distinction of MRI7. Thus, for a thorough, secure, and multifaceted evaluation of brain tumors, MRI continues to be the standard of excellence.

The categories of scan provide multiple structural images, e.g., cerebrospinal fluid (CSF) and fats look darkish on T1-weighted MRIs, which produce comprehensive structural pictures8. It is mostly employed to illustrate typical brain architecture and provides a standard against which numerous patterns can be compared. T2-weighted MRI, on the other hand, emphasizes fluids, giving the appearance of bright spots in areas of edema, inflammation, or malignancies; therefore, T2 is very helpful in detecting anomalies like infection and malignancies. A revised T2 pattern known as FLAIR (Fluid Attenuated Inversion Recovery) suppresses impulses from floating fluids such as CSF. The elimination of the bright underlying signal from CSF improves the appearance of infections close to the brain’s valves or fluid-filled areas. In T1, alongside a contrast enhancement (T1-CE), primarily gadolinium is injected, which builds up in locations where the blood-brain barrier is compromised, that is, frequently the case in parts of hostile or proliferative tumors. Effective evaluation and therapy strategy depend on this sequence since it aids in locating and defining the tumor’s dynamic regions9,10. The additional data that those patterns offer when combined is essential for a thorough evaluation of brain tumors.

Brain tumor segmentation from MRI is a critical task in medical image analysis, facilitating early diagnosis, treatment planning, and disease monitoring11,12. However, the inherent complexity of brain tumors, including variations in size, shape, and location across different MRI modalities (T1, T2, FLAIR, and T1-CE), poses significant challenges13,14,15. Traditional segmentation methods, such as thresholding, region-based approaches, and classical machine learning models, often struggle with heterogeneous tumor structures and modality-specific variations. While deep learning-based models like U-Net, nnU-Net, and Attention U-Net have improved segmentation accuracy, they still face limitations in capturing multi-scale spatial features and suppressing irrelevant background noise. Additionally, manual segmentation by radiologists remains time-intensive, subjective, and prone to inter-observer variability.

To surmount these challenges, we propose a Multi-Modal Multi-Scale Contextual Aggregation with Attention Fusion (MM-MSCA-AF) model for brain tumor segmentation. The proposed model integrates multi-modal MRI inputs to capture complementary tissue contrasts and employs Multi-Scale Contextual Aggregation (MSCA) to learn both global and fine-grained spatial features. A gated attention fusion (GAF) process is suggested to selectively enhance tumor-specific features and suppress noise, improving segmentation performance in necrotic, enhancing, and edema regions. The proposed framework is evaluated on the BRATS 2020 dataset, as illustrated in Figure 1, and achieves a Dice score of 0.8158 for necrotic tumor regions and 0.8589 overall, which is better than state-of-the-art models.

Fig. 1
figure 1

BRATS dataset and its interpretation: visual representation of the BRATS 2020 dataset used for brain tumor segmentation, highlighting different MRI modalities (T1, T2, FLAIR, T1-CE) and corresponding tumor annotations, including necrotic, edema, and enhancing regions.

Related work

Traditional approaches for brain tumor segmentation

Initial techniques used for brain tumor segmentation were largely thresholding approaches, i.e., Otsu’s thresholding, wherein an optimal intensity threshold is calculated to segment tumor and non-tumor tissues. The technique is simple and computationally efficient, but the strategy is not generalizable across MRI modalities owing to tumor intensity variations, patient-dependent acquisition parameters, and noise in medical images. Region-based methods, including region growing and watershed segmentation, attempted to move beyond thresholding by considering spatial relationships between neighboring pixels. Region growing starts from a seed point and expands the region based on intensity similarity, but is extremely sensitive to seed position and noise. Similarly, watershed segmentation is effective in segmenting tumor boundaries but suffers from over-segmentation when applied to MRI images. Though they are basic, thresholding and region-based methods cannot manage tumor heterogeneity, where multiple regions of the tumor contain disparate intensity values. Furthermore, MRI artifacts such as intensity non-uniformity and partial volume effects also undermine segmentation performance, making these traditional techniques unsuitable for complex medical imaging tasks16,17.

The recent advancements in machine learning algorithms, such as Support Vector Machines (SVMs)18, Random Forests (RFs), and k-Nearest Neighbors (KNN), were introduced to improve brain tumor segmentation. These models rely on handcrafted feature extraction, including texture, shape, and intensity-based descriptors, to classify tumor regions. For instance, Gabor filters and histogram of oriented gradients (HOG) were used to capture local intensity variations and edge information in MRI scans19. The machine learning-based methods performed better than thresholding techniques, but they require extensive feature engineering and struggle with generalization across different datasets. SVMs, for example, have been widely used due to their robustness in handling high-dimensional data, but they are computationally expensive for large MRI datasets and require careful kernel selection for optimal performance20. Similarly, Random Forests, which aggregate multiple decision trees, can mitigate overfitting but still suffer from feature redundancy and class imbalance issues when segmenting tumor regions21. One of the main limitations of classical machine learning techniques is their inability to capture hierarchical spatial dependencies in MRI images. Since these models operate on manually extracted features, they lack the end-to-end learning capability of modern deep learning architectures22. As a result, classical approaches are gradually being replaced by deep convolutional neural networks (CNNs), which automatically learn spatial and contextual features from raw MRI data. To better understand the limitations of traditional approaches, Table 1 provides a comparative analysis of thresholding, region-based methods, and classical machine learning techniques for brain tumor segmentation.

Table 1 Comparison of brain tumor segmentation methods.
Table 2 Comparison of U-Net-based brain tumor segmentation models.

Deep learning-based approaches

One of the most widely used deep learning architectures for biomedical image segmentation is U-Net, a fully convolutional network (FCN) introduced by Ronneberger et al.23,24. The architecture follows an encoder-decoder structure, where the contracting path extracts hierarchical feature representations, and the expanding path progressively refines spatial details using skip connections. These skip connections allow the network to recover fine-grained structural details, making the U-Net highly effective for medical imaging tasks. Despite its success, U-Net has notable limitations when applied to brain tumor segmentation. It struggles to accurately segment small tumor regions, particularly necrotic cores and enhancing tumor areas, due to limited spatial awareness in deeper layers25. Moreover, U-Net lacks an explicit attention mechanism, making it susceptible to background noise and less effective in highlighting tumor boundaries26.

To overcome the limitations of background noise, the proposed MM-MSCA-AF model introduces a GAF mechanism, which selectively enhances tumor-relevant features while suppressing irrelevant information. Furthermore, MSCA allows MM-MSCA-AF to better capture global and local contextual details, significantly improving segmentation accuracy over the standard U-Net. U-Net++ is an extension of U-Net that incorporates nested and dense skip connections to refine feature representations27 by introducing intermediate convolutional layers between the encoder and decoder paths. U-Net++ improves gradient flow, allowing for better preservation of fine details during segmentation. These modifications enhance feature learning, particularly for complex and heterogeneous structures such as brain tumors. However, U-Net++ introduces higher computational complexity due to the increased number of convolutional operations within the skip pathways28. This additional processing increases training time and memory requirements, making the model less efficient for large-scale datasets such as BRATS 2020. MM-MSCA-AF leverages hierarchical multi-scale contextual aggregation in contrast to U-Net++, which provides a more efficient way to integrate both low-level and high-level features. Instead of relying solely on dense connections, MM-MSCA-AF explicitly processes features at multiple scales, improving the segmentation of varying tumor sizes without adding excessive computational overhead.

The nnU-Net (no-new U-Net) is an automated segmentation framework that dynamically adapts the network architecture, preprocessing, and postprocessing based on the characteristics of the dataset29. Unlike standard U-Net variants, nnU-Net optimizes hyperparameters, normalizes input data, and fine-tunes model depth without requiring manual intervention. This self-configuring approach has made nnU-Net one of the most robust baseline models for medical image segmentation and delivers strong performance. It does not explicitly incorporate multi-modal fusion, limiting its effectiveness for multi-sequence MRI analysis30. Brain tumors exhibit distinct contrasts across MRI modalities (T1, T2, FLAIR, T1-CE), and ignoring this complementary information can lead to suboptimal segmentation performance. MM-MSCA-AF addresses this limitation by explicitly fusing information from multiple MRI sequences, enhancing feature representation across different tissue contrasts. The use of multi-modal integration and contextual attention mechanisms allows MM-MSCA-AF to better delineate tumor boundaries, particularly in challenging cases where tumor regions blend with surrounding tissues.

Attention U-Net extends the original U-Net architecture by introducing attention gates (AGs) to selectively focus on salient tumor features while suppressing irrelevant background regions31. These gates learn to assign higher weights to important anatomical structures, enhancing segmentation performance, particularly in medical imaging tasks where contrast variations are subtle. The attention mechanisms improve localization accuracy, standard attention U-Nets are computationally expensive, as they require additional multiplicative operations at multiple feature scales31. Furthermore, conventional attention mechanisms may still fail to distinguish between tumor regions and other high-intensity areas, leading to false positives in segmentation results.

To address the aforementioned challenges, MM-MSCA-AF introduces GAF, which dynamically adjusts feature importance across different MRI modalities and spatial scales. Unlike standard attention gates, GAF learns adaptive attention weights, allowing the model to selectively enhance critical features while maintaining computational efficiency. This leads to superior segmentation accuracy with reduced false positive rates compared to Attention U-Net. To address the advantages of MM-MSCA-AF over existing models, Table 2 presents a comparative analysis of the deep learning approaches discussed above.

Hybrid and transformer-based approaches

SegNet is an encoder-decoder-based deep learning architecture originally designed for semantic segmentation32. It follows a similar structure to U-Net, where the encoder extracts hierarchical feature representations, and the decoder progressively reconstructs segmentation masks. However, unlike U-Net, SegNet does not incorporate skip connections, relying instead on pooling indices from the encoder to retain spatial information33. This reduces memory usage and makes SegNet computationally efficient, making it a viable option for medical image segmentation. Although effective, SegNet has serious limitations in tumor segmentation. Due to the absence of advanced contextual aggregation mechanisms, it cannot model global and local tumor information, particularly in heterogeneous brain tumors. In addition, the absence of explicit attention mechanisms makes SegNet less effective at distinguishing between tumor boundaries and background noise, leading to false positives in complex medical images24. In comparison, MM-MSCA-AF transcends these restrictions by the introduction of MSCA, which captures both fine-grained spatial details and general tumor shape. Besides, the GAF mechanism allows feature refinement, making MM-MSCA-AF more efficient than SegNet in the segmentation of tumor regions.

Hierarchical Context Aggregation (HCA) networks strive to improve spatial information aggregation by applying multi-level feature representations together34. Pyramid pooling and hierarchical feature extraction are used in these networks, which allow for better object segmentation of objects of various structures and sizes33. HCA has been efficiently used in high-resolution medical imaging applications wherein global and local information must be preserved34. However, HCA-based models lack dedicated attention mechanisms, limiting their ability to selectively focus on tumor-relevant features and disregard background noise29. Since brain tumor segmentation requires fine-grained feature selection, HCA models can mislabel small tumor regions or fail to differentiate between different tumor subtypes.

The MM-MSCA-AF model builds upon HCA principles by integrating MSCA, which not only aggregates hierarchical spatial features but also incorporates attention-based fusion through GAF. This allows MM-MSCA-AF to selectively enhance important tumor features, improving segmentation performance without increasing computational complexity.

Table 3 Comparison of hybrid and transformer-based brain tumor segmentation models.

Transformer-based architectures, such as Swin-UNETR and TransBTS, have recently gained popularity in medical image segmentation due to their ability to model long-range dependencies and spatial relationships35. Unlike CNNs, which rely on localized convolutional operations, Vision Transformers (ViTs) process entire image patches, capturing global contextual information more effectively35. Swin-UNETR combines Swin Transformers with an encoder-decoder U-Net structure, enabling hierarchical feature extraction while preserving spatial consistency. TransBTS integrates Transformer layers with CNN backbones, achieving state-of-the-art performance in brain tumor segmentation by leveraging both local and global feature representations22.

Recent works have proposed refined encoder-decoder architectures and hybrid CNN-Transformer networks to overcome limitations in conventional U-Net-based models. For instance, DCSSGA-UNet integrates DenseNet201 as the encoder and introduces both channel spatial attention and semantic guidance attention modules, which enhance feature discrimination and reduce semantic gaps during decoding, resulting in state-of-the-art performance on multiple medical image datasets36. MAGRes-UNet addresses the limitations in contextual modeling by incorporating multi-attention gates and residual blocks, effectively capturing small-scale tumor features and promoting better gradient flow during training37. To balance local detail extraction with global context, EFFResNet-ViT fuses CNN backbones (EfficientNet-B0, ResNet-50) with a Vision Transformer module, improving classification accuracy and interpretability through Grad-CAM and t-SNE visualization38. In another direction, a YOLOv8-based pipeline for elbow fracture detection combines region-of-interest localization with advanced CNN and transformer variants (ResNet, SeResNet, ViT), and demonstrates the effectiveness of attention-guided ROI refinement and enhancement techniques in orthopedic diagnosis39. These approaches exemplify recent trends in leveraging attention, residual learning, and transformer-CNN hybrids to improve medical image segmentation and classification. TransBTS is a hybrid brain tumor segmentation framework that integrates CNN-based feature extraction with a Transformer module to capture long-range dependencies and contextual relationships. It uses a ResNet encoder to learn spatial features, followed by a Bottleneck Transformer that operates in the latent space to model global context. This design improves segmentation accuracy in regions where conventional CNNs struggle due to limited receptive fields. The decoder mirrors U-Net-style skip connections for spatial detail recovery. While TransBTS demonstrates strong performance on BRATS datasets, its computational overhead is higher than purely CNN-based models, limiting its efficiency in real-time or resource-constrained clinical environments.

Despite their advantages, transformer-based models require large-scale training datasets to generalize effectively. Medical imaging datasets, including BRATS 2020, are relatively small compared to datasets used for natural image processing, which limits the effectiveness of pure transformer-based models40. Additionally, transformers introduce higher computational costs, making them less practical for real-time clinical applications40.

In comparison, MM-MSCA-AF balances computational efficiency and accuracy by combining CNN-based multi-scale feature extraction with attention-based refinement, making it less resource-intensive than transformers while still achieving high segmentation precision. To highlight the differences between MM-MSCA-AF and existing hybrid and transformer-based models, Table 3 presents a comparative analysis.

Research gaps and motivation for MM-MSCA-AF

Brain tumor segmentation remains a challenging task due to the complex morphological variations of tumors and the need for high-precision delineation. The traditional and deep learning-based models have improved segmentation performance, but they still exhibit significant limitations. Conventionally, CNN-based architectures, including U-Net, nnU-Net, and SegNet, process single-modal MRI scans, limiting their ability to leverage complementary information from different MRI sequences (T1, T2, FLAIR, and T1-CE). Each modality highlights distinct tissue contrasts, making multi-modal fusion essential for accurately segmenting tumor regions40. Models that do incorporate multi-modal inputs, such as TransBTS and HCA, often lack effective fusion mechanisms, leading to suboptimal integration of spatial and contrast-based features33. Typically, CNN-based architectures operate at a fixed spatial resolution, making it difficult to capture tumor regions of varying sizes. While architectures like U-Net++ and nnU-Net improve feature refinement through skip connections, they do not explicitly incorporate multi-scale contextual information41. This limitation results in poor segmentation of small or irregularly shaped tumors, particularly in complex regions such as the tumor core and edema34. Attention-based architectures such as Attention U-Net improve segmentation by selectively focusing on tumor-relevant features. However, standard attention mechanisms apply uniform attention weights across feature maps, making them susceptible to background noise and false positives in high-intensity regions42. Additionally, transformer-based models like Swin-UNETR and TransBTS require large-scale training datasets, making them computationally expensive and less practical for medical imaging applications43.

To address these limitations, the (MM-MSCA-AF) model introduces three major innovations: Dissimilar to the traditional CNN architectures that operate at fixed spatial resolutions, MM-MSCA-AF employs MSCA to integrate features from multiple spatial scales. This approach enables the model to capture both global tumor morphology and fine-grained boundary details, leading to improved segmentation of small, irregular, and complex tumor structures44. MM-MSCA-AF incorporates a GAF mechanism, which dynamically assigns importance weights to multi-modal feature maps. Unlike standard attention models, GAF suppresses irrelevant background information while enhancing tumor-relevant features, significantly reducing false positives and segmentation errors45. This leads to better localization of necrotic, enhancing, and edema regions compared to previous attention-based architectures42. The comparative limitations and corresponding improvements are systematically outlined in Table 4. Specifically, the novel contributions of this article are given in the following:

  1. 1.

    MM-MSCA-AF model is presented, which improves brain tumor segmentation by integrating multi-modal MRI signals.

  2. 2.

    The model captured simultaneously fine-grained and global spatial data using MSCA.

  3. 3.

    GAF was implemented to minimize noise and refine tumor-specific characteristics.

  4. 4.

    The results obtained an exceptional Dice score of 0.8589 on the BRATS 2020 database, surpassing the effectiveness of cutting-edge techniques.

The remainder of this paper is structured as follows. The “Methodology” section presents the proposed MM-MSCA-AF model, including its architecture and key components. The “Experimental setup” section describes the dataset, preprocessing pipeline, and evaluation metrics. The “Results and discussion” section reports the experimental findings and provides comparative analysis. Finally, the “Conclusion” section concludes the paper and outlines future research directions.

Table 4 Comparison of model limitations and MM-MSCA-AF improvements.

Methodology

The (MM-MSCA-AF) model is designed to improve brain tumor segmentation by leveraging multi-modal MRI inputs, multi-scale feature extraction, and attention-based feature refinement. This section details the model architecture, highlighting its multi-modal input processing, MSCA, GAF, and decoding strategy.

Model architecture

An encoder–decoder design is used by the MMSCA-AF paradigm for effectively segmenting brain tumors from multi-modal MRI components, such as T1, T2, FLAIR, and T1-CE. First, it extracts hierarchical features from the original dataset to obtain crucial structural details. Second, it enhances the precision of tumor boundary identification by combining contextual data at multiple spatial scales. Third, redundant information is suppressed, and feature visualizations are refined through the use of attention mechanisms. Ultimately, each pixel is classified into one of four tumor classes by the decoder, which generates high-resolution segmentation maps. Figure  1 illustrates the broader concept, which combines attention fusion and multi-scale feature extraction. Figure 2 presents the MMSCA-AF model representation, including multi-modal input processing, MCA, GAF, and the final segmentation output. This architecture is designed to capture both global and fine-grained tumor features for precise segmentation. For clarity, Fig. 3 provides the MMSCA-AF network legend, which explains the role of each model component.

Fig. 2
figure 2

MM-MSCA-AF model representation detailed architecture of the proposed MM-MSCA-AF model, illustrating its multi-modal input processing, MSCA, GAF, and final segmentation output. Each module is designed to capture both global and fine-grained tumor features for precise segmentation.

Fig. 3
figure 3

MM-MSCA-AF model legend.

Multi-modal input processing

Brain tumors exhibit high variability across MRI modalities, requiring multi-modal fusion to capture diverse tissue contrasts. MM-MSCA-AF processes T1, T2, FLAIR, and T1-CE separately through dedicated convolutional feature extractors before aggregating their information. The input set is represented as:

$$X_{\text {input}} = \{X_{T1}, X_{T2}, X_{\text {FLAIR}}, X_{\text {T1-CE}}\},$$

where \(X_{T1}\), \(X_{T2}\), \(X_{\text {FLAIR}}\), and \(X_{\text {T1-CE}}\) represent MRI modalities emphasizing different tumor structures. This separation preserves modality-specific features, ensuring more effective feature aggregation in later processing stages.

Multi-scale contextual aggregation

The complexity and variability of tumors may range greatly in terms of dimension and shape. First of all, segmenting such patterns efficiently is a challenge for fixed-resolution CNNs. This restriction is addressed with the MSCA portion, which improves the reliability of segmentation by collecting both fine-grained features and more general contextual layouts. Second, the MSCA component aggregates characteristics at multiple levels of space using a three-tiered pooling technique. By downsizing the feature diagrams, global pooling obtains extensive context-relevant data that aids in comprehending the general layout of the cancer. The transitional properties that are essential for preserving the stability of the cancer boundaries are the subject of medium-scale pooling. Lastly, small-scale pooling ensures superior segmentation results by improving the system’s ability to identify restricted and microscopic cancer elements, including necrotic and edematous locations. The multi-scale feature maps are concatenated to form a unified representation, given by: \(Y_{\text {MSCA}} = \text {Concat}(P_1(F_i), P_2(F_i), \dots , P_n(F_i)),\) where \(P_n(F_i)\) represents feature maps pooled at different scales. This reduces spatial resolution loss while improving tumor region recognition.

Gated attention fusion

Conventional CNN-based segmentation methods are unable to differentiate between background noise and tumor shapes. The utilization of GAF helps MM-MSCA-AF suppress unnecessary features while enhancing tumor-relevant ones. GAF calculates the attention gates as follows: \(\alpha = \sigma (W^T[F_{\text {encoder}}, F_{\text {decoder}}] + b)\), where \(F_{\text {encoder}}\) and \(F_{\text {decoder}}\) are the feature maps from the encoder and decoder, W is a learnable weight matrix, \(\sigma\) is the sigmoid activation function, and b is the bias term. The refined feature output is: \(F_{\text {output}} = \alpha \cdot F_{\text {decoder}} + (1 - \alpha ) \cdot F_{\text {encoder}}\). GAF reduces false positives and improves accurate segmentation by continuously adjusting feature relevance, which enables the model to concentrate on tumor features while limiting high-intensity non-tumor portions.

Decoder and segmentation output

The decoder module reconstructs high-resolution segmentation masks by progressively refining feature maps through up-sampling layers and skip connections. The final segmentation map is generated using a softmax activation function, which assigns probabilities for each tumor class:

$$\hat{Y} = \text {Softmax}(W \cdot F_{\text {output}} + b),$$

Where \(W\) represents trainable parameters, and \(\hat{Y}\) is the predicted segmentation map with pixel-wise classification into non-tumor regions, necrotic core, enhancing tumor, and edema.

Computational trade-offs and efficiency considerations

The integration of multi-scale pooling and attention fusion increases memory and GPU usage. However, these additional computations improve segmentation accuracy, making the trade-off worthwhile for clinical applications. \(\text {Total FLOPs} = O(N \cdot M_{\text {encoder}} + M_{\text {decoder}} + M_{\text {attention}}),\) where \(M_{\text {encoder}}\) and \(M_{\text {decoder}}\) represent encoder-decoder operations, and \(M_{\text {attention}}\) accounts for attention-based refinement46. Due to its depth, MM-MSCA-AF requires  48 hours on an NVIDIA RTX 3090 GPU, with early stopping after 200 epochs. Also, despite higher memory requirements, MM-MSCA-AF achieves real-time performance, with an average segmentation time of 1.2 seconds per 3D MRI scan.

Loss function and optimization

The model is optimized using the categorical cross-entropy loss function, defined as:

$$L = -\sum _{c=1}^{C} Y_{\text {true}}^{(c)} \log Y_{\text {pred}}^{(c)}$$

where \(Y_{\text {true}}^{(c)}\) and \(Y_{\text {pred}}^{(c)}\) represent the true and predicted probabilities for class \(c\), respectively. MM-MSCA-AF is trained using the Adam optimizer with an initial learning rate of 0.001, chosen for its adaptive gradient adjustments, ensuring faster convergence without overshooting. Table 5 illustrates the components used in MM-MSCA-AF and their functionalities, and their improvement in results.

Table 5 Components and their functions with improvements.

Experimental setup

This section details the experimental setup, including dataset characteristics, preprocessing steps, computational environment, hyperparameter selection, data handling techniques, and evaluation metrics used to assess model performance.

Dataset description

The proposed MM-MSCA-AF model is evaluated on the Brain Tumor Segmentation (BRATS 2020) dataset, a widely used benchmark for brain tumor segmentation tasks32. The dataset consists of multi-modal MRI scans from 369 patients, with four MRI sequences available for each subject and is given by T1-weighted (T1) scan that highlights normal brain structures, T1 contrast-enhanced (T1-CE) scan that enhances tumor boundaries, T2-weighted (T2) scan that shows fluid accumulation around the tumor, and Fluid-Attenuated Inversion Recovery (FLAIR) scan that detects edema and tumor infiltration. Each scan is manually annotated by expert radiologists into four classes, e.g., Non-tumor regions, Necrotic tumor core, Enhancing tumor, Edema. The dataset is split into \(70\%\) training, \(20\%\) validation, and \(10\%\) testing to ensure model generalization across unseen cases.

Preprocessing pipeline

To normalize MRI scans across modalities, several preprocessing steps were undertaken. All images were resampled to a common voxel size of 1 mm³ to encourage spatial consistency within the dataset. Each MRI sequence was independently normalized so that pixel intensities followed a zero-mean, unit-variance distribution. Image quality was improved by Gaussian filtering to remove noise, which eliminated high-frequency noise without distorting the crucial structural information. Finally, histogram equalization was used to enhance contrast and normalize intensity distributions across all scans. After preprocessing, multi-modal inputs were fed to the MM-MSCA-AF model, which was then able to effectively extract modality-specific tumor features.

Computational environment

A premium computer system with an Intel Core i9-10900K processor, 64 GB RAM, and an NVIDIA RTX 3090 GPU with 24GB VRAM was used in all the experiments. The PyTorch environment was utilized to carry out the installation. To ensure complete model convergence and performance analysis, the algorithms were trained for about 48 hours overall. Due to the multi-scale, multi-modal nature of MM-MSCA-AF, memory usage was up to 18 GB during training. Table 6 presents the appropriate hyperparameters utilized in the training of the MM-MSCA-AF model.

Table 6 Hyperparameters and their values.

Handling class imbalance

There exists a class imbalance problem in the BRATS 2020 dataset, specifically observed in small tumor areas like the necrotic core. Several approaches were utilized to address this problem. Initially, there was the utilization of a weighted categorical cross-entropy loss function where less-represented areas were given increased weight inversely proportional to class frequencies. Second, to improve the diversity of training samples, data augmentation techniques were utilized, such as flips, random rotations, and intensity variations. To improve the representation of minority class samples, such as the necrotic core, and the model’s capacity to learn from these valuable areas, oversampling was eventually done during batch generation.

Cross-validation strategy

To improve model generalization and robustness, 5-fold cross-validation was used. This ensures that MM-MSCA-AF’s performance is not biased by dataset splits and can adapt to different patient cases.

Training details

Hyperparameter tuning was performed using grid search across several key parameters. Learning rates were selected from the range \(\{1\text {e}^{-4}, 1\text {e}^{-3}, 1\text {e}^{-2}\}\), batch sizes from \(\{8, 16, 32\}\), and optimizers from standard variants such as Adam, AdamW, and SGD. The best configuration learning rate = \(1\text {e}^{-3}\), batch size = 16, optimizer = AdamW was determined based on the lowest validation loss observed over 5-fold cross-validation. All runs used early stopping and learning rate schedulers to avoid overfitting and ensure convergence stability.

Performance metrics

The model’s segmentation performance was assessed using six standard metrics, focusing on global and class-specific accuracy. Table 7 shows the performance metric used along with their evaluation criteria.

Categorical cross-entropy loss

Measures pixel-wise classification error:

$$L = -\sum _{c=1}^{C} Y_{\text {true}}^{(c)} \log Y_{\text {pred}}^{(c)}$$

Lower values indicate better model performance.

Mean intersection over union (IoU)

Evaluates the overlap between predicted and ground-truth regions:

$$\text {Mean IoU} = \frac{1}{C} \sum _{c=1}^{C} \frac{|A_c \cap B_c|}{|A_c \cup B_c|}$$

Higher values indicate better segmentation accuracy.

Dice similarity coefficient (DSC)

Measures segmentation overlap, emphasizing small regions:

$$\text {Dice} = \frac{2|A \cap B|}{|A| + |B|}$$

Higher values (close to 1.0) indicate better segmentation quality.

Precision (positive predictive value)

Determines how many predicted tumor pixels are tumors:

$$\text {Precision} = \frac{TP}{TP + FP}$$

Higher precision means fewer false positives.

Sensitivity (recall)

Measures how many actual tumor pixels were correctly identified:

$$\text {Sensitivity} = \frac{TP}{TP + FN}$$

Higher sensitivity ensures low false-negative rates.

Specificity

Evaluates how well non-tumor pixels are classified:

$$\text {Specificity} = \frac{TN}{TN + FP}$$

Higher specificity reduces false-positive rates.

Table 7 Evaluation metrics and performance criteria.

Computational efficiency

Despite achieving competitive segmentation performance, the MM-MSCA-AF model is designed to maintain a lower computational footprint compared to recent state-of-the-art models. Compared to TransBTS, which employs multi-head self-attention within a Transformer bottleneck and incurs high memory overhead due to the quadratic complexity of attention operations, MM-MSCA-AF relies on hierarchical pooling and convolutional aggregation through the Multi-Scale Contextual Aggregation (MSCA) module. This significantly reduces parameter count and GPU memory usage while preserving the ability to model global context. Similarly, while nnU-Net dynamically adapts preprocessing and training configurations, it lacks a mechanism for explicit spatial-semantic attention, resulting in suboptimal trade-offs between accuracy and computational cost. MM-MSCA-AF’s Gated Attention Fusion (GAF) module selectively focuses on relevant features across modalities, eliminating redundancy without introducing computationally expensive attention maps over full-resolution feature maps. Compared to hybrid architectures like EFFResNet-ViT, which combine CNNs and Transformers and suffer from increased inference time (often >3s per scan), MM-MSCA-AF achieves inference in approximately 1.2 seconds per 3D scan on an RTX 3090 GPU with fewer parameters. This balance of accuracy and efficiency makes MM-MSCA-AF a more tractable choice for clinical settings, where inference speed and resource usage are as critical as segmentation accuracy. Table 8 shows the comparative analysis of the complexity of different models with MM-MSCA-AF, showing MM-MSCA-AF has the minimal computational complexity.

Table 8 Model complexity and inference performance.

Results & discussion

The MM-MSCA-AF model performance was compared against the best segmentation models in the state-of-the-art on the BRATS 2020 data and was found to perform better. The comparison models included HCA, nnU-Net, UNet++, SegNet, and Attention U-Net. The performance was assessed using categorical cross-entropy loss, accuracy, Mean IoU, Dice score, precision, sensitivity, and specificity. From the experimental results, MM-MSCA-AF consistently performs better than the existing models with the highest segmentation accuracy in different tumor regions.

Table 9 indicates the comparative performance of MM-MSCA-AF compared to baseline methods. The new model achieved a Mean IoU of 0.8589, a Dice score of 0.8158, and a Dice score of 0.7982 for necrotic tumor regions, which is greater segmentation accuracy. Compared to nnU-Net and Attention U-Net, MM-MSCA-AF possesses higher precision, sensitivity, and specificity, confirming its superiority in detecting tumor edges and subregions.

Table 9 Model performance comparison on BRATS 2020.

These results indicate that MM-MSCA-AF exceeds the existing architectures on Mean IoU and Dice score, particularly in complex tumor regions such as necrotic cores and enhancing tumors. The increased accuracy of segmentation arises from the multi-scale contextual aggregation and gated attention fusion mechanisms, which are capable of efficiently learning global and fine-grained features of tumors. Figure 4 shows results obtained from MM-MSCA-AF from the BRATS Dataset.

Fig. 4
figure 4

Predicted tumor for all classes. Examples of predicted brain tumor segmentations generated by the MM-MSCA-AF model for various classes, including necrotic, edema, and enhancing tumor regions. Ground truth annotations and corresponding model predictions are shown for comparison.

Computational efficiency

The proposed model achieves high segmentation accuracy while maintaining computational feasibility. The training process required approximately 48 hours on an NVIDIA RTX 3090 GPU, with a peak memory usage of 18GB, which remains manageable with high-performance computing resources. During inference, MM-MSCA-AF processes a 3D MRI scan in approximately 1.2 seconds, making it suitable for near-real-time clinical deployment. The integration of MSCA and GAF increases computational complexity, but the benefits in segmentation accuracy outweigh the additional cost. Compared to transformer-based architectures such as TransBTS, MM-MSCA-AF achieves similar accuracy with significantly lower computational overhead, making it a more efficient solution for medical image segmentation.

The results demonstrate that MM-MSCA-AF significantly outperforms conventional and state-of-the-art segmentation models by effectively integrating multi-modal MRI inputs, multi-scale contextual learning, and attention fusion mechanisms. This section discusses the key factors contributing to MM-MSCA-AF’s superior performance, its clinical implications, and the limitations and future directions for improvement.

Factors contributing to MM-MSCA-AF’s performance

Traditional CNN architectures, including U-Net and nnU-Net, operate at a fixed spatial resolution, limiting their ability to capture tumors of varying sizes. MM-MSCA-AF employs hierarchical multi-scale feature extraction, enabling the model to capture both global tumor context and fine-grained boundary details, leading to improved segmentation accuracy. Unlike standard attention mechanisms that apply uniform attention weights, MM-MSCA-AF dynamically adjusts feature importance across scales and modalities. This selective refinement significantly reduces false positives while ensuring robust feature representation, leading to improved precision and specificity. Baseline models such as nnU-Net and SegNet process single-modality MRI data, limiting their ability to leverage complementary tissue contrasts. MM-MSCA-AF effectively integrates T1, T2, FLAIR, and T1-CE sequences, enhancing segmentation performance across different tumor subregions.

Impact of MSCA on tumor size and shape variability

The MSCA module plays an important role in improving segmentation performance across tumor regions with diverse morphological characteristics. By incorporating hierarchical feature extraction at three spatial levels, global, medium, and fine, the model can accurately capture both the coarse structure of large edema regions and the intricate boundaries of smaller necrotic cores. This multi-level pooling strategy enhances the model’s ability to resolve features at varying resolutions, leading to more robust and generalized segmentation across patient cases.

Advantages of GAF over standard attention mechanisms

Unlike conventional attention gates that apply uniform spatial weighting, the Gated Attention Fusion (GAF) module dynamically learns attention weights across both spatial dimensions and MRI modalities. This adaptiveness allows GAF to emphasize modality-specific tumor features while suppressing irrelevant background information. Consequently, MM-MSCA-AF demonstrates reduced false positive rates and enhanced focus on tumor-relevant regions, as evidenced by improved precision and specificity.

Addressing class imbalance in tumor subregions

Brain tumor datasets suffer from class imbalance, particularly for minority classes like necrotic tumor cores. To mitigate this, a combination of techniques was applied:

  • A class-weighted categorical cross-entropy loss was used, assigning higher weights to underrepresented classes.

  • Oversampling strategies ensured more frequent inclusion of necrotic and enhancing tumor slices during batch formation.

  • Data augmentation, including flipping, rotation, and intensity variation, was applied with a focus on minority classes to improve their representation and reduce overfitting.

These measures significantly enhanced the model’s ability to learn discriminative features across all tumor regions.

Ablation study

To evaluate the individual impact of the Multi-Scale Contextual Aggregation (MSCA) and Gated Attention Fusion (GAF) modules in the proposed MM-MSCA-AF framework, we conducted a comprehensive ablation study. Four model configurations were compared on the BRATS 2020 dataset under identical training and evaluation settings:

  • Full MM-MSCA-AF: incorporates both MSCA and GAF modules. This complete configuration leverages multi-scale spatial feature extraction and adaptive attention-driven fusion, enabling the model to capture both coarse and fine tumor structures while reducing false positives through refined feature selection.

  • Without MSCA (only GAF): excludes the MSCA module, retaining only the attention mechanism. Without multi-scale pooling, the model lacks global spatial context, which reduces its ability to accurately delineate tumors of varying sizes, especially small and irregular shapes. However, GAF still suppresses irrelevant features and enhances salient ones, partially compensating for the loss of spatial diversity.

  • Without GAF (only MSCA): excludes the attention mechanism, retaining multi-scale aggregation. While MSCA helps capture tumor morphology at different resolutions, the absence of GAF limits the model’s ability to selectively focus on tumor-relevant modalities and features, leading to increased false positives and reduced boundary sharpness.

  • Baseline (no MSCA, No GAF): removes both enhancements, using only a standard encoder-decoder backbone. This version lacks both spatial adaptability and attention-based refinement, resulting in poor generalization to heterogeneous tumor structures and reduced segmentation accuracy across all metrics.

The ablation study results are presented in Table 10.

Table 10 Ablation study showing the performance impact of MSCA and GAF modules on the BRATS 2020 dataset.

These results demonstrate that both MSCA and GAF play critical roles in enhancing segmentation performance. MSCA improves spatial representation by combining features at different scales, crucial for modeling tumor heterogeneity, while GAF dynamically modulates feature importance across modalities, leading to better discrimination of tumor boundaries and reduced misclassification. The best performance is consistently achieved when both modules are used in conjunction.

Generalization across BRATS datasets (2017–2020)

To evaluate the generalization capability of MM-MSCA-AF, we conducted experiments on four consecutive BRATS datasets: BRATS 2017, 2018, 2019, and 2020. These datasets differ in terms of annotation protocols, patient cohorts, and acquisition conditions, making them a suitable benchmark for assessing cross-year robustness. Table 11 presents the quantitative results obtained using MM-MSCA-AF across these multiple datasets. The model consistently achieved high Dice scores, IoU, and precision across all datasets, with minimal variation in performance. Notably, the Dice Score (necrotic) remained above 0.78 for all years, indicating stable performance even on challenging subregions. This confirms the ability of MM-MSCA-AF to generalize effectively across data from different years and clinical institutions, supporting its potential for real-world clinical adoption.

Table 11 Generalization performance of MM-MSCA-AF on BRATS 2017–2020 datasets.

Clinical implications

The high sensitivity (i.e., 0.9349) and specificity (i.e., 0.9656) of MM-MSCA-AF indicate its potential for clinical deployment in brain tumor diagnosis and treatment planning. Accurate segmentation of necrotic, enhancing, and edema regions is critical for determining tumor progression and therapeutic strategies. The ability to precisely delineate tumor boundaries improves radiological assessments and surgical planning, reducing inter-observer variability in manual segmentation tasks.

Although Hausdorff Distance at 95th Percentile (HD95) and Precision-Recall (PR) curves are not included in baseline model comparisons due to the unavailability of these metrics in their original publications, we recognize their clinical relevance. Therefore, to maintain transparency, we explicitly note this limitation in our comparative analysis. To support interpretability for clinical readers, we include visualizations of PR curves and HD95 scores for MM-MSCA-AF alone in the supplementary materials. These additional results provide insight into the boundary precision and classification thresholds of our model, which are particularly useful in clinical decision-making contexts such as surgical planning and lesion monitoring.

Fig. 5
figure 5

PR curve over the epochs.

The PR curve shows class-wise segmentation performance across epochs. Precision and sensitivity steadily increase, reaching values of  0.935 and  0.969, respectively, by epoch 34. The overall trend reflects progressive model learning, with increasing recall and stable precision across tumor subregions. The Fig. 5 illustrates the PR curve over the epochs for MM-MSCA-AF.

Fig. 6
figure 6

HD95 distribution per tumor region.

The HD95 metric evaluates boundary-based segmentation accuracy. For the enhancing tumor (ET), the mean HD95 is 26.5 mm, suggesting a wider variation in boundary alignment. In contrast, the whole tumor (WT) and tumor core (TC) show smaller mean HD95 values (9.75 mm and 24.5 mm, respectively), reflecting better spatial consistency in these regions. These results underscore the method’s stronger performance in segmenting the core and whole tumor compared to the enhancing tumor, which exhibits greater variability. Figure 6 shows the HD95 computed across multiple tumor regions.

Future work

To further improve model adaptability and reduce reliance on labeled data, we propose integrating self-supervised learning (SSL) into the MM-MSCA-AF pipeline. In particular, a contrastive pretraining scheme using unlabelled MRIs can be applied to initialize the encoder with robust spatial and modality-invariant features. Additionally, few-shot learning (FSL) strategies such as prototype networks or meta-learning modules can be adapted to fine-tune MM-MSCA-AF on smaller, institution-specific datasets. Preliminary tests show a marginal \(3-5\%\) drop in accuracy when training on just \(10\%\) of annotated data, indicating strong potential for label-efficient adaptation.

Conclusion

This paper introduced MM-MSCA-AF, a novel multi-modal, multi-scale segmentation model for segmenting brain tumors from MRI images. The model integrated MSCA and GAF to extract and amplify features from various MRI modalities to achieve accurate demarcation of tumor sub-regions like necrotic, enhancing, and edema regions. The results on the BRATS 2020 dataset showed that MM-MSCA-AF was superior to other traditional deep learning methods in the key performance indicators of Mean Intersection over Union and Dice score, confirming its potential in dealing with tumor heterogeneity and boundary complexity. The outcome revealed that effective integration of the hierarchical multi-scale feature extraction method harnessed the model in accessing global context along with high-resolution information. The gated attention fusion technique selectively enhanced tumor-related features and suppressed non-pertinent information, resulting in accurate segmentation. Despite the enhancements described above, the model was still computationally tractable, with fast inference times suitable for real-time clinical applications. This work will be extended by integrating self-supervised and few-shot learning approaches that possess the capability to generalize when little annotated medical data is available, thus being more suitable for low-resource settings.