Enhanced brain tumor segmentation in medical imaging using multi-modal multi-scale contextual aggregation and attention fusion

Aslam, Waqar; Hussain, Jawad; Aslam, Muhammad Zeeshan; Jan, Salman; Riaz, Talha Bin; Iqbal, Adeel; Arif, Mohammad; Khan, Inayat

doi:10.1038/s41598-025-21255-4

Download PDF

Article
Open access
Published: 24 October 2025

Enhanced brain tumor segmentation in medical imaging using multi-modal multi-scale contextual aggregation and attention fusion

Waqar Aslam¹^na1,
Jawad Hussain¹,
Muhammad Zeeshan Aslam¹,
Salman Jan²,
Talha Bin Riaz³,
Adeel Iqbal⁴^na1,
Mohammad Arif⁵ &
…
Inayat Khan⁶

Scientific Reports volume 15, Article number: 37308 (2025) Cite this article

3806 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Accurate segmentation of brain tumors from multi-modal MRI scans is critical for diagnosis, treatment planning, and disease monitoring. Tumor heterogeneity and inter-image variability across MRI sequences pose challenging problems to state-of-the-art segmentation models. This paper presents a novel Multi-Modal Multi-Scale Contextual Aggregation with Attention Fusion (MM-MSCA-AF) framework that leverages multi-modal MRI images (T1, T2, FLAIR, and T1-CE) to enhance segmentation performance. The model employs multi-scale contextual aggregation to obtain global and fine-grained spatial features, and gated attention fusion for selectively refining effective feature representations and discarding noise. Evaluated on the BRATS 2020 dataset, MM-MSCA-AF achieves a Dice value of 0.8158 for necrotic tumor regions and 0.8589 in total, outperforming state-of-the-art architectures such as U-Net, nnU-Net, and Attention U-Net. These results demonstrate the effectiveness of MM-MSCA-AF in handling complex tumor shapes and improving segmentation accuracy. The proposed approach has significant clinical value, offering a more accurate and automatic brain tumor segmentation solution in medical imaging.

Introduction

A brain tumor is an abnormal growth of cells within the brain or the central spinal canal. These cells multiply at an extremely rapid rate, forming a mass that can disrupt normal brain function. The brain tumor segmentation is a technique of locating and labeling a brain tumor’s precise position and boundaries in medical pictures, particularly using magnetic resonance imaging (MRI) scans. Based on the form of scan, such as T1, T2, FLAIR, or T1-CE, tumors may appear distinctly when viewing these precise images of the skull. In these scans, segmentation refers to picking out the malignant or tumor cell from the natural brain structure and occasionally even splitting the tumor into distinct areas, such as the core, the active edge, and the surrounding swelling^1,2.

Three main areas are typically distinguished in brain tumor examination, especially when using medical examination: the core, the active edge, and the surrounding edema³. Typically called the necrotic core, the core of the malignancy or tumor is the heart of the tumor, composed of material that is either dead or not functioning. In general, this necrosis or mortality happens when the tumor develops too quickly for its access to blood, which causes central cell death. The active edge, sometimes referred to as the augmenting tumor region, surrounds the core and denotes the region where the malignancy persists in actively spreading and encroaching on neighboring tissues. This area is frequently the main focus of operational and medical treatments since it shows up clearly on contrast-enhanced images⁴. The inflammation (edema) or peripheral swelling (located behind the active edge) shows the tumor’s existence and does not consist of cancerous cells. This inflammation has a significant impact on treatment scheduling and treating symptoms because it damages nearby brain structures and worsens illnesses.

The primary technique for segmenting brain cancers is MRI, which offers remarkable soft tissue clarity and enables the integration of several imaging frames that emphasize distinct tissue properties. In contrast to CT scans⁵, which perform better at visualizing bones, MRI makes it easy to discern the differences among the tumor’s center, its actively growing areas, and any adjacent swelling. The MRI also spares patients from radiation exposure, and it is preferable to use it repeatedly over an extended period⁶. Every MRI scan category, including T1, T2, FLAIR, and T1 with contrast enhancement, provides distinct visuals that aid in precisely distinguishing different tumor segments. Alternative procedures for imaging, such as CT or PET scans, can potentially be employed in specific situations, but they either tend to be unsuccessful in defining the intricate layers of brain malignancies or miss the delicate tissue distinction of MRI⁷. Thus, for a thorough, secure, and multifaceted evaluation of brain tumors, MRI continues to be the standard of excellence.

The categories of scan provide multiple structural images, e.g., cerebrospinal fluid (CSF) and fats look darkish on T1-weighted MRIs, which produce comprehensive structural pictures⁸. It is mostly employed to illustrate typical brain architecture and provides a standard against which numerous patterns can be compared. T2-weighted MRI, on the other hand, emphasizes fluids, giving the appearance of bright spots in areas of edema, inflammation, or malignancies; therefore, T2 is very helpful in detecting anomalies like infection and malignancies. A revised T2 pattern known as FLAIR (Fluid Attenuated Inversion Recovery) suppresses impulses from floating fluids such as CSF. The elimination of the bright underlying signal from CSF improves the appearance of infections close to the brain’s valves or fluid-filled areas. In T1, alongside a contrast enhancement (T1-CE), primarily gadolinium is injected, which builds up in locations where the blood-brain barrier is compromised, that is, frequently the case in parts of hostile or proliferative tumors. Effective evaluation and therapy strategy depend on this sequence since it aids in locating and defining the tumor’s dynamic regions^9,10. The additional data that those patterns offer when combined is essential for a thorough evaluation of brain tumors.

Brain tumor segmentation from MRI is a critical task in medical image analysis, facilitating early diagnosis, treatment planning, and disease monitoring^11,12. However, the inherent complexity of brain tumors, including variations in size, shape, and location across different MRI modalities (T1, T2, FLAIR, and T1-CE), poses significant challenges^13,14,15. Traditional segmentation methods, such as thresholding, region-based approaches, and classical machine learning models, often struggle with heterogeneous tumor structures and modality-specific variations. While deep learning-based models like U-Net, nnU-Net, and Attention U-Net have improved segmentation accuracy, they still face limitations in capturing multi-scale spatial features and suppressing irrelevant background noise. Additionally, manual segmentation by radiologists remains time-intensive, subjective, and prone to inter-observer variability.

To surmount these challenges, we propose a Multi-Modal Multi-Scale Contextual Aggregation with Attention Fusion (MM-MSCA-AF) model for brain tumor segmentation. The proposed model integrates multi-modal MRI inputs to capture complementary tissue contrasts and employs Multi-Scale Contextual Aggregation (MSCA) to learn both global and fine-grained spatial features. A gated attention fusion (GAF) process is suggested to selectively enhance tumor-specific features and suppress noise, improving segmentation performance in necrotic, enhancing, and edema regions. The proposed framework is evaluated on the BRATS 2020 dataset, as illustrated in Figure 1, and achieves a Dice score of 0.8158 for necrotic tumor regions and 0.8589 overall, which is better than state-of-the-art models.

Related work

Traditional approaches for brain tumor segmentation

Initial techniques used for brain tumor segmentation were largely thresholding approaches, i.e., Otsu’s thresholding, wherein an optimal intensity threshold is calculated to segment tumor and non-tumor tissues. The technique is simple and computationally efficient, but the strategy is not generalizable across MRI modalities owing to tumor intensity variations, patient-dependent acquisition parameters, and noise in medical images. Region-based methods, including region growing and watershed segmentation, attempted to move beyond thresholding by considering spatial relationships between neighboring pixels. Region growing starts from a seed point and expands the region based on intensity similarity, but is extremely sensitive to seed position and noise. Similarly, watershed segmentation is effective in segmenting tumor boundaries but suffers from over-segmentation when applied to MRI images. Though they are basic, thresholding and region-based methods cannot manage tumor heterogeneity, where multiple regions of the tumor contain disparate intensity values. Furthermore, MRI artifacts such as intensity non-uniformity and partial volume effects also undermine segmentation performance, making these traditional techniques unsuitable for complex medical imaging tasks^16,17.

The recent advancements in machine learning algorithms, such as Support Vector Machines (SVMs)¹⁸, Random Forests (RFs), and k-Nearest Neighbors (KNN), were introduced to improve brain tumor segmentation. These models rely on handcrafted feature extraction, including texture, shape, and intensity-based descriptors, to classify tumor regions. For instance, Gabor filters and histogram of oriented gradients (HOG) were used to capture local intensity variations and edge information in MRI scans¹⁹. The machine learning-based methods performed better than thresholding techniques, but they require extensive feature engineering and struggle with generalization across different datasets. SVMs, for example, have been widely used due to their robustness in handling high-dimensional data, but they are computationally expensive for large MRI datasets and require careful kernel selection for optimal performance²⁰. Similarly, Random Forests, which aggregate multiple decision trees, can mitigate overfitting but still suffer from feature redundancy and class imbalance issues when segmenting tumor regions²¹. One of the main limitations of classical machine learning techniques is their inability to capture hierarchical spatial dependencies in MRI images. Since these models operate on manually extracted features, they lack the end-to-end learning capability of modern deep learning architectures²². As a result, classical approaches are gradually being replaced by deep convolutional neural networks (CNNs), which automatically learn spatial and contextual features from raw MRI data. To better understand the limitations of traditional approaches, Table 1 provides a comparative analysis of thresholding, region-based methods, and classical machine learning techniques for brain tumor segmentation.

Table 1 Comparison of brain tumor segmentation methods.

Full size table

Table 2 Comparison of U-Net-based brain tumor segmentation models.

Full size table

Deep learning-based approaches

One of the most widely used deep learning architectures for biomedical image segmentation is U-Net, a fully convolutional network (FCN) introduced by Ronneberger et al.^23,24. The architecture follows an encoder-decoder structure, where the contracting path extracts hierarchical feature representations, and the expanding path progressively refines spatial details using skip connections. These skip connections allow the network to recover fine-grained structural details, making the U-Net highly effective for medical imaging tasks. Despite its success, U-Net has notable limitations when applied to brain tumor segmentation. It struggles to accurately segment small tumor regions, particularly necrotic cores and enhancing tumor areas, due to limited spatial awareness in deeper layers²⁵. Moreover, U-Net lacks an explicit attention mechanism, making it susceptible to background noise and less effective in highlighting tumor boundaries²⁶.

To overcome the limitations of background noise, the proposed MM-MSCA-AF model introduces a GAF mechanism, which selectively enhances tumor-relevant features while suppressing irrelevant information. Furthermore, MSCA allows MM-MSCA-AF to better capture global and local contextual details, significantly improving segmentation accuracy over the standard U-Net. U-Net++ is an extension of U-Net that incorporates nested and dense skip connections to refine feature representations²⁷ by introducing intermediate convolutional layers between the encoder and decoder paths. U-Net++ improves gradient flow, allowing for better preservation of fine details during segmentation. These modifications enhance feature learning, particularly for complex and heterogeneous structures such as brain tumors. However, U-Net++ introduces higher computational complexity due to the increased number of convolutional operations within the skip pathways²⁸. This additional processing increases training time and memory requirements, making the model less efficient for large-scale datasets such as BRATS 2020. MM-MSCA-AF leverages hierarchical multi-scale contextual aggregation in contrast to U-Net++, which provides a more efficient way to integrate both low-level and high-level features. Instead of relying solely on dense connections, MM-MSCA-AF explicitly processes features at multiple scales, improving the segmentation of varying tumor sizes without adding excessive computational overhead.

The nnU-Net (no-new U-Net) is an automated segmentation framework that dynamically adapts the network architecture, preprocessing, and postprocessing based on the characteristics of the dataset²⁹. Unlike standard U-Net variants, nnU-Net optimizes hyperparameters, normalizes input data, and fine-tunes model depth without requiring manual intervention. This self-configuring approach has made nnU-Net one of the most robust baseline models for medical image segmentation and delivers strong performance. It does not explicitly incorporate multi-modal fusion, limiting its effectiveness for multi-sequence MRI analysis³⁰. Brain tumors exhibit distinct contrasts across MRI modalities (T1, T2, FLAIR, T1-CE), and ignoring this complementary information can lead to suboptimal segmentation performance. MM-MSCA-AF addresses this limitation by explicitly fusing information from multiple MRI sequences, enhancing feature representation across different tissue contrasts. The use of multi-modal integration and contextual attention mechanisms allows MM-MSCA-AF to better delineate tumor boundaries, particularly in challenging cases where tumor regions blend with surrounding tissues.

Attention U-Net extends the original U-Net architecture by introducing attention gates (AGs) to selectively focus on salient tumor features while suppressing irrelevant background regions³¹. These gates learn to assign higher weights to important anatomical structures, enhancing segmentation performance, particularly in medical imaging tasks where contrast variations are subtle. The attention mechanisms improve localization accuracy, standard attention U-Nets are computationally expensive, as they require additional multiplicative operations at multiple feature scales³¹. Furthermore, conventional attention mechanisms may still fail to distinguish between tumor regions and other high-intensity areas, leading to false positives in segmentation results.

To address the aforementioned challenges, MM-MSCA-AF introduces GAF, which dynamically adjusts feature importance across different MRI modalities and spatial scales. Unlike standard attention gates, GAF learns adaptive attention weights, allowing the model to selectively enhance critical features while maintaining computational efficiency. This leads to superior segmentation accuracy with reduced false positive rates compared to Attention U-Net. To address the advantages of MM-MSCA-AF over existing models, Table 2 presents a comparative analysis of the deep learning approaches discussed above.

Hybrid and transformer-based approaches

SegNet is an encoder-decoder-based deep learning architecture originally designed for semantic segmentation³². It follows a similar structure to U-Net, where the encoder extracts hierarchical feature representations, and the decoder progressively reconstructs segmentation masks. However, unlike U-Net, SegNet does not incorporate skip connections, relying instead on pooling indices from the encoder to retain spatial information³³. This reduces memory usage and makes SegNet computationally efficient, making it a viable option for medical image segmentation. Although effective, SegNet has serious limitations in tumor segmentation. Due to the absence of advanced contextual aggregation mechanisms, it cannot model global and local tumor information, particularly in heterogeneous brain tumors. In addition, the absence of explicit attention mechanisms makes SegNet less effective at distinguishing between tumor boundaries and background noise, leading to false positives in complex medical images²⁴. In comparison, MM-MSCA-AF transcends these restrictions by the introduction of MSCA, which captures both fine-grained spatial details and general tumor shape. Besides, the GAF mechanism allows feature refinement, making MM-MSCA-AF more efficient than SegNet in the segmentation of tumor regions.

Hierarchical Context Aggregation (HCA) networks strive to improve spatial information aggregation by applying multi-level feature representations together³⁴. Pyramid pooling and hierarchical feature extraction are used in these networks, which allow for better object segmentation of objects of various structures and sizes³³. HCA has been efficiently used in high-resolution medical imaging applications wherein global and local information must be preserved³⁴. However, HCA-based models lack dedicated attention mechanisms, limiting their ability to selectively focus on tumor-relevant features and disregard background noise²⁹. Since brain tumor segmentation requires fine-grained feature selection, HCA models can mislabel small tumor regions or fail to differentiate between different tumor subtypes.

The MM-MSCA-AF model builds upon HCA principles by integrating MSCA, which not only aggregates hierarchical spatial features but also incorporates attention-based fusion through GAF. This allows MM-MSCA-AF to selectively enhance important tumor features, improving segmentation performance without increasing computational complexity.

Table 3 Comparison of hybrid and transformer-based brain tumor segmentation models.

Full size table

Transformer-based architectures, such as Swin-UNETR and TransBTS, have recently gained popularity in medical image segmentation due to their ability to model long-range dependencies and spatial relationships³⁵. Unlike CNNs, which rely on localized convolutional operations, Vision Transformers (ViTs) process entire image patches, capturing global contextual information more effectively³⁵. Swin-UNETR combines Swin Transformers with an encoder-decoder U-Net structure, enabling hierarchical feature extraction while preserving spatial consistency. TransBTS integrates Transformer layers with CNN backbones, achieving state-of-the-art performance in brain tumor segmentation by leveraging both local and global feature representations²².

Recent works have proposed refined encoder-decoder architectures and hybrid CNN-Transformer networks to overcome limitations in conventional U-Net-based models. For instance, DCSSGA-UNet integrates DenseNet201 as the encoder and introduces both channel spatial attention and semantic guidance attention modules, which enhance feature discrimination and reduce semantic gaps during decoding, resulting in state-of-the-art performance on multiple medical image datasets³⁶. MAGRes-UNet addresses the limitations in contextual modeling by incorporating multi-attention gates and residual blocks, effectively capturing small-scale tumor features and promoting better gradient flow during training³⁷. To balance local detail extraction with global context, EFFResNet-ViT fuses CNN backbones (EfficientNet-B0, ResNet-50) with a Vision Transformer module, improving classification accuracy and interpretability through Grad-CAM and t-SNE visualization³⁸. In another direction, a YOLOv8-based pipeline for elbow fracture detection combines region-of-interest localization with advanced CNN and transformer variants (ResNet, SeResNet, ViT), and demonstrates the effectiveness of attention-guided ROI refinement and enhancement techniques in orthopedic diagnosis³⁹. These approaches exemplify recent trends in leveraging attention, residual learning, and transformer-CNN hybrids to improve medical image segmentation and classification. TransBTS is a hybrid brain tumor segmentation framework that integrates CNN-based feature extraction with a Transformer module to capture long-range dependencies and contextual relationships. It uses a ResNet encoder to learn spatial features, followed by a Bottleneck Transformer that operates in the latent space to model global context. This design improves segmentation accuracy in regions where conventional CNNs struggle due to limited receptive fields. The decoder mirrors U-Net-style skip connections for spatial detail recovery. While TransBTS demonstrates strong performance on BRATS datasets, its computational overhead is higher than purely CNN-based models, limiting its efficiency in real-time or resource-constrained clinical environments.

Despite their advantages, transformer-based models require large-scale training datasets to generalize effectively. Medical imaging datasets, including BRATS 2020, are relatively small compared to datasets used for natural image processing, which limits the effectiveness of pure transformer-based models⁴⁰. Additionally, transformers introduce higher computational costs, making them less practical for real-time clinical applications⁴⁰.

In comparison, MM-MSCA-AF balances computational efficiency and accuracy by combining CNN-based multi-scale feature extraction with attention-based refinement, making it less resource-intensive than transformers while still achieving high segmentation precision. To highlight the differences between MM-MSCA-AF and existing hybrid and transformer-based models, Table 3 presents a comparative analysis.

Research gaps and motivation for MM-MSCA-AF

Brain tumor segmentation remains a challenging task due to the complex morphological variations of tumors and the need for high-precision delineation. The traditional and deep learning-based models have improved segmentation performance, but they still exhibit significant limitations. Conventionally, CNN-based architectures, including U-Net, nnU-Net, and SegNet, process single-modal MRI scans, limiting their ability to leverage complementary information from different MRI sequences (T1, T2, FLAIR, and T1-CE). Each modality highlights distinct tissue contrasts, making multi-modal fusion essential for accurately segmenting tumor regions⁴⁰. Models that do incorporate multi-modal inputs, such as TransBTS and HCA, often lack effective fusion mechanisms, leading to suboptimal integration of spatial and contrast-based features³³. Typically, CNN-based architectures operate at a fixed spatial resolution, making it difficult to capture tumor regions of varying sizes. While architectures like U-Net++ and nnU-Net improve feature refinement through skip connections, they do not explicitly incorporate multi-scale contextual information⁴¹. This limitation results in poor segmentation of small or irregularly shaped tumors, particularly in complex regions such as the tumor core and edema³⁴. Attention-based architectures such as Attention U-Net improve segmentation by selectively focusing on tumor-relevant features. However, standard attention mechanisms apply uniform attention weights across feature maps, making them susceptible to background noise and false positives in high-intensity regions⁴². Additionally, transformer-based models like Swin-UNETR and TransBTS require large-scale training datasets, making them computationally expensive and less practical for medical imaging applications⁴³.

To address these limitations, the (MM-MSCA-AF) model introduces three major innovations: Dissimilar to the traditional CNN architectures that operate at fixed spatial resolutions, MM-MSCA-AF employs MSCA to integrate features from multiple spatial scales. This approach enables the model to capture both global tumor morphology and fine-grained boundary details, leading to improved segmentation of small, irregular, and complex tumor structures⁴⁴. MM-MSCA-AF incorporates a GAF mechanism, which dynamically assigns importance weights to multi-modal feature maps. Unlike standard attention models, GAF suppresses irrelevant background information while enhancing tumor-relevant features, significantly reducing false positives and segmentation errors⁴⁵. This leads to better localization of necrotic, enhancing, and edema regions compared to previous attention-based architectures⁴². The comparative limitations and corresponding improvements are systematically outlined in Table 4. Specifically, the novel contributions of this article are given in the following:

1.
MM-MSCA-AF model is presented, which improves brain tumor segmentation by integrating multi-modal MRI signals.
2.
The model captured simultaneously fine-grained and global spatial data using MSCA.
3.
GAF was implemented to minimize noise and refine tumor-specific characteristics.
4.
The results obtained an exceptional Dice score of 0.8589 on the BRATS 2020 database, surpassing the effectiveness of cutting-edge techniques.

The remainder of this paper is structured as follows. The “Methodology” section presents the proposed MM-MSCA-AF model, including its architecture and key components. The “Experimental setup” section describes the dataset, preprocessing pipeline, and evaluation metrics. The “Results and discussion” section reports the experimental findings and provides comparative analysis. Finally, the “Conclusion” section concludes the paper and outlines future research directions.

Table 4 Comparison of model limitations and MM-MSCA-AF improvements.

Full size table

Methodology

The (MM-MSCA-AF) model is designed to improve brain tumor segmentation by leveraging multi-modal MRI inputs, multi-scale feature extraction, and attention-based feature refinement. This section details the model architecture, highlighting its multi-modal input processing, MSCA, GAF, and decoding strategy.

Model architecture

An encoder–decoder design is used by the MMSCA-AF paradigm for effectively segmenting brain tumors from multi-modal MRI components, such as T1, T2, FLAIR, and T1-CE. First, it extracts hierarchical features from the original dataset to obtain crucial structural details. Second, it enhances the precision of tumor boundary identification by combining contextual data at multiple spatial scales. Third, redundant information is suppressed, and feature visualizations are refined through the use of attention mechanisms. Ultimately, each pixel is classified into one of four tumor classes by the decoder, which generates high-resolution segmentation maps. Figure 1 illustrates the broader concept, which combines attention fusion and multi-scale feature extraction. Figure 2 presents the MMSCA-AF model representation, including multi-modal input processing, MCA, GAF, and the final segmentation output. This architecture is designed to capture both global and fine-grained tumor features for precise segmentation. For clarity, Fig. 3 provides the MMSCA-AF network legend, which explains the role of each model component.

Multi-modal input processing

Brain tumors exhibit high variability across MRI modalities, requiring multi-modal fusion to capture diverse tissue contrasts. MM-MSCA-AF processes T1, T2, FLAIR, and T1-CE separately through dedicated convolutional feature extractors before aggregating their information. The input set is represented as:

$$X_{\text {input}} = \{X_{T1}, X_{T2}, X_{\text {FLAIR}}, X_{\text {T1-CE}}\},$$

where $X_{T1}$, $X_{T2}$, $X_{\text {FLAIR}}$, and $X_{\text {T1-CE}}$ represent MRI modalities emphasizing different tumor structures. This separation preserves modality-specific features, ensuring more effective feature aggregation in later processing stages.

Multi-scale contextual aggregation

The complexity and variability of tumors may range greatly in terms of dimension and shape. First of all, segmenting such patterns efficiently is a challenge for fixed-resolution CNNs. This restriction is addressed with the MSCA portion, which improves the reliability of segmentation by collecting both fine-grained features and more general contextual layouts. Second, the MSCA component aggregates characteristics at multiple levels of space using a three-tiered pooling technique. By downsizing the feature diagrams, global pooling obtains extensive context-relevant data that aids in comprehending the general layout of the cancer. The transitional properties that are essential for preserving the stability of the cancer boundaries are the subject of medium-scale pooling. Lastly, small-scale pooling ensures superior segmentation results by improving the system’s ability to identify restricted and microscopic cancer elements, including necrotic and edematous locations. The multi-scale feature maps are concatenated to form a unified representation, given by: $Y_{\text {MSCA}} = \text {Concat}(P_1(F_i), P_2(F_i), \dots , P_n(F_i)),$ where $P_n(F_i)$ represents feature maps pooled at different scales. This reduces spatial resolution loss while improving tumor region recognition.

Gated attention fusion

Conventional CNN-based segmentation methods are unable to differentiate between background noise and tumor shapes. The utilization of GAF helps MM-MSCA-AF suppress unnecessary features while enhancing tumor-relevant ones. GAF calculates the attention gates as follows: $\alpha = \sigma (W^T[F_{\text {encoder}}, F_{\text {decoder}}] + b)$, where $F_{\text {encoder}}$ and $F_{\text {decoder}}$ are the feature maps from the encoder and decoder, W is a learnable weight matrix, $\sigma$ is the sigmoid activation function, and b is the bias term. The refined feature output is: $F_{\text {output}} = \alpha \cdot F_{\text {decoder}} + (1 - \alpha ) \cdot F_{\text {encoder}}$. GAF reduces false positives and improves accurate segmentation by continuously adjusting feature relevance, which enables the model to concentrate on tumor features while limiting high-intensity non-tumor portions.

Decoder and segmentation output

The decoder module reconstructs high-resolution segmentation masks by progressively refining feature maps through up-sampling layers and skip connections. The final segmentation map is generated using a softmax activation function, which assigns probabilities for each tumor class:

$$\hat{Y} = \text {Softmax}(W \cdot F_{\text {output}} + b),$$

Where $W$ represents trainable parameters, and $\hat{Y}$ is the predicted segmentation map with pixel-wise classification into non-tumor regions, necrotic core, enhancing tumor, and edema.

Computational trade-offs and efficiency considerations

The integration of multi-scale pooling and attention fusion increases memory and GPU usage. However, these additional computations improve segmentation accuracy, making the trade-off worthwhile for clinical applications. $\text {Total FLOPs} = O(N \cdot M_{\text {encoder}} + M_{\text {decoder}} + M_{\text {attention}}),$ where $M_{\text {encoder}}$ and $M_{\text {decoder}}$ represent encoder-decoder operations, and $M_{\text {attention}}$ accounts for attention-based refinement⁴⁶. Due to its depth, MM-MSCA-AF requires 48 hours on an NVIDIA RTX 3090 GPU, with early stopping after 200 epochs. Also, despite higher memory requirements, MM-MSCA-AF achieves real-time performance, with an average segmentation time of 1.2 seconds per 3D MRI scan.

Loss function and optimization

The model is optimized using the categorical cross-entropy loss function, defined as:

$$L = -\sum _{c=1}^{C} Y_{\text {true}}^{(c)} \log Y_{\text {pred}}^{(c)}$$

where $Y_{\text {true}}^{(c)}$ and $Y_{\text {pred}}^{(c)}$ represent the true and predicted probabilities for class $c$, respectively. MM-MSCA-AF is trained using the Adam optimizer with an initial learning rate of 0.001, chosen for its adaptive gradient adjustments, ensuring faster convergence without overshooting. Table 5 illustrates the components used in MM-MSCA-AF and their functionalities, and their improvement in results.

Table 5 Components and their functions with improvements.

Full size table

Experimental setup

This section details the experimental setup, including dataset characteristics, preprocessing steps, computational environment, hyperparameter selection, data handling techniques, and evaluation metrics used to assess model performance.

Dataset description

The proposed MM-MSCA-AF model is evaluated on the Brain Tumor Segmentation (BRATS 2020) dataset, a widely used benchmark for brain tumor segmentation tasks³². The dataset consists of multi-modal MRI scans from 369 patients, with four MRI sequences available for each subject and is given by T1-weighted (T1) scan that highlights normal brain structures, T1 contrast-enhanced (T1-CE) scan that enhances tumor boundaries, T2-weighted (T2) scan that shows fluid accumulation around the tumor, and Fluid-Attenuated Inversion Recovery (FLAIR) scan that detects edema and tumor infiltration. Each scan is manually annotated by expert radiologists into four classes, e.g., Non-tumor regions, Necrotic tumor core, Enhancing tumor, Edema. The dataset is split into $70\%$ training, $20\%$ validation, and $10\%$ testing to ensure model generalization across unseen cases.

Preprocessing pipeline

To normalize MRI scans across modalities, several preprocessing steps were undertaken. All images were resampled to a common voxel size of 1 mm³ to encourage spatial consistency within the dataset. Each MRI sequence was independently normalized so that pixel intensities followed a zero-mean, unit-variance distribution. Image quality was improved by Gaussian filtering to remove noise, which eliminated high-frequency noise without distorting the crucial structural information. Finally, histogram equalization was used to enhance contrast and normalize intensity distributions across all scans. After preprocessing, multi-modal inputs were fed to the MM-MSCA-AF model, which was then able to effectively extract modality-specific tumor features.

Computational environment

A premium computer system with an Intel Core i9-10900K processor, 64 GB RAM, and an NVIDIA RTX 3090 GPU with 24GB VRAM was used in all the experiments. The PyTorch environment was utilized to carry out the installation. To ensure complete model convergence and performance analysis, the algorithms were trained for about 48 hours overall. Due to the multi-scale, multi-modal nature of MM-MSCA-AF, memory usage was up to 18 GB during training. Table 6 presents the appropriate hyperparameters utilized in the training of the MM-MSCA-AF model.

Table 6 Hyperparameters and their values.

Full size table

Handling class imbalance

There exists a class imbalance problem in the BRATS 2020 dataset, specifically observed in small tumor areas like the necrotic core. Several approaches were utilized to address this problem. Initially, there was the utilization of a weighted categorical cross-entropy loss function where less-represented areas were given increased weight inversely proportional to class frequencies. Second, to improve the diversity of training samples, data augmentation techniques were utilized, such as flips, random rotations, and intensity variations. To improve the representation of minority class samples, such as the necrotic core, and the model’s capacity to learn from these valuable areas, oversampling was eventually done during batch generation.

Cross-validation strategy

To improve model generalization and robustness, 5-fold cross-validation was used. This ensures that MM-MSCA-AF’s performance is not biased by dataset splits and can adapt to different patient cases.

Training details

Hyperparameter tuning was performed using grid search across several key parameters. Learning rates were selected from the range $\{1\text {e}^{-4}, 1\text {e}^{-3}, 1\text {e}^{-2}\}$, batch sizes from $\{8, 16, 32\}$, and optimizers from standard variants such as Adam, AdamW, and SGD. The best configuration learning rate = $1\text {e}^{-3}$, batch size = 16, optimizer = AdamW was determined based on the lowest validation loss observed over 5-fold cross-validation. All runs used early stopping and learning rate schedulers to avoid overfitting and ensure convergence stability.

Performance metrics

The model’s segmentation performance was assessed using six standard metrics, focusing on global and class-specific accuracy. Table 7 shows the performance metric used along with their evaluation criteria.

Categorical cross-entropy loss

Measures pixel-wise classification error:

$$L = -\sum _{c=1}^{C} Y_{\text {true}}^{(c)} \log Y_{\text {pred}}^{(c)}$$

Lower values indicate better model performance.

Mean intersection over union (IoU)

Evaluates the overlap between predicted and ground-truth regions:

$$\text {Mean IoU} = \frac{1}{C} \sum _{c=1}^{C} \frac{|A_c \cap B_c|}{|A_c \cup B_c|}$$

Higher values indicate better segmentation accuracy.

Dice similarity coefficient (DSC)

Measures segmentation overlap, emphasizing small regions:

$$\text {Dice} = \frac{2|A \cap B|}{|A| + |B|}$$

Higher values (close to 1.0) indicate better segmentation quality.

Precision (positive predictive value)

Determines how many predicted tumor pixels are tumors:

$$\text {Precision} = \frac{TP}{TP + FP}$$

Higher precision means fewer false positives.

Sensitivity (recall)

Measures how many actual tumor pixels were correctly identified:

$$\text {Sensitivity} = \frac{TP}{TP + FN}$$

Higher sensitivity ensures low false-negative rates.

Specificity

Evaluates how well non-tumor pixels are classified:

$$\text {Specificity} = \frac{TN}{TN + FP}$$

Higher specificity reduces false-positive rates.

Table 7 Evaluation metrics and performance criteria.

Full size table

Computational efficiency

Despite achieving competitive segmentation performance, the MM-MSCA-AF model is designed to maintain a lower computational footprint compared to recent state-of-the-art models. Compared to TransBTS, which employs multi-head self-attention within a Transformer bottleneck and incurs high memory overhead due to the quadratic complexity of attention operations, MM-MSCA-AF relies on hierarchical pooling and convolutional aggregation through the Multi-Scale Contextual Aggregation (MSCA) module. This significantly reduces parameter count and GPU memory usage while preserving the ability to model global context. Similarly, while nnU-Net dynamically adapts preprocessing and training configurations, it lacks a mechanism for explicit spatial-semantic attention, resulting in suboptimal trade-offs between accuracy and computational cost. MM-MSCA-AF’s Gated Attention Fusion (GAF) module selectively focuses on relevant features across modalities, eliminating redundancy without introducing computationally expensive attention maps over full-resolution feature maps. Compared to hybrid architectures like EFFResNet-ViT, which combine CNNs and Transformers and suffer from increased inference time (often >3s per scan), MM-MSCA-AF achieves inference in approximately 1.2 seconds per 3D scan on an RTX 3090 GPU with fewer parameters. This balance of accuracy and efficiency makes MM-MSCA-AF a more tractable choice for clinical settings, where inference speed and resource usage are as critical as segmentation accuracy. Table 8 shows the comparative analysis of the complexity of different models with MM-MSCA-AF, showing MM-MSCA-AF has the minimal computational complexity.

Table 8 Model complexity and inference performance.

Full size table

Results & discussion

The MM-MSCA-AF model performance was compared against the best segmentation models in the state-of-the-art on the BRATS 2020 data and was found to perform better. The comparison models included HCA, nnU-Net, UNet++, SegNet, and Attention U-Net. The performance was assessed using categorical cross-entropy loss, accuracy, Mean IoU, Dice score, precision, sensitivity, and specificity. From the experimental results, MM-MSCA-AF consistently performs better than the existing models with the highest segmentation accuracy in different tumor regions.

Table 9 indicates the comparative performance of MM-MSCA-AF compared to baseline methods. The new model achieved a Mean IoU of 0.8589, a Dice score of 0.8158, and a Dice score of 0.7982 for necrotic tumor regions, which is greater segmentation accuracy. Compared to nnU-Net and Attention U-Net, MM-MSCA-AF possesses higher precision, sensitivity, and specificity, confirming its superiority in detecting tumor edges and subregions.

Table 9 Model performance comparison on BRATS 2020.

Full size table

These results indicate that MM-MSCA-AF exceeds the existing architectures on Mean IoU and Dice score, particularly in complex tumor regions such as necrotic cores and enhancing tumors. The increased accuracy of segmentation arises from the multi-scale contextual aggregation and gated attention fusion mechanisms, which are capable of efficiently learning global and fine-grained features of tumors. Figure 4 shows results obtained from MM-MSCA-AF from the BRATS Dataset.

Computational efficiency

The proposed model achieves high segmentation accuracy while maintaining computational feasibility. The training process required approximately 48 hours on an NVIDIA RTX 3090 GPU, with a peak memory usage of 18GB, which remains manageable with high-performance computing resources. During inference, MM-MSCA-AF processes a 3D MRI scan in approximately 1.2 seconds, making it suitable for near-real-time clinical deployment. The integration of MSCA and GAF increases computational complexity, but the benefits in segmentation accuracy outweigh the additional cost. Compared to transformer-based architectures such as TransBTS, MM-MSCA-AF achieves similar accuracy with significantly lower computational overhead, making it a more efficient solution for medical image segmentation.

The results demonstrate that MM-MSCA-AF significantly outperforms conventional and state-of-the-art segmentation models by effectively integrating multi-modal MRI inputs, multi-scale contextual learning, and attention fusion mechanisms. This section discusses the key factors contributing to MM-MSCA-AF’s superior performance, its clinical implications, and the limitations and future directions for improvement.

Factors contributing to MM-MSCA-AF’s performance

Traditional CNN architectures, including U-Net and nnU-Net, operate at a fixed spatial resolution, limiting their ability to capture tumors of varying sizes. MM-MSCA-AF employs hierarchical multi-scale feature extraction, enabling the model to capture both global tumor context and fine-grained boundary details, leading to improved segmentation accuracy. Unlike standard attention mechanisms that apply uniform attention weights, MM-MSCA-AF dynamically adjusts feature importance across scales and modalities. This selective refinement significantly reduces false positives while ensuring robust feature representation, leading to improved precision and specificity. Baseline models such as nnU-Net and SegNet process single-modality MRI data, limiting their ability to leverage complementary tissue contrasts. MM-MSCA-AF effectively integrates T1, T2, FLAIR, and T1-CE sequences, enhancing segmentation performance across different tumor subregions.

Impact of MSCA on tumor size and shape variability

The MSCA module plays an important role in improving segmentation performance across tumor regions with diverse morphological characteristics. By incorporating hierarchical feature extraction at three spatial levels, global, medium, and fine, the model can accurately capture both the coarse structure of large edema regions and the intricate boundaries of smaller necrotic cores. This multi-level pooling strategy enhances the model’s ability to resolve features at varying resolutions, leading to more robust and generalized segmentation across patient cases.

Advantages of GAF over standard attention mechanisms

Unlike conventional attention gates that apply uniform spatial weighting, the Gated Attention Fusion (GAF) module dynamically learns attention weights across both spatial dimensions and MRI modalities. This adaptiveness allows GAF to emphasize modality-specific tumor features while suppressing irrelevant background information. Consequently, MM-MSCA-AF demonstrates reduced false positive rates and enhanced focus on tumor-relevant regions, as evidenced by improved precision and specificity.

Addressing class imbalance in tumor subregions

Brain tumor datasets suffer from class imbalance, particularly for minority classes like necrotic tumor cores. To mitigate this, a combination of techniques was applied:

A class-weighted categorical cross-entropy loss was used, assigning higher weights to underrepresented classes.
Oversampling strategies ensured more frequent inclusion of necrotic and enhancing tumor slices during batch formation.
Data augmentation, including flipping, rotation, and intensity variation, was applied with a focus on minority classes to improve their representation and reduce overfitting.

These measures significantly enhanced the model’s ability to learn discriminative features across all tumor regions.

Ablation study

To evaluate the individual impact of the Multi-Scale Contextual Aggregation (MSCA) and Gated Attention Fusion (GAF) modules in the proposed MM-MSCA-AF framework, we conducted a comprehensive ablation study. Four model configurations were compared on the BRATS 2020 dataset under identical training and evaluation settings:

Full MM-MSCA-AF: incorporates both MSCA and GAF modules. This complete configuration leverages multi-scale spatial feature extraction and adaptive attention-driven fusion, enabling the model to capture both coarse and fine tumor structures while reducing false positives through refined feature selection.
Without MSCA (only GAF): excludes the MSCA module, retaining only the attention mechanism. Without multi-scale pooling, the model lacks global spatial context, which reduces its ability to accurately delineate tumors of varying sizes, especially small and irregular shapes. However, GAF still suppresses irrelevant features and enhances salient ones, partially compensating for the loss of spatial diversity.
Without GAF (only MSCA): excludes the attention mechanism, retaining multi-scale aggregation. While MSCA helps capture tumor morphology at different resolutions, the absence of GAF limits the model’s ability to selectively focus on tumor-relevant modalities and features, leading to increased false positives and reduced boundary sharpness.
Baseline (no MSCA, No GAF): removes both enhancements, using only a standard encoder-decoder backbone. This version lacks both spatial adaptability and attention-based refinement, resulting in poor generalization to heterogeneous tumor structures and reduced segmentation accuracy across all metrics.

The ablation study results are presented in Table 10.

Table 10 Ablation study showing the performance impact of MSCA and GAF modules on the BRATS 2020 dataset.

Full size table

These results demonstrate that both MSCA and GAF play critical roles in enhancing segmentation performance. MSCA improves spatial representation by combining features at different scales, crucial for modeling tumor heterogeneity, while GAF dynamically modulates feature importance across modalities, leading to better discrimination of tumor boundaries and reduced misclassification. The best performance is consistently achieved when both modules are used in conjunction.

Generalization across BRATS datasets (2017–2020)

To evaluate the generalization capability of MM-MSCA-AF, we conducted experiments on four consecutive BRATS datasets: BRATS 2017, 2018, 2019, and 2020. These datasets differ in terms of annotation protocols, patient cohorts, and acquisition conditions, making them a suitable benchmark for assessing cross-year robustness. Table 11 presents the quantitative results obtained using MM-MSCA-AF across these multiple datasets. The model consistently achieved high Dice scores, IoU, and precision across all datasets, with minimal variation in performance. Notably, the Dice Score (necrotic) remained above 0.78 for all years, indicating stable performance even on challenging subregions. This confirms the ability of MM-MSCA-AF to generalize effectively across data from different years and clinical institutions, supporting its potential for real-world clinical adoption.

Table 11 Generalization performance of MM-MSCA-AF on BRATS 2017–2020 datasets.

Full size table

Clinical implications

The high sensitivity (i.e., 0.9349) and specificity (i.e., 0.9656) of MM-MSCA-AF indicate its potential for clinical deployment in brain tumor diagnosis and treatment planning. Accurate segmentation of necrotic, enhancing, and edema regions is critical for determining tumor progression and therapeutic strategies. The ability to precisely delineate tumor boundaries improves radiological assessments and surgical planning, reducing inter-observer variability in manual segmentation tasks.

Although Hausdorff Distance at 95th Percentile (HD95) and Precision-Recall (PR) curves are not included in baseline model comparisons due to the unavailability of these metrics in their original publications, we recognize their clinical relevance. Therefore, to maintain transparency, we explicitly note this limitation in our comparative analysis. To support interpretability for clinical readers, we include visualizations of PR curves and HD95 scores for MM-MSCA-AF alone in the supplementary materials. These additional results provide insight into the boundary precision and classification thresholds of our model, which are particularly useful in clinical decision-making contexts such as surgical planning and lesion monitoring.

The PR curve shows class-wise segmentation performance across epochs. Precision and sensitivity steadily increase, reaching values of 0.935 and 0.969, respectively, by epoch 34. The overall trend reflects progressive model learning, with increasing recall and stable precision across tumor subregions. The Fig. 5 illustrates the PR curve over the epochs for MM-MSCA-AF.

The HD95 metric evaluates boundary-based segmentation accuracy. For the enhancing tumor (ET), the mean HD95 is 26.5 mm, suggesting a wider variation in boundary alignment. In contrast, the whole tumor (WT) and tumor core (TC) show smaller mean HD95 values (9.75 mm and 24.5 mm, respectively), reflecting better spatial consistency in these regions. These results underscore the method’s stronger performance in segmenting the core and whole tumor compared to the enhancing tumor, which exhibits greater variability. Figure 6 shows the HD95 computed across multiple tumor regions.

Future work

To further improve model adaptability and reduce reliance on labeled data, we propose integrating self-supervised learning (SSL) into the MM-MSCA-AF pipeline. In particular, a contrastive pretraining scheme using unlabelled MRIs can be applied to initialize the encoder with robust spatial and modality-invariant features. Additionally, few-shot learning (FSL) strategies such as prototype networks or meta-learning modules can be adapted to fine-tune MM-MSCA-AF on smaller, institution-specific datasets. Preliminary tests show a marginal $3-5\%$ drop in accuracy when training on just $10\%$ of annotated data, indicating strong potential for label-efficient adaptation.

Conclusion

This paper introduced MM-MSCA-AF, a novel multi-modal, multi-scale segmentation model for segmenting brain tumors from MRI images. The model integrated MSCA and GAF to extract and amplify features from various MRI modalities to achieve accurate demarcation of tumor sub-regions like necrotic, enhancing, and edema regions. The results on the BRATS 2020 dataset showed that MM-MSCA-AF was superior to other traditional deep learning methods in the key performance indicators of Mean Intersection over Union and Dice score, confirming its potential in dealing with tumor heterogeneity and boundary complexity. The outcome revealed that effective integration of the hierarchical multi-scale feature extraction method harnessed the model in accessing global context along with high-resolution information. The gated attention fusion technique selectively enhanced tumor-related features and suppressed non-pertinent information, resulting in accurate segmentation. Despite the enhancements described above, the model was still computationally tractable, with fast inference times suitable for real-time clinical applications. This work will be extended by integrating self-supervised and few-shot learning approaches that possess the capability to generalize when little annotated medical data is available, thus being more suitable for low-resource settings.

Data availability

The primary dataset used in this study, BRATS 2020, is publicly available from the Medical Image Computing and Computer-Assisted Intervention (MICCAI) Society at https://www.med.upenn.edu/cbica/brats2020/data.html.

Code availability

The code used in this study is provided as a supplementary file with the manuscript.

References

Remzan, N., Tahiry, K. & Farchi, A. Advancing brain tumor classification accuracy through deep learning: Harnessing RadImageNet pre-trained convolutional neural networks, ensemble learning, and machine learning classifiers on MRI brain images. Multimed. Tools Appl. 83, 82719–82747 (2024).
Article Google Scholar
Mathivanan, S. K. et al. Employing deep learning and transfer learning for accurate brain tumor detection. Sci. Rep. 14, 7232 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Nahiduzzaman, M. et al. A hybrid explainable model based on advanced machine learning and deep learning models for classifying brain tumors using MRI images. Sci. Rep. 15, 1649 (2025).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhang, H. et al. Using machine learning to develop a stacking ensemble learning model for the CT radiomics classification of brain metastases. Sci. Rep. 14, 28575 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Yeom, J. C., Park, S. H., Kim, Y. J., Ahn T. R. & Kim, K. G. Performance comparison of machine learning using radiomic features and CNN-based deep learning in benign and malignant classification of vertebral compression fractures using CT scans. J. Imaging Inform. Med. 1–9 (2025).
Saha, B., Khan, K., Soumil, S., Das, A. & Sahoo, P. Deep learning-powered brain tumour detection in MRI imaging. Adv. Comput. Solut. 2024, 231 (2024).
Google Scholar
Gunasekaran, S. et al. Automated brain tumor diagnostics: Empowering neuro-oncology with deep learning-based MRI image analysis. PLoS One 19, e0306493 (2024).
Article CAS PubMed PubMed Central Google Scholar
Saboor, A. et al. DDFC: Deep learning approach for deep feature extraction and classification of brain tumors using magnetic resonance imaging in e-healthcare system. Sci. Rep. 14, 6425 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Patro, S. G. K. et al. Brain tumor classification using an ensemble of deep learning techniques. IEEE Access 2024 (2024).
Renugadevi, M. et al. Machine learning empowered brain tumor segmentation and grading model for lifetime prediction. IEEE Access 11, 120868–120880 (2023).
Article Google Scholar
Sajjanar, R., Dixit, U. D. & Vagga, V. K. Advancements in hybrid approaches for brain tumor segmentation in MRI: A comprehensive review of machine learning and deep learning techniques. Multimed. Tools Appl. 83, 30505–30539 (2024).
Article Google Scholar
Karthik, A. et al. Unified approach for accurate brain tumor multi-classification and segmentation through fusion of advanced methodologies. Biomed. Signal Process. Control 100, 106872 (2025).
Article Google Scholar
Rasheed, Z. et al. Integrating convolutional neural networks with attention mechanisms for magnetic resonance imaging-based classification of brain tumors. Bioengineering 11(7), 701 (2024).
Article PubMed PubMed Central Google Scholar
Rasheed, Z. et al. Brain tumor classification from MRI using image enhancement and convolutional neural network techniques. Brain Sci. 13(9), 1320 (2023).
Article PubMed PubMed Central Google Scholar
Rasheed, Z. et al. Automated classification of brain tumors from magnetic resonance imaging using deep learning. Brain Sci. 13(4), 602 (2023).
Article PubMed PubMed Central Google Scholar
Mohanty, B. C., Subudhi, P. K., Dash, R. & Mohanty, B. Feature-enhanced deep learning technique with soft attention for MRI-based brain tumor classification. Int. J. Inf. Technol. 16, 1617–1626 (2024).
Google Scholar
Kaushik, P., Kukreja, V., Jain, E. & Rathour, A. Innovative CNN approaches for brain tumor detection: a comparative analysis of model architectures. In Proceedings of the International Conference on Intelligent Computing and Sustainable Innovations in Technology (IC-SIT). 1–6 (2024).
Khan, S., Khan, H. U., Nazir, S., Albahooth, B. & Arif, M. Users sentiment analysis using artificial intelligence-based FinTech data fusion in financial organizations. Mobile Netw. Appl. 29(2), 477–488 (2024).
Article Google Scholar
Abidin, Z. U. et al. Recent deep learning-based brain tumor segmentation models using multi-modality magnetic resonance imaging: a prospective survey. Front. Bioeng. Biotechnol. 12, 1392807 (2024).
Article PubMed PubMed Central Google Scholar
Li, P. et al. MCA-Net: Multi-scale context-aware aggregation network for medical image segmentation. In Proceedings of 12th International Conference on Pattern Recognition. 355–361 (2023).
Shao, H., Zeng, Q., Hou, Q. & Yang, J. MCANet: Medical image segmentation with multi-scale cross-axis attention. Preprint https://arxiv.org/abs/2312.08866 (2023).
Wang, W. et al. TransBTS: Multimodal brain tumor segmentation using transformer. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). 109–119 (2021).
Isensee, F. et al. nnU-Net: Self-adapting framework for U-Net-based medical image segmentation. Preprint https://arxiv.org/abs/1809.10486 (2018).
Long, J., Shelhamer, E. & Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3431–3440 (2015).
Myronenko, A. 3D MRI brain tumor segmentation using autoencoder regularization. In Proceedings of the International MICCAI Brain Lesion Workshop. 311–320 (2018).
Zhou, Z., Siddiquee, M. M. R., Tajbakhsh, N. & Liang, J. Unet++: A nested U-Net architecture for medical image segmentation. In Proceedings of the DLMIA/ML-CDS Workshops at MICCAI. 3–11 (2018).
Çiçek, Ö., Abdulkadir, A., Lienkamp, S. S., Brox, T. & Ronneberger, O. 3D U-Net: Learning dense volumetric segmentation from sparse annotation. In Proceedings of Medical Image Computing and Computer Assisted Intervention (MICCAI). 424–432 (2016).
Yuan, Y. & Cheng, Y. Medical image segmentation with UNet-based multi-scale context fusion. Sci. Rep. 14, 15687 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Dorfner, F. J., Patel, J. B., Kalpathy-Cramer, J., Gerstner, E. R. & Bridge, C. P. A review of deep learning for brain tumor analysis in MRI. NPJ Precis. Oncol. 9, 2 (2025).
Article PubMed PubMed Central Google Scholar
Wang, Z. et al. Dual-branch dynamic hierarchical U-Net with multi-layer space fusion attention for medical image segmentation. Sci. Rep. 15, 8194 (2025).
Article ADS CAS PubMed PubMed Central Google Scholar
Lin, C.-W., Hong, Y. & Liu, J. Aggregation-and-attention network for brain tumor segmentation. BMC Med. Imaging 21, 109 (2021).
Article CAS PubMed PubMed Central Google Scholar
Menze, B. H. et al. The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans. Med. Imaging 34, 1993–2024 (2014).
Article ADS PubMed PubMed Central Google Scholar
Zhao, H., Shi, J., Qi, X., Wang, X. & Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Visual Pattern Recognition (CVPR). 2881–2890 (2017).
Sinha, A. & Dolz, J. Multi-scale self-guided attention for medical image segmentation. IEEE J. Biomed. Health Inform. 25, 121–130 (2020).
Article Google Scholar
Yazıcı, Z. A., Öksüz, İ. & Ekenel, H. K. Attention-enhanced hybrid feature aggregation network for 3D brain tumor segmentation. In Proceedings of the International Challenge on Cross-Modality Domain Adaptation for Medical Image Segmentation. 94–105 (2023).
Hussain, T., Shouno, H., Mohammed, M. A., Marhoon, H. A. & Alam, T. DCSSGA-UNet: Biomedical image segmentation with DenseNet channel spatial and Semantic Guidance Attention. Knowl. Based Syst. 286, 113233. https://doi.org/10.1016/j.knosys.2025.113233 (2025).
Article Google Scholar
Hussain, T. & Shouno, H. MAGRes-UNet: Improved medical image segmentation through a deep learning paradigm of multi-attention gated residual U-Net. IEEE Access (2024). https://ieeexplore.ieee.org/document/10460541
Hussain, T., Shouno, H., Hussain, A., Hussain, D., Ismail, M. & Mir, T. H. EFFResNet-ViT: A fusion-based convolutional and vision transformer model for explainable medical image classification. IEEE Access (2024). https://ieeexplore.ieee.org/document/10938132
Alam, T. et al. An integrated approach using YOLOv8 and ResNet, SeResNet & vision transformer (ViT) algorithms based on ROI fracture prediction in X-ray images of the elbow. Curr. Med. Imaging Rev. https://doi.org/10.2174/0115734056309890240912054616 (2024).
Article Google Scholar
Havaei, M. et al. Brain tumor segmentation with deep neural networks. Med. Image Anal. 35, 18–31 (2017).
Article PubMed Google Scholar
Bai, H., Cheng, J., Huang, X., Liu, S. & Deng, C. HCANet: A hierarchical context aggregation network for semantic segmentation of high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 19, 1–5 (2021).
Google Scholar
Wu, S. et al. Attention-guided multi-scale context aggregation network for multi-modal brain glioma segmentation. Med. Phys. 50, 7629–7640 (2023).
Article PubMed Google Scholar
Yan, S., Yang, B., Chen, A., Zhao, X. & Zhang, S. Multi-scale convolutional attention frequency-enhanced transformer network for medical image segmentation. Inf. Fusion 2025, 103019 (2025).
Article Google Scholar
Seshimo, H. & Rashed, E. A. Segmentation of low-grade brain tumors using mutual attention multimodal MRI. Sensors 24, 7576 (2024).
Article ADS PubMed PubMed Central Google Scholar
Chen, H., Dou, Q., Yu, L., Qin, J. & Heng, P.-A. VoxResNet: Deep voxelwise residual networks for brain segmentation from 3D MR images. NeuroImage 170, 446–455 (2018).
Article PubMed Google Scholar
Aslam, M. Z., Raza, B., Faheem, M. & Raza, A. AML-Net: Attention-based multi-scale lightweight model for brain tumour segmentation in internet of medical things. IET Cyber-Phys. Syst. Theory Appl. https://doi.org/10.1049/cit2.12278 (2024).
Article Google Scholar

Download references

Funding

This work was funded by the Arab Open University, Kingdom of Bahrain, as part of its research support initiative. The authors acknowledge the institution’s contribution to the conduct and facilitation of this research.

Author information

Waqar Aslam and Adeel Iqbal contributed equally to this work.

Authors and Affiliations

Department of Computer Science, Sir Syed CASE Institute of Technology, Islamabad, Pakistan
Waqar Aslam, Jawad Hussain & Muhammad Zeeshan Aslam
Faculty of Computer Studies, Arab Open University, A’ali, 732, Kingdom of Bahrain
Salman Jan
Department of Electrical Engineering, College of Electrical and Mechanical Engineering, NUST, Rawalpindi, Pakistan
Talha Bin Riaz
School of Computer Science and Engineering, Yeungnam University, Gyeongsan-si, 38541, Republic of Korea
Adeel Iqbal
Department of Computer Engineering, Gachon University, Seongnam-si, 13120, Republic of Korea
Mohammad Arif
Department of Computer Science, University of Engineering and Technology, Mardan, Pakistan
Inayat Khan

Authors

Waqar Aslam
View author publications
Search author on:PubMed Google Scholar
Jawad Hussain
View author publications
Search author on:PubMed Google Scholar
Muhammad Zeeshan Aslam
View author publications
Search author on:PubMed Google Scholar
Salman Jan
View author publications
Search author on:PubMed Google Scholar
Talha Bin Riaz
View author publications
Search author on:PubMed Google Scholar
Adeel Iqbal
View author publications
Search author on:PubMed Google Scholar
Mohammad Arif
View author publications
Search author on:PubMed Google Scholar
Inayat Khan
View author publications
Search author on:PubMed Google Scholar

Contributions

Waqar Aslam, Muhammad Zeeshan Aslam, and Talha Bin Riaz have collected data from different resources and contributed to writing original draft preparation. Waqar Aslam, Muhammad Zeeshan Aslam, and Talha Bin Riaz performed formal analysis and simulation; Waqar Aslam, Jawad Hussain, Muhammad Zeeshan Aslam, Salman Jan, Talha Bin Riaz, Adeel Iqbal, Mohammad Arif, and Inayat Khan performed writing review and editing; Jawad Hussain, Salman Jan, and Adeel Iqbal performed supervision, and Waqar Aslam and Muhammad Zeeshan Aslam drafted pictures and tables. All authors have read and agreed to the published version of the manuscript.

Corresponding authors

Correspondence to Salman Jan or Mohammad Arif.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Aslam, W., Hussain, J., Aslam, M.Z. et al. Enhanced brain tumor segmentation in medical imaging using multi-modal multi-scale contextual aggregation and attention fusion. Sci Rep 15, 37308 (2025). https://doi.org/10.1038/s41598-025-21255-4

Download citation

Received: 02 May 2025
Accepted: 19 September 2025
Published: 24 October 2025
Version of record: 24 October 2025
DOI: https://doi.org/10.1038/s41598-025-21255-4

Subjects

Abstract

Introduction

Related work

Traditional approaches for brain tumor segmentation

Deep learning-based approaches

Hybrid and transformer-based approaches

Research gaps and motivation for MM-MSCA-AF

Methodology

Model architecture

Multi-modal input processing

Multi-scale contextual aggregation

Gated attention fusion

Decoder and segmentation output

Computational trade-offs and efficiency considerations

Loss function and optimization

Experimental setup

Dataset description

Preprocessing pipeline

Computational environment

Handling class imbalance

Cross-validation strategy

Training details

Performance metrics

Categorical cross-entropy loss

Mean intersection over union (IoU)

Dice similarity coefficient (DSC)

Precision (positive predictive value)

Sensitivity (recall)

Specificity

Computational efficiency

Results & discussion

Computational efficiency

Factors contributing to MM-MSCA-AF’s performance

Impact of MSCA on tumor size and shape variability

Advantages of GAF over standard attention mechanisms

Addressing class imbalance in tumor subregions

Ablation study

Generalization across BRATS datasets (2017–2020)

Clinical implications

Future work

Conclusion

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links