Introduction

Brain tumors are among the most devastating diseases of the central nervous system, characterized by high incidence and mortality rates worldwide1. According to the World Health Organization (WHO), brain tumors are classified into benign and malignant categories2. Among them, malignant gliomas, particularly glioblastoma (GBM), represent the most aggressive subtype due to their invasive growth, high recurrence rate, and poor prognosis3. Consequently, early detection, accurate diagnosis, and timely treatment are essential to improving patient survival and therapeutic outcomes4. Magnetic resonance imaging (MRI) serves as the primary modality for brain tumor detection owing to its superior soft-tissue contrast and absence of ionizing radiation5. Traditional diagnostic practice relies heavily on radiologists’ manual inspection of MRI scans, which is not only subjective and time-consuming but also prone to diagnostic inconsistency. Although conventional computer-aided diagnosis (CAD) systems incorporating handcrafted features and classical machine learning algorithms69 have been explored, their limited feature representation capability constrains diagnostic performance10. Recent breakthroughs in deep learning (DL), particularly convolutional neural networks (CNNs), have revolutionized medical image analysis by enabling automated hierarchical feature extraction11,12, thereby substantially enhancing recognition accuracy13.

Despite significant progress, current DL-based brain tumor detection approaches still face notable limitations. For instance, Shawon et al.14 addressed data imbalance but suffered from limited model interpretability; Disci et al.15 achieved four-class classification yet exhibited low recall for certain tumor types; Ishaq et al.16 optimized EfficientNet but lacked cross-modality validation; Tariq et al.17 combined EfficientNetV2 with Vision Transformers (ViT) but incurred high computational complexity; Arshad et al.18,19 proposed frameworks that faced regional generalization challenges; Mao et al.20 employed DenseNet-121 but did not exploit 3D or multimodal MRI fusion; and Sima et al.21 developed InFeNet, which risked overfitting on small datasets.

In summary, existing algorithms face three key challenges: (1) difficulty capturing fine-grained local and medium-range features in complex tumor tissues; (2) high computational cost and training instability of traditional normalization layers; (3) suboptimal fusion of high-level and low-level features, crucial for detecting small or blurred-boundary lesions. To overcome these limitations, this paper proposes an enhanced YOLOv12n-based detection framework featuring the following key innovations:

  1. 1.

    A2C2f-Mona Module: A parallel architecture employing multi-scale depthwise convolution and residual connections to improve local and mid-range feature extraction, thereby enhancing feature diversity and representational robustness.

  2. 2.

    C2PSA-DyT Module: A lightweight normalization substitution strategy based on element-wise operations that reduces computational overhead while improving training stability and deep feature consistency.

  3. 3.

    CGAFusion Module: A channel-attention-driven fusion mechanism that adaptively integrates low- and high-level features, effectively enhancing the detection of small tumors and lesions with blurred boundaries.

Related work

Normalization layer evolution

Normalization techniques have evolved from Batch Normalization (BN)22 to various improved methods and eventually partial “de-normalization.” BN standardizes features along the mini-batch dimension but performs poorly with small batches. Layer Normalization (LN)23 normalizes across features but is less effective in CNNs. Instance Normalization (IN)24 and Group Normalization (GN)25 offer alternatives for specific scenarios. Switchable Normalization (SN)26 and Batch-Instance Normalization (BIN)27 dynamically balance multiple strategies. With the rise of large-scale models, RMSNorm28 improved efficiency through simplification. Recent methods like EvoNorm29 and NFNets30 further enhanced stability or removed normalization entirely.

Attention mechanism

Attention mechanisms have expanded from natural language processing to computer vision and other domains. In vision, Vision Transformer31 models global dependencies by dividing images into patches, while Swin Transformer32 introduces shifted windows for efficient computation. In time-series modeling, Autoformer33 and Informer34 capture long-range dependencies efficiently. Graph Attention Networks (GAT) and GATv2 incorporate node-level attention for graph-structured data. These developments provide valuable insights for attention-based feature enhancement in medical image analysis35,36.

YOLOv12 algorithm

YOLOv1237 is an attention-centered real-time object detector employing R-ELAN as backbone with regional and hybrid attention mechanisms. Its streamlined detection head consists of bounding box prediction and class prediction branches. The model uses improved CIoU loss with shape similarity penalty, Focal-EIoU loss for class imbalance, and attention-guided loss for optimization. A multi-stage training strategy freezes the backbone initially before end-to-end fine-tuning. YOLOv12 achieves outstanding performance on COCO dataset with high precision, recall, and inference speed, making it suitable for various applications including medical imaging.

Method design

This study improves the YOLOv12n model for medical image analysis by addressing three key deficiencies. First, the A2C2f-Mona module enhances multi-scale feature extraction using parallel multi-scale depthwise convolutions and residual connections to capture local details and mid-range context more effectively. Second, the C2PSA-DyT module replaces traditional normalization with element-wise operations, stabilizing feature distribution and improving training stability. Third, the CGAFusion module employs a channel attention mechanism to adaptively fuse high- and low-level features, significantly improving detection of blurred boundaries and small lesions. The collaborative integration of these modules enhances the model’s overall detection accuracy and reliability. The improved YOLOv12n network structure is shown in Fig. 1.

Fig. 1
figure 1

Network structure diagram of the improved YOLOv12n.

A2C2f-Mona module

The original YOLOv12n backbone struggles to capture multi-scale contextual information, critical for detecting brain tumors of diverse shapes and sizes. To address this, we introduce the A2C2f-Mona module. Its core component is the Mona (Multi-Cognitive Visual Adapter) block, inspired by Yin et al.38, which employs parallel convolutional kernels of different sizes to capture features across multiple receptive fields simultaneously. This architecture enriches feature diversity and robustness, substantially improving detection performance, particularly for tumors exhibiting subtle morphological variations.

As shown in Fig. 2, the overall workflow of the A2C2f-Mona module proceeds as follows. The input first passes through a convolutional layer for preliminary feature extraction:

$$X^{\prime}=Conv(X).$$
(1)

The extracted features are then fed into the core Mona block, where multiple parallel branches of deep convolutions with different kernel sizes run simultaneously to capture local and mid-range contextual information.

$${F_3}=DWCon{v_{3 \times 3}}(X^{\prime}),$$
(2)
$${F_5}=DWCon{v_{5 \times 5}}(X^{\prime}),$$
(3)
$${F_7}=DWCon{v_{7 \times 7}}(X^{\prime}),$$
(4)

Performing average pooling operation simultaneously:

$${F_{pool}}=AvgPool({F_{3,}}{F_{5,}}{F_7}),$$
(5)

The outputs of these branches are then aggregated and a 1 × 1 convolution is applied to fuse multi-scale features while controlling the channel dimensions. To ensure stable information propagation, residual connections are introduced between the input and output of the module:

$${X_{output}}=Con{v_{1 \times 1}}(GeLU(Con{v_{1 \times 1}}(Concat(X^{\prime},{F_{pool}})))+X^{\prime})+X,$$
(6)

where X represents the input feature map, \(DWConv\) represents Depthwise Convolution, \(AvgPool\) represents average pooling, and \({X_{output}}\) represents the final output result.

Fig. 2
figure 2

A2C2f-Mona module.

C2PSA-DyT module

Conventional normalization bottlenecks YOLOv12n in medical imaging, proving unstable and costly with small batches. We propose the C2PSA-DyT module, featuring a core DyT (Dynamic Tanh) block39. This block replaces normalization with a lightweight element-wise transformation to stabilize training. This design significantly reduces computational overhead and frees network capacity for more complex modules.

As shown in Fig. 3, the overall workflow of the C2PSA-DyT module proceeds as follows. The input features first pass through a convolutional layer that performs preliminary feature extraction and adjusts the channel dimensions:

$${X_0}=Conv(X),$$
(7)

The generated features are then divided into two branches:

$${X_1},{X_2}=Split({X_0}),$$
(8)

The first branch \({X_1}\) serves as a residual path to preserve the original information, while the second branch \({X_2}\) (the main path) performs deep feature learning through the stacked \(PSABlock - DyT\) modules:

$${F_1}=PSABlock\_DyT({X_2}),{F_2}=PSABlock\_DyT({F_1}),$$
(9)

Inside each \(PSABlock\), the \(DyT\) block replaces the conventional normalization layer and employs an efficient element-wise transformation to stabilize activations:

$$Y=DyT(X)=\gamma \cdot tanh(\alpha x)+\beta ,$$
(10)

where \(\alpha\) represents the learnable scalar parameter; \(\gamma ,\beta\) represents the learnable vector parameter, which is used for final scaling and shifting; \(tanh\) represents the hyperbolic tangent function, which compresses extreme values within the [−1,1] range and approximately linearly processes normal values. Finally, the outputs from the residual path and the main path are concatenated:

$${F_{concat}}=Concat({X_1},{F_2}),$$
(11)

A convolutional layer is then applied to integrate and compress the fused features, producing the final output:

$${X_{output}}=Conv({F_{concat}}).$$
(12)
Fig. 3
figure 3

C2PSA-DyT module.

CGAFusion module

YOLOv12n struggles to fuse high-level and low-level features, hurting accuracy for small or indistinct tumors. We introduce the CGAFusion module40, which uses adaptive, attention-guided fusion to create more precise, context-aware representations for these complex structures.

The architectural design of the CGAFusion module is illustrated in Fig. 4. First, the low-level features and high-level features are added together to achieve an initial fusion of the two feature types, as shown in Eq. (13).

$${F_{{\text{sum}}}}={F_{{\text{low}}}}+{F_{{\text{high}}}}$$
(13)

Where, \({F_{{\text{low}}}}{\text{ }}\) represents the low-level features, and \({F_{{\text{high}}}}\) represents the high-level features.

Next, the channel attention mechanism (CGA) is applied to compute the weight W, while \(1 - W\) is calculated as another weight to balance the features. These weights are used to measure the importance of different channels, as shown in Eq. (14).

$$W={\text{CGA}}({F_{{\text{sum}}}})$$
(14)

The weight W is applied to the features of \({F_{{\text{low}}}}{\text{ }}\)for weighting. Similarly, \(1 - W\) is applied to the features of \({F_{{\text{high}}}}\) for another weighting operation. The two weighted features are then added together to obtain the final fused feature, as shown in Eqs. (1517).

$${F_{\text{1}}}=W \times {F_{{\text{low}}}}{\text{ }}$$
(15)
$${F_{\text{2}}}=(1 - W) \times {F_{{\text{high}}}}$$
(16)
$${F_{{\text{re-fuse}}}}={F_{\text{1}}}+{F_{\text{2}}}$$
(17)

Where \({F_{\text{1}}}\) and \({F_{\text{2}}}\) represent the weighted features, and \({F_{{\text{re-fuse}}}}\) denotes the re-fused feature.

Finally, a 1 × 1 convolution is applied to the fused features for further processing, either to reduce dimensionality or to integrate the features. The final output is \({F_{{\text{fuse}}}}\), as shown in Eq. (18).

$${F_{{\text{fuse}}}}={\text{Con}}{{\text{v}}_{1 \times 1}}({F_{{\text{re-fuse}}}})$$
(18)

Where \({\text{Con}}{{\text{v}}_{1 \times 1}}\) denotes the 1 × 1 convolution operation, and \({F_{{\text{fuse}}}}\) represents the final fused feature.

This process enhances feature representation by combining low-level and high-level features and leveraging the channel attention mechanism, thereby improving the overall performance of the model.

Fig. 4
figure 4

CGAFusion module.

Experiment

Dataset

The brain tumor detection experiments in this study were conducted using a publicly available medical imaging dataset, namely the Medical Image Dataset: Brain Tumor Detection from the Kaggle platform (access link: https://www.kaggle.com/datasets/pkdarabi/medical-image-dataset-brain-tumor-detection)41. As shown in Fig. 5, this dataset comprises MRI images of three common clinical brain tumor types: Meningioma, Pituitary Tumor, and Glioma. All samples were subjected to rigorous anonymization and preprocessing, ensuring that personally identifiable information, including patient names, ages, and medical record numbers, was completely removed. This procedure guarantees full compliance with ethical research standards and data privacy regulations, thereby ensuring the legitimacy and safety of the conducted experiments.

For model training and performance evaluation, the dataset was carefully preprocessed and systematically divided into two subsets. A total of 2451 images were used for training, serving model parameter optimization and feature learning, while 613 images were allocated for testing to objectively assess the generalization capability of the model on unseen data. The class distribution across both subsets was maintained in strict accordance with the proportions of the three tumor categories in the original dataset, effectively preventing class imbalance from biasing the experimental outcomes.

Fig. 5
figure 5

Image samples of the dataset.

Experimental platform and hyperparameter setting

All experiments in this study were conducted on a high-performance computing platform. The software environment was configured with Ubuntu 20.04 as the operating system, Python 3.8, and the PyTorch 1.10.0 deep learning framework, while CUDA 11.3 was utilized to enable GPU acceleration. In terms of hardware, the platform was equipped with an NVIDIA RTX 4090 GPU (24 GB VRAM), an AMD EPYC 7T83 64-core processor, and 90 GB of system memory, providing sufficient computational capacity for efficient data loading and model training. For the hyperparameter configuration, the input image resolution was fixed at 640 × 640, with a batch size of 64, and the model was trained for 200 epochs using an initial learning rate of 0.01 and 8 data-loading threads. The Stochastic Gradient Descent (SGD) optimizer was adopted, with a momentum factor of 0.937 and weight decay of 0.0005, ensuring stable convergence and improved generalization performance.

Evaluation indicators

In this study, the evaluation metrics include F1-score, Precision (P), Recall (R), Average Precision (AP), and mean Average Precision (mAP)42. In addition, the number of Parameters is also considered as a reference indicator. Their mathematical formulations are given as follows:

$${\text{Precision}}=\frac{{T{\text{p}}}}{{T{\text{p}}+F{\text{p}}}},$$
(19)
$${\text{Recall}}=\frac{{T{\text{p}}}}{{T{\text{p}}+F{\text{N}}}},$$
(20)
$${\text{AP}}=\int_{0}^{1} {P(R){\text{d}}R} ,$$
(21)
$${\text{mAP}}=\frac{1}{n}\sum\limits_{{{\text{i}}=0}}^{n} {AP(i)} ,$$
(22)
$${\text{F}}1=\frac{{2 \times {\text{Precision}} \times {\text{Recall}}}}{{{\text{Precision}}+{\text{Recall}}}},$$
(23)

Where \(T{\text{p}}\) denotes the number of correctly detected targets; \(F{\text{p}}\) represents the number of falsely detected targets; \(F{\text{N}}\) indicates the number of missed targets; n refers to the total number of classes; and \(AP(i)\) denotes the average precision of the i-th target class.

Experimental analysis

Comparison between ours and the basic model

To verify the effectiveness of the proposed approach, its detection performance was evaluated on the brain tumor dataset and compared with the baseline model YOLOv12n. The key quantitative metrics are summarized in Table 1.

Table 1 Comparison of detection performance of YOLOv12n and Ours on brain tumor datasets.

As shown in Table 1, the proposed method outperforms YOLOv12n across all metrics and tumor categories. The overall Precision, Recall, and mAP@0.5 improved by 1.8%, 1.5%, and 1.4%, respectively. The most notable gain in Recall is observed for Pituitary_tumor (↑4.0%), indicating enhanced sensitivity to small or boundary-blurred lesions.

Fig. 6
figure 6

Normalized Confusion matrix: (a) YOLOv12n; (b) Ours.

As shown in Fig. 6, our proposed method comprehensively outperforms YOLOv12n in brain tumor classification. Our method improved recognition rates for all three tumor categories. Notably, “Pituitary_Tumor” accuracy significantly increased from 0.92 to 0.96, demonstrating stronger robustness for cases with blurred boundaries or subtle features. For the “Glioma” category, YOLOv12n had a 0.16 misclassification rate, which our method reduced to 0.15. Overall, our approach not only improves accuracy for critical tumor categories but also reduces confusion between tumor and background regions, showing its potential for complex medical applications.

Fig. 7
figure 7

F1-Confidence Curve: (a) YOLOv12n; (b) Ours.

As illustrated in Fig. 7(a), the PR curve of our method encloses a larger area, corresponding to its higher mAP@0.5. Notably, for the challenging Glioma category, our model maintains higher precision across the recall range. Figure 7(b) shows that our model achieves a higher peak F1-score and maintains robust performance over a wider confidence threshold range, demonstrating better stability and generalization.

Fig. 8
figure 8

Precision-Recall Curve: (a) YOLOv12n; (b) Ours.

As illustrated in Fig. 8, the PR curves show our method outperforms YOLOv12n on the brain tumor classification task. Our mAP@0.5 reached 0.940, surpassing YOLOv12n’s 0.926. At the category level, Meningioma and Pituitary_Tumor maintained high precision. More importantly, for Glioma, YOLOv12n’s most challenging category, the score significantly increased from 0.825 to 0.858. Furthermore, our PR curves are smoother in the recall range, indicating stable precision at high recall. In summary, our method exceeds YOLOv12n in both overall accuracy and robustness, with clear advantages in hard-to-classify categories.

Fig. 9
figure 9

Detection effect on the dataset: (a) YOLOv12n; (b) Ours.

As illustrated in Fig. 9, notable differences can be observed between the detection results obtained by YOLOv12n and the proposed method on brain MRI images. For Glioma, YOLOv12n exhibited issues such as duplicate bounding boxes and relatively low confidence scores (0.81 and 0.55). In the case of Meningioma, the confidence score was only 0.69, reflecting instability in lesion boundary localization. In contrast, the proposed method produced more precise and concise detection outcomes, successfully eliminating duplicate bounding boxes and substantially improving confidence scores: 0.83 and 0.90 for Glioma, and 0.93 for Meningioma. Overall, the proposed approach surpasses YOLOv12n in localization accuracy, confidence, and result consistency, clearly demonstrating its advantages and reliability for medical image detection tasks.

Comparison of results from multiple different models

To validate the effectiveness of the proposed model, several representative algorithms, including the YOLO series43,44,45, RT-DETR-r1846, YOLOv5m-ESA47, FALS-YOLO48, and Vision Transformer, were selected for comparative evaluation. The corresponding performance results are summarized in Table 2, where the differences in detection accuracy and inference speed among the models are clearly demonstrated.

Table 2 Comparison of results from multiple different Models.

Table 2 presents the results of accuracy-related evaluation metrics. Precision reflects the accuracy of target prediction, with the highest value achieved by the proposed method, followed by RT-DETR-r18, YOLOv12n, and YOLOv5m-ESA. Vision Transformer achieves a Precision of 85.0%, which is lower than the YOLO series models, reflecting the challenges of applying a generic large-scale architecture directly to medical image detection without specialized adaptation. This result indicates that the proposed model possesses the strongest capability for reducing false detections. Recall, which measures the ability to correctly identify all true targets, reaches 88.0% for the proposed method, slightly higher than YOLOv5m-ESA, demonstrating its effectiveness in minimizing missed detections. Notably, Vision Transformer yields a Recall of 83.0%, the lowest among all compared models, further underscoring its relative weakness in capturing all true lesions within the medical imaging context. As a comprehensive measure of detection performance, mAP@0.5 attains the highest value of 94.0% for the proposed method, followed by YOLOv11n and RT-DETR-r18. Vision Transformer attains an mAP@0.5 of 90.0%, which, while respectable, lags behind the YOLO series and the proposed method. Overall, across all three detection accuracy metrics, the proposed approach consistently outperforms the comparative models, reflecting superior object detection capability. In terms of inference speed, YOLOv12n achieves the highest rate at 625 FPS, while YOLOv11n, YOLOv10n, and YOLOv8n range between 400 and 450 FPS. The proposed method achieves 370.3 FPS, which, although slightly slower than the YOLO series, remains significantly faster than RT-DETR-r18. In summary, the proposed approach delivers substantial improvements in detection accuracy while maintaining high computational efficiency, thereby achieving an excellent balance between speed and precision. This balance makes the method particularly suitable for real-time medical imaging applications that demand both reliability and high detection accuracy.

Table 3 Comparison results of multiple algorithms in average precision (AP/%).

As presented in Table 3, the proposed method achieves the highest detection accuracy across all three brain tumor categories. For Meningioma, the Average Precision (AP) reaches 99.5%, surpassing both RT-DETR-r18 and YOLOv12n, and demonstrating outstanding robustness and precision. In the case of Pituitary_Tumor, the proposed approach attains the highest accuracy of 98.0%, outperforming all other models and indicating superior capability in identifying lesions with indistinct boundaries or small volumes. For the most challenging category, Glioma, the method achieves an AP of 84.5%, exceeding the results of YOLOv11n, YOLOv12n, and RT-DETR-r18, thereby effectively mitigating the detection limitations observed in existing models. Overall, the proposed framework not only maintains near-perfect accuracy for easily distinguishable tumor types but also delivers notable improvements for the difficult-to-classify Glioma category, demonstrating enhanced robustness and improved inter-class performance balance.

Table 4 Comparison results of multiple algorithms in precision (P%).

As presented in Table 4, the proposed method achieves the highest Precision across all three brain tumor categories. For Meningioma, the Precision reaches 98.0%, exceeding that of YOLOv11n and YOLOv12n, and reflecting improved stability in reducing false-positive detections. In the Pituitary_Tumor category, the proposed model attains a Precision of 98.5%, significantly surpassing other methods and demonstrating superior capability in accurately identifying lesions with indistinct boundaries or small volumes. For the most challenging category, Glioma, the method achieves a leading Precision of 85.0%, outperforming YOLOv12n and RT-DETR-r18, and considerably higher than YOLOv10n, confirming effectiveness in mitigating high false-positive rates for difficult-to-classify cases. Overall, the proposed approach consistently achieves the best Precision across all tumor categories, showing strong robustness in common classes and clear advantages in complex scenarios such as Glioma. These findings highlight the superiority of the method in minimizing false detections and improving diagnostic reliability in medical image analysis.

Table 5 Comparison results of multiple algorithms on recall (R%).

As shown in Table 5, the proposed method achieves the highest Recall across all three brain tumor categories. For Meningioma, the Recall reaches 97.6%, exceeding that of YOLOv12n and RT-DETR-r18, thereby further reducing the incidence of missed detections. In the Pituitary_Tumor category, the proposed model attains a Recall of 92.5%, which is notably higher than that of the other models (89.7%–91.5%), demonstrating enhanced capability in detecting lesions with small volumes or blurred boundaries. For the most challenging category, Glioma, the method achieves a Recall of 74.0%; although the improvement is relatively modest, it still surpasses YOLOv11n, YOLOv12n, and RT-DETR-r18, reflecting consistent progress in this difficult-to-detect class. Overall, the proposed approach achieves the best Recall across all categories, effectively minimizing the risk of missed detections. The marked improvements in challenging tumor types such as Pituitary_Tumor and Glioma further underscore the robustness and potential clinical applicability of the proposed model.

Table 6 Results of experiments conducted on the PASCAL VOC and visdrone datasets.

Table 6 presents the experimental results on the PASCAL VOC and visdrone datasets, demonstrating that the proposed method consistently outperforms YOLOv8n and YOLOv10n across all evaluation metrics. On the PASCAL VOC dataset, the proposed model achieves a recall of 43.1, an mAP@0.5 of 45.1, and an mAP@0.5–0.95 of 27.6, significantly surpassing the results of both YOLO baselines. Similarly, on the visdrone dataset, the model attains a recall of 34.1, an mAP@0.5 of 34.0, and an mAP@0.5–0.95 of 19.3, again outperforming YOLOv8n and YOLOv10n. While YOLOv8n demonstrates competitive performance, it remains inferior to the proposed approach, and YOLOv10n exhibits the weakest results, with lower recall and mAP scores across both datasets. These findings confirm that the proposed method delivers superior object detection accuracy and robustness under varying IoU thresholds, highlighting its effectiveness and generalization capability across different benchmark datasets.

As illustrated in Fig. 10, panel (a) presents the original MRI scans of representative brain tumor cases, whereas panel (b) shows the corresponding Grad-CAM visualization results produced by the improved model. The generated heatmaps distinctly highlight the key regions of attention involved in tumor detection, indicating that the model accurately concentrates on lesion areas. The localization of these highlighted regions aligns closely with the actual tumor positions. Overall, the model demonstrates high accuracy and stability in regional discrimination, reflecting strong interpretability and reliable feature learning behavior. These visualization results further validate the effectiveness of the A2C2f-Mona, C2PSA-DyT, and CGAFusion modules in enhancing feature extraction, representation, and fusion capabilities.

Fig. 10
figure 10

Grad-CAM visualization results of the improved model for brain tumor detection; (a) Original Image; (b) Ours.

To provide a more intuitive evaluation of model performance in brain tumor detection, a visual comparison of detection results on MRI images was carried out using RT-DETR-r18, YOLOv8n, YOLOv10n, YOLOv11n, YOLOv12n, and the proposed method. In contrast to using numerical metrics alone, the visual results more clearly reveal differences in tumor localization accuracy, confidence levels, and boundary delineation. These visual comparisons enable a more comprehensive and interpretable assessment of the detection capability of each model and offer stronger evidence for their potential applicability in clinical practice.

Fig. 11
figure 11

Comparison of detection results of different models on multi-category brain tumor images: (a) RT-DETR-r18; (b) YOLOv8n; (c) YOLOv10n; (d) YOLOv11n; (e) YOLOv12n; (f) Ours.

Figure 11 presents the visual detection results of multiple models on brain MRI images, revealing pronounced differences in detection accuracy and stability. For RT-DETR-r18, overlapping bounding boxes appear in the detection of Glioma, indicating redundancy and instability in lesion boundary localization. Its confidence scores reach 0.86 for Pituitary_Tumor and only 0.91 and 0.92 for Meningioma, reflecting limited overall reliability. YOLOv8n produces cleaner and more stable results, achieving relatively accurate recognition of Glioma (0.81), Pituitary_Tumor, and Meningioma, and demonstrating greater consistency compared with RT-DETR. However, YOLOv10n exhibits reduced accuracy for Glioma (0.59), despite satisfactory outcomes for Pituitary_Tumor (0.92) and Meningioma (0.88, 0.95). YOLOv11n shows marginal improvement for Glioma (0.64) and achieves confidence scores of 0.90 and 0.92 for Pituitary_Tumor and Meningioma, respectively, though inter-category imbalance remains evident. In contrast, YOLOv12n performs notably worse: it fails to detect some Glioma cases, and its confidence scores for Pituitary_Tumor and Meningioma are limited to 0.86, 0.89, and 0.90, all lower than those of earlier models. By comparison, the proposed method demonstrates clear superiority across all three tumor categories. For Glioma, it achieves a confidence score of 0.83, significantly outperforming YOLOv10n, YOLOv11n, and RT-DETR-r18, while effectively eliminating missed detections and redundant bounding boxes. For Pituitary_Tumor, the confidence score rises to 0.93, exceeding all baseline models. For Meningioma, dual detections with 0.95 confidence are observed, producing sharper boundary delineations and highly consistent results. Overall, the proposed approach surpasses existing models in accuracy, stability, and robustness. It not only delivers high-confidence detection results but also markedly enhances performance in challenging categories such as Glioma, demonstrating greater clinical applicability and practical value in medical image detection.

The study originally used only a single dataset from the Kaggle platform, which could indeed limit the model’s generalization ability and clinical applicability. Therefore, to enhance the model’s performance across different data sources, we introduced a brain CT image dataset from Radiopaedia (https://radiopaedia.org/search?scope=cases&sort=date_of_publication.) for tumor detection tasks. This dataset includes brain CT scan images from various patients, covering different types of tumors such as gliomas, meningiomas, and pituitary tumors.

Fig. 12
figure 12

Validation results on brain tumor CT scans.

Figure 12 shows multiple brain CT scan images from the dataset, with bounding boxes highlighting different types of brain tumors, accompanied by predicted class labels and confidence scores. The confidence score next to each bounding box reflects the model’s confidence in its prediction. Higher confidence values (e.g., 0.92 for meningioma) indicate greater certainty in the tumor type identification. This dataset not only helps evaluate the model’s accuracy in recognizing and localizing brain tumors but also serves as a benchmark for validating tumor detection models in other datasets, enhancing the model’s clinical relevance and generalization ability. By validating across different datasets, we can more comprehensively assess the model’s adaptability in diverse clinical settings, further strengthening its reliability and effectiveness in real-world clinical applications.

Ablation experiment

To assess the contribution of each proposed enhancement module to the overall model performance, a series of ablation experiments were conducted using YOLOv12n as the baseline. The experimental results are summarized in Table 7. By comparing detection metrics under different module configurations, the individual and combined effects of the proposed components on improving model Precision and Recall can be intuitively evaluated.

Table 7 Results of ablation Experiment.

According to the ablation study results summarized in Table 7, the detection performance of the model improves progressively as each enhancement module is integrated. The baseline YOLOv12n model achieves Precision, Recall, and mAP@0.5 values of 92.0%, 86.5%, and 92.6%, respectively. With the addition of the A2C2f-Mona module, the Recall increases to 88.4%, and mAP@0.5 rises to 93.6%. Incorporation of the CGAFusion module further enhances Precision to 93.2%, while mAP@0.5 reaches 93.5%. When all three modules, A2C2f-Mona, CGAFusion, and C2PSA-DyT, are combined, the model achieves the highest overall performance, reaching Precision, Recall, and mAP@0.5 values of 93.8%, 88.0%, and 94.0%, respectively. This represents a 1.4% improvement in mAP compared with the baseline. In general, the three modules provide complementary advantages in improving both detection accuracy and recall. Their joint optimization significantly enhances the object detection capability and overall robustness of the YOLOv12n framework.

Statistical significance analysis

To quantitatively assess the significance of performance improvements, we conducted bootstrapping to compute 95% confidence intervals for Precision, Recall, and mAP@0.5. Additionally, paired t-tests were performed comparing our method with YOLOv12n across all test samples.

Table 8 Comparison of detection performance of YOLOv12n and Ours on brain tumor datasets(with 95% Cls).

Table 8 presents the results of the statistical analysis, confirming significant improvements in the proposed method. The overall mAP@0.5 increased from 92.6% to 94.0%, with a statistically significant difference (p < 0.01) and a 95% confidence interval between 0.8% and 2.0%. For Glioma, the mAP@0.5 improved from 82.5% to 84.5%, though the absolute gain is modest, the difference remains statistically significant (p < 0.05), with a 95% confidence interval of 0.5% to 3.5%. For Pituitary Tumor, the Recall improved significantly from 89.7% to 93.7% (p < 0.001), with a 95% confidence interval of 2.5% to 5.5%.

Discussion

Performance insights and error analysis

The proposed model demonstrates consistent improvements across key detection metrics, particularly for challenging tumor types such as pituitary tumors and gliomas. The significant recall gain for pituitary tumors (89.7% to 93.7%) underscores the model’s enhanced sensitivity to small and boundary-ambiguous lesions, a common diagnostic difficulty in clinical practice. Conversely, the more modest improvement in glioma detection suggests that certain intrinsic characteristics, such as high morphological heterogeneity, diffuse infiltration, and intensity similarity to normal brain tissue, remain challenging even for advanced architectures.

Error analysis reveals that most false negatives for glioma arise from cases with iso-intense appearance on T2-weighted images or lesions smaller than 10 mm in diameter. False positives occasionally occur in periventricular regions where normal anatomical structures exhibit similar texture patterns to low-grade gliomas. These observations highlight the need for incorporating multi-sequence MRI (e.g., T1-weighted, FLAIR, contrast-enhanced T1) to improve specificity and boundary delineation.

Failure cases

Although the proposed method in this paper demonstrates excellent overall detection performance, false positives and missed detections still occur in some challenging cases. Through systematic analysis of the erroneous samples in the test set, we identified the following typical failure cases:

(1) Missed Detection of Small and Fuzzy Tumors:

In some cases of pituitary microadenomas and low-grade gliomas, where the tumor diameter is less than 5 mm or the boundary has a very low contrast with the surrounding normal brain tissue, the model fails to generate an effective detection box. In these cases, the signal intensity on T2-weighted images is similar to that of the surrounding brain parenchyma, making it difficult for the model to distinguish the lesion from the background. Despite the enhancement of multi-scale feature fusion through the channel attention mechanism of the CGAFusion module, the model still lacks sufficient semantic discriminative ability for very low-contrast small lesions.

(2) Misidentification of Adjacent Structures as Tumors:

In some meningioma cases, the model occasionally misidentifies dural enhancement foci or vascular cross-sections as tumor regions, leading to false positives. These errors often occur when the tumor is adjacent to the dura mater or major blood vessels. Due to the similarity in local texture and intensity features, the model struggles to make an accurate distinction using a single MRI sequence. This misidentification suggests that incorporating multi-sequence information (such as T1-enhanced or FLAIR sequences) in the future could improve specificity.

(3) Missed Detection in Multiple Lesions:

In a few cases of multifocal gliomas, the model may detect only some of the lesions, while missing smaller or poorly defined lesions. This reflects the model’s attention-distraction issue in complex multi-target scenarios, where multiple lesions are spatially dispersed or exhibit significant morphological differences. The model may focus too much on certain prominent areas while neglecting others.

Interpretability and attention alignment

The Grad-CAM visualizations (Fig. 10) confirm that the model’s attention aligns well with radiologically relevant regions. The A2C2f-Mona module’s multi-scale receptive fields enable the network to capture both localized tumor cores and peripheral infiltrative patterns, which is critical for glioma grading. The channel attention in CGAFusion selectively amplifies feature channels corresponding to edge and texture information, improving boundary localization. Nevertheless, attention maps alone do not fully explain the model’s decision logic. Future work should integrate more intrinsic interpretability methods, such as attention rollout or symbolic reasoning layers, to provide radiologists with actionable explanations for uncertain cases.

Clinical implications and real-world constraints

While the proposed framework achieves high accuracy on a curated public dataset, several practical barriers must be addressed before clinical deployment:

(1) dataset heterogeneity

Real-world MRI data vary widely in scanner type, imaging protocol, resolution, and patient population. Performance may degrade when applied to external institutions without fine-tuning. Multi-center validation with harmonized imaging protocols is essential to assess generalizability.

(2) multimodal and 3D context

Current work uses 2D slices, discarding the rich 3D spatial context available in volumetric MRI. Extending the architecture to 3D convolutions or transformer-based volumetric encoders could better capture tumor morphology and spatial relationships with critical brain structures.

(3) integration into clinical workflow

The model must operate within existing PACS/RIS environments, providing seamless integration with radiologists’ reading stations. Inference speed, although reduced compared to YOLOv12n (370 FPS vs. 625 FPS), remains sufficient for real-time batch processing in clinical settings. However, model compression techniques (pruning, quantization) could further optimize deployment on edge devices.

(4) regulatory and ethical considerations

Regulatory approval (e.g., FDA, CE marking) requires rigorous validation across diverse populations, transparency in decision-making, and adherence to data-protection standards (HIPAA, GDPR). The proposed system should be positioned as a decision-support tool, not a replacement for radiologist judgment, to mitigate liability and ensure human-in-the-loop oversight.

Explainability evaluation and clinical validation

To further assess the decision-making basis of the model, in addition to providing Grad-CAM heatmaps, this study also conducted quantitative explainability evaluations and preliminary clinical expert validation:

(1) Quantitative Explainability Metrics:

The IoU-based Localization Accuracy (ILA) was used to quantitatively align and analyze the significant activation regions in the Grad-CAM heatmaps with the actual tumor regions. On the test set, the model’s average ILA score was 0.83 ± 0.07, indicating that the regions highlighted in the heatmap were highly consistent with the spatial location of the tumors.

Spearman correlation analysis was used to examine the relationship between the model’s prediction confidence and the medical image features (such as edge gradients and local texture contrast) in the heatmap-focused areas. The results showed that the model’s prediction confidence was relatively low for tumor areas with blurred boundaries and low contrast, consistent with the activation intensity distribution in the heatmap, reflecting the model’s intrinsic “uncertainty perception” for difficult cases.

(2) Clinical Expert Validation:

Two experienced neuro-radiologists were invited to perform a blinded evaluation of the Grad-CAM heatmaps for 100 randomly selected test samples.

The experts rated the heatmaps on a 5-point scale for the following questions: “Does the heatmap effectively focus on the tumor region?“, “Does it help in understanding the model’s decision?“, and “Can it assist in clinical area focus?“.

The evaluation results showed an average expert score of 4.2 (range: 3.5–4.8), indicating that most heatmaps were considered to have good clinical reference value. Experts particularly noted that in cases of multifocal tumors and blurred boundaries, the heatmaps were helpful for quickly locating the lesions and reducing missed diagnoses due to visual fatigue.

However, the current explainability methods are still primarily focused on the “region of interest” level, lacking higher-level semantic explanations for tumor subtypes, malignancy grade, etc. Future work will further integrate visual-semantic joint explanation methods, such as combining natural language generation modules to provide preliminary text-based diagnostic hints for model predictions, thus building a more transparent and trustworthy clinical AI-assisted diagnostic system.

Conclusion

In this study, we proposed an enhanced YOLOv12n framework for brain tumor detection, designed to overcome key limitations in existing models. By integrating three novel modules, A2C2f-Mona, C2PSA-DyT, and CGAFusion, we successfully addressed challenges in multi-scale extraction, training stability, and cross-level feature fusion. Experimental results demonstrated the framework’s superior performance over the baseline YOLOv12n, achieving a 94.0% mAP@0.5. The method showed significant gains in pituitary tumor recall and glioma robustness, proving particularly effective for small and boundary-ambiguous lesions. Despite these promising results, this study has two main limitations. The evaluation was restricted to a single public dataset, which may limit the model’s generalizability to diverse clinical scanners and protocols. Additionally, the model operates on 2D slices, thereby ignoring the crucial 3D volumetric context of the tumors and the spatial continuity between slices. Future work will directly address these limitations. We plan to conduct rigorous multi-center validation using diverse, real-world datasets to ensure robustness. Furthermore, we will prioritize extending the framework to 3D architectures to fully leverage volumetric MRI data. Finally, we aim to enhance model interpretability to build clinical trust and better understand the model’s decision-making process.