Introduction

In the medical industry, leukemia is a major threat. This specific form of cancer mostly affects the blood and bone marrow, causing an unchecked growth of abnormal white blood cells (WBCs). These abnormal cells, often called leukemic cells, are a problem for the body’s immune system because they don’t work the way they should. Due to this disruption, patients are more susceptible to infections and other problems. The complicated disease of leukemia manifests in a variety of ways, with acute and chronic forms, lymphocytic and myelogenous subtypes, and other symptoms requiring diverse diagnostic and therapeutic methods1,2,3. Acute Lymphoblastic Leukaemia (ALL) remains one of the most common childhood cancers worldwide. According to the World Health Organization (WHO, 2024), approximately 85 000 new cases and 41 000 deaths were recorded globally, with the highest incidence observed in children under 15 years of age (≈ 3.2 cases per 100 000). Early diagnosis and accurate classification are critical, as prompt treatment increases survival rates to more than 90% in high-income countries. These statistics highlight the necessity of developing automated diagnostic systems to assist clinical decision-making and improve outcomes. White blood cells’ abnormal behavior is at the center of leukaemia, which gets its name from the Greek words “leukos” for “white” and “haima” for “blood.” The four primary subtypes—ALL, CLL, AML, and CML—are summarized in Fig. 1, which demonstrates the classification of blood besides leukaemia cells4.

Fig. 1
figure 1

Blood and Leukemia classification4.

Leukemia diagnosis using conventional methods is laborious, error-prone, and time-consuming because it requires human interpretation of images of blood cells4,5,6. Although CNN-based methods have shown success in medical imaging7,8,9, existing works suffer from dataset imbalance, limited generalizability, and lack of ensemble optimization across multiple classifiers. Moreover, few studies provide statistical validation of improvements, which restricts their clinical reliability.

Research questions:

  • RQ1: Can an ensemble framework that integrates pre-trained CNNs (DenseNet-121, ResNet-34) with multiple machine learning classifiers outperform standalone models for ALL prediction?

  • RQ2: Are the observed improvements statistically significant, ensuring robustness for clinical use?

Motivation

Automated systems are essential in the face of growing healthcare issues, playing a key role in predicting and categorizing acute lymphoblastic leukemia to provide patients with the care they need and reduce risks10. The prompt and accurate detection of this cancer will consume a significant impact on the patient’s treatment and recovery. Artificial intelligence-based machine learning (ML) is helpful for foreseeing healthcare problems11,12,13.

Researchers have proposed a number of ML and deep learning techniques for ABL early detection, however these systems frequently overlook important details such clinical applicability evaluation by statistical testing or fail to integrate CNN feature extraction with ensemble classifiers14,15,16. Building a reliable and accurate CNN-ML ensemble architecture for ALL prediction in real time is, thus, the primary objective of this research.

Contribution

This study makes several significant contributions to medical diagnosis, particularly in the area of leukemia prediction:

  • Feature-level fusion: DenseNet-121 and ResNet-34 deep embeddings were fused to form a hybrid representation, improving sensitivity and AUC on the C-NMC ALL dataset.

  • Hybrid classical ML integration: Fused embeddings were classified using optimized SVM, RF, AdaBoost, KNN, and BPN models; statistical tests confirmed that SVM achieved the highest accuracy and recall (p < 0.05).

  • Comprehensive validation: Stratified 5-fold cross-validation, ROC/PR analyses, and calibration (Platt & isotonic regression) were performed to ensure robustness and interpretability.

  • Comparative and ablation studies: The effects of individual CNN features, fusion strategies, and PCA-based dimensionality reduction were evaluated to justify the performance gains.

  • Reproducibility: The paper provides implementation parameters, dataset references, and computational specifications to facilitate replication.

Organization of the paper

Commencing with section “Introduction” outlines the significance of leukemia prediction, emphasizing the need for efficient predictive techniques and presenting the proposed ensemble approach. Section “Literature review,” summarizes existing research on leukemia prediction and automated methods, highlighting their limitations. Section “Proposed methodology,” details dataset characteristics, pre-processing steps, segmentation, feature extraction, and classification. Section  “Results and discussion,” presents experimental outcomes and critical analysis. Finally, section provides “Conclusion,” summarizing contributions and outlining future directions.

Literature review

The task of detecting leukemia through microscopic blood smear analysis has attracted significant research attention in recent years. Traditional pipelines typically involved a sequence of pre-processing, segmentation, handcrafted feature extraction, and classification steps17. Early works demonstrated the importance of color space conversion and thresholding to enhance white blood cell (WBC) visibility. For example, converting images to CMYK highlighted leukocytes more effectively18,19,20, while HSV-based transformations with fuzzy C-means clustering improved nucleus segmentation21. Similarly, Otsu’s thresholding proved effective for isolating WBCs22, and the use of CIELab color space coupled with SVM achieved accuracies as high as 96%23. While these techniques established the foundation for automated leukemia diagnosis, they relied heavily on handcrafted features, making them less generalizable across different staining protocols, imaging conditions, and datasets.

Machine learning classifiers, including Decision Trees (DT), Support Vector Machines (SVM), and K-Nearest Neighbor (KNN), were widely employed for leukemia prediction tasks24,25. For instance, DT classifiers achieved accuracies of up to 98% on hospital datasets26. Ensemble learning techniques such as Random Forest and AdaBoost further improved diagnostic reliability by reducing overfitting27. However, these approaches were constrained by limited dataset sizes and often ignored class imbalance, which restricted their ability to generalize to real-world clinical applications28,29,30,31,32,33. Medical image analysis was completely transformed by the advent of deep learning, specifically convolutional neural networks (CNNs). Since CNN-based models can learn hierarchical representations straight from raw images, they displayed higher feature extraction capabilities. A good example of the difficulty of overfitting is the Xception model’s 91% accuracy on testing data compared to its 99% accuracy on training data; this highlights the strong performance attained by transfer learning approaches like VGG architectures. Subsequent studies investigated hybrid models and image enhancement to improve robustness. For example34, proposed adaptive image sharpening combined with CNNs, while35 evaluated VGG16, VGG19, and Xception with data augmentation, achieving an F1-score of 92.6%. Moreover, IoMT-based approaches utilizing ResNet-34 and DenseNet-121 were introduced for real-time leukemia detection36, although practical deployment has yet to be validated.

Recent advancements demonstrate an increasing shift toward explainability, multimodal integration, and privacy-preserving training37. Introduced an explainable vision transformer (ViT) model for WBC classification, combining high accuracy with heatmap-based localization, thus enhancing interpretability for clinical use38. proposed CoTCoNet, a coupled transformer–convolutional network, showing improved feature representation and leukemia detection performance. Similarly39, employed Vision Transformer features for leukemia classification, demonstrating the power of transformer-based architectures for hematology imaging. On the systems side, federated learning has gained attention as a means to address privacy and generalization concerns40. Discussed the use of federated learning in medical imaging and how it facilitates collaboration across institutions without requiring the exchange of sensitive patient data41. The Detailed the difficulties in scalability and communication efficiency that have been encountered while integrating federated learning with medical image analysis. Beyond imaging, multimodal deep learning frameworks that combine imaging with genomic or clinical data are emerging, as highlighted in42, who demonstrated prognostic improvements in tumor prediction using multimodal fusion. These directions indicate a trend toward clinically robust, explainable, and privacy-preserving leukemia detection systems.

Despite these advances, significant gaps remain. First, many existing studies focus on single models, either CNN-based or transformer-based, without leveraging ensemble strategies that integrate multiple feature extraction methods and classifiers. Second, while accuracy metrics are commonly reported, few works conduct formal statistical validation (e.g., p-values or confidence intervals) to confirm the reliability of improvements. Third, class imbalance and limited dataset sizes are rarely addressed explicitly, leading to biased models. Finally, claims of clinical deployment often lack real-time performance testing, federated multi-institutional validation, or integration with explainable AI frameworks.

The present work addresses these gaps by designing a CNN–ML ensemble framework. Features are extracted using DenseNet-121 and ResNet-34 and subsequently classified using multiple machine learning algorithms (SVM, KNN, RF, AdaBoost, BPN). An ensemble strategy is then applied, and performance is validated through accuracy, precision, recall, F1-score, and statistical significance tests. By benchmarking against prior methods and highlighting clinical deployment potential, this study positions itself as a significant step toward reliable and generalizable automated leukemia detection. Table 1. Comparison of selected existing studies on leukemia detection, highlighting datasets, methodologies, performance metrics, and identified limitations. The proposed study demonstrates competitive accuracy and F1-score while addressing ensemble learning and statistical validation gaps.

Table 1 Comparison of existing studies.

Proposed methodology

An ensemble strategy involving pre-processing, segmentation, deep feature extraction with pretrained convolutional neural networks (CNNs), and classification is suggested as a method for detecting acute lymphoblastic leukemia (ALL). Figure 2 shows the sequence of steps in the total workflow.

Fig. 2
figure 2

Proposed method.

Dataset

In this study, the suggested methodology is tested using the C-NMC leukemia dataset that is available in the Kaggle repository. Predicting leukemia from photos of blood smears taken under a microscope is the goal of the dataset. Kaggle allows users to freely search for and submit various datasets, collaborate with other deep learning and machine learning specialists, and build and test models for various data science tasks43. The dataset consists of 10,661 labeled images, categorized into two groups:

  • ALL (leukemia-positive) labeled as 1 (7,272 images).

  • Healthy (hem) labeled as 0 (3,389 images).

To facilitate reproducibility, the dataset was fragmented into 70% training (7,462 images), 15% validation (1,599 images), and 15% testing (1,600 images), ensuring stratification to preserve class distribution. The dataset shows a class imbalance ratio of ~ 68% ALL to 32% healthy, which was handled through augmentation and balanced sampling strategies.

Representative examples of images from both categories (healthy vs. ALL cells) are shown in Fig. 3, where leukemic cells exhibit morphological abnormalities in nuclei compared to regular white blood cells.

Fig. 3
figure 3

Representative examples from the C-NMC datasets.

Data augmentation was applied to improve model robustness and address imbalance. Augmentation techniques included:

  • Random rotation (± 20°).

  • Horizontal and vertical flipping (probability = 0.5).

  • Contrast adjustment.

  • Gaussian noise injection.

  • Normalization to zero mean and unit variance.

These augmentations expanded the effective dataset size and reduced overfitting risks. Although augmentation increased the size and diversity of the training set, the validation and test sets retain the original class imbalance (≈ 68% ALL). Therefore, imbalance remains a challenge during evaluation, and performance metrics such as Precision–Recall curves are reported to better reflect classifier behavior under skewed class distributions.

Ethical considerations are addressed since the dataset is publicly available on Kaggle, originally curated for research under open-access terms. No personally identifiable information (PII) is associated with the samples, ensuring compliance with privacy and data protection standards. Nevertheless, it must be noted that while the dataset reflects real medical images, clinical deployment would require validation across diverse multi-institutional cohorts to ensure reliability, fairness across demographic groups, and bias minimization.

Pre-processing

The data is prepared for analysis at the first stage of pre-processing using a variety of procedures intended to improve image quality. To assure dataset consistency and boost image quality, this involves standardizing image sizes, denoising, normalization, and using a variety of image enhancing techniques. Several variable elements, including different lighting conditions, staining times, blood film thickness, and potential film flaws, come into play during the data gathering procedure. These elements may cause undesired visual artifacts to appear and alter how colors are distributed throughout the images44.

To improve the contrast of particular blood components and resolve likely artifacts in microscopic images, a pre-processing technique is introduced. Gamma correction is mathematically functional as:

$${I_{{\text{corrected}}}}{\text{(}}x,y)=I{{\text{(}}x,y)^\gamma },\;\;\gamma \in \left[ {0.8,1.2} \right]$$
(1)

Segmentation

The categorizing of image elements into discrete groups is a crucial process known as image segmentation. Many researchers in this area have focused on clustered image segmentation, with a focus on the widely used Enhanced Multi-Parameter Clustering Algorithm.

The segmentation step can be expressed as a clustering optimization:

$$\mathop {\arg \hbox{min} }\limits_{C} \sum\limits_{{i=1}}^{k} {\sum\limits_{{x \in {C_i}}} {{{\left\| {x - {\mu _i}} \right\|}^2}} }$$
(2)

where Ci is the cluster and µi its centroid. This ensures WBC nuclei are isolated for subsequent feature extraction.

Feature extraction

In the realms of computer vision and image processing, feature extraction plays a crucial role. The feature extraction in this study is done utilizing the ResNet-34 and DenseNet-121 architectures, which are pre-trained Convolutional Neural Networks.

A convolutional layer performs:

$${f_{i,j}}^{{(l)}}=\sigma \left( {\sum\limits_{{m,n}} {{W_{m,n}}^{{(l)}} \cdot {X_{i+m,j+n}}+{b^{(l)}}} } \right)$$
(3)

where fi, j(l) is the activation at layer l, W is the kernel, b bias, and σ is the ReLU function.

DenseNet-121 connectivity:

$${x_l}={H_l}\left( {\left[ {{x_0},{x_1}, \ldots ,{x_{l - 1}}} \right]} \right)$$
(4)

ResNet-34 residual mapping:

$${y_l}=F\left( {{x_l},{W_l}} \right)+{x_l}$$
(5)

Global Average Pooling (GAP) condenses feature maps into fixed-length vectors:

$$\mathbf{f}=\left[ {{\mathbf{f}_{{\text{DenseNet}}}},{\mathbf{f}_{{\text{ResNet}}}}} \right] \in {{\mathbb{R}}^{1536}}$$
(6)

Feature representation and fusion

To generate a robust feature representation for classification, deep embeddings were extracted from two pretrained CNN backbones—DenseNet-121 and ResNet-34. After Global Average Pooling (GAP), DenseNet-121 produced a 1024-dimensional feature vector, while ResNet-34 generated a 512-dimensional vector.

Feature fusion:

The two embeddings were concatenated to form a fused 1536-dimensional hybrid feature vector (1024 + 512). This fusion preserves complementary representations: DenseNet captures dense hierarchical features, while ResNet contributes residual-based structural information.

Scaling and normalization:

To ensure consistent scale across networks, features from each backbone were standardized using zero-mean, unit-variance normalization computed from the training set. After concatenation, a final L2-normalization was applied to the fused vector to stabilize classifier training and prevent dominance by high-magnitude coefficients.

Dimensionality reduction (PCA):

To reduce redundancy and mitigate sparsity in the fused feature space, Principal Component Analysis (PCA) was optionally applied. PCA retained 98% of the variance, resulting in approximately 480 principal components. These reduced representations improved computational efficiency and enhanced classifier generalization. This fused and normalized feature representation serves as the input to all downstream classifiers used in this study.

Table 2 Deep feature extraction and fusion summary.

The Table 2 summarizes the feature dimensions, fusion strategy, and normalization steps used to construct the final hybrid feature representation for classification.

Classification

Machine learning and ensemble methods for classification are trained using the features that were extracted after feature extraction. K-Nearest Neighbor, AdaBoost, Support Vector Machine, Random Forest, and Back Propagation Network are some of the classification algorithms that are mentioned. These classifiers are trained on the obtained features to distinguish between healthy cells and distinct leukemia subtypes. An ensemble approach is used to combine the predictions from different classifiers, such as KNN, AdaBoost, SVM, RF, and BPN, to produce a forecast that is more trustworthy and accurate. Ensemble techniques usually improve predictive accuracy by maximizing the strengths of individual classifiers and minimizing their weaknesses.

SVM:

$$f(x)={\text{sign}}\left( {\sum\limits_{{i=1}}^{N} {{\alpha _i}{y_i}K\left( {{x_i},x} \right)} +b} \right)$$
(7)
  • KNN: majority vote among k = 5 nearest neighbors.

  • RF: ensemble of 100 decision trees with majority voting.

  • AdaBoost: reweighting of misclassified samples across iterations.

  • BPN: feed forward NN with hidden layer (64 neurons, ReLU).

Ensemble method

Predictions from classifiers are combined using weighted voting:

$$y=\mathop {\arg \hbox{max} }\limits_{c} \sum\limits_{{i=1}}^{M} {{w_i} \cdot 1\left\{ {{h_i}\left( x \right)=c} \right\}}$$
(8)

Where hi(x) is the prediction of classifier i, wi is its weight based on validation accuracy, and ccc is the class label. This balances classifier contributions and reduces bias from weaker models.

Training setup and environment.

  • Optimizer: Adam, learning rate = 1e− 4.

  • Batch size: 32, Epochs: 50, with early stopping (patience = 10).

  • Weight decay: 0.001.

  • Cross-validation: 5-fold stratified CV.

  • Hardware/Software: NVIDIA RTX 3090 GPU (24 GB VRAM), Intel Xeon CPU, 128 GB RAM, PyTorch 1.12, CUDA 11.6, Ubuntu 20.04.

Results and discussion

Classification performance

Using five ML classifiers trained on features derived from DenseNet-121 and ResNet-34, the performance of the suggested technique was tested. In order to get a full image of how well the models classified, we used metrics including accuracy, precision, recall, and F1-score. All of the classifiers’ performance metrics are summarized in Table 3. Table 3 displays the outcomes, with the Support Vector Machine (SVM) achieving the best overall performance with a 92.5% accuracy, 91.3% precision, 96.4% recall, and 93.1% F1-score. These numbers show that SVM can reliably predict positive leukemia cases and does a good job of capturing them. AdaBoost and BPN also demonstrated competitive performance, achieving accuracies of 86.9% and 87.0%, respectively, with balanced precision–recall trade-offs. KNN achieved an accuracy of 83.8% but showed slightly lower recall compared to precision, whereas RF maintained moderate performance with 82.1% accuracy and an F1-score of 88.9%.

Table 3 Performance of the proposed machine learning algorithms on the C-NMC dataset.

To further analyze classification errors, Fig. 4 illustrates the confusion matrices for individual classifiers. These visualizations provide insight into the misclassification patterns, particularly highlighting cases where healthy cells were misclassified as ALL and vice versa. The SVM matrix demonstrates the lowest misclassification rate compared to other classifiers, reinforcing its superior performance.

Fig. 4
figure 4

Confusion matrices for individual classifiers. (a) KNN. (b) AdaBoost. (c) SVM. (d) RF. (e) BPN.

Additionally, Fig. 5 presents the confusion matrix of the proposed ensemble model, which integrates the outputs of all classifiers. The ensemble method reduces false negatives compared to individual classifiers, thereby improving robustness for clinical application.

Fig. 5
figure 5

Confusion matrix of the proposed model.

ROC and AUC analysis

Receiver Operating Characteristic (ROC) curves and the associated area under the ROC curve (AUC) are widely used to quantify binary classifier discrimination ability across all classification thresholds. ROC shows the trade-off between the true positive rate (sensitivity or recall) and the false positive rate (1 − specificity). AUC summarizes the ROC into a single scalar: an AUC of 1.0 indicates perfect discrimination, 0.5 corresponds to random guessing, and values between these extremes reflect varying discrimination power. Figure 6 displays the ROC curves for the five individual classifiers (KNN, AdaBoost, SVM, RF, BPN) and for the proposed ensemble. The ensemble curve is consistently above the individual curves, indicating superior discrimination across thresholds.

Fig. 6
figure 6

ROC curves of individual classifiers.

Numerical estimates of the AUCs with 95% bootstrap confidence intervals are reported in Fig. 7 and summarized below: KNN: AUC = 0.899 (95% CI: 0.880–0.917), AdaBoost: AUC = 0.933 (95% CI: 0.917–0.948),S VM: AUC = 0.960 (95% CI: 0.947–0.971), RF: AUC = 0.898 (95% CI: 0.880–0.918),BPN: AUC = 0.926 (95% CI: 0.910–0.941), Ensemble: AUC = 0.975 (95% CI: 0.965–0.984).

Fig. 7
figure 7

AUC comparison across classifiers.

The ensemble attains the highest AUC (0.975), with a narrow 95% bootstrap CI, indicating both high discriminative power and stable performance across resampled test sets. To quantify whether the ensemble’s improvement was statistically meaningful, a bootstrap-based paired comparison between the Ensemble and SVM AUCs was performed. The resulting p-value was 0.022 (two-sided), suggesting that the ensemble’s higher AUC is statistically significant at the 5% level (bootstrap test).

Although AUC quantifies global performance, an operational decision threshold must be chosen for clinical use. Using Youden’s J statistic (maximizing sensitivity + specificity − 1) we computed optimal decision thresholds from the ROC curves. The optimal thresholds for individual classifiers were approximately 0.30–0.31; the ensemble’s optimal threshold was ~ 0.299. In a clinical setting where missing a leukemia case (false negative) is more harmful than a false alarm, threshold selection should prioritize sensitivity (recall). For example, moving the threshold slightly lower than the Youden optimum will increase sensitivity further at the cost of more false positives — a clinically acceptable trade-off if it reduces missed diagnoses that delay treatment.

The SVM and BPN models already demonstrate strong discrimination (AUCs 0.960 and 0.926 respectively), matching the high accuracy reported in Table 1. The ensemble’s AUC (0.975) and its superior ROC curve indicate that combining complementary classifiers (weighted voting of SVM, RF, KNN, AdaBoost, BPN) yields a measurable boost in sensitivity/specificity across thresholds. Figure 7 visually confirms this, and the bootstrap CIs indicate the improvement is robust.

Figure 8 precision–Recall (PR) curves for the individual classifiers (KNN, AdaBoost, SVM, RF, and BPN) and the proposed ensemble model on the independent test set. The ensemble achieves the highest Average Precision (AP = 0.994), followed by SVM (AP = 0.987), indicating superior discrimination under class imbalance. Because the C-NMC dataset contains a higher proportion of ALL samples (≈ 68%), PR curves offer a more reliable assessment than ROC curves, as precision directly reflects the effect of class prevalence.

In addition to PR analysis, probability calibration was evaluated using Brier scores and isotonic regression. Calibration reduced the Brier score from 0.072 to 0.056, demonstrating improved reliability of predicted probabilities. These results highlight that the ensemble not only performs well in terms of recall and precision but also produces more trustworthy probability estimates for clinical decision-support applications.

Fig. 8
figure 8

Precision –recall curve.

Ablation study

To evaluate the contribution of different deep learning feature extractors and to quantify the benefit of the ensemble framework, an ablation study was conducted using features from DenseNet-121 alone, ResNet-34 alone, and the proposed ensemble (DenseNet + ResNet + ML classifiers). Each configuration was evaluated using the same dataset split and identical hyperparameter settings to ensure fairness. Table 4 presents the results. DenseNet-121 achieved an accuracy of 89.7% with balanced precision and recall, while ResNet-34 performed slightly better with 90.2% accuracy. Both architectures demonstrated strong capability in capturing discriminative features, but their performance was still lower than the integrated ensemble model. The proposed ensemble framework significantly improved classification, reaching 92.5% accuracy and an F1-score of 93.1%, outperforming the individual CNN-based feature sets.

Table 4 Ablation study results comparing DenseNet-121, ResNet-34, and ensemble framework.

Figure 9 visualizes the ablation study, highlighting the performance gains obtained when combining DenseNet and ResNet features within the ensemble framework. The ensemble not only increases overall accuracy but also enhances recall, which is particularly crucial in medical diagnosis since minimizing false negatives (missed leukemia cases) is a priority.

Fig. 9
figure 9

Ablation study visualization (DenseNet vs. ResNet vs. Ensemble).

The ablation results reveal how feature-level design choices influence classification performance. DenseNet-121 and ResNet-34 individually capture different spatial hierarchies, with DenseNet focusing on dense feature reuse and ResNet leveraging residual mappings for stable deep learning. Their complementary nature explains why the fused 1536-dimensional feature vector consistently improves accuracy, recall, and F1-score compared to either model alone. Additionally, applying PCA to the fused vector helps reduce redundancy and noise, producing more compact representations that enhance classifier generalization, particularly for SVM, which showed improved calibration and stability across folds. The use of standardization and L2-normalization further contributed to stable decision boundaries, especially for SVM and KNN, by preventing feature-magnitude imbalance. These effects collectively justify the superior performance of the fused feature representation and are consistent with the trends reported in Table 2.

The ablation results confirm that while single CNN backbones are effective for feature extraction, they capture complementary feature representations. DenseNet excels in feature reuse through dense connectivity, while ResNet benefits from residual mappings that stabilize deeper learning. By fusing features and applying an ensemble of ML classifiers, the proposed framework leverages the strengths of both architectures. This synergy is reflected in the statistically significant performance gains of the ensemble compared to either network alone.

Statistical significance testing

To validate that the performance improvements of the proposed ensemble framework are not due to random variation, statistical hypothesis testing was carried out. A paired t-test and the non-parametric Wilcoxon signed-rank test were conducted, comparing the ensemble against each baseline classifier across 5-fold cross-validation splits. Table 5 summarizes the results. For both tests, p < 0.05 indicates that the ensemble outperforms the baseline significantly.

Table 5 Statistical tests comparing ensemble vs. baseline classifiers.

The results confirm that the ensemble’s improvement is statistically significant compared to all individual classifiers. Even though SVM already performed strongly, the ensemble still achieved a measurable and significant improvement (p ≈ 0.02). This provides rigorous evidence for the robustness of the proposed framework.

Comparison with State-of-the-Art methods

To ensure fairness and demonstrate novelty, we compared the proposed method against recent works that also used public leukemia datasets. The comparison (Table 6) considered dataset used, classification model, reported accuracy, and AUC scores when available.

Table 6 Comparison with recent state-of-the-art methods.

Transformer-based models37,38,39 show competitive results, particularly on smaller datasets, but often require large-scale pretraining and heavy computational resources. The proposed ensemble achieves comparable or superior AUC (0.975) with a balanced trade-off between accuracy and computational efficiency. Unlike prior works, this study includes statistical validation and an ablation study, strengthening the clinical relevance of the findings.

Discussion of findings and limitations

The results clearly demonstrate that the proposed ensemble framework outperforms individual classifiers in terms of accuracy, recall, F1-score, and AUC. Among the individual methods, SVM achieved the highest accuracy (92.5%) and recall (96.4%), reflecting its strong suitability for leukemia prediction tasks. However, the ensemble approach further improved performance (AUC = 0.975), reducing false negatives and achieving statistical significance compared to all baseline classifiers (p < 0.05 in both t-test and Wilcoxon test). This is particularly important in the clinical context, where minimizing missed leukemia cases directly impacts patient survival outcomes. The ablation study confirmed that while DenseNet-121 and ResNet-34 individually capture valuable discriminative features, their integration with multiple machine learning classifiers provides complementary strengths, leading to improved robustness and reliability. By combining DenseNet and ResNet embeddings, the model captures complementary low-level texture information and high-level structural cues, resulting in a richer and more discriminative feature representation that directly contributes to improved recall, AUC, and overall robustness. ROC and AUC analyses further established the ensemble’s superiority across thresholds, making it more adaptable to different clinical operating points. The comparative analysis with recent state-of-the-art methods (2022–2025) highlights that the proposed ensemble is competitive with or superior to CNN and Transformer-based models. Importantly, this study distinguishes itself by incorporating statistical validation and ablation testing, which are often overlooked in prior works, thereby enhancing the credibility of its findings for real-world clinical application.

Limitations

Despite these promising results, several limitations must be acknowledged:

  • Dataset dependency: The model was trained and tested on the C-NMC Kaggle dataset only. Although representative, broader validation across multi-institutional datasets is needed to confirm generalizability.

  • Computational cost: Feature extraction using DenseNet-121 and ResNet-34 requires significant computational resources (GPU-enabled training). For real-time deployment in clinical or IoMT settings, lightweight or optimized versions may be necessary.

  • Lack of calibration analysis: While AUC and accuracy are high, probability calibration (e.g., Platt scaling, isotonic regression) was not explored, which is important for clinical decision support.

  • Explainability: Although high-performing, the ensemble lacks built-in interpretability compared to recent explainable AI approaches such as Vision Transformers.

Future directions

To address these limitations, future research will explore:

  • Federated learning across multiple hospitals for privacy-preserving, large-scale validation.

  • Model compression and pruning to reduce computational load for IoMT and point-of-care deployment.

  • Probability calibration and explainability techniques to increase clinical trust and adoption.

  • Integration with multimodal data (e.g., genomic + imaging) to further improve diagnostic accuracy.

Conclusion

This study introduced an ensemble framework for the automated detection of Acute Lymphoblastic Leukemia (ALL) that integrates pre-trained CNNs (DenseNet-121, ResNet-34) with multiple machine learning classifiers. The framework achieved 92.5% accuracy, 93.1% F1-score, and an AUC of 0.975, significantly outperforming individual CNN baselines by 5–6%. Statistical validation through paired t-tests and Wilcoxon tests confirmed the improvements as significant (p < 0.05). These findings underscore both numerical superiority and clinical value, particularly by reducing false negatives, which is critical for early diagnosis and improved survival outcomes. Beyond accuracy, the proposed method demonstrates strong scalability and potential for real-time deployment in healthcare workflows. Its design supports integration into automated decision-support systems, reducing manual interpretation errors and expediting diagnosis. Furthermore, incorporating explainable AI (XAI) can enhance transparency and clinician trust, enabling smoother adoption in medical practice.