Introduction

Diabetes is a metabolic disorder that causes elevated blood glucose levels due to insulin resistance or insufficient insulin production. Insulin deficiency prevents glucose uptake and leads to increased blood glucose levels, which can cause long-term damage to organs, including damage to the eyes, kidneys, nerves, heart, and blood vessels1.According to International Diabetes Federation (IDF) statistics, diabetes affected approximately 537 million adults worldwide as of 20212. This number is predicted to rise to 643 million by 2030 and 783 million by 20453. Diabetes can be classified into three distinct types: type 1, type 2, and gestational diabetes. Type 1 diabetes is usually diagnosed in young people and is characterized by insufficient insulin production. Type 2 diabetes typically affects adults between 45 and 60 years of age and results from metabolic disorders that increase blood glucose levels. Gestational diabetes occurs during pregnancy due to hormonal changes that raise blood glucose levels4. Moreover, diabetes affects about 7% of pregnancies each year and poses life-threatening risks to both mother and fetus5. Type 2 diabetes is a chronic condition which, if left uncontrolled, may lead to serious and disabling complications. Key issues include cardiovascular diseases (such as heart attack and stroke), kidney failure, diabetic retinopathy and blindness, peripheral neuropathy, diabetic foot ulcers, and potential limb amputations. The disease also exacerbates high blood pressure, obesity, and elevated cholesterol levels, significantly reducing patients’ quality of life6,7,8. If diabetes is not diagnosis in its early stages, it may lead to kidney failure, diabetic retinopathy, or other eye diseases. Given the increasing prevalence of diabetes, particularly type 2, the need for early and accurate diagnostic methods for better disease control is more pressing than ever. Recent advances in artificial intelligence (AI) have produced notable success across various domains such as medical image analysis, disease detection, and classification9.One method for early disease detection of diabetes is the application of machine learning and deep learning techniques10,11.

Machine learning (ML) and artificial intelligence (AI) models have high potential for developing personalized predictive systems for diabetes. Researchers have employed machine learning and data mining techniques in various diabetes-related areas12,13, including identifying diagnostic and predictive factors in diabetes onset, predicting diabetes incidence, analyzing diabetic complications, developing drugs and treatments related to diabetes, and investigating the influence of genetic and environmental factors on the onset and progression of the disease14. By analyzing vast amounts of diabetes-related data, machine learning models can transform raw data into valuable knowledge and open new pathways for prognosis, diagnosis, and more effective treatment of diabetes15,16. Deep neural networks (DNNs) can also automatically identify complex patterns from clinical and imaging data related to diabetes17.With multiple linear and nonlinear layers, these networks have strong capabilities in data processing and classification and can provide highly accurate performance in detecting, identifying, and predicting diabetic complications such as retinopathy or neuropathy. In addition, the DNN structure allows models trained on existing data to generalize results to new patients and, together with other deep learning models like CNNs or RNNs, can be used to improve prediction accuracy and screening.

Explainable AI (XAI) enhances the transparency and trustworthiness of neural networks (NNs) and machine learning (ML) models in type 2 diabetes care by elucidating how input features influence outcomes thereby supporting both diagnosis and treatment. For instance, DeepNetX2, a DNN-based diagnostic model, integrates XAI methods such as LIME and SHAP to make prediction mechanisms interpretable while maintaining high accuracy (up to ~ 97%) across multiple datasets18. Similarly, combining AutoML with XAI for diabetes risk prediction using techniques like SHAP, LIME, integrated gradients, and counterfactual analysis enables clinicians to understand and trust predictive factors such as glucose levels and BMI19.

According to preliminary reviews by the present researchers, no study to date has used a transparent ensemble learning for type 2 diabetes prediction and classification by combining three machine learning models via stacking and voting methods and employable explainability techniques LIME and SHAP to identify important features. In some of these studies, only neural network -based algorithms such as multilayer perceptron (MLP)20, DNN21, and conventional machine learning models (CML)22 have been used. Therefore, the aim of this study is to develop a transparent and explainable hybrid model for predicting type 2 diabetes in which three machine learning algorithms RF, KNN, and NN are combined using stacking and voting methods. Also, to increase model interpretability and gain the trust of clinical experts, the explainability techniques LIME and SHAP are used to identify and analyze the features that affect disease diagnosis.

Method

Dataset

In this study, the type 2 diabetes prediction dataset was collected from the Pima Indians Diabetes database. This dataset includes 768 samples and 8 features such as age, body mass index, skin thickness, blood pressure, glucose, insulin, pregnancies, and family history of diabetes. Figure 1 shows the number of samples based on the three classes: “Normal”, “Prediabetes”, and “Diabetes”.

Fig. 1
figure 1

Number of samples across the three classes: normal, prediabetic, and diabetic.

Figure 2 illustrates the percentage of each class (“Normal”, “Prediabetes”, and “Diabetes”) across the overall dataset and the training, validation, and test splits. The distribution remains consistent, preserving the real-world imbalance for a realistic evaluation. As shown in Fig. 2, the imbalance is maintained across all data splits as follows:

  • Overall dataset 23.57% of the samples belong to the “Normal” class, 35.68% to the “Pre-diabetes” class, and 40.76% to the “Diabetes” class.

  • Training set 23.62% of instances are “Normal”, 35.67% are “Pre-diabetes”, and 40.72% are “Diabetes”.

  • Validation set 23.38% of cases correspond to “Normal”, 35.06% to “Pre-diabetes”, and 41.56% to “Diabetes”.

  • Test set 23.38% of samples are “Normal”, 36.36% are “Pre-diabetes”, and 40.26% are “Diabetes”.

Fig. 2
figure 2

Class distribution across dataset splits.

Data preprocessing

In this study, several preprocessing techniques such as initial transformation, data cleaning, and missing-value handling were applied. Initially, the dataset contained various physiological features and medical tests. Based on glucose level, a new “output” column was introduced and individuals were classified into “Normal”, “Prediabetes”, and “Diabetes” categories according to the American Diabetes Association study15. According to the criteria, if fasting glucose is less than 100 mg/dL, the status is considered normal. If fasting glucose is between 100 and 125 mg/dL, it is classified as prediabetes. If fasting glucose is 126 mg/dL or higher, it is classified as diabetes. As a precaution, low glucose levels (i.e., less than 70 mg/dL) were also classified as “Diabetes”. Table 1 lists the diabetes diagnostic standards mentioned by the American Diabetes Association (AMA).

Table 1 Diabetes diagnostic standards.

After categorizing the data into “Normal”, “Prediabetes”, and “Diabetes”, data cleaning was performed. To clean the data, impractical zero values were identified for several features such as glucose level, blood pressure, insulin, body mass index, and age. These zero values could potentially distort our analysis and model predictions. To address this issue, a systematic approach was applied by replacing these zero values with the mean of each feature. Exploratory analysis indicated that these variables (e.g., glucose, blood pressure, BMI, insulin, age) exhibited near-symmetric distributions, making mean imputation statistically appropriate. The mean imputation strategy was also selected to ensure methodological consistency with prior baseline clinical studies on the Pima dataset, which employed mean replacement for continuous attributes. Mean replacement preserves the global central tendency without distorting scale distribution an important consideration when using parametric classifiers. The mean value for each feature was computed according to Eq. (1):

$$\mathcal{U}=\frac{1}{\mathcal{n}}\sum_{\mathcal{i}=1}^{\mathcal{n}}\mathcal{X}\mathfrak{i}$$
(1)
  • U: mean of the feature

  • Xi: value of feature x for the i-th sample

  • n: total number of samples in the dataset

Next, data normalization was performed. After replacing zero values with the mean of each feature, Min–Max Scaling was used to place all variables into a common range [0, 1]. For each feature (column), the normalized value x′ was computed from the original value x according to formula (2), where:

  • x is the original feature value for each sample.

  • xmin is the minimum value in that column (after imputation, i.e., without zeros and missing values).

  • xmax is the maximum value in that column.

    $${\text{x}}^{\prime } = \frac{{{\text{x}} - {\text{xmin}} }}{{{\text{xmax}} - {\text{xmin}} }}$$
    (2)

After normalization, all values in each column were between 0 and 1. Min–Max scaling was retained to preserve normalized ranges corresponding to bounded clinical variables (e.g., glucose, BMI, blood pressure) and to maintain compatibility with algorithms sensitive to feature magnitude such as neural networks and kernel-based models. This approach was explicitly adopted to replicate the baseline model configuration used in previous studies, ensuring methodological continuity and interpretability.

To evaluate the robustness of our preprocessing pipeline, we analyzed the effect of different imputation, scaling, and feature selection methods on model performance. Figure 3 presents the results across seven panels, showing ROC-AUC (panels A–C) and Macro-F1 (panels D–G) scores for various preprocessing strategies. Specifically, panels A and D–E illustrate performance across imputation methods (mean, median), panels B and F–G show performance across scaling methods (min–max, standard, robust), and panel C shows ROC-AUC across feature selection methods (spearman, mutual info, RFE). Overall, the results indicate that minor variations exist among different preprocessing combinations, but the overall performance remains stable, confirming the reliability and reproducibility of the selected imputation, scaling, and feature selection strategies. These results further support the feature selection choices and confirm the stability of the preprocessing pipeline used in this study.

Fig. 3
figure 3

Model performance across different imputation and scaling methods.

Feature selection

In this step, all numeric features (except the output) were considered candidates for selection using the Spearman rank correlation test, and for each feature the correlation coefficient ρ and p-value with respect to the target variable were calculated. This method was chosen because of its ability to detect nonlinear relationships, suitability for ordinal data, and simplicity of implementation compared to methods like RFE. After calculation, features were sorted by the absolute value of the correlation coefficient and only features whose p-value was less than or equal to 0.0001 were selected as the final features. The Spearman rank correlation coefficient is defined in formula (3):

$$\rho =1-\frac{6d{i}^{2}}{\eta \left({\eta }^{2}-1\right)}$$
(3)

A strict significance threshold (p ≤ 10⁻4) was applied to ensure statistical rigor. Additionally, a Bonferroni correction was implemented to control for multiple comparisons. Spearman correlation was selected due to its robustness to monotonic yet non-linear relationships between physiological parameters and diabetic outcomes. The retained features glucose, blood pressure, skin thickness, insulin, BMI, and age align closely with clinically validated indicators of glycemic status, reflecting both statistical and physiological relevance. Table 2 summarizes which features were selected by each method, highlighting the consistency between Spearman filtering and Mutual Information.

Table 2 Summarizes which features were selected by each method.

In the end, the output set contained that subset of parameters which statistically showed the strongest association with diabetes status, an approach that both reduces the dimensionality of the problem and removes noise, making the final model faster and more interpretable. Table 3 presents the comparative performance of different preprocessing configurations, indicating that the baseline pipeline achieved competitive results with minimal variance across alternatives.

Table 3 Performance comparison of preprocessing pipelines across different imputers, scalers, and feature selectors.

Hybrid model framework

In this hybrid method, which is based on voting and stacking strategies, three base models RF, KNN, and a NN with distinct architectures and hyperparameters are used within a coherent, integrated framework. The data were split into training, validation, and test sets in an 80:10:10 ratio. To ensure a robust evaluation and prevent data leakage a critical concern given the unusually high accuracies (≈98–99%) reported on the Pima dataset a nested cross-validation (CV) framework was rigorously employed. In this framework:

  • The outer loop consisted of 5 folds, used for unbiased model evaluation.

  • The inner loop also consisted of 5 folds, used for hyperparameter optimization and stacking of the base models (Random Forest, K-Nearest Neighbors, and Deep Neural Network)23.

All preprocessing steps including imputation, scaling, and feature selection were fit exclusively on the training portion of each fold and then applied to the validation/test folds without refitting24,25. Hyperparameter tuning and stacking were performed strictly within the inner training folds. Table 4 presents the model performance across the five outer folds of the nested cross-validation. For each fold, key metrics including accuracy, precision, recall, F1-score, and ROC-AUC are reported. The consistently high values across folds indicate the robustness and stability of the stacking model.

Table 4 Performance metrics of the model across different folds.

Figure 4 illustrates the complete nested cross-validation and model stacking pipeline. The data is first split into 5 outer folds for final evaluation. All preprocessing and hyperparameter tuning are performed exclusively on each outer training fold, and the final stacking model is evaluated on the completely held-out test data. This rigorous approach prevents any data leakage, ensuring the validity and reliability of the reported high-performance metrics.

Fig. 4
figure 4

Nested cross-validation pipeline with stacking for the Pima Indians diabetes dataset.

In addition to the stacking-based hybrid model (using the stacking classifier class), a model using voting classifier was also defined, in which voting with weights [1, 2, 2] was used for the [NN, KNN, RF] models, respectively. The aim of this design is to compare the performance of the two aggregation strategies, stacking and voting. Formula (4) presents the proposed architecture. In this method, the base models NN, KNN, and RF first make predictions on the training data. The outputs of these models are used as new features. The meta-model then combines these features for the final prediction. If the outputs of the base models are probabilities, then for each sample x and each model m, the following probability vector is produced:

$${\mathcal{P}}_{\mathcal{m}eta}(\mathcal{c}|\mathcal{x})=\text{softmax}(\text{W}.\text{concat}({\text{P}1\text{P}}_{2}\dots {\text{P}}_{m})+\text{b})$$
(4)

These vectors for each model are concatenated and given as input to the meta-model. The meta-model (LR) uses a logistic regression function to predict the final class, where W and b are the meta-model parameters learned during training.

All analyses were conducted in Python version 3.12.4 (tags/v3.12.4:8e8a4ba, Jun 6 2024, 19:30:16) [MSC v.1940 64-bit (AMD64)]. The following major packages and versions were used: numpy (1.26.4), pandas (2.2.1), matplotlib (3.8.3), seaborn (0.13.2), and scikit-learn (1.5.2).

Model performance evaluation metrics

To evaluate the performance of the different models (RF, KNN, NN, voting and stacking classifier), five main metrics were used. These metrics will help examine various aspects of model performance and identify the best model. The formulas are given below, where:

  • TP (True positive) The number of samples correctly classified as positive.

  • TN (True negative) The number of samples correctly classified as negative.

  • FP (False positive) The number of samples incorrectly classified as positive.

  • FN (False negative) The number of samples incorrectly classified as negative.

\(\text{Accuracy }=\frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}}\)

\(\text{Precision }=\frac{\text{TP}}{\text{TP}+\text{FP}}\)

\(\text{Recall }=\frac{\text{TP}}{\text{TP}+\text{FN}}\)

\(\text{Specificity}=\frac{\text{TN}}{\text{TN}+\text{FP}}\)

\({}_{k}{\text{AUROC}}\sum\limits_{k = 1}^{K} {\frac{1}{K} = {\text{Macro}}\;{\text{AUROC}}}\)

\({}^{2}\left( {{}_{i}y - {}_{i}p} \right)\sum\limits_{i = 1}^{N} \frac{1}{N} - 1 = B - 1\)

\(\left| {{\text{conf}}(B_{m} ) - {\text{acc}}(B_{m} )} \right|\frac{{\left| {{}_{m}B} \right|}}{N}\sum\limits_{m = 1}^{M} { - 1} = E - 1\)

\(\text{F}1\text{ Score}=\frac{2\times \text{Precision }\times \text{Recall}}{\text{Precision }+\text{Recall}}\)

Statistical evaluation of model performance

We performed a comprehensive statistical evaluation to assess whether the observed improvements from the stacking ensemble and other models were statistically meaningful. Specifically, macro-F1 and macro-AUROC metrics were computed across 200 bootstrap resamples (sampling with replacement) of the test set and repeated cross-validation (CV) folds. For each classifier, the 95% confidence interval (CI) was estimated using the standard error method. Additionally, paired t-tests and Wilcoxon signed-rank tests were conducted to compare the stacking ensemble against all strong baselines (NN, KNN, RF, and voting ensemble).

Meta-learner evaluation

Within the stacking ensemble, the impact of different meta-learners was evaluated by comparing LR, RF, and Gradient Boosting Decision Tree (GBDT). Each meta-learner received as input the out-of-fold predicted probabilities generated by the base learners (NN, RF, and KNN) during training. The same training, validation, and test splits were used across all meta-learners to ensure consistency.

Performance was assessed using Macro-F1 and Macro-AUROC, and 95% confidence intervals were calculated via 1000 bootstrap resamples of the test set. This analysis allowed us to identify LR as the most stable and interpretable meta-learner, preserving both classification balance (Macro-F1) and discriminative power (Macro-AUROC).

Explainable AI

To interpret the predictions of the hybrid machine learning model, both local (LIME) and global (SHAP) explainability techniques were employed. The LIME technique provides local and interpretable explanations for each prediction by creating small perturbations in the input data around each sample and approximating the complex model’s prediction with simple models. This method helps clarify and interpret the outputs of machine learning models at the individual sample level. In Formula (5), \(\hat{f}\left( x \right)\) denotes the local explanation of the model’s prediction at input x. Here, g(x′) is the interpretable model’s prediction on the perturbed sample x′, h_j(x) represents interpretable features, ω_j are the learned coefficients, and M is the number of interpretable features. In the hybrid model presented in this study, LIME is used as an interpretability technique to provide local, interpretable explanations for why the model made a particular prediction.

$$\hat{f}\left( x \right) = g\left( {x^{\prime } } \right) + \sum M j = 1\,\omega j \cdot hj\left( x \right)$$
(5)

SHAP values provide a comprehensive framework for interpreting the output of any machine learning model and indicate the importance of each feature in the prediction process. These values are based on Shapley values from cooperative game theory and offer a principled approach to allocating the “contribution” of each input feature to the model’s prediction. In Formula (6), ϕ_i is the SHAP value for feature i, K is the number of samples, f(x_k) is the model’s prediction for sample k, E(f(x_k)) is the expected prediction for sample k, and ϕ^{(k)}_i denotes the contribution of feature i in sample k. In the hybrid model presented in this study, SHAP values are used to explain feature importance at the dataset-wide level and to determine which features have a major impact on the model’s predictions.

$$\phi i = (1/K)*\sum\limits_{(k = 1)}^{K} {\left[ {f(xk) - E(f(xk))} \right]} \cdot \phi (k)i$$
(6)

Global SHAP summary and class-wise plots were generated to compare feature importance across “Normal”, “Prediabetes”, and “Diabetes” classes. To reconcile potential discrepancies between LIME and SHAP explanations, results were examined jointly: LIME provided local insights for specific cases, while SHAP identified globally consistent feature contributions. Additionally, clinically meaningful interpretations were derived by defining decision thresholds and feature regions where model predictions are likely to change, and by evaluating model calibration and decision curves to assess clinical reliability and utility.

Results

As Fig. 5 shows, there are significant differences in the distribution of data across the different classes. These differences are particularly evident in the features age, glucose, blood pressure, and insulin.

Fig. 5
figure 5

Feature distribution plot.

As mentioned in the methods section, feature selection was performed using Spearman’s correlation coefficient, and only variables with p-value ≤ 0.0001 were chosen as selected features, indicating very high confidence in their true association with the outcome. The correlation matrix plot shows, in addition to the relationships between features and the outcome, the correlations among the features themselves. This method retains features with the greatest predictive power and the lowest likelihood of random association, and it helps prevent overfitting. In this study and based on the Spearman correlation criterion with significance level p-value ≤ 0.0001, the features glucose, blood pressure, body mass index (BMI), age, and insulin were selected as the variables for modeling. Figure 6 shows the list of top features.

Fig. 6
figure 6

List of top features.

Table 5 summarizes the RF model architecture: number of trees: from 50 to 300, maximum tree depth: from 3 to 15, minimum samples to split a node: from 2 to 10, number of random features at each node: ‘sqrt’, ‘log2’ or a float between 0.1 and 1.0. With these settings, the RF creates high diversity among trees through sampling with replacement and random feature selection at each node, reduces variance errors, and is robust to noise.

Table 5 Summary of RF model architecture.

Table 6 summarizes the KNN model architecture: number of neighbors between 3 and 30, leaf size between 20 and 60, weighting method ‘uniform’ or ‘distance’, distance metric: ‘manhattan’ or ‘minkowski’. With these settings, KNN, while maintaining simplicity, captures local patterns well and performs acceptably at nonlinear decision boundaries.

Table 6 Summary of KNN model architecture.

Table 7 summarizes the NN model architecture, which has three consecutive hidden layers, number of neurons: first layer = 64, second layer = 128, third layer = 64, activation function in all hidden layers: ReLU, and after each hidden layer a Dropout layer to prevent overfitting, with rates [0.20, 0.30, 0.20], and a Softmax output layer for multi-class classification (Fig. 7). The three-layer NN with ReLU activations and Dropout layers combines the ability to learn complex nonlinear relationships with appropriate control over overfitting. Figure 7 shows the structure of the neural network used in the modeling.

Table 7 Summary of NN model architecture.
Fig. 7
figure 7

Illustration of the NNs structure used in modeling.

Summarizes the stacking ensemble architecture

After each of the three base models was trained on the training data with their best hyperparameter settings, their probabilistic outputs were passed as inputs to a simple LR model. This model is used as the meta-learner in the stacking structure. The stacked model training process was carried out using scikit-learn’s stacking classifier class and utilizing internal fivefold cross-validation. This arrangement is designed to prevent overfitting and to obtain an optimal combination of the base models’ output weights. Table 8 summarizes the stacking ensemble architecture.

Table 8 Summarizes the stacking ensemble architecture|

Confusion matrix of model based on data splits

Figure 8 is presented as a confusion matrix for each model, showing the number of correct and incorrect predictions of the models. This matrix is an appropriate tool for analyzing how the models perform in classifying samples.

Fig. 8
figure 8

Confusion matrix of each model based on data splits.

Model performance evaluation

Based on the comprehensive evaluation in Table 9, the stacking ensemble model confirmed its superiority. It achieved flawless scores (100%) across all classification metrics accuracy, precision, recall, specificity, F1 Score, and AUROC for class 0. Furthermore, it demonstrated the best calibration, marked by a Brier score complement (1-B) of 99.9% and a high overall calibration score (ECE) of 98.7 % for class 0. The RF model also delivered exceptionally strong results, particularly for classes 0 and 2, where it achieved perfect scores in several key metrics. In stark contrast, the KNN model consistently underperformed across all classes and metrics, highlighted by a low F1 Score of 83.3% for class 0. The neural network and voting ensemble both provided robust and balanced performance; the neural network notably achieved 100% recall for class 0, while the voting ensemble produced strong overall results, though slightly trailing the leading stacking and RF models.

Table 9 Model performance evaluation.

Figure 9 displays the results of various model performance metrics, including accuracy, precision, recall, specificity, and F1 score.

Fig. 9
figure 9figure 9

Model evaluation metrics chart.

Macro-averaged ROC curves for all models

Figure 10 presents the macro-averaged Receiver Operating Characteristic (ROC) curves, comparing the overall discriminative ability of all models across all classes. The neural network achieves the highest AUC (99.8%), closely followed by the voting ensemble (99.6%) and the stacking ensemble (99.5%). All ensemble methods and the NN demonstrate excellent classification performance (AUC > 0.99), while KNN shows relatively lower but still strong performance (AUC = 98.9%).

Fig. 10
figure 10

Macro-averaged ROC curves for all models.

Calibration error comparison across models

Figure 11 presents the expected calibration error (ECE) for all evaluated models, with lower values indicating better reliability of predicted probabilities. The stacking ensemble demonstrates the best calibration performance (ECE: 0.87%), followed closely by the voting ensemble (1.09%) and RF (2.60%). In contrast, KNN exhibits the highest calibration error (10.36%), reflecting poor probabilistic consistency. These results highlight that the stacking ensemble not only achieves superior classification accuracy but also provides well-calibrated probability estimates, essential for risk stratification and clinical decision-making.

Fig. 11
figure 11

Calibration error comparison across models.

Effect of meta-learner choice

Figure 12 compares the performance of different meta‑learners (LR, RF, and GBDT) within the stacking ensemble framework. All three models achieved very high Macro‑AUROC values (greater than 0.99), confirming excellent discriminative ability. However, differences emerged in Macro‑F1 scores: LR and RF maintained balanced performance at approximately 98.9%, while GBDT dropped to around 92.7%, indicating lower precision‑recall balance and potential overfitting. Overall, LR emerged as the optimal meta‑learner, offering a more stable and interpretable configuration that preserves both classification balance (Macro‑F1) and discriminative power (Macro‑AUROC).

Fig. 12
figure 12

Impact of meta-learner choice (LR, RF, GBDT) on stacking model performance.

Statistical evaluation of model performance

The results of statistically meaningful are summarized in Fig. 13 (“Bootstrap 95% CI & p-values vs stacking model”). Darker bars represent Macro-F1, lighter bars correspond to Macro-AUROC, and the vertical error bars visualize the 95% confidence intervals. Values annotated atop each bar show the p-value of the paired comparisons against stacking. All comparisons yielded p < 0.05, demonstrating statistically significant differences. In particular, the stacking ensemble achieved the highest mean performance with the narrowest confidence intervals, confirming its superiority and robustness. The voting ensemble (p = 0.026) was the closest competitor but still significantly inferior. In summary, these results provide strong statistical evidence that the Stacking Ensemble significantly outperforms all baseline models in both classification accuracy and discrimination ability.

Fig. 13
figure 13

Comparison of Macro-F1 and AUROC with Bootstrap CIs and p-values vs stacking.

Explainable AI

The LIME explanation identified the condition “0.36 < Glucose ≤ 0.47” as the primary reason for the “Pre-diabetes” prediction and assigned it a high weight of 0.88. However, the individual’s actual glucose value was 0.36, which did not satisfy this condition as it required a value strictly greater than 0.36 (Fig. 14).

Fig. 14
figure 14

LIME explanations for diabetes prediction model.

Figure 15 illustrates the global SHAP feature importance for the top predictive variables. The findings demonstrate that blood pressure, insulin, and glucose exerted the highest overall influence on the model’s predictions, reflecting the well-established physiological interdependence between insulin and glucose in metabolic regulation.

Fig. 15
figure 15

Global SHAP feature importance of top predictive variables.

Local LIME probabilities (0.96 probability for Pre-diabetes) are consistent with SHAP directional explanations: the sample’s glucose value (0.36 normalized) lies within the SHAP-defined “decision band” (0.35–0.47) that shifts model output toward Pre-diabetes. While LIME focuses on local feature perturbation, SHAP aggregates contributions under a game-theoretic framework, leading to smoother gradients across individuals. Both frameworks converge in their primary attribution of glucose as the decisive marker for intermediate dysregulation (Fig. 16A,B).

Fig. 16
figure 16

SHAP vs. LIME for pre-diabetes detection.

The SHAP dependency plot (Fig. 17A) identifies the transition zone between “Normal” and “Pre-diabetes” near normalized glucose values of 0.35–0.47. Within this region, SHAP values switch sharply from negative to positive, confirming the model’s internal threshold for early metabolic risk escalation. Clinically, this corresponds approximately to plasma glucose levels of 100–125 mg/dL (WHO/ADA criteria for prediabetes), indicating the model’s sound alignment with established diagnostic reference ranges.

Fig. 17
figure 17

Clinical validation and utility of a stacking model for diabetes risk stratification.

Finally, probability calibration analysis (ECE = 1.09%, Brier = 0.87%) and decision curve evaluation (Fig. 17B) both confirm that the stacking model provides well-calibrated and clinically reliable probability outputs, improving net benefit across plausible risk thresholds compared to baseline models (RF, NN, KNN).

Discussion

In this study, prediction of type 2 diabetes was performed by combining three models RF, KNN and NN and two explainable AI techniques, LIME and SHAP, were used to identify important features for diagnosing type 2 diabetes. As the findings showed, among all models the stacking model achieved the best performance across all evaluation metrics including accuracy, sensitivity, specificity, and F1 score for accurately predicting type 2 diabetes. The voting model demonstrated strong performance with high accuracy and appreciable values for sensitivity, specificity, and F1 score compared to other models. In contrast, the KNN model performed relatively poorly compared to the others, achieving lower results across all evaluation metrics including accuracy, sensitivity, specificity, and F1 score.

Studies have consistently demonstrated that combining multiple models through ensemble learning techniques, particularly stacking, can effectively integrate the complementary strengths of diverse algorithms, leading to enhanced predictive performance and greater robustness compared to individual model26, Stacking operates by training a meta-learner that optimally combines the outputs of several base learners, enabling it to capture both linear and non-linear relationships that single models might overlook. This layered architecture allows the system to balance the variance and bias trade-offs inherent in different algorithms, producing a more generalized model that performs well on unseen data. Notably, stacking has been shown to yield marked improvements in multi-class diabetes classification tasks27. highlighting its potential for complex biomedical prediction problems where patterns are often subtle and heterogeneous.

Similarly, ensemble strategies such as majority voting and bagging have proven capable of increasing accuracy and overall performance metrics compared to single classifiers28. In recent research, the use of stacking to combine classical machine learning models (e.g., logistic regression, support vector machines) with neural networks on the PIMA dataset and local datasets achieved accuracies between 92 and 95%29. These findings confirm the superiority of combined approaches, particularly when models with distinct learning mechanisms such as probabilistic, tree-based, and deep learning methods are integrated to leverage their complementary perspectives on the data. The concordance of these results with the present study underscores the idea that hybrid ensemble architectures can deliver more accurate and stable predictions in early diabetes diagnosis. Such approaches not only enhance predictive performance but also improve model reliability and interpretability when coupled with tools such as SHAP or LIME, which elucidate the contribution of each feature to the final decision an essential factor for clinical deployment.

Previous investigations have further emphasized the advantage of hybrid approaches and advanced learning architectures in diabetes diagnosis. For example, a hybrid model combining deep neural networks and random forest achieved an accuracy of 94.81%20, while more sophisticated frameworks incorporating clustering and autoencoder-based feature extraction reached accuracies as high as 99.13%21. Although these methods tend to involve more complex implementation and computational requirements, their results indicate the substantial potential of combining deep feature learning with structured ensemble modeling.

The observed improvement of stacking-based models can be attributed to their capacity to integrate diverse feature representations learned by different base models, effectively combining the unique strengths of each. By averaging the outputs of multiple learners, stacking reduces overfitting and enhances model robustness, while the meta-learning layer optimally weights the contributions of individual models to improve generalization on unseen data. Furthermore, this approach captures complex nonlinear dependencies and intricate interactions among predictors that single models might overlook. Consequently, stacking offers a powerful and flexible framework for medical prediction tasks, achieving an optimal balance between accuracy, stability, and interpretability qualities that are essential for developing trustworthy, AI-driven diagnostic systems suitable for clinical practice.

The superior performance of the voting model in our study aligns with a growing body of evidence highlighting the advantages of ensemble learning over single classifiers in diabetes prediction. ensemble methods, particularly voting, reduce the weaknesses of individual models by aggregating their strengths, thereby achieving higher accuracy and stability. For example, one study demonstrated that while individual classifiers such as Random Forest, SVM, and Naïve Bayes showed moderate predictive power, combining the better-performing models through a voting algorithm achieved higher accuracy (82%) and more robust results across both imbalanced and balanced datasets30. Similarly, researchers emphasized that approaches relying solely on a single classifier may lack robustness compared to voting-based strategies; integrating diverse models (SVM, ANN, and Naïve Bayes) improved sensitivity and overall diagnostic performance30. This advantage was also confirmed in large-scale evaluations using NHANES data, where an ensemble voting model significantly enhanced predictive accuracy and discrimination ability (AUC = 0.75)31.

Further evidence supports that both hard and voting strategies enhance model reliability compared to standalone algorithms. One study showed that ensemble classifiers consistently outperformed individual models in terms of accuracy and probability estimation, strengthening diagnostic decision-making in diabetes prediction32. Another investigation in Mexico applied a hard voting ensemble of GLM, SVM, and ANN for T2DM detection and achieved an AUC of 90% ± 3%, demonstrating that voting provides a fast, effective, and clinically valuable method for early disease identification33. Similarly, a soft voting approach integrating RF, LR, and Naïve Bayes achieved the highest performance across accuracy, precision, recall, and F1 score compared to state-of-the-art classifiers on the PIMA dataset, further underscoring the effectiveness of ensemble voting34. Finally, a systematic review combining eleven well-known ML algorithms into an ensemble voting framework reported nearly 86% accuracy on the Pima dataset, showing that ensembles consistently outperform single models even after extensive hyperparameter tuning and cross-validation34. Taken together, these studies highlight that voting-based ensemble approaches provide superior predictive power by balancing variance, reducing overfitting, and leveraging the complementary strengths of diverse classifiers. This justifies why the voting model in our work demonstrated stronger performance, with high accuracy and favorable sensitivity, specificity, and F1 score compared to other individual models.

The consistently inferior performance of the KNN algorithm across all evaluation metrics accuracy, sensitivity, specificity, and F1 score as observed in this study, can be attributed to several inherent limitations of the KNN algorithm that are particularly exacerbated when applied to clinical and questionnaire-based datasets like those used for diabetes prediction. KNN is a distance-based algorithm that relies on measures like Euclidean distance to compute similarity between data points. A fundamental weakness is its susceptibility to the “curse of dimensionality,” where its performance deteriorates as the number of features increases. In high-dimensional spaces, which are common in medical datasets (e.g., features like age, BMI, glucose levels, blood pressure, etc.), points tend to become equidistant from each other. This makes the concept of a “nearest neighbor” increasingly meaningless and degrades the model’s predictive power35. Unlike tree-based models like RF that perform implicit feature selection, KNN weighs all features equally. This allows irrelevant or redundant features, which are common in initial feature sets, to distort the distance calculations and introduce noise, leading to poor generalization on unseen data36,37.

Moreover, the performance of distance-based algorithms is highly contingent upon feature scaling. Features on a larger scale (e.g., insulin levels) can disproportionately dominate the distance calculation compared to features on a smaller scale (e.g., number of pregnancies), unless normalized. Failure to apply rigorous feature scaling puts KNN at a significant disadvantage compared to models like decision trees (DT) or Naive Bayes (NB), which are inherently scale-invariant38. Furthermore, KNN is notoriously sensitive to class imbalance, a pervasive issue in medical datasets where the number of non-diabetic cases often vastly outweighs diabetic ones. In such scenarios, the majority class tends to overwhelm the decision boundary in the local neighborhood of any query point, leading to a high number of false negatives and consequently low sensitivity a critical metric in disease diagnosis39.

KNN is an instance-based (“lazy learner”) algorithm. It does not learn a discriminative model from the training data but rather memorizes the entire dataset and makes predictions based on local approximations. This lack of a generalized model makes it highly prone to overfitting, especially in the presence of noisy or outlier data points36,38. Clinical data often contains noise, measurement errors, and incomplete entries, even after preprocessing. A single anomalous sample can significantly skew the prediction for a new data point. In contrast, ensemble methods like Random Forest and Gradient Boosting average multiple deep decision trees to create a robust model that is far more resilient to such noise and variance40.

Therefore, the poor performance of the KNN model is not an anomaly but an expected outcome based on its fundamental algorithmic characteristics. Its well-documented limitations in handling high-dimensional data, its critical dependence on meticulous preprocessing, and its vulnerability to noise and class imbalance make it a generally suboptimal choice for complex classification tasks like diabetes prediction. This is especially true when compared to more sophisticated ensemble methods or algorithms capable of learning complex, non-linear decision boundaries while managing variance. Therefore, its lower scores across all metrics align perfectly with established machine learning theory and prior research in clinical informatics.

Limitations of the study

The data used in this research were extracted from the Pima Indians Diabetes dataset, which is publicly and limitedly available via Kaggle. This dataset has limitations such as relatively small size, insufficient diversity in features and sample population, which may reduce the generalizability of the results to other populations or real-world conditions. To increase the generalizability and validity of the developed models, it is recommended that future studies use larger and more diverse datasets, including multicenter clinical data and different populations.

Conclusion

This study demonstrates that the Stacking Ensemble model achieves superior performance in type 2 diabetes prediction, outperforming all other models across key evaluation metrics. While Random Forest also delivered strong results, the stacking approach proved most reliable and well-calibrated across all patient classes. By integrating diverse algorithms and leveraging explainability tools, clinically relevant features such as glucose level, BMI, and blood pressure were identified as key predictive factors. This accurate and interpretable framework provides a practical decision-support tool for early diabetes detection, particularly in resource-constrained settings, and can be adapted for other chronic diseases to enhance preventive healthcare and improve clinical outcomes. Future work will focus on validating the proposed framework on larger and more diverse populations, incorporating longitudinal patient data, and extending the approach to other chronic disease domains.