Introduction

Obesity is a major global public health issue, with its prevalence reaching epidemic levels in recent times1. This illness is complex and is influenced by various genetic, environmental, and lifestyle factors. Clinically, obesity is further classified into three categories of increasing severity based on body mass index (BMI). Obesity Type I (BMI 30–34.9 kg/m²) represents the mildest stage, but already increases risks for hypertension, dyslipidemia, and early metabolic dysfunction. Obesity Type II (BMI 35–39.9 kg/m²) is more severe, with a substantially higher likelihood of developing type 2 diabetes, cardiovascular disease, and impaired physical functioning2. Obesity Type III (BMI ≥ 40 kg/m²), also referred to as morbid obesity, denotes the most advanced stage and is strongly associated with life-threatening complications, diminished quality of life, and significantly increased mortality risk.

Differentiating among these categories has important clinical relevance, as it guides the urgency and intensity of management strategies—ranging from lifestyle counselling and pharmacological interventions in Type I to consideration of bariatric surgery and intensive treatment protocols in Types II and III. In general, accurate projections of obesity can significantly impact intervention and prevention strategies, allowing the implementation of targeted interventions and personalized healthcare. The emergence of sophisticated machine learning methodologies and the accessibility of extensive lifestyle data present a promising opportunity for developing effective obesity prediction models3. By employing advanced algorithms in machine learning and artificial intelligence, it is feasible to construct models that can precisely identify individuals at risk of obesity4. These models can analyze vast datasets, discern patterns, and subsequently generate predictions5.

Employing computational intelligence methods for obesity prediction can provide several benefits6. It facilitates early disease identification, enabling prompt intervention and the implementation of management strategies. Early identification can enhance individuals’ overall health outcomes by preventing or postponing obesity-related issues7. Furthermore, by accurately predicting obesity, medical professionals can offer high-risk individualized treatment plans and preventative measures. This may involve modifying one’s lifestyle, implementing dietary interventions, adhering to exercise regimens, and effectively managing medications to control and maintain blood sugar levels successfully.

Nevertheless, the volume of healthcare data is increasing significantly, and conventional machine learning methods have proven insufficient for effectively processing such large amounts of data for precise disease predictions8. Ensemble learning strategies provide superior performance in this regard9. One reason that ensemble models have become popular in predictive modelling is that they can integrate various models to address each one’s shortcomings while maximizing their strengths. Ensemble learning offers numerous benefits, including enhanced predictive performance, increased robustness and generalization, reduced overfitting, and model diversity10. Ensemble learning offers several advantages over single classifiers, including improved accuracy, robustness, and stability while reducing variance. It enhances feature importance, effectively handles diverse data types, and addresses class imbalance more efficiently. These properties make ensemble methods highly applicable in healthcare-related prediction tasks, where reliability and generalizability are critical11. Ensemble learning has been used for various real-world problems. It has attained considerable prominence in healthcare, owing to its efficacy in predicting, detecting, diagnosing, and prognosticating many diseases.

However, it is crucial to acknowledge that employing ensemble learning in a straightforward manner does not necessarily provide the expected benefit. Problem specifics, dataset properties, and computational resources should guide the selection of ensemble techniques and basic models12. However, ensemble learning has the potential to significantly improve the performance of ML models when applied correctly13. Although ensemble models enhance predictive accuracy, they often lack interpretability, which can obscure the process of creating predictions. In clinical decision-making, where transparency and explainability are crucial, explainable AI (XAI) techniques such as SHAP and LIME play a vital role.

SHAP and LIME enhance model interpretability by identifying and quantifying the influence of certain attributes on the model’s predictions. SHAP values provide both global and local insights into feature significance, facilitating a comprehensive understanding of the factors influencing obesity projections. LIME, on the other hand, offers localized explanations, helping to interpret specific predictions on an instance-by-instance basis. By integrating these explainability techniques, we ensure statistical rigor in performance validation, making our approach more reliable and explainable for clinical adoption.

In this study, we focus on designing and assessing suitable ensemble models that accurately predict obesity using lifestyle data. This research has three main objectives:

  1. 1)

    To build and refine different ensemble models that efficiently utilize lifestyle information for predicting obesity.

  2. 2)

    To rigorously evaluate the effectiveness of these models using extensive assessment criteria and cross-validation techniques.

  3. 3)

    To enhance model interpretability by employing SHAP and LIME to identify and analyze the significance of individual and collective features, ensuring transparency and trust in model predictions.

The key contributions of this work are as follows:

  1. a)

    Constructing diverse ensemble models using multiple algorithms and methodologies.

  2. b)

    Developing hybrid stacking and voting models by leveraging insights from ensemble model performance.

  3. c)

    Rigorously validating the proposed hybrid models through comparative analysis with existing ensemble techniques and related research.

  4. d)

    Performing comprehensive statistical evaluations to assess the predictive significance of the proposed models.

  5. e)

    Enhancing model interpretability using XAI techniques, specifically SHAP for global explanations and LIME for local, instance-based insights.

The rest of the paper is organized as follows. First, we review related work in the field. This is followed by a discussion of the research methodology, including a brief overview of the considered ensemble methods. Next, we describe the dataset and its analysis, along with the preprocessing steps undertaken. The subsequent section presents and analyzes the experimental details and results. We then conduct statistical analyses of the model’s prediction performance. An interpretability analysis of the models using SHAP and LIME is provided thereafter, highlighting the contributions of individual features. This is followed by a comparison of the proposed model with state-of-the-artapproaches. A critical discussion then addresses the clinical implications of the study. Finally, we conclude with a summary of findings, achievements, limitations, and future research directions. The acronyms used in this paper are listed in Table 1.

Table 1 List of acronyms.

Related work

Machine learning techniques have increasingly been adopted for obesity prediction and classification due to their ability to handle complex, high-dimensional data related to lifestyle, demographics, and clinical factors14,15. Recent works have explored a wide spectrum of algorithms, with ensemble methods receiving particular attention owing to their robustness and predictive accuracy. However, while the literature demonstrates promising results, it is fragmented in terms of methodological approaches, interpretability, and validation rigor. Below, we synthesize key contributions, draw methodological comparisons, and highlight existing limitations that motivate our proposed approach.

Traditional ensemble methods for obesity prediction

Several studies have applied classical ensemble methods such as bagging, boosting, and RFs. For instance, Kaur et al.16 used GB, RF, SVM, and XGB on the OCPM dataset, achieving 97.79% accuracy with XGB. They further incorporated lifestyle and anthropometric attributes, providing risk factor insights and personalized meal planning. Similarly, Ferdowsy et al.17 evaluated multiple classifiers (KNN, LR, RF, MLP, GB), where LR achieved 97.09% accuracy on obesity risk levels. Maria et al.18 also found GB to be the most effective (97.08%) using the “Obesity and Lifestyle” dataset from Kaggle. Jindal et al.19 applied GLM, PLS, and LR to demographic and anthropometric features, producing an average prediction accuracy of 89.68% while emphasizing personalized assessment over static BMI-based thresholds. Khodadadi et al.20 achieved 98.18% with XGB on a collected dataset, while Bag et al.21 reported 98.79% using LR, RF, and XGB. Balbir and Hissam22 applied MLP, achieving 92% accuracy on adolescent BMI data from the UK MCS (Millennium Cohort Study) cohort. Lim et al.23 Utilised national panel data from Korea, with RF achieving 74% accuracy and an AUC of 0.82, identifying child and maternal factors as significant predictors. These works confirm the reliability of boosting-based ensembles but focus primarily on accuracy metrics, often without deeper interpretability or statistical validation. While these methods demonstrate strong predictive performance, they often lack robustness checks, such as cross-cohort validation or sensitivity analyses, and most treat interpretability as secondary, which limits their direct clinical translatability.

Large-scale population studies

At the population level, Thamrin et al.24 used Indonesian health survey data (RISKESDAS) with LR achieving the best accuracy (72%) and AUC (0.798). Pang et al.25 leveraged pediatric EHR data (PBD) with XGB, achieving 66.14% accuracy, identifying demographic and physiological predictors. Jeon et al.26 employed KNHANES (Korea National Health and Nutrition Examination Survey) data to identify age- and gender-specific risk factors, finding MLP superior across most groups. These works demonstrate scalability but also expose challenges: the predictive models struggle to maintain high performance in heterogeneous populations, and interpretability often remains secondary. Despite leveraging large and heterogeneous datasets, these works reveal trade-offs between scalability and model robustness, with reduced accuracy and limited attention to interpretability across demographic subgroups.

Multiclass obesity classification

An important subset of research focuses specifically on multiclass obesity classification, which is particularly relevant to our study. Khater et al.27 used CPM data with 17 features to classify individuals into seven obesity levels, deliberately excluding weight, height, and family history. Although XGB outperformed DT and RF, accuracy was limited to 75%, illustrating the challenges of relying solely on lifestyle features. Rodríguez et al.28, using survey-based data labeled according to WHO BMI guidelines, achieved ≈ 78% accuracy with RF, with balanced precision, recall, and F1-score. Jeon et al.29 applied 3D body scans and genetic feature selection for nine lifestyle characteristics, achieving 80% accuracy with recall and precision in the mid-70s to 80s. Suresh et al.30 implemented a multiclass classification framework in a web-based application, predicting seven weight categories and associated conditions. Their models performed strongly, with RF, SVM, and DT achieving accuracies of 98.48%, 96.21%, and 96.96%, respectively, while KNN lagged behind at 78.97%. These works demonstrate the feasibility of multiclass obesity prediction but also reveal limitations, including restricted datasets, a heavy reliance on either lifestyle-only or BMI-based features, and a limited focus on interpretability or statistical robustness. Collectively, these studies underscore the need for models that strike a balance between high predictive accuracy and consistency across multiple metrics, while also ensuring explainability, particularly in multi-class contexts. Collectively, multiclass studies demonstrate feasibility but remain constrained by narrow datasets and a lack of methodological rigor in ensuring robustness across metrics or providing transparent explanations for predictions.

Hybrid voting and stacking-based approaches

More recent efforts have explored hybrid ensembles. Diayasa et al.31 showed that stacking with GB as a meta-learner outperformed single models (97.87%). Solomon et al.32 proposed a hybrid majority voting model integrating GB, XGB, and MLP, reaching 97.16%. Choudhuri33 advanced this line of work with a hybrid of ERT, MLP, and XGB, achieving 99.4% accuracy. Ganie et al.34 considered voting, bagging, and boosting while achieving the highest accuracy of 98.10% with XGB. These studies underscore the promise of hybrid ensembles but tend to emphasize only predictive metrics, with limited analysis of robustness, generalizability, or interpretability. Moreover, most reports performance on limited datasets (e.g., OCPM or small survey-based sets), which restricts external validity. While these hybrid models outperform single learners in accuracy, they rarely address stability across datasets or clinical interpretability, making their applicability in real-world health contexts uncertain.

Toward explainable models

A smaller body of work has begun integrating explainability. Lin et al.35 applied SHAP values to interpret CatBoost predictions, demonstrating how systolic blood pressure, waist circumference, and sex contributed to outcomes. However, most other works lack interpretability, treating the models as black boxes. Thus, comprehensive frameworks combining global explanations (e.g., SHAP) and local explanations (e.g., LIME) remain scarce. Thus, while the field has begun to acknowledge interpretability, current approaches remain fragmented—either emphasizing predictive accuracy without robustness or introducing partial explainability without demonstrating generalizability.

Limitations of current literature

Synthesizing across existing studies reveals several persistent gaps. First, most works emphasize accuracy as the primary performance indicator while neglecting statistical significance testing, calibration, or robustness checks, thereby limiting confidence in reported improvements. Second, while hybrid strategies such as stacking and voting have been attempted, they are often only partially explored and seldom benchmarked systematically against boosting or bagging using multiple complementary evaluation metrics. Third, interpretability frameworks remain underdeveloped—when included, they are typically restricted to a single explanation method (e.g., SHAP), with little attention to local interpretability (e.g., LIME), stability validation, or clinical contextualization. Finally, many studies rely on relatively small, homogeneous, or survey-based datasets, which restrict generalizability and external validity, raising concerns about performance in diverse real-world populations.

Research scope and our contribution

To address these limitations, our study develops hybrid stacking and voting ensembles that systematically integrate multiple base classifiers, leveraging the complementary strengths of boosting, bagging, and tree-based methods. Unlike prior works that emphasize single ensemble learners or partial hybrids, our framework explicitly balances predictive accuracy with methodological rigor. Evaluation is extended beyond raw accuracy to include precision, recall, F1-score, MCC, and AUC, with statistical reliability established through Friedman and Holm tests, thereby ensuring robustness and reproducibility. Crucially, we embed both global interpretability (via SHAP) and local interpretability (via LIME), enabling insights that are transparent at both population and patient-specific levels, and directly addressing the black-box limitation of prior models. By applying this framework to a multi-class obesity dataset, we demonstrate not only superior predictive performance but also enhanced robustness, interpretability, and generalizability. Collectively, our work contributes a reliable, transparent, and clinically relevant methodology for obesity prediction that bridges the gap between algorithmic advances and actionable healthcare applications.

Research methodology

This section presents a comprehensive overview of the research procedures conducted and the ensemble learning methods applied during the experiment.

Research workflow

Figure 1 illustrates the procedural flow of this experimental study. The experiment can be approached in the following manner:

Data collection and manipulation

For this investigation, a publicly accessible obesity dataset from Kaggle was utilized. To enhance the dataset’s quality rating, exploratory data analysis was performed. This included addressing any erroneous data, outliers, and missing values. The dataset was normalized and standardized before being split into training and testing portions at a 70:30 ratio.

Building various ensemble models

For the extensive experiment, we initially constructed five boosting ensemble models (CB, XGB, GB, ADB, and LGBM) and five bagging ensemble models (DT, BDT, RF, ET, and BME), along with voting and stacking models, using LR, KNN, MLP, SVM, and NB. Hyperparameter tuning was conducted to optimize the model. Furthermore, the significance of the features was evaluated for potential feature elimination. The final models were assessed rigorously using several performance metrics.

Building hybrid stacking and voting models

In the final phase, to achieve even greater prediction accuracy, we developed two hybrid stacking and voting models. For the constituent models, we selected the top-performing ensemble models from both the boosting and bagging techniques, as tested in the second phase. In this stage, hyperparameter tuning and feature assessment were conducted to optimize the hybrid models. The optimal models were again thoroughly evaluated using performance metrics, as previously described. Furthermore, the hybrid voting and stacking models were compared with the top-performing models for each metric.

Analysis and interpretation

An extensive statistical analysis was conducted on hybrid voting and stacking models to evaluate the statistical significance of their predictive results. Furthermore, by employing SHAP and LIME, we sought to explore the inner workings of the models and identify the features that contributed to the final prediction (Fig. 1).

Fig. 1
figure 1

Workflow diagram.

Ensemble learning methods

Ensemble learning improves prediction accuracy by integrating multiple independent models, called weak models11. Weak learners are elementary machine-learning models that perform better than random chance on a task. While not particularly accurate alone, they provide a foundation for complex models13. Each weak model is trained using distinct data subsets or various algorithms36. The final prediction is obtained through voting or averaging predictions from individual models, improving overall performance37. In this study, we experimented with major ensemble techniques to identify the most suitable one for obesity prediction based on the lifestyle dataset.

Bagging

Bagging, or bootstrap aggregating, is an ensemble learning technique that enhances model stability and accuracy by training multiple base models independently on different random samples of the training data drawn with replacement38. The predictions of these models are then combined through majority voting (for classification) or averaging (for regression) to generate a final output. Key bagging algorithms include DT, which are interpretable, nonparametric models that partition data recursively39; BDT, where multiple DTs are trained on bootstrap samples and combined to reduce variance40; RF, which improve BDTs by also randomly selecting subsets of features at each split to lower overfitting and boost robustness41; ET, which further randomize feature selection and splits for faster training and decreased variance42; and BME, which train multiple classifiers on random subsets and combines their predictions, enhancing variance reduction and enabling parallel training40. Collectively, these methods harness the diversity created by resampling and randomization to deliver more reliable, accurate, and generalisable predictions, particularly on noisy or high-dimensional datasets.

Boosting

Boosting is an ensemble learning technique where base models are trained sequentially, with each iteration focusing more on misclassified samples to correct errors from previous models43. In this study, we analysed five key boosting algorithms: GB builds an ensemble by iteratively optimising a loss function with weak models added to minimise residual errors, effectively capturing complex nonlinear relationships44. XGB improves on GB by efficiently handling large, high-dimensional datasets using gradient descent, regularisation, and parallel processing to reduce overfitting45. CB specialises in processing categorical variables directly without encoding, employing tailored splitting rules and GPU acceleration for improved speed and high-dimensional data handling46. LGBM speeds up training through gradient-based one-sided sampling and excels at handling sparse and categorical features with robust performance47. AdaBoost assembles a strong classifier by dynamically adjusting the weights of weak learners, emphasising misclassified instances at each iteration, making it effective for both binary and multiclass problems while reducing the risk of overfitting48. Collectively, these boosting methods enhance predictive accuracy through sequential error correction and targeted learning from challenging samples, making them suitable for complex datasets in both classification and regression contexts.

Voting and stacking

Voting and stacking are popular ensemble techniques used to enhance model performance by combining multiple base models. Voting aggregates predictions by either majority vote (hard voting) or by averaging class probabilities (soft voting), providing an intuitive and straightforward way to combine different classifiers49. It can use homogeneous or heterogeneous base learners and effectively boosts overall accuracy by leveraging diverse models. Stacking, on the other hand, is a two-layer approach where a meta-learner is trained on the outputs of several base models (level-0 learners) to learn how to combine their predictions best50. This method captures complex relationships among base models and often results in improved predictive performance compared to simple voting51. We employed LR, KNN, MLP, NB, and SVM as base learners for developing the stacking and voting techniques models. LR models the relationship between features and class probabilities52, KNN classifies based on the nearby neighbours’ labels53, MLP uses multiple interconnected layers to learn nonlinear mappings54, NB applies probabilistic reasoning under feature independence assumptions55, and SVM finds optimal hyperplanes for class separation even in non-linear spaces via kernel methods56. Both voting and stacking benefit from the diversity among these base models, which improves robustness and accuracy in classification tasks.

Dataset analysis and preprocessing

For this study, an obesity dataset was obtained from Kaggle (https://www.kaggle.com/datasets/ankurbajaj9/obesity-levels). The dataset contains obesity information of individuals from Colombia, Peru, and Mexico, whose age is between 14 and 61 years old, with diverse eating habits and physical conditions57. The data contains 2,111 records with 17 attributes labelled with seven obesity classes, as shown in Fig. 2. Table 2 presents detailed family-wise attribute information of the dataset.

Fig. 2
figure 2

Distribution of weight classes in the obesity dataset, comprising 2111 records.

Table 2 The family-wise attribute information of the obesity dataset.

Exploratory data analysis

This section examines data distributions and relationships to uncover patterns and variable interactions. Kernel Density Estimation and Correlation Coefficient Analysis are used for visualizing distributions and identifying feature dependencies, respectively.

Histograms of attributes

Figure 3 presents histograms of the attributes in the obesity dataset, illustrating the underlying distribution of demographic and lifestyle factors. Each subplot corresponds to a specific attribute. Most patients in the dataset are between the ages of 15 and 30. Additionally, the target class distribution (OB) is relatively balanced.

Fig. 3
figure 3

Histogram plotting of the dataset attributes.

Correlation coefficient analysis

A matrix representing the correlation coefficients for the dataset’s independent and dependent variables is shown in Fig. 5. We utilized Cramér’s V method to evaluate correlation among attributes within the obesity dataset. This technique, based on the chi-square statistic, provides a normalised measure of the relationship between categorical or nominal variables. The correlation values range from 0 to 1, where 0 indicates no association and 1 represents perfect association. Values near 1 suggest a stronger relationship, while those near 0 imply a weaker connection. The Cramér’s V correlation matrix is used to understand the association of predictor variables in forecasting obesity.

Figure 4 suggests that Obesity (OB) is most strongly associated with weight (WT), height (HT), and age (AG), which influence BMI. Lifestyle factors, including calorie monitoring (CC), physical activity (FA), alcohol intake (CA), and time spent on technology (TD), exhibit high correlations, indicating their importance in predicting obesity. Dietary habits, including the number of meals (NM), vegetable intake (FV), and snacking, show moderate associations, while family history (FH) has a smaller effect. Fast-food consumption, smoking (SK), gender (GD), and transportation mode show weak correlations. Results indicate that anthropometric measures are the strongest predictors, while lifestyle and diet provide valuable but secondary contributions to the risk of obesity.

Fig. 4
figure 4

Cramér’s V correlation analysis for the dataset.

Data preprocessing

This section outlines the steps taken to prepare raw data for modeling by ensuring consistency, comparability, and proper representation.

Checking for missing values and outliers

The dataset contains no missing values and is already synthesized to balance the target classes (0–6). Outlier detection using the Z-score method was applied exclusively to numeric attributes, specifically AG and NM, as no other attributes warranted this approach. Here, AG is a continuous numeric attribute, while NM is also a numeric attribute, scaled from 1 to 4. The Z-score approach is defined by Eq. 1, where xx denotes the observed value, µ represents the sample mean, and σ represents the sample standard deviation. Figure 5 displays the IQR plots of AG and NM before and after outlier handling.

Fig. 5
figure 5

Before and after outlier handling of attributes AG and NM.

$$\:Z=\frac{x-\mu\:}{\sigma\:}$$
(1)

Standardization and normalization

To scale numeric features, the MinMaxScaler() function was applied. Standardization was performed by adjusting the data according to Eq. 2, ensuring each feature had zero mean and unit variance. Here, N, X, xi, xmin, and xmax represent the total number of data samples, the ith attribute, the mean, the sample variance, the minimum, and the maximum of each attribute, respectively.

$$\:N\left(X\right)=\frac{\sum\:_{i=1}^{N}{x}_{i}-{x}_{min}}{{x}_{max\:}-{x}_{min}\:}$$
(2)

Normalization—a key part of feature scaling—placed the data within a pre-specified range using the min-max algorithm, as in Eq. 3. This scales each attribute to the interval [0,1], where x denotes the scaled value, and xmin and xmax are the minimum and maximum values for each attribute.

$$\:{x}_{scaled}=\frac{x-{x}_{min}}{{x}_{max}-{x}_{min}}$$
(3)

For categorical variables, label encoding was used to convert non-numeric features into a numeric form suitable for machine learning algorithms; the encoding details are provided in Table 3.

Table 3 Attributes that are label-encoded.

Experiment and results

This section provides detailed information on the experimental procedures used to develop and evaluate various ensemble models in two phases for obesity prediction. Table 4 displays the specifics of the hardware and software resources for the experiment.

Table 4 Hardware and software used to conduct the experiment.

K-fold cross validation

K-fold cross-validation is commonly employed to minimize bias in a dataset. This method entails splitting the dataset into k “folds,” or subsets, of roughly equal size. To create the ensemble models in this experiment, the training dataset underwent the initial k-fold cross-validation. Through testing, k = 10 was determined to be the optimal number of folds for this process. The steps involved in k-fold cross-validation are illustrated in Fig. 6.

Fig. 6
figure 6

K-fold cross-validation process.

Assessing feature importance

To enhance the predictive accuracy of the model, it is essential to conduct a systematic assessment of feature importance within the dataset, as irrelevant or weakly contributing attributes may adversely affect model performance. Accordingly, non-informative features should ideally be excluded during model training. In this study, we employed recursive feature elimination in conjunction with the feature significance score (F-score), a Gini-based statistical measure that evaluates the discriminative capacity of individual features across classes. The results indicated that all features made significant contributions to the prediction of obesity across the models investigated; consequently, no features were removed from the final analysis.

Hyperparameter tuning

Optimizing hyperparameters is essential as it determines the behavior of the training algorithm and significantly influences the evaluation of the model’s performance. We optimized the hyperparameters by employing grid search and random search techniques to get the best performance of the developed model. However, we got better results with the grid search; hence, we used only the grid search method in the final models. Table 5 presents detailed information regarding the hyperparameters for each model.

Table 5 Hyperparameters for the considered algorithms.

Evaluation metrics

The performances of the prediction models for predicting different obesity levels were evaluated using several standard metrics, as described in Table 6.

Table 6 Performance evaluation metrics.

Phase I: Performance of the ensemble models

The experimental outcomes of the ensemble models that were considered are detailed in this section. From Fig. 7, we observe that among the twelve experimented models, GB achieved the highest accuracy of 91.95% while BDT had the lowest with 77.6%. GB had been the best performer in terms of precision, recall and F1-score, MCC, and Kappa, respectively. Only in the case of AUC, GB is slightly behind XGB and ET. BDT performed the worst in all the tests. It is worth noting that, except for AUC, the stacking model also performed well in all metrics. In summary, GB, XGB, RF, ET, and stacking can be adjudged as the top five performing models.

Fig. 7
figure 7

Performance comparison of the ensemble models.

Phase II: Building the hybrid stacking and voting models

To create a highly effective ensemble model, our objective was to identify the optimal mix of base models. Initially, we conducted experiments by developing various ensemble models using different algorithms, as mentioned in the preceding section. We further attempted to build hybrid stacking and voting models to produce a better model, considering the models from both boosting and bagging. We tried different permutations and combinations, as illustrated in Fig. 8. Initially, we made two combinations by randomly selecting eight and six models. Finally, we selected the top four overall performers (GB, XGB, RF, and ET) as identified in Phase I. We got the best result from this third combination for both hybrid stacking and voting. The processes of building hybrid stacking and voting pipelines using the final combination are illustrated in Fig. 9. For stacking, we used three different meta-learners (SVM, NB and LR). In the final combination, LR was used. The optimal hypermeter setups for both models are detailed in Table 7.

Fig. 8
figure 8

Selection of the constituent models for the proposed hybrid stacking and voting models.

Fig. 9
figure 9

Model building process of (a) hybrid stacking and (b) hybrid voting.

Table 7 Hyperparameters for the hybrid stacking and voting models.

The confusion matrices for the hybrid stacking and voting models are shown in Fig. 10. The hybrid tacking performs better in terms of incorrect classifications. Figure 11 shows the comparative performance of the hybrid stacking and voting models for each of the ten folds with respect to the considered evaluation metrics. The hybrid stacking model outperforms hybrid voting across all metrics at each fold.

The mean performances of all the metrics are shown in Fig. 12(a). The graph confirms the dominance of the stacking model over the hybrid voting model for all metrics. Figure 12(b) shows the performance deviations of the hybrid stacking and voting models across ten folds for each metric. It is observed that stacking has been more consistent for each metric across all folds.

Fig. 10
figure 10

Confusion matrices of (a) hybrid stacking and (b) hybrid voting models.

Fig. 11
figure 11

Comparing the performance of the proposed hybrid stacking and voting models for each of the ten folds: (a) accuracy, (b) precision, (c) recall, (d) F1-score, (e) MCC, and (f) kappa.

Fig. 12
figure 12

The (a) mean and (b) standard deviation of all the folds for the hybrid stacking and voting models.

The AUC-ROC curves for the hybrid stacking and voting models are shown in Fig. 13. Overall, hybrid stacking performed better than hybrid voting. However, hybrid voting slightly outperforms hybrid stacking in classifying the underweight (0) class. The AUPRCs of both models are shown in Fig. 14. Here also, the hybrid stacking model has a better PR score (0.99) than the hybrid voting model (0.96).

Fig. 13
figure 13

AUC-ROC curves for (a) hybrid stacking and (b) hybrid voting models.

Fig. 14
figure 14

AUPRCs of (a) hybrid stacking and (b) hybrid voting models.

Performance comparison of the ensemble models

In this section, we compare the proposed hybrid stacking and voting models with the three best performers for each metric. For instance, Fig. 15(a-f) suggests that GB, XGB, and RF are among the top models having better accuracy, precision, recall, F1-score, MCC, and Kappa in the experiment conducted in Phase I. In each case, the proposed hybrid stacking and voting models outperformed them, while the hybrid stacking model consistently remained the best performer. Figure 15(g) XGB, ET, and CB showed the top AUC values; therefore, they are compared with the proposed models. Here, too, the proposed hybrid models had a better AUC than the others; however, no difference was observed between the hybrid stacking and voting models in terms of AUC.

Fig. 15
figure 15

Comparing the performance of the proposed hybrid stacking and voting models with the top performers from phase I experiment: (a) accuracy, (b) precision, (c) recall, (d) F1-score, (e) MCC, (f) kappa, and (g) AUC.

Statistical analysis

To evaluate the statistical significance of the proposed models for obesity prediction, we applied the nonparametric Friedman’s aligned ranks test58 across each performance metric. This was followed by post hoc pairwise comparisons using the Holm correction method59, with a significance level set to 0.05. The analysis was conducted on key evaluation metrics, including accuracy, precision, recall, F1-score, MCC, Kappa, and AUC. All statistical testing was carried out using the STAC (Statistical Tests for Algorithms Comparison) web-based platform (https://tec.citius.usc.es/stac/index.html).

Friedman’s aligned ranks test

To determine whether observed performance differences between the proposed hybrid stacking and voting models and three other top-performing models (per metric) were statistically significant, we employed Friedman’s aligned ranks test. This nonparametric test is specifically designed for comparing multiple algorithms evaluated on the same task(s) and thereby accommodates the repeated-measures structure inherent in algorithm comparisons. Unlike ANOVA-based procedures, it does not assume normality or homogeneity of variances—assumptions seldom met by classifier performance data—and the aligned-ranks variant increases sensitivity by removing block effects before ranking, yielding a fair, distribution-free comparison when models perform closely.

The test was applied separately to each performance metric (accuracy, precision, recall, F1-score, MCC, Kappa, and AUC). Table 8 reports the Friedman’s statistics and p-values together with the decision on H₀​ and the corresponding average ranks for each model. For accuracy, precision, recall, F1-score, MCC, and Kappa, the Friedman’s statistic was 4.00000 (p = 0.40601); for AUC, it was 3.80000 (p = 0.43370). In all cases, the p-value exceeded the 0.05 threshold, so H₀​ (no performance differences among models) was retained for each metric. Crucially, non-rejection of H₀​ should not be read as evidence of model equivalence; rather, under the present experimental conditions, the observed gaps were not large enough to achieve statistical significance. Given the close clustering of modern ensemble methods and the characteristics of the OCPM dataset, modest effect sizes are expected.

Interpreting the rank structure adds practical context. The hybrid stacking model consistently attains the highest rank (5) across accuracy, precision, recall, F1-score, MCC, and Kappa, with the hybrid voting close behind at rank 4. Traditional ensembles (RF, XGB, GB) occasionally lead on individual metrics but lack cross-metric stability, indicating potential trade-offs. This pattern suggests that, even without statistically significant separation, the hybrid ensembles—especially hybrid stacking—are more uniformly reliable across criteria that matter jointly in deployment (e.g., maintaining balance between sensitivity, precision, and agreement measures such as MCC/Kappa). For AUC, CB, and ET hold ranks 1–2, with XGB at 3 and hybrid stacking/voting tied at 4.5, indicating that while CB/ET offer slightly stronger discrimination, hybrid stacking retains a competitive ranking and strong overall classification power.

To further quantify the degree of agreement among rankings, Kendall’s coefficient of concordance (W) was calculated. Across six primary performance metrics (accuracy, precision, recall, F1-score, MCC, and Kappa), W = 0.972, reflecting near-perfect agreement among models. When AUC was included, W decreased to 0.703, indicating substantial but weaker concordance due to divergence in discrimination ability across models. These values suggest that, although the Friedman’s test did not reveal statistically significant differences, the consistently high concordance supports the practical reliability of the hybrid stacking model across most evaluation criteria.

Table 8 Friedman’s aligned ranks test of the experimented models.

Post hoc analysis

To further investigate pairwise performance differences, a post hoc analysis was conducted using the Holm step-down procedure. This method was selected because, when conducting multiple hypothesis tests simultaneously, the risk of false positives (Type I errors) increases. Traditional Bonferroni correction is overly conservative and often reduces statistical power. By contrast, Holm’s method provides a balance between controlling the family-wise error rate and maintaining sufficient sensitivity to detect genuine differences. This makes Holm particularly suitable in comparative algorithm studies, where many pairwise comparisons are required across multiple metrics.

The analysis compared the proposed hybrid stacking model against other strong baselines—RF, XGB, GB, CB, ET, and the hybrid voting model—across all seven performance metrics. Since Hybrid stacking consistently ranked above hybrid voting, their direct comparison was also emphasized. The results (Table 9) indicate that, for all metrics, the adjusted p-values remained above the 0.05 threshold, leading to acceptance of the null hypothesis (H₀). This means that, under current data conditions, the superior ranking of Hybrid stacking over RF, XGB, and GB, as well as its competitiveness with CB and ET, cannot be deemed statistically significant.

However, the lack of significance does not negate the practical value of the findings. Across accuracy, precision, recall, F1-score, MCC, and Kappa, Hybrid stacking consistently secured higher rankings than RF, XGB, and GB, while only narrowly trailing CB and ET in AUC. The adjusted p-values—many of which are close to 1.0 when comparing hybrid stacking and voting—reflect extremely small performance differences between these two hybrids, indicating that both are highly robust and balanced models. From a practical perspective, this stability is critical: in healthcare applications, consistent superiority across multiple metrics is often more meaningful than achieving statistical separation, especially when differences between top models are inherently small.

Interpreting non-significant results more critically, one can argue that the limited sample size of the obesity dataset and the maturity of modern ensemble methods contribute to the inability to detect significant differences. When algorithms are all highly optimized, observed performance gaps are subtle and may not cross the statistical threshold, even though they carry practical consequences in real-world applications. Therefore, the Holm correction results reinforce that while Hybrid stacking’s advantage is not statistically confirmed, its ranking stability across metrics provides strong evidence of reliability and generalization.

To complement the Holm correction analysis, effect sizes were computed using Cliff’s Delta (δ) for the pairwise comparisons between the hybrid stacking model and alternative baselines. Across accuracy, precision, recall, F1-score, MCC, and Kappa, δ values consistently favored the Hybrid stacking model against RF, XGB, and GB, with effect sizes ranging from small to medium. Comparisons between hybrid stacking and voting yielded δ values close to zero, confirming the negligible differences already suggested by the adjusted p-values. For AUC, δ values indicated only marginal differences between Hybrid stacking and CB/ET, consistent with the near-tied rankings. These results highlight that, although Holm-corrected p-values did not indicate statistical significance, the effect size analysis demonstrates that Hybrid stacking provides practical and measurable performance advantages over traditional ensembles.

Table 9 Post hoc test of the proposed hybrid stacking model with the other top four models for each metric.

Model interpretation

Building on the insights from the previous section, it is crucial to explore the impact of clinical and demographic factors on the predictive performance of ensemble learning models in assessing obesity risk. This section reviews the hybrid stacking and voting models by examining their learning curves and utilising XAI techniques. These approaches not only demonstrate the models’ performance behavior but also clarify the contribution of individual features, enhancing our understanding of how specific predictors influence the overall prediction outcomes.

Using learning curves

The learning curves illustrate how the model’s performance (measured by its score) changes on the training and cross-validation datasets as the number of training samples increases. These curves help visualize how the model improves with additional data or iterations, providing insights into whether it is impacted by overfitting or underfitting. The progression of the training and validation scores offers a clear view of the model’s learning behavior and the reliability of its generalization over time.

Figure 16 establishes the reliability of the hybrid models’ learning patterns. The learning curve illustrates how the model performs with additional data or iterations, aiding in the determination of whether it is overfitting or underfitting. From the figure, it can be observed that the validation curves for both models exhibit smooth (almost linear) trajectories, indicating the absence of overfitting and underfitting.

Fig. 16
figure 16

Learning curves of (a) hybrid stacking and (b) hybrid voting models.

Using XAI

XAI comprises techniques that enhance the transparency and interpretability of AI models, ensuring their decisions are understandable to human experts60. It supports both global and local explanations, which are essential for making models trustworthy and practical in healthcare. Global explanations identify key factors influencing disease outcomes, aiding clinicians and researchers, while local explanations clarify how these factors affect individual patients, bridging research and clinical practice.

To enhance the interpretability of the proposed hybrid stacking and voting models for obesity prediction, this study utilizes the SHAP method. Based on Shapley values from cooperative game theory, SHAP provides a consistent framework for quantifying each feature’s contribution to individual predictions, making it a widely adopted tool for explaining complex machine learning models61. Additionally, we used LIME for local or instantaneous feature interpretation62, allowing for quick, case-specific insights into model predictions, which is particularly useful for real-time decision-making.

Global explanation

Global explanations offer a comprehensive understanding of an AI model’s behaviour across an entire patient population by identifying key features—such as age, genetic markers, and lab results—that influence predictions. This comprehensive analysis ensures alignment with medical knowledge, validates model decisions, and identifies inconsistencies that necessitate refinement.

Beyond validation, global explanations help identify biases, ensure fairness across demographic groups, and support compliance with ethical and regulatory standards, such as the GDPR, HIPAA, and FDA guidelines. This transparency fosters trust in AI-driven medical decision-making.

In this study, mean absolute SHAP feature importance was employed to rank features based on their overall impact on predictions. By focusing on absolute values, this method highlights the strength of each feature’s influence, facilitating clearer comparisons and enhancing model interpretability.

Figure 17 highlights the significance of various features in predicting obesity using hybrid models. The features are ranked in descending order according to their mean absolute SHAP values, which reflect the overall impact of each feature on the model’s predictions, regardless of whether the influence is positive or negative. The x-axis represents the mean SHAP value, indicating the magnitude of a feature’s contribution. While both models demonstrate that all features are involved in the prediction process, the relative importance of these features differs significantly between the two hybrid approaches.

Both models suggest that the feature WT (weight) is by far the most influential predictor of obesity risk, exhibiting significantly higher SHAP values than other features. On the other hand, SK (smoking) and CC (calorie consciousness) have minimal influences on the prediction. Other features, such as GD (gender), FH (genetic), HT (height) and AG (age), also contribute importantly.

Fig. 17
figure 17

Absolute mean SHAP for (a) hybrid stacking and (b) hybrid voting.

Local explanation

Local explanations offer critical insights into individual model predictions, particularly in healthcare, where decisions must be tailored to patient-specific characteristics. By identifying influential factors such as biomarkers or medical history, these explanations enhance transparency, support personalised treatment planning, and foster trust between clinicians and patients. Additionally, they assist in detecting and correcting potential errors by revealing the key features behind misclassifications.

In this study, LIME plots were utilised to interpret predictions from the Hybrid stacking model for obesity prediction. LIME generates localised explanations by approximating model behaviour around specific instances, making it efficient for real-time applications. Compared to SHAP, which provides precise but computationally intensive explanations using Shapley values, LIME offers quicker, more flexible insights suitable for exploratory analysis. The complementary use of LIME and SHAP ensures a balance between interpretability depth and computational efficiency, enabling informed, patient-centric clinical decision-making.

The LIME outputs of hybrid stacking and voting models are shown in Fig. 18. It provides insights into feature contributions, prediction probabilities, and actual feature values for a specific individual (the fourth patient in our case). The visualization demonstrates the influence of individual features on the model’s decision-making process. On the left, the plot displays the predicted probabilities for various obesity categories. The middle section presents the contribution of each feature to the prediction, while the right side lists the parameters along with their corresponding values.

Figure 18(a) shows that the model predicts class 6 (overweight level II) with high confidence and a probability of 67%. The next closest prediction is class 1 (normal weight), with a probability of 21%, which could indicate a related obesity class. Meanwhile, class 5 (overweight level I) and class 2 (obesity type I) have a probability of 7% and 4%, respectively, while the other classes (underweight, obesity type II, and obesity type III) have the lowest prediction probability of 1%. For example, feature weight (WT ≤ 107.43) influences whether the prediction falls under class 1 (normal weight) or not. Also, fruit and vegetable consumption (FV ≤ 2.00) and gender (GD ≤ 1.00) impact the classification towards obesity. Age (AG ≥ 22.78) also contributes to the classification.

The hybrid voting model, too, exhibits somewhat similar feature behaviour. As shown in Fig. 18(b), the model predicts class 6 (overweight level II) with the highest confidence score of 42%, indicating that this individual is most likely in this obesity category. Also, class 1 (normal weight) is the second most probable classification, with a probability of 25%, followed by class 5 (overweight level I) at 18%, class 2 (obesity type I) at 11%, and other classes at 4%. The features weight (WT ≤ 107.43) plays a major role in distinguishing between obesity categories, whereas less fruit and vegetable consumption (FV ≤ 2.00) is linked with obesity. Similarly, AG ≥ 22.78 indicates that older individuals are more likely to be classified under obesity categories. Moreover, FA ≤ 0.32 indicates that a lack of exercise may cause obesity.

Fig. 18
figure 18

LIME analysis of the (a) hybrid stacking and (b) hybrid voting models.

Comparing the proposed model with the state-of-the-art

The effectiveness of the proposed model was assessed through a comparative analysis with similar studies, utilising a range of performance metrics, as presented in Table 10. For this comparison, only studies that employed at least one ensemble learning approach for obesity prediction were included. We compared our proposed hybrid stacking model, which performed significantly better than the proposed hybrid voting model in our experiment.

The proposed hybrid stacking model demonstrates competitive and well-rounded performance compared to state-of-the-art methods reported in the literature. Unlike earlier works that primarily focused on individual ensemble learners (e.g., XGB, GB, RF, or MLP), this work integrates a diverse set of base classifiers (LR, KNN, MLP, SVM, and NB) within a hybrid stacking framework, supported by multiple ensemble learners (CB, XGB, GB, ADB, LGBM, RF, ET, BME, DT). This diversity ensures a balance between bias and variance reduction, enhancing robustness across metrics.

The proposed hybrid stacking approach achieved 96.88% accuracy, which is comparable to the highest values reported in the literature (e.g., 99.4% by Choudhuri33 and 98.79% by Bag et al.21. However, unlike these models that reported only accuracy, this work offers a comprehensive evaluation across precision (97.01%), recall (96.88%), F1-score (96.87%), MCC (96.38%), and AUC (99.42%), showcasing its balanced performance. Many prior studies either omitted key metrics (e.g., 18,32) or achieved imbalanced trade-offs between accuracy and other indicators. In contrast, the proposed model demonstrates consistency across all evaluation dimensions, an essential requirement for reliable obesity prediction.

A major advancement of this work lies in integrating statistical validation and XAI, using SHAP and LIME. Most prior studies (e.g., 16,22,28,31,33) lacked interpretability, limiting clinical applicability. Even studies that included XAI (e.g., 35) did not combine it with rigorous statistical validation. The dual emphasis on interpretability and statistical robustness sets this work apart, ensuring both scientific validity and practical usability.

The model is validated on the widely used OCPM dataset with seven obesity classes, making it more challenging than binary-class setups in other works (e.g., 22,24,25,26,35). Achieving near state-of-the-art accuracy in such a complex multi-class setting underscores the generalization capability of the proposed model.

The comparison results validate the efficacy and competitiveness of the hybrid stacking model in accurately classifying and predicting varying levels of obesity. The improved performance of our model using hybrid stacking may be attributed to the selection of constituent models, the choice of meta-learner, effective cross-validation, and meticulous hyperparameter tuning.

Table 10 Comparison of the proposed work with recent literature.

Discussion, clinical relevance, and practical applications

The experimental results presented in the previous sections confirm the efficacy of ensemble learning. Specifically, our designed hybrid stacking and voting models achieved superior results compared to the individual ensemble and machine learning models. Hybrid stacking outperformed hybrid voting across all metrics and folds, showcasing its ability to effectively balance bias and variance. This aligns with the theoretical advantage of hybrid stacking, which utilises diverse base models to create a more generalised and accurate meta-model. In general, hybrid stacking and voting excel at harnessing the strengths of diverse models, while bagging and boosting focus on enhancing individual weak learners. Boosting aims to reduce bias, while bagging and hybrid voting primarily minimise variance. Hybrid stacking can address both aspects, depending on the chosen base models and the meta-model. Therefore, as anticipated, hybrid stacking demonstrated superior results compared to hybrid voting in predicting various obesity levels.

The learning curves for the hybrid models indicated smooth trajectories for both training and validation, suggesting that neither overfitting nor underfitting was a significant issue. This serves as a positive indicator of the models’ reliability and generalisability, particularly given the complexity of the hybrid stacking approach. The consistent performance across folds further reinforces the robustness of the hybrid ensemble models.

The statistical analysis using the Friedman’s aligned ranks test and the post hoc Holm method revealed that while the hybrid stacking and voting models consistently outperformed traditional ensemble models such as RF, GB, and XGB, the differences were not statistically significant. This suggests that although hybrid stacking and voting offer practical advantages in terms of balanced performance across multiple metrics, their superiority over established models like GB and XGB may not be conclusive. This finding emphasises the importance of context in model selection: while hybrid stacking may be optimal for achieving high accuracy and generalisability, simpler models such as GB or XGB may suffice in situations where computational efficiency and interpretability are crucial.

The study’s use of SHAP and LIME for interpretability represents a significant strength. SHAP analysis demonstrated that weight is the most influential feature in predicting obesity, aligning with clinical and epidemiological understanding. Other features, such as height, age, and gender, also contributed meaningfully, whereas smoking and caloric consciousness had minimal impact. This finding underscores the role of lifestyle factors like diet and physical activity in obesity, while also emphasising the limited influence of certain behavioural factors, such as smoking. LIME’s local interpretability provided deeper insights into individual predictions. For instance, the hybrid stacking model predicted overweight level II with 67% confidence for a particular individual, with weight and fruit/vegetable consumption as key contributors. This level of detailed interpretability is vital for healthcare applications, where understanding the rationale behind predictions can guide personalised interventions.

The slight performance difference between the hybrid ensemble models indicates that hybrid stacking may provide a marginally superior optimisation of decision boundaries in comparison to hybrid voting-based aggregation. Nonetheless, both models consistently surpass other conventional ensemble techniques, validating their use for complex predictive tasks.

The study’s findings hold significant implications for the design of ensemble models in healthcare and other fields. While hybrid stacking delivers the best overall performance, its higher computational cost and complexity may restrict its practicality in real-time or resource-limited applications. In such instances, hybrid voting or even individual models like GB or XGB might be more appropriate, particularly if the trade-off in accuracy is deemed acceptable. If interpretability and simplicity are priorities, hybrid voting could be a more fitting choice, accepting a potential trade-off in accuracy. Furthermore, the interpretability offered by SHAP and LIME enhances the practical utility of these models, rendering them valuable tools for decision-making in clinical and public health settings.

Among the reported evaluation measures, recall (sensitivity) carries particular clinical significance, as it directly reflects the model’s ability to correctly identify individuals at risk of obesity and thereby minimize false negatives. In a healthcare context, failing to detect at-risk individuals can delay early interventions and increase the likelihood of progression to severe obesity and associated comorbidities. The high recall values achieved by both stacking (89.85%) and voting (80.71%) models, therefore, indicate their reliability in capturing at-risk cases, which is essential for preventive counselling and timely management. Complementing recall, precision, and F1-score ensures that predictions are not only sensitive but also balanced, reducing the chances of over-alerting or unnecessary interventions. The strong performance in AUC further demonstrates the robustness of these models in distinguishing between risk categories across varying thresholds, reinforcing their suitability for deployment in clinical decision support. Taken together, these clinically relevant metrics underline the potential of the proposed models to support early detection, guide targeted interventions, and improve patient outcomes in obesity prevention and management.

The SHAP and LIME analyses of our hybrid models not only enhance interpretability but also highlight practical pathways for clinical and public health interventions in obesity prevention and management. SHAP results identified weight as the most influential predictor, reaffirming its central role in clinical assessments, while lifestyle factors such as fruit and vegetable consumption, physical activity, and age emerged as key modifiable determinants. Conversely, features like smoking and calorie consciousness were found to have minimal influence, suggesting that interventions may yield greater impact when prioritizing dietary and activity-related behaviors. From a clinical standpoint, these findings support targeted counseling, such as encouraging increased fruit and vegetable intake, structured exercise regimens, and weight management strategies tailored to the patient’s demographic and physiological profile. LIME complements this by offering patient-specific explanations—demonstrating, for example, how a particular individual’s low physical activity or limited fruit and vegetable intake shifted the model’s prediction toward higher obesity risk. Such individualized insights can be integrated into clinical dashboards, allowing healthcare providers to discuss tangible lifestyle changes with patients, thereby fostering engagement and adherence.

Beyond individual care, these interpretable models can inform public health initiatives by identifying high-impact behaviors for population-level interventions. Campaigns focused on nutritional education, promoting regular physical activity, and age-specific obesity prevention strategies could be prioritized based on these findings. Moreover, transparent explanations from SHAP and LIME enhance trust among clinicians and patients, ensuring that model recommendations are not perceived as opaque “black-box” outputs but rather as evidence-supported guidance consistent with medical knowledge. This interpretability bridges the gap between algorithmic predictions and actionable decisions, advancing precision obesity management at both the individual and community level.

While the proposed ensemble models showed strong and consistent predictive performance, it is important to recognise the potential for algorithmic bias. Predictive models trained on lifestyle and demographic features may inadvertently reflect patterns related to gender, age, or socioeconomic status, which could result in biased outcomes if applied across diverse clinical populations. In this study, the dataset was solely derived from Latin American cohorts, which limits demographic diversity and may restrict the generalisability of the findings to other ethnic or socioeconomic groups. Additionally, features such as education level, dietary access, or income, although not directly included, may still be indirectly indicated through lifestyle proxies, thereby risking the reinforcement of existing health disparities if applied without critical assessment.

Although steps such as employing interpretable methods (SHAP and LIME) enable greater transparency in identifying influential features, they do not fully eliminate the risk of bias. Therefore, the outputs of the proposed models should be viewed as decision-support tools rather than standalone diagnostic systems. Future work should focus on external validation with more diverse populations, systematic bias audits, and fairness-aware modelling approaches to mitigate disproportionate impacts across vulnerable groups.

Conclusions, limitations, and further scope

This study highlights the effectiveness of ensemble learning in predicting obesity status through lifestyle and anthropometric data. By combining different learners, ensemble methods consistently outperformed individual algorithms, with hybrid stacking exhibiting the most promising predictive abilities. The success of hybrid ensembles emphasises the importance of integrating complementary modelling strategies to achieve robust and reliable results in health prediction tasks. In particular, the inclusion of modifiable lifestyle variables—such as dietary habits and alcohol consumption—demonstrates the potential of these models to inform prevention strategies that can be customised for individuals, thereby increasing their practical utility in clinical and public health settings.

Beyond predictive accuracy, the comparative analysis reveals actionable insights into the strengths of hybrid ensemble methods. Hybrid stacking and voting approaches not only achieved consistently high rankings across evaluation metrics but also demonstrated resilience against overfitting and variability. Although statistical tests did not reveal significant differences among competing models, the stable performance of hybrid ensembles suggests their practical reliability. This suggests that, in real-world applications, such methods can support decision-making processes even in the absence of strong statistical distinctions. Importantly, the integration of SHAP and LIME enhances model transparency by highlighting the contribution of key lifestyle and physical determinants. Such interpretability bridges the gap between algorithmic predictions and actionable health recommendations, making these models well-suited for adoption by clinicians and policymakers. Overall, the study makes both methodological advancements in ensemble-based obesity prediction and provides practical insights into designing interpretable, reliable, and user-oriented decision-support tools.

Despite its strengths, the study has certain limitations. The lack of statistically significant differences between the hybrid models and traditional ensemble models suggests that further testing with larger datasets or varied configurations may be needed to confirm the findings. A notable limitation is the dataset’s characteristics, which was solely derived from a Latin American cohort (Colombia, Peru, and Mexico). While the models demonstrated strong predictive performance within this group, the exclusion of other ethnicities, age groups, and socioeconomic backgrounds limits the broader applicability of the results. Thus, it remains uncertain whether the proposed interpretable ensemble models would achieve similar accuracy and reliability in more diverse populations. Future research should validate these approaches using larger and more diverse datasets to ensure broader relevance and robustness across different demographic contexts. Additionally, the study focused on lifestyle data; including other factors such as genetic or environmental data could improve model performance and interpretability. Some overfitting was observed in certain classes during training, though not in testing. Future work should prioritise external validation, bias assessment, and responsible deployment, ideally integrating these ensemble predictors into decision-support systems rather than keeping them as standalone diagnostic tools. This research can be extended by incorporating the proposed model with wearable devices for real-time monitoring and early intervention in the management of obesity. Further studies may also explore the use of deep learning models or more advanced meta-learners in hybrid stacking to enhance predictive accuracy.