Abstract
Obesity is a major public health concern. Predicting obesity risk from lifestyle data can guide targeted interventions, but current models remain limited. This study first evaluates ensemble learning methods and then combines approaches to improve prediction accuracy and generalizability. Four ensemble techniques—boosting, bagging, stacking, and voting—were tested. Five boosting and five bagging models were constructed alongside voting and stacking models. Hyperparameter tuning optimized performance, and feature importance analysis guided potential feature elemination. In phase two, hybrid stacking and voting models integrated the best-performing boosting and bagging models to enhance predictive capability. Model robustness was ensured through k-fold cross-validation and statistical validation. SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) improved interpretability by analyzing feature contributions. Hybrid stacking and voting models outperformed other ensemble methods, with stacking achieving the best performance (accuracy: 96.88%, precision: 97.01%, and recall: 96.88%). Feature importance analysis identified key predictors, including sex, weight, food habits, and alcohol consumption. The results demonstrated that hybrid ensembles significantly improved obesity risk prediction while preserving interpretability. Integrating multiple ensemble and explainability techniques provides a reliable framework for obesity prediction, supporting clinical decisions and personalized healthcare strategies to mitigate obesity risk.
Introduction
Obesity is a major global public health issue, with its prevalence reaching epidemic levels in recent times1. This illness is complex and is influenced by various genetic, environmental, and lifestyle factors. Clinically, obesity is further classified into three categories of increasing severity based on body mass index (BMI). Obesity Type I (BMI 30–34.9 kg/m²) represents the mildest stage, but already increases risks for hypertension, dyslipidemia, and early metabolic dysfunction. Obesity Type II (BMI 35–39.9 kg/m²) is more severe, with a substantially higher likelihood of developing type 2 diabetes, cardiovascular disease, and impaired physical functioning2. Obesity Type III (BMI ≥ 40 kg/m²), also referred to as morbid obesity, denotes the most advanced stage and is strongly associated with life-threatening complications, diminished quality of life, and significantly increased mortality risk.
Differentiating among these categories has important clinical relevance, as it guides the urgency and intensity of management strategies—ranging from lifestyle counselling and pharmacological interventions in Type I to consideration of bariatric surgery and intensive treatment protocols in Types II and III. In general, accurate projections of obesity can significantly impact intervention and prevention strategies, allowing the implementation of targeted interventions and personalized healthcare. The emergence of sophisticated machine learning methodologies and the accessibility of extensive lifestyle data present a promising opportunity for developing effective obesity prediction models3. By employing advanced algorithms in machine learning and artificial intelligence, it is feasible to construct models that can precisely identify individuals at risk of obesity4. These models can analyze vast datasets, discern patterns, and subsequently generate predictions5.
Employing computational intelligence methods for obesity prediction can provide several benefits6. It facilitates early disease identification, enabling prompt intervention and the implementation of management strategies. Early identification can enhance individuals’ overall health outcomes by preventing or postponing obesity-related issues7. Furthermore, by accurately predicting obesity, medical professionals can offer high-risk individualized treatment plans and preventative measures. This may involve modifying one’s lifestyle, implementing dietary interventions, adhering to exercise regimens, and effectively managing medications to control and maintain blood sugar levels successfully.
Nevertheless, the volume of healthcare data is increasing significantly, and conventional machine learning methods have proven insufficient for effectively processing such large amounts of data for precise disease predictions8. Ensemble learning strategies provide superior performance in this regard9. One reason that ensemble models have become popular in predictive modelling is that they can integrate various models to address each one’s shortcomings while maximizing their strengths. Ensemble learning offers numerous benefits, including enhanced predictive performance, increased robustness and generalization, reduced overfitting, and model diversity10. Ensemble learning offers several advantages over single classifiers, including improved accuracy, robustness, and stability while reducing variance. It enhances feature importance, effectively handles diverse data types, and addresses class imbalance more efficiently. These properties make ensemble methods highly applicable in healthcare-related prediction tasks, where reliability and generalizability are critical11. Ensemble learning has been used for various real-world problems. It has attained considerable prominence in healthcare, owing to its efficacy in predicting, detecting, diagnosing, and prognosticating many diseases.
However, it is crucial to acknowledge that employing ensemble learning in a straightforward manner does not necessarily provide the expected benefit. Problem specifics, dataset properties, and computational resources should guide the selection of ensemble techniques and basic models12. However, ensemble learning has the potential to significantly improve the performance of ML models when applied correctly13. Although ensemble models enhance predictive accuracy, they often lack interpretability, which can obscure the process of creating predictions. In clinical decision-making, where transparency and explainability are crucial, explainable AI (XAI) techniques such as SHAP and LIME play a vital role.
SHAP and LIME enhance model interpretability by identifying and quantifying the influence of certain attributes on the model’s predictions. SHAP values provide both global and local insights into feature significance, facilitating a comprehensive understanding of the factors influencing obesity projections. LIME, on the other hand, offers localized explanations, helping to interpret specific predictions on an instance-by-instance basis. By integrating these explainability techniques, we ensure statistical rigor in performance validation, making our approach more reliable and explainable for clinical adoption.
In this study, we focus on designing and assessing suitable ensemble models that accurately predict obesity using lifestyle data. This research has three main objectives:
-
1)
To build and refine different ensemble models that efficiently utilize lifestyle information for predicting obesity.
-
2)
To rigorously evaluate the effectiveness of these models using extensive assessment criteria and cross-validation techniques.
-
3)
To enhance model interpretability by employing SHAP and LIME to identify and analyze the significance of individual and collective features, ensuring transparency and trust in model predictions.
The key contributions of this work are as follows:
-
a)
Constructing diverse ensemble models using multiple algorithms and methodologies.
-
b)
Developing hybrid stacking and voting models by leveraging insights from ensemble model performance.
-
c)
Rigorously validating the proposed hybrid models through comparative analysis with existing ensemble techniques and related research.
-
d)
Performing comprehensive statistical evaluations to assess the predictive significance of the proposed models.
-
e)
Enhancing model interpretability using XAI techniques, specifically SHAP for global explanations and LIME for local, instance-based insights.
The rest of the paper is organized as follows. First, we review related work in the field. This is followed by a discussion of the research methodology, including a brief overview of the considered ensemble methods. Next, we describe the dataset and its analysis, along with the preprocessing steps undertaken. The subsequent section presents and analyzes the experimental details and results. We then conduct statistical analyses of the model’s prediction performance. An interpretability analysis of the models using SHAP and LIME is provided thereafter, highlighting the contributions of individual features. This is followed by a comparison of the proposed model with state-of-the-artapproaches. A critical discussion then addresses the clinical implications of the study. Finally, we conclude with a summary of findings, achievements, limitations, and future research directions. The acronyms used in this paper are listed in Table 1.
Related work
Machine learning techniques have increasingly been adopted for obesity prediction and classification due to their ability to handle complex, high-dimensional data related to lifestyle, demographics, and clinical factors14,15. Recent works have explored a wide spectrum of algorithms, with ensemble methods receiving particular attention owing to their robustness and predictive accuracy. However, while the literature demonstrates promising results, it is fragmented in terms of methodological approaches, interpretability, and validation rigor. Below, we synthesize key contributions, draw methodological comparisons, and highlight existing limitations that motivate our proposed approach.
Traditional ensemble methods for obesity prediction
Several studies have applied classical ensemble methods such as bagging, boosting, and RFs. For instance, Kaur et al.16 used GB, RF, SVM, and XGB on the OCPM dataset, achieving 97.79% accuracy with XGB. They further incorporated lifestyle and anthropometric attributes, providing risk factor insights and personalized meal planning. Similarly, Ferdowsy et al.17 evaluated multiple classifiers (KNN, LR, RF, MLP, GB), where LR achieved 97.09% accuracy on obesity risk levels. Maria et al.18 also found GB to be the most effective (97.08%) using the “Obesity and Lifestyle” dataset from Kaggle. Jindal et al.19 applied GLM, PLS, and LR to demographic and anthropometric features, producing an average prediction accuracy of 89.68% while emphasizing personalized assessment over static BMI-based thresholds. Khodadadi et al.20 achieved 98.18% with XGB on a collected dataset, while Bag et al.21 reported 98.79% using LR, RF, and XGB. Balbir and Hissam22 applied MLP, achieving 92% accuracy on adolescent BMI data from the UK MCS (Millennium Cohort Study) cohort. Lim et al.23 Utilised national panel data from Korea, with RF achieving 74% accuracy and an AUC of 0.82, identifying child and maternal factors as significant predictors. These works confirm the reliability of boosting-based ensembles but focus primarily on accuracy metrics, often without deeper interpretability or statistical validation. While these methods demonstrate strong predictive performance, they often lack robustness checks, such as cross-cohort validation or sensitivity analyses, and most treat interpretability as secondary, which limits their direct clinical translatability.
Large-scale population studies
At the population level, Thamrin et al.24 used Indonesian health survey data (RISKESDAS) with LR achieving the best accuracy (72%) and AUC (0.798). Pang et al.25 leveraged pediatric EHR data (PBD) with XGB, achieving 66.14% accuracy, identifying demographic and physiological predictors. Jeon et al.26 employed KNHANES (Korea National Health and Nutrition Examination Survey) data to identify age- and gender-specific risk factors, finding MLP superior across most groups. These works demonstrate scalability but also expose challenges: the predictive models struggle to maintain high performance in heterogeneous populations, and interpretability often remains secondary. Despite leveraging large and heterogeneous datasets, these works reveal trade-offs between scalability and model robustness, with reduced accuracy and limited attention to interpretability across demographic subgroups.
Multiclass obesity classification
An important subset of research focuses specifically on multiclass obesity classification, which is particularly relevant to our study. Khater et al.27 used CPM data with 17 features to classify individuals into seven obesity levels, deliberately excluding weight, height, and family history. Although XGB outperformed DT and RF, accuracy was limited to 75%, illustrating the challenges of relying solely on lifestyle features. Rodríguez et al.28, using survey-based data labeled according to WHO BMI guidelines, achieved ≈ 78% accuracy with RF, with balanced precision, recall, and F1-score. Jeon et al.29 applied 3D body scans and genetic feature selection for nine lifestyle characteristics, achieving 80% accuracy with recall and precision in the mid-70s to 80s. Suresh et al.30 implemented a multiclass classification framework in a web-based application, predicting seven weight categories and associated conditions. Their models performed strongly, with RF, SVM, and DT achieving accuracies of 98.48%, 96.21%, and 96.96%, respectively, while KNN lagged behind at 78.97%. These works demonstrate the feasibility of multiclass obesity prediction but also reveal limitations, including restricted datasets, a heavy reliance on either lifestyle-only or BMI-based features, and a limited focus on interpretability or statistical robustness. Collectively, these studies underscore the need for models that strike a balance between high predictive accuracy and consistency across multiple metrics, while also ensuring explainability, particularly in multi-class contexts. Collectively, multiclass studies demonstrate feasibility but remain constrained by narrow datasets and a lack of methodological rigor in ensuring robustness across metrics or providing transparent explanations for predictions.
Hybrid voting and stacking-based approaches
More recent efforts have explored hybrid ensembles. Diayasa et al.31 showed that stacking with GB as a meta-learner outperformed single models (97.87%). Solomon et al.32 proposed a hybrid majority voting model integrating GB, XGB, and MLP, reaching 97.16%. Choudhuri33 advanced this line of work with a hybrid of ERT, MLP, and XGB, achieving 99.4% accuracy. Ganie et al.34 considered voting, bagging, and boosting while achieving the highest accuracy of 98.10% with XGB. These studies underscore the promise of hybrid ensembles but tend to emphasize only predictive metrics, with limited analysis of robustness, generalizability, or interpretability. Moreover, most reports performance on limited datasets (e.g., OCPM or small survey-based sets), which restricts external validity. While these hybrid models outperform single learners in accuracy, they rarely address stability across datasets or clinical interpretability, making their applicability in real-world health contexts uncertain.
Toward explainable models
A smaller body of work has begun integrating explainability. Lin et al.35 applied SHAP values to interpret CatBoost predictions, demonstrating how systolic blood pressure, waist circumference, and sex contributed to outcomes. However, most other works lack interpretability, treating the models as black boxes. Thus, comprehensive frameworks combining global explanations (e.g., SHAP) and local explanations (e.g., LIME) remain scarce. Thus, while the field has begun to acknowledge interpretability, current approaches remain fragmented—either emphasizing predictive accuracy without robustness or introducing partial explainability without demonstrating generalizability.
Limitations of current literature
Synthesizing across existing studies reveals several persistent gaps. First, most works emphasize accuracy as the primary performance indicator while neglecting statistical significance testing, calibration, or robustness checks, thereby limiting confidence in reported improvements. Second, while hybrid strategies such as stacking and voting have been attempted, they are often only partially explored and seldom benchmarked systematically against boosting or bagging using multiple complementary evaluation metrics. Third, interpretability frameworks remain underdeveloped—when included, they are typically restricted to a single explanation method (e.g., SHAP), with little attention to local interpretability (e.g., LIME), stability validation, or clinical contextualization. Finally, many studies rely on relatively small, homogeneous, or survey-based datasets, which restrict generalizability and external validity, raising concerns about performance in diverse real-world populations.
Research scope and our contribution
To address these limitations, our study develops hybrid stacking and voting ensembles that systematically integrate multiple base classifiers, leveraging the complementary strengths of boosting, bagging, and tree-based methods. Unlike prior works that emphasize single ensemble learners or partial hybrids, our framework explicitly balances predictive accuracy with methodological rigor. Evaluation is extended beyond raw accuracy to include precision, recall, F1-score, MCC, and AUC, with statistical reliability established through Friedman and Holm tests, thereby ensuring robustness and reproducibility. Crucially, we embed both global interpretability (via SHAP) and local interpretability (via LIME), enabling insights that are transparent at both population and patient-specific levels, and directly addressing the black-box limitation of prior models. By applying this framework to a multi-class obesity dataset, we demonstrate not only superior predictive performance but also enhanced robustness, interpretability, and generalizability. Collectively, our work contributes a reliable, transparent, and clinically relevant methodology for obesity prediction that bridges the gap between algorithmic advances and actionable healthcare applications.
Research methodology
This section presents a comprehensive overview of the research procedures conducted and the ensemble learning methods applied during the experiment.
Research workflow
Figure 1 illustrates the procedural flow of this experimental study. The experiment can be approached in the following manner:
Data collection and manipulation
For this investigation, a publicly accessible obesity dataset from Kaggle was utilized. To enhance the dataset’s quality rating, exploratory data analysis was performed. This included addressing any erroneous data, outliers, and missing values. The dataset was normalized and standardized before being split into training and testing portions at a 70:30 ratio.
Building various ensemble models
For the extensive experiment, we initially constructed five boosting ensemble models (CB, XGB, GB, ADB, and LGBM) and five bagging ensemble models (DT, BDT, RF, ET, and BME), along with voting and stacking models, using LR, KNN, MLP, SVM, and NB. Hyperparameter tuning was conducted to optimize the model. Furthermore, the significance of the features was evaluated for potential feature elimination. The final models were assessed rigorously using several performance metrics.
Building hybrid stacking and voting models
In the final phase, to achieve even greater prediction accuracy, we developed two hybrid stacking and voting models. For the constituent models, we selected the top-performing ensemble models from both the boosting and bagging techniques, as tested in the second phase. In this stage, hyperparameter tuning and feature assessment were conducted to optimize the hybrid models. The optimal models were again thoroughly evaluated using performance metrics, as previously described. Furthermore, the hybrid voting and stacking models were compared with the top-performing models for each metric.
Analysis and interpretation
An extensive statistical analysis was conducted on hybrid voting and stacking models to evaluate the statistical significance of their predictive results. Furthermore, by employing SHAP and LIME, we sought to explore the inner workings of the models and identify the features that contributed to the final prediction (Fig. 1).
Ensemble learning methods
Ensemble learning improves prediction accuracy by integrating multiple independent models, called weak models11. Weak learners are elementary machine-learning models that perform better than random chance on a task. While not particularly accurate alone, they provide a foundation for complex models13. Each weak model is trained using distinct data subsets or various algorithms36. The final prediction is obtained through voting or averaging predictions from individual models, improving overall performance37. In this study, we experimented with major ensemble techniques to identify the most suitable one for obesity prediction based on the lifestyle dataset.
Bagging
Bagging, or bootstrap aggregating, is an ensemble learning technique that enhances model stability and accuracy by training multiple base models independently on different random samples of the training data drawn with replacement38. The predictions of these models are then combined through majority voting (for classification) or averaging (for regression) to generate a final output. Key bagging algorithms include DT, which are interpretable, nonparametric models that partition data recursively39; BDT, where multiple DTs are trained on bootstrap samples and combined to reduce variance40; RF, which improve BDTs by also randomly selecting subsets of features at each split to lower overfitting and boost robustness41; ET, which further randomize feature selection and splits for faster training and decreased variance42; and BME, which train multiple classifiers on random subsets and combines their predictions, enhancing variance reduction and enabling parallel training40. Collectively, these methods harness the diversity created by resampling and randomization to deliver more reliable, accurate, and generalisable predictions, particularly on noisy or high-dimensional datasets.
Boosting
Boosting is an ensemble learning technique where base models are trained sequentially, with each iteration focusing more on misclassified samples to correct errors from previous models43. In this study, we analysed five key boosting algorithms: GB builds an ensemble by iteratively optimising a loss function with weak models added to minimise residual errors, effectively capturing complex nonlinear relationships44. XGB improves on GB by efficiently handling large, high-dimensional datasets using gradient descent, regularisation, and parallel processing to reduce overfitting45. CB specialises in processing categorical variables directly without encoding, employing tailored splitting rules and GPU acceleration for improved speed and high-dimensional data handling46. LGBM speeds up training through gradient-based one-sided sampling and excels at handling sparse and categorical features with robust performance47. AdaBoost assembles a strong classifier by dynamically adjusting the weights of weak learners, emphasising misclassified instances at each iteration, making it effective for both binary and multiclass problems while reducing the risk of overfitting48. Collectively, these boosting methods enhance predictive accuracy through sequential error correction and targeted learning from challenging samples, making them suitable for complex datasets in both classification and regression contexts.
Voting and stacking
Voting and stacking are popular ensemble techniques used to enhance model performance by combining multiple base models. Voting aggregates predictions by either majority vote (hard voting) or by averaging class probabilities (soft voting), providing an intuitive and straightforward way to combine different classifiers49. It can use homogeneous or heterogeneous base learners and effectively boosts overall accuracy by leveraging diverse models. Stacking, on the other hand, is a two-layer approach where a meta-learner is trained on the outputs of several base models (level-0 learners) to learn how to combine their predictions best50. This method captures complex relationships among base models and often results in improved predictive performance compared to simple voting51. We employed LR, KNN, MLP, NB, and SVM as base learners for developing the stacking and voting techniques models. LR models the relationship between features and class probabilities52, KNN classifies based on the nearby neighbours’ labels53, MLP uses multiple interconnected layers to learn nonlinear mappings54, NB applies probabilistic reasoning under feature independence assumptions55, and SVM finds optimal hyperplanes for class separation even in non-linear spaces via kernel methods56. Both voting and stacking benefit from the diversity among these base models, which improves robustness and accuracy in classification tasks.
Dataset analysis and preprocessing
For this study, an obesity dataset was obtained from Kaggle (https://www.kaggle.com/datasets/ankurbajaj9/obesity-levels). The dataset contains obesity information of individuals from Colombia, Peru, and Mexico, whose age is between 14 and 61 years old, with diverse eating habits and physical conditions57. The data contains 2,111 records with 17 attributes labelled with seven obesity classes, as shown in Fig. 2. Table 2 presents detailed family-wise attribute information of the dataset.
Exploratory data analysis
This section examines data distributions and relationships to uncover patterns and variable interactions. Kernel Density Estimation and Correlation Coefficient Analysis are used for visualizing distributions and identifying feature dependencies, respectively.
Histograms of attributes
Figure 3 presents histograms of the attributes in the obesity dataset, illustrating the underlying distribution of demographic and lifestyle factors. Each subplot corresponds to a specific attribute. Most patients in the dataset are between the ages of 15 and 30. Additionally, the target class distribution (OB) is relatively balanced.
Correlation coefficient analysis
A matrix representing the correlation coefficients for the dataset’s independent and dependent variables is shown in Fig. 5. We utilized Cramér’s V method to evaluate correlation among attributes within the obesity dataset. This technique, based on the chi-square statistic, provides a normalised measure of the relationship between categorical or nominal variables. The correlation values range from 0 to 1, where 0 indicates no association and 1 represents perfect association. Values near 1 suggest a stronger relationship, while those near 0 imply a weaker connection. The Cramér’s V correlation matrix is used to understand the association of predictor variables in forecasting obesity.
Figure 4 suggests that Obesity (OB) is most strongly associated with weight (WT), height (HT), and age (AG), which influence BMI. Lifestyle factors, including calorie monitoring (CC), physical activity (FA), alcohol intake (CA), and time spent on technology (TD), exhibit high correlations, indicating their importance in predicting obesity. Dietary habits, including the number of meals (NM), vegetable intake (FV), and snacking, show moderate associations, while family history (FH) has a smaller effect. Fast-food consumption, smoking (SK), gender (GD), and transportation mode show weak correlations. Results indicate that anthropometric measures are the strongest predictors, while lifestyle and diet provide valuable but secondary contributions to the risk of obesity.
Data preprocessing
This section outlines the steps taken to prepare raw data for modeling by ensuring consistency, comparability, and proper representation.
Checking for missing values and outliers
The dataset contains no missing values and is already synthesized to balance the target classes (0–6). Outlier detection using the Z-score method was applied exclusively to numeric attributes, specifically AG and NM, as no other attributes warranted this approach. Here, AG is a continuous numeric attribute, while NM is also a numeric attribute, scaled from 1 to 4. The Z-score approach is defined by Eq. 1, where xx denotes the observed value, µ represents the sample mean, and σ represents the sample standard deviation. Figure 5 displays the IQR plots of AG and NM before and after outlier handling.
Standardization and normalization
To scale numeric features, the MinMaxScaler() function was applied. Standardization was performed by adjusting the data according to Eq. 2, ensuring each feature had zero mean and unit variance. Here, N, X, xi, xmin, and xmax represent the total number of data samples, the ith attribute, the mean, the sample variance, the minimum, and the maximum of each attribute, respectively.
Normalization—a key part of feature scaling—placed the data within a pre-specified range using the min-max algorithm, as in Eq. 3. This scales each attribute to the interval [0,1], where x denotes the scaled value, and xmin and xmax are the minimum and maximum values for each attribute.
For categorical variables, label encoding was used to convert non-numeric features into a numeric form suitable for machine learning algorithms; the encoding details are provided in Table 3.
Experiment and results
This section provides detailed information on the experimental procedures used to develop and evaluate various ensemble models in two phases for obesity prediction. Table 4 displays the specifics of the hardware and software resources for the experiment.
K-fold cross validation
K-fold cross-validation is commonly employed to minimize bias in a dataset. This method entails splitting the dataset into k “folds,” or subsets, of roughly equal size. To create the ensemble models in this experiment, the training dataset underwent the initial k-fold cross-validation. Through testing, k = 10 was determined to be the optimal number of folds for this process. The steps involved in k-fold cross-validation are illustrated in Fig. 6.
Assessing feature importance
To enhance the predictive accuracy of the model, it is essential to conduct a systematic assessment of feature importance within the dataset, as irrelevant or weakly contributing attributes may adversely affect model performance. Accordingly, non-informative features should ideally be excluded during model training. In this study, we employed recursive feature elimination in conjunction with the feature significance score (F-score), a Gini-based statistical measure that evaluates the discriminative capacity of individual features across classes. The results indicated that all features made significant contributions to the prediction of obesity across the models investigated; consequently, no features were removed from the final analysis.
Hyperparameter tuning
Optimizing hyperparameters is essential as it determines the behavior of the training algorithm and significantly influences the evaluation of the model’s performance. We optimized the hyperparameters by employing grid search and random search techniques to get the best performance of the developed model. However, we got better results with the grid search; hence, we used only the grid search method in the final models. Table 5 presents detailed information regarding the hyperparameters for each model.
Evaluation metrics
The performances of the prediction models for predicting different obesity levels were evaluated using several standard metrics, as described in Table 6.
Phase I: Performance of the ensemble models
The experimental outcomes of the ensemble models that were considered are detailed in this section. From Fig. 7, we observe that among the twelve experimented models, GB achieved the highest accuracy of 91.95% while BDT had the lowest with 77.6%. GB had been the best performer in terms of precision, recall and F1-score, MCC, and Kappa, respectively. Only in the case of AUC, GB is slightly behind XGB and ET. BDT performed the worst in all the tests. It is worth noting that, except for AUC, the stacking model also performed well in all metrics. In summary, GB, XGB, RF, ET, and stacking can be adjudged as the top five performing models.
Phase II: Building the hybrid stacking and voting models
To create a highly effective ensemble model, our objective was to identify the optimal mix of base models. Initially, we conducted experiments by developing various ensemble models using different algorithms, as mentioned in the preceding section. We further attempted to build hybrid stacking and voting models to produce a better model, considering the models from both boosting and bagging. We tried different permutations and combinations, as illustrated in Fig. 8. Initially, we made two combinations by randomly selecting eight and six models. Finally, we selected the top four overall performers (GB, XGB, RF, and ET) as identified in Phase I. We got the best result from this third combination for both hybrid stacking and voting. The processes of building hybrid stacking and voting pipelines using the final combination are illustrated in Fig. 9. For stacking, we used three different meta-learners (SVM, NB and LR). In the final combination, LR was used. The optimal hypermeter setups for both models are detailed in Table 7.
The confusion matrices for the hybrid stacking and voting models are shown in Fig. 10. The hybrid tacking performs better in terms of incorrect classifications. Figure 11 shows the comparative performance of the hybrid stacking and voting models for each of the ten folds with respect to the considered evaluation metrics. The hybrid stacking model outperforms hybrid voting across all metrics at each fold.
The mean performances of all the metrics are shown in Fig. 12(a). The graph confirms the dominance of the stacking model over the hybrid voting model for all metrics. Figure 12(b) shows the performance deviations of the hybrid stacking and voting models across ten folds for each metric. It is observed that stacking has been more consistent for each metric across all folds.
The AUC-ROC curves for the hybrid stacking and voting models are shown in Fig. 13. Overall, hybrid stacking performed better than hybrid voting. However, hybrid voting slightly outperforms hybrid stacking in classifying the underweight (0) class. The AUPRCs of both models are shown in Fig. 14. Here also, the hybrid stacking model has a better PR score (0.99) than the hybrid voting model (0.96).
Performance comparison of the ensemble models
In this section, we compare the proposed hybrid stacking and voting models with the three best performers for each metric. For instance, Fig. 15(a-f) suggests that GB, XGB, and RF are among the top models having better accuracy, precision, recall, F1-score, MCC, and Kappa in the experiment conducted in Phase I. In each case, the proposed hybrid stacking and voting models outperformed them, while the hybrid stacking model consistently remained the best performer. Figure 15(g) XGB, ET, and CB showed the top AUC values; therefore, they are compared with the proposed models. Here, too, the proposed hybrid models had a better AUC than the others; however, no difference was observed between the hybrid stacking and voting models in terms of AUC.
Statistical analysis
To evaluate the statistical significance of the proposed models for obesity prediction, we applied the nonparametric Friedman’s aligned ranks test58 across each performance metric. This was followed by post hoc pairwise comparisons using the Holm correction method59, with a significance level set to 0.05. The analysis was conducted on key evaluation metrics, including accuracy, precision, recall, F1-score, MCC, Kappa, and AUC. All statistical testing was carried out using the STAC (Statistical Tests for Algorithms Comparison) web-based platform (https://tec.citius.usc.es/stac/index.html).
Friedman’s aligned ranks test
To determine whether observed performance differences between the proposed hybrid stacking and voting models and three other top-performing models (per metric) were statistically significant, we employed Friedman’s aligned ranks test. This nonparametric test is specifically designed for comparing multiple algorithms evaluated on the same task(s) and thereby accommodates the repeated-measures structure inherent in algorithm comparisons. Unlike ANOVA-based procedures, it does not assume normality or homogeneity of variances—assumptions seldom met by classifier performance data—and the aligned-ranks variant increases sensitivity by removing block effects before ranking, yielding a fair, distribution-free comparison when models perform closely.
The test was applied separately to each performance metric (accuracy, precision, recall, F1-score, MCC, Kappa, and AUC). Table 8 reports the Friedman’s statistics and p-values together with the decision on H₀ and the corresponding average ranks for each model. For accuracy, precision, recall, F1-score, MCC, and Kappa, the Friedman’s statistic was 4.00000 (p = 0.40601); for AUC, it was 3.80000 (p = 0.43370). In all cases, the p-value exceeded the 0.05 threshold, so H₀ (no performance differences among models) was retained for each metric. Crucially, non-rejection of H₀ should not be read as evidence of model equivalence; rather, under the present experimental conditions, the observed gaps were not large enough to achieve statistical significance. Given the close clustering of modern ensemble methods and the characteristics of the OCPM dataset, modest effect sizes are expected.
Interpreting the rank structure adds practical context. The hybrid stacking model consistently attains the highest rank (5) across accuracy, precision, recall, F1-score, MCC, and Kappa, with the hybrid voting close behind at rank 4. Traditional ensembles (RF, XGB, GB) occasionally lead on individual metrics but lack cross-metric stability, indicating potential trade-offs. This pattern suggests that, even without statistically significant separation, the hybrid ensembles—especially hybrid stacking—are more uniformly reliable across criteria that matter jointly in deployment (e.g., maintaining balance between sensitivity, precision, and agreement measures such as MCC/Kappa). For AUC, CB, and ET hold ranks 1–2, with XGB at 3 and hybrid stacking/voting tied at 4.5, indicating that while CB/ET offer slightly stronger discrimination, hybrid stacking retains a competitive ranking and strong overall classification power.
To further quantify the degree of agreement among rankings, Kendall’s coefficient of concordance (W) was calculated. Across six primary performance metrics (accuracy, precision, recall, F1-score, MCC, and Kappa), W = 0.972, reflecting near-perfect agreement among models. When AUC was included, W decreased to 0.703, indicating substantial but weaker concordance due to divergence in discrimination ability across models. These values suggest that, although the Friedman’s test did not reveal statistically significant differences, the consistently high concordance supports the practical reliability of the hybrid stacking model across most evaluation criteria.
Post hoc analysis
To further investigate pairwise performance differences, a post hoc analysis was conducted using the Holm step-down procedure. This method was selected because, when conducting multiple hypothesis tests simultaneously, the risk of false positives (Type I errors) increases. Traditional Bonferroni correction is overly conservative and often reduces statistical power. By contrast, Holm’s method provides a balance between controlling the family-wise error rate and maintaining sufficient sensitivity to detect genuine differences. This makes Holm particularly suitable in comparative algorithm studies, where many pairwise comparisons are required across multiple metrics.
The analysis compared the proposed hybrid stacking model against other strong baselines—RF, XGB, GB, CB, ET, and the hybrid voting model—across all seven performance metrics. Since Hybrid stacking consistently ranked above hybrid voting, their direct comparison was also emphasized. The results (Table 9) indicate that, for all metrics, the adjusted p-values remained above the 0.05 threshold, leading to acceptance of the null hypothesis (H₀). This means that, under current data conditions, the superior ranking of Hybrid stacking over RF, XGB, and GB, as well as its competitiveness with CB and ET, cannot be deemed statistically significant.
However, the lack of significance does not negate the practical value of the findings. Across accuracy, precision, recall, F1-score, MCC, and Kappa, Hybrid stacking consistently secured higher rankings than RF, XGB, and GB, while only narrowly trailing CB and ET in AUC. The adjusted p-values—many of which are close to 1.0 when comparing hybrid stacking and voting—reflect extremely small performance differences between these two hybrids, indicating that both are highly robust and balanced models. From a practical perspective, this stability is critical: in healthcare applications, consistent superiority across multiple metrics is often more meaningful than achieving statistical separation, especially when differences between top models are inherently small.
Interpreting non-significant results more critically, one can argue that the limited sample size of the obesity dataset and the maturity of modern ensemble methods contribute to the inability to detect significant differences. When algorithms are all highly optimized, observed performance gaps are subtle and may not cross the statistical threshold, even though they carry practical consequences in real-world applications. Therefore, the Holm correction results reinforce that while Hybrid stacking’s advantage is not statistically confirmed, its ranking stability across metrics provides strong evidence of reliability and generalization.
To complement the Holm correction analysis, effect sizes were computed using Cliff’s Delta (δ) for the pairwise comparisons between the hybrid stacking model and alternative baselines. Across accuracy, precision, recall, F1-score, MCC, and Kappa, δ values consistently favored the Hybrid stacking model against RF, XGB, and GB, with effect sizes ranging from small to medium. Comparisons between hybrid stacking and voting yielded δ values close to zero, confirming the negligible differences already suggested by the adjusted p-values. For AUC, δ values indicated only marginal differences between Hybrid stacking and CB/ET, consistent with the near-tied rankings. These results highlight that, although Holm-corrected p-values did not indicate statistical significance, the effect size analysis demonstrates that Hybrid stacking provides practical and measurable performance advantages over traditional ensembles.
Model interpretation
Building on the insights from the previous section, it is crucial to explore the impact of clinical and demographic factors on the predictive performance of ensemble learning models in assessing obesity risk. This section reviews the hybrid stacking and voting models by examining their learning curves and utilising XAI techniques. These approaches not only demonstrate the models’ performance behavior but also clarify the contribution of individual features, enhancing our understanding of how specific predictors influence the overall prediction outcomes.
Using learning curves
The learning curves illustrate how the model’s performance (measured by its score) changes on the training and cross-validation datasets as the number of training samples increases. These curves help visualize how the model improves with additional data or iterations, providing insights into whether it is impacted by overfitting or underfitting. The progression of the training and validation scores offers a clear view of the model’s learning behavior and the reliability of its generalization over time.
Figure 16 establishes the reliability of the hybrid models’ learning patterns. The learning curve illustrates how the model performs with additional data or iterations, aiding in the determination of whether it is overfitting or underfitting. From the figure, it can be observed that the validation curves for both models exhibit smooth (almost linear) trajectories, indicating the absence of overfitting and underfitting.
Using XAI
XAI comprises techniques that enhance the transparency and interpretability of AI models, ensuring their decisions are understandable to human experts60. It supports both global and local explanations, which are essential for making models trustworthy and practical in healthcare. Global explanations identify key factors influencing disease outcomes, aiding clinicians and researchers, while local explanations clarify how these factors affect individual patients, bridging research and clinical practice.
To enhance the interpretability of the proposed hybrid stacking and voting models for obesity prediction, this study utilizes the SHAP method. Based on Shapley values from cooperative game theory, SHAP provides a consistent framework for quantifying each feature’s contribution to individual predictions, making it a widely adopted tool for explaining complex machine learning models61. Additionally, we used LIME for local or instantaneous feature interpretation62, allowing for quick, case-specific insights into model predictions, which is particularly useful for real-time decision-making.
Global explanation
Global explanations offer a comprehensive understanding of an AI model’s behaviour across an entire patient population by identifying key features—such as age, genetic markers, and lab results—that influence predictions. This comprehensive analysis ensures alignment with medical knowledge, validates model decisions, and identifies inconsistencies that necessitate refinement.
Beyond validation, global explanations help identify biases, ensure fairness across demographic groups, and support compliance with ethical and regulatory standards, such as the GDPR, HIPAA, and FDA guidelines. This transparency fosters trust in AI-driven medical decision-making.
In this study, mean absolute SHAP feature importance was employed to rank features based on their overall impact on predictions. By focusing on absolute values, this method highlights the strength of each feature’s influence, facilitating clearer comparisons and enhancing model interpretability.
Figure 17 highlights the significance of various features in predicting obesity using hybrid models. The features are ranked in descending order according to their mean absolute SHAP values, which reflect the overall impact of each feature on the model’s predictions, regardless of whether the influence is positive or negative. The x-axis represents the mean SHAP value, indicating the magnitude of a feature’s contribution. While both models demonstrate that all features are involved in the prediction process, the relative importance of these features differs significantly between the two hybrid approaches.
Both models suggest that the feature WT (weight) is by far the most influential predictor of obesity risk, exhibiting significantly higher SHAP values than other features. On the other hand, SK (smoking) and CC (calorie consciousness) have minimal influences on the prediction. Other features, such as GD (gender), FH (genetic), HT (height) and AG (age), also contribute importantly.
Local explanation
Local explanations offer critical insights into individual model predictions, particularly in healthcare, where decisions must be tailored to patient-specific characteristics. By identifying influential factors such as biomarkers or medical history, these explanations enhance transparency, support personalised treatment planning, and foster trust between clinicians and patients. Additionally, they assist in detecting and correcting potential errors by revealing the key features behind misclassifications.
In this study, LIME plots were utilised to interpret predictions from the Hybrid stacking model for obesity prediction. LIME generates localised explanations by approximating model behaviour around specific instances, making it efficient for real-time applications. Compared to SHAP, which provides precise but computationally intensive explanations using Shapley values, LIME offers quicker, more flexible insights suitable for exploratory analysis. The complementary use of LIME and SHAP ensures a balance between interpretability depth and computational efficiency, enabling informed, patient-centric clinical decision-making.
The LIME outputs of hybrid stacking and voting models are shown in Fig. 18. It provides insights into feature contributions, prediction probabilities, and actual feature values for a specific individual (the fourth patient in our case). The visualization demonstrates the influence of individual features on the model’s decision-making process. On the left, the plot displays the predicted probabilities for various obesity categories. The middle section presents the contribution of each feature to the prediction, while the right side lists the parameters along with their corresponding values.
Figure 18(a) shows that the model predicts class 6 (overweight level II) with high confidence and a probability of 67%. The next closest prediction is class 1 (normal weight), with a probability of 21%, which could indicate a related obesity class. Meanwhile, class 5 (overweight level I) and class 2 (obesity type I) have a probability of 7% and 4%, respectively, while the other classes (underweight, obesity type II, and obesity type III) have the lowest prediction probability of 1%. For example, feature weight (WT ≤ 107.43) influences whether the prediction falls under class 1 (normal weight) or not. Also, fruit and vegetable consumption (FV ≤ 2.00) and gender (GD ≤ 1.00) impact the classification towards obesity. Age (AG ≥ 22.78) also contributes to the classification.
The hybrid voting model, too, exhibits somewhat similar feature behaviour. As shown in Fig. 18(b), the model predicts class 6 (overweight level II) with the highest confidence score of 42%, indicating that this individual is most likely in this obesity category. Also, class 1 (normal weight) is the second most probable classification, with a probability of 25%, followed by class 5 (overweight level I) at 18%, class 2 (obesity type I) at 11%, and other classes at 4%. The features weight (WT ≤ 107.43) plays a major role in distinguishing between obesity categories, whereas less fruit and vegetable consumption (FV ≤ 2.00) is linked with obesity. Similarly, AG ≥ 22.78 indicates that older individuals are more likely to be classified under obesity categories. Moreover, FA ≤ 0.32 indicates that a lack of exercise may cause obesity.
Comparing the proposed model with the state-of-the-art
The effectiveness of the proposed model was assessed through a comparative analysis with similar studies, utilising a range of performance metrics, as presented in Table 10. For this comparison, only studies that employed at least one ensemble learning approach for obesity prediction were included. We compared our proposed hybrid stacking model, which performed significantly better than the proposed hybrid voting model in our experiment.
The proposed hybrid stacking model demonstrates competitive and well-rounded performance compared to state-of-the-art methods reported in the literature. Unlike earlier works that primarily focused on individual ensemble learners (e.g., XGB, GB, RF, or MLP), this work integrates a diverse set of base classifiers (LR, KNN, MLP, SVM, and NB) within a hybrid stacking framework, supported by multiple ensemble learners (CB, XGB, GB, ADB, LGBM, RF, ET, BME, DT). This diversity ensures a balance between bias and variance reduction, enhancing robustness across metrics.
The proposed hybrid stacking approach achieved 96.88% accuracy, which is comparable to the highest values reported in the literature (e.g., 99.4% by Choudhuri33 and 98.79% by Bag et al.21. However, unlike these models that reported only accuracy, this work offers a comprehensive evaluation across precision (97.01%), recall (96.88%), F1-score (96.87%), MCC (96.38%), and AUC (99.42%), showcasing its balanced performance. Many prior studies either omitted key metrics (e.g., 18,32) or achieved imbalanced trade-offs between accuracy and other indicators. In contrast, the proposed model demonstrates consistency across all evaluation dimensions, an essential requirement for reliable obesity prediction.
A major advancement of this work lies in integrating statistical validation and XAI, using SHAP and LIME. Most prior studies (e.g., 16,22,28,31,33) lacked interpretability, limiting clinical applicability. Even studies that included XAI (e.g., 35) did not combine it with rigorous statistical validation. The dual emphasis on interpretability and statistical robustness sets this work apart, ensuring both scientific validity and practical usability.
The model is validated on the widely used OCPM dataset with seven obesity classes, making it more challenging than binary-class setups in other works (e.g., 22,24,25,26,35). Achieving near state-of-the-art accuracy in such a complex multi-class setting underscores the generalization capability of the proposed model.
The comparison results validate the efficacy and competitiveness of the hybrid stacking model in accurately classifying and predicting varying levels of obesity. The improved performance of our model using hybrid stacking may be attributed to the selection of constituent models, the choice of meta-learner, effective cross-validation, and meticulous hyperparameter tuning.
Discussion, clinical relevance, and practical applications
The experimental results presented in the previous sections confirm the efficacy of ensemble learning. Specifically, our designed hybrid stacking and voting models achieved superior results compared to the individual ensemble and machine learning models. Hybrid stacking outperformed hybrid voting across all metrics and folds, showcasing its ability to effectively balance bias and variance. This aligns with the theoretical advantage of hybrid stacking, which utilises diverse base models to create a more generalised and accurate meta-model. In general, hybrid stacking and voting excel at harnessing the strengths of diverse models, while bagging and boosting focus on enhancing individual weak learners. Boosting aims to reduce bias, while bagging and hybrid voting primarily minimise variance. Hybrid stacking can address both aspects, depending on the chosen base models and the meta-model. Therefore, as anticipated, hybrid stacking demonstrated superior results compared to hybrid voting in predicting various obesity levels.
The learning curves for the hybrid models indicated smooth trajectories for both training and validation, suggesting that neither overfitting nor underfitting was a significant issue. This serves as a positive indicator of the models’ reliability and generalisability, particularly given the complexity of the hybrid stacking approach. The consistent performance across folds further reinforces the robustness of the hybrid ensemble models.
The statistical analysis using the Friedman’s aligned ranks test and the post hoc Holm method revealed that while the hybrid stacking and voting models consistently outperformed traditional ensemble models such as RF, GB, and XGB, the differences were not statistically significant. This suggests that although hybrid stacking and voting offer practical advantages in terms of balanced performance across multiple metrics, their superiority over established models like GB and XGB may not be conclusive. This finding emphasises the importance of context in model selection: while hybrid stacking may be optimal for achieving high accuracy and generalisability, simpler models such as GB or XGB may suffice in situations where computational efficiency and interpretability are crucial.
The study’s use of SHAP and LIME for interpretability represents a significant strength. SHAP analysis demonstrated that weight is the most influential feature in predicting obesity, aligning with clinical and epidemiological understanding. Other features, such as height, age, and gender, also contributed meaningfully, whereas smoking and caloric consciousness had minimal impact. This finding underscores the role of lifestyle factors like diet and physical activity in obesity, while also emphasising the limited influence of certain behavioural factors, such as smoking. LIME’s local interpretability provided deeper insights into individual predictions. For instance, the hybrid stacking model predicted overweight level II with 67% confidence for a particular individual, with weight and fruit/vegetable consumption as key contributors. This level of detailed interpretability is vital for healthcare applications, where understanding the rationale behind predictions can guide personalised interventions.
The slight performance difference between the hybrid ensemble models indicates that hybrid stacking may provide a marginally superior optimisation of decision boundaries in comparison to hybrid voting-based aggregation. Nonetheless, both models consistently surpass other conventional ensemble techniques, validating their use for complex predictive tasks.
The study’s findings hold significant implications for the design of ensemble models in healthcare and other fields. While hybrid stacking delivers the best overall performance, its higher computational cost and complexity may restrict its practicality in real-time or resource-limited applications. In such instances, hybrid voting or even individual models like GB or XGB might be more appropriate, particularly if the trade-off in accuracy is deemed acceptable. If interpretability and simplicity are priorities, hybrid voting could be a more fitting choice, accepting a potential trade-off in accuracy. Furthermore, the interpretability offered by SHAP and LIME enhances the practical utility of these models, rendering them valuable tools for decision-making in clinical and public health settings.
Among the reported evaluation measures, recall (sensitivity) carries particular clinical significance, as it directly reflects the model’s ability to correctly identify individuals at risk of obesity and thereby minimize false negatives. In a healthcare context, failing to detect at-risk individuals can delay early interventions and increase the likelihood of progression to severe obesity and associated comorbidities. The high recall values achieved by both stacking (89.85%) and voting (80.71%) models, therefore, indicate their reliability in capturing at-risk cases, which is essential for preventive counselling and timely management. Complementing recall, precision, and F1-score ensures that predictions are not only sensitive but also balanced, reducing the chances of over-alerting or unnecessary interventions. The strong performance in AUC further demonstrates the robustness of these models in distinguishing between risk categories across varying thresholds, reinforcing their suitability for deployment in clinical decision support. Taken together, these clinically relevant metrics underline the potential of the proposed models to support early detection, guide targeted interventions, and improve patient outcomes in obesity prevention and management.
The SHAP and LIME analyses of our hybrid models not only enhance interpretability but also highlight practical pathways for clinical and public health interventions in obesity prevention and management. SHAP results identified weight as the most influential predictor, reaffirming its central role in clinical assessments, while lifestyle factors such as fruit and vegetable consumption, physical activity, and age emerged as key modifiable determinants. Conversely, features like smoking and calorie consciousness were found to have minimal influence, suggesting that interventions may yield greater impact when prioritizing dietary and activity-related behaviors. From a clinical standpoint, these findings support targeted counseling, such as encouraging increased fruit and vegetable intake, structured exercise regimens, and weight management strategies tailored to the patient’s demographic and physiological profile. LIME complements this by offering patient-specific explanations—demonstrating, for example, how a particular individual’s low physical activity or limited fruit and vegetable intake shifted the model’s prediction toward higher obesity risk. Such individualized insights can be integrated into clinical dashboards, allowing healthcare providers to discuss tangible lifestyle changes with patients, thereby fostering engagement and adherence.
Beyond individual care, these interpretable models can inform public health initiatives by identifying high-impact behaviors for population-level interventions. Campaigns focused on nutritional education, promoting regular physical activity, and age-specific obesity prevention strategies could be prioritized based on these findings. Moreover, transparent explanations from SHAP and LIME enhance trust among clinicians and patients, ensuring that model recommendations are not perceived as opaque “black-box” outputs but rather as evidence-supported guidance consistent with medical knowledge. This interpretability bridges the gap between algorithmic predictions and actionable decisions, advancing precision obesity management at both the individual and community level.
While the proposed ensemble models showed strong and consistent predictive performance, it is important to recognise the potential for algorithmic bias. Predictive models trained on lifestyle and demographic features may inadvertently reflect patterns related to gender, age, or socioeconomic status, which could result in biased outcomes if applied across diverse clinical populations. In this study, the dataset was solely derived from Latin American cohorts, which limits demographic diversity and may restrict the generalisability of the findings to other ethnic or socioeconomic groups. Additionally, features such as education level, dietary access, or income, although not directly included, may still be indirectly indicated through lifestyle proxies, thereby risking the reinforcement of existing health disparities if applied without critical assessment.
Although steps such as employing interpretable methods (SHAP and LIME) enable greater transparency in identifying influential features, they do not fully eliminate the risk of bias. Therefore, the outputs of the proposed models should be viewed as decision-support tools rather than standalone diagnostic systems. Future work should focus on external validation with more diverse populations, systematic bias audits, and fairness-aware modelling approaches to mitigate disproportionate impacts across vulnerable groups.
Conclusions, limitations, and further scope
This study highlights the effectiveness of ensemble learning in predicting obesity status through lifestyle and anthropometric data. By combining different learners, ensemble methods consistently outperformed individual algorithms, with hybrid stacking exhibiting the most promising predictive abilities. The success of hybrid ensembles emphasises the importance of integrating complementary modelling strategies to achieve robust and reliable results in health prediction tasks. In particular, the inclusion of modifiable lifestyle variables—such as dietary habits and alcohol consumption—demonstrates the potential of these models to inform prevention strategies that can be customised for individuals, thereby increasing their practical utility in clinical and public health settings.
Beyond predictive accuracy, the comparative analysis reveals actionable insights into the strengths of hybrid ensemble methods. Hybrid stacking and voting approaches not only achieved consistently high rankings across evaluation metrics but also demonstrated resilience against overfitting and variability. Although statistical tests did not reveal significant differences among competing models, the stable performance of hybrid ensembles suggests their practical reliability. This suggests that, in real-world applications, such methods can support decision-making processes even in the absence of strong statistical distinctions. Importantly, the integration of SHAP and LIME enhances model transparency by highlighting the contribution of key lifestyle and physical determinants. Such interpretability bridges the gap between algorithmic predictions and actionable health recommendations, making these models well-suited for adoption by clinicians and policymakers. Overall, the study makes both methodological advancements in ensemble-based obesity prediction and provides practical insights into designing interpretable, reliable, and user-oriented decision-support tools.
Despite its strengths, the study has certain limitations. The lack of statistically significant differences between the hybrid models and traditional ensemble models suggests that further testing with larger datasets or varied configurations may be needed to confirm the findings. A notable limitation is the dataset’s characteristics, which was solely derived from a Latin American cohort (Colombia, Peru, and Mexico). While the models demonstrated strong predictive performance within this group, the exclusion of other ethnicities, age groups, and socioeconomic backgrounds limits the broader applicability of the results. Thus, it remains uncertain whether the proposed interpretable ensemble models would achieve similar accuracy and reliability in more diverse populations. Future research should validate these approaches using larger and more diverse datasets to ensure broader relevance and robustness across different demographic contexts. Additionally, the study focused on lifestyle data; including other factors such as genetic or environmental data could improve model performance and interpretability. Some overfitting was observed in certain classes during training, though not in testing. Future work should prioritise external validation, bias assessment, and responsible deployment, ideally integrating these ensemble predictors into decision-support systems rather than keeping them as standalone diagnostic tools. This research can be extended by incorporating the proposed model with wearable devices for real-time monitoring and early intervention in the management of obesity. Further studies may also explore the use of deep learning models or more advanced meta-learners in hybrid stacking to enhance predictive accuracy.
Data availability
The datasets used during the current study are available in the Kaggle repository, [https://www.kaggle.com/datasets/ankurbajaj9/obesity-levels].
Code availability
The code used in this study is openly available on GitHub at the following repository: [https://github.com/Shahid92-Phd/Obesity-Prediction.git].
References
World Health Organization. Obesity and overweight, 1 March 2024. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/obesity-and-overweight. [Accessed 5 April 2024].
Yao, Z. et al. Associations between class I, II, or III obesity and health outcomes. NEJM Evidence, 4 (4), EVIDoa2400229 (2025).
Ferreras, A. et al. and I. d. l. T. Díez. Systematic review of machine learning applied to the prediction of obesity and overweight, J. Med. Syst. 47 (8), (2023).
DeGregory, K. W. et al. Rev. Mach. Learn. Obes. Obesity Reviews, 19, 5, 668–685, (2018).
Ganie, S. M., Pramanik, P. K. D., Mallik, S. & Zhao, Z. Chronic kidney disease prediction using boosting techniques based on clinical parameters. PLoS ONE. 18 (12), e0295234 (2023).
Rautiainen, I. & Äyrämö, S. Predicting overweight and obesity in later life from childhood data: A review of predictive modeling approaches. In Computational Sciences and Artificial Intelligence in Industry. Intelligent Systems, Control and Automation: Science and Engineering Vol. 76 (eds Tuovinen, T. et al.) 203–220 (Springer, 2022).
Safaei, M., Sundararajan, E. A., Driss, M., Boulila, W. & Shapi’i, A. A systematic literature review on obesity: Understanding the causes & consequences of obesity and reviewing various machine learning approaches used to predict obesity. Comput. Biol. Med. 136, 104754 (2021).
Ganie, S. M., Pramanik, P. K. D., Malik, M. B., Mallik, S. & Qin, H. An ensemble learning approach for diabetes prediction using boosting techniques. Frontiers Genetics, 14, (2023).
Ganie, S. M., Pramanik, P. K. D., Malik, M. B., Nayyar, A. & Kwak, K. S. An improved ensemble learning approach for heart disease prediction using boosting algorithms. Comput. Syst. Sci. Eng. 46 (3), 3993–4006 (2023).
Ganie, S. M., Pramanik, P. K. D. & Zhao, Z. Ensemble learning with explainable AI for improved heart disease prediction based on multiple datasets. Sci. Rep. 15, 13912 (2025).
Mienye, I. D. & Sun, Y. A survey of ensemble learning: Concepts, Algorithms, Applications, and prospects. IEEE Access. 10, 99129–99149 (2022).
Ganie, S. M. & Pramanik, P. K. D. A comparative analysis of boosting algorithms for chronic liver, Healthcare Analytics, 5 (100313), (2024).
Ganie, S. M. & Pramanik, P. K. D. Interpretable lung cancer risk prediction using ensemble learning and XAI based on lifestyle and demographic data. Comput. Biol. Chem. 117, 108438 (2025).
Dutta, R. R., Mukherjee, I. & Chakraborty, C. Obesity disease risk prediction using machine learning. Int. J. Data Sci. Analytics, 19, 709–718 (2025).
Osadchiy, V. et al. Machine learning model to predict obesity using gut metabolite and brain microstructure data. Sci. Rep. 13, 5488 (2023).
Kaur, R., Kumar, R. & Gupta, M. Predicting risk of obesity and meal planning to reduce the obese in adulthood using artificial intelligence, Endocrine, 78, 458–469, (2022).
Ferdowsy, F., Rahi, K. S. A., Jabiullah, M. I. & Habib, M. T. A machine learning approach for obesity risk prediction. Current Res. Behav. Sci. 2 (100053), (2021).
Maria, A. S., Sunder, R. & Kumar, R. S. Obesity Risk Prediction Using Machine Learning Approach, in International Conference on Networking and Communications (ICNWC), Chennai, India, (2023).
Jindal, K., Baliyan, N. & Rana, P. S. Obesity prediction using ensemble machine learning approaches. In Recent Findings Intel. Comput. Tech. Adv. Intel. Syst. Comput. (eds.,). 708 355–362 (Springer, 2018).
Khodadadi, N., Saber, M. & Abotaleb, M. A Data-Driven approach for obesity classification using machine learning. J. Artif. Intell. Metaheuristics. 3 (2), 08–17 (2023).
Bag, H. G. G. et al. Estimation of Obesity Levels through the Proposed Predictive Approach Based on Physical Activity and Nutritional Habits, Diagnostics, 13 (18), 2949, (2023).
Singh, B. & Tawfik, H. Machine learning approach for the early prediction of the risk of overweight and obesity in young people. In Computational Science (ICCS 2020). Lecture Notes in Computer Science Vol. 12140 (eds Krzhizhanovskaya, V. V. et al.) 523–535 (Springer, 2020).
Lim, H., Lee, H. & Kim, J. A prediction model for childhood obesity risk using the machine learning method: a panel study on Korean children, Scientific Reports, 13 (10122), (2023).
Thamrin, S. A., Arsyad, D. S., Kuswanto, H., Lawi, A. & Nasir, S. Predicting obesity in adults using machine learning techniques: an analysis of Indonesian basic health research 2018. Frontiers Nutrition, 8, 669155 (2021).
Pang, X., Forrest, C. B., Lê-Scherban, F. & Masino, A. J. Prediction of early childhood obesity with machine learning and electronic health record data. International J. Med. Informatics, 150 (104454), (2021).
Jeon, J., Lee, S. & Oh, C. Age-specific risk factors for the prediction of obesity using a machine learning approach. Front. Public. Health. 10, 998782 (2022).
Khater, T., Tawfik, H. & Singh, B. Machine Learning for the Classification of Obesity Levels Based on Lifestyle Factors, in 7th Inter. Conference on Cloud and Big Data Comput. Manc. UK, (2023).
Rodríguez, E., Rodrígueza, E., Nascimento, L., Silvaa, A. & Marins, F. Machine learning techniques to predict overweight or obesity, in 4th International Conference on Informatics & Data-Driven Medicine, Valencia, Spain, (2021).
Jeon, S., Kim, M., Yoon, J., Lee, S. & Youm, S. Machine learning-based obesity classification considering 3D body scanner measurements. Sci. Rep. 13, 3299 (2023). (Article number.
Suresh, C. et al. Obesity Prediction Based on Daily Lifestyle Habits and Other Factors Using Different Machine Learning Algorithms, in Proceedings of Second International Conference on Advances in Computer Engineering and Communication Systems. Algorithms for Intelligent Systems, A. B. Reddy, B. Kiranmayee, R. Mukkamala and K. Srujan Raju, Eds., Springer, Singapore, pp. 397–407. (2022).
Diayasa, I. G. S. M., Idhom, M., Fauzi, A. & Damaliana, A. T. Stacking Ensemble Methods to Predict Obesity Levels in Adults, in 8th Information Technology International Seminar (ITIS), Surabaya, Indonesia, (2022).
Solomon, D. D. et al. Hybrid Majority Voting: Prediction and Classification Model for Obesity, Diagnostics, 13 (15), 2610, (2023).
Choudhuri, A. A hybrid machine learning model for Estimation of obesity levels. In Data Management, Analytics and Innovation. ICDMAI 2022. Lecture Notes on Data Engineering and Communications Technologies Vol. 137 (eds Goswami, S. et al.) 315–329 (Springer, 2023).
Ganie, S. M., Reddy, B. B., Hemachandran, K. & Rege, M. An investigation of ensemble learning techniques for obesity risk prediction using lifestyle data. Decis. Analytics J. 14, 100539 (2025).
Lin, W., Shi, S., Huang, H., Wen, J. & Chen, G. Predicting risk of obesity in overweight adults using interpretable machine learning algorithms. Frontiers Endocrinology, 14, (2023).
Zhou, Z. H. Ensemble Methods: Foundations and Algorithms (Chapman & Hall/CRC, 2012).
Sarmah, U., Borah, P. & Bhattacharyya, D. K. Ensemble learning methods: an empirical study. SN Comput. Sci. 5, 924 (2024).
Liu, Z. Ensemble learning. In Artificial Intelligence for Engineers 221–242 (Springer, 2025).
Freund, Y. & Schapire, R. E. A Decision-Theoretic generalization of On-Line learning and an application to boosting. J. Comput. Syst. Sci. 55 (1), 119–139 (1997).
Breiman, L. Bagging predictors. Maching Learn. 24, 123–140 (1996).
Breiman, L. Random forests. Mach. Learn. 45 (1), 5–32 (2001).
Geurts, P., Ernst, D. & Wehenkel, L. Extremely randomized trees. Mach. Learn. 63, 3–42 (2006).
Zhao, C. et al. BoostTree and BoostForest for ensemble learning. IEEE Trans. Pattern Anal. Mach. Intell. 45 (7), 8110–8126 (2023).
Aziz, N. et al. A study on gradient boosting algorithms for development of AI monitoring and prediction systems, in Inter. Confer. Comput. Intel. (ICCI), Malaysia, (2020).
Chen, T. & Guestrin, C. XGBoost: A scalable and portable parallel tree boosting framework, in 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, SanFrancisco, USA, (2016).
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. CatBoost: A powerful and efficient categorical feature boosting decision tree, Advances in Neural Infor. Proc. Syst. (NeurIPS 2018), 31, 6237–6249, (2018).
Ke, G. et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree, Advances in Neural Information Processing Systems (NIPS 2017), 30, 3146–3154, (2017).
Freund, Y. & Schapire, R. E. AShort introduction to boosting. J. Japanese Soc. Artif. Intell. 14 (5), 771–780 (1999).
Zhang, Y., Zhang, H., Cai, J. & Yang, B. A Weighted voting classifier based on differential evolution, Abstract and Applied Analysis, 2014, 376950, 2014. (2014).
Giraud-Carrier, C. Combining Base-Learners into ensembles. In Metalearning. Cognitive Technologies (eds. Brazdil, P., Jan N. van Rijn, Soares, C., Vanschoren, J.) 169–188 (Springer, 2022).
Ganie, S. M., Pramanik, P. K. D. & Zhao, Z. Enhanced and interpretable prediction of multiple cancer types using a stacking ensemble approach with SHAP Analysis, Bioengineering, 15, 472, (2025).
Hastie, T., Tibshirani, R. & Friedman, J. The elements of statistical learning: data mining, inference, and prediction (Springer-, 2009).
Cover, T. & Hart, P. E. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory. 13 (1), 21–27 (1967).
Murtagh, F. Multilayer perceptrons for classification and regression, Neurocomputing, 2 (5–6), 183–197, (1991).
McCallum, A. & Nigam, K. A comparison of event models for naive bayes text classification, AAAI-98 workshop on learning for text categorization, 752 (1), 41–48, (1998).
Schapire, R. E. & Singer, Y. Improved boosting algorithms using Confidence-rated predictions. Mach. Learn. 37, 297–336 (1999).
Palechor, F. M. & Manotas, A. Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico, Data in Brief, 25 (104344), (2019).
Hodges, J. L. Jr. & Lehmann, E. L. Rank methods for combination of independent experiments in analysis of variance. In Selected Works of E. L. Lehmann. Selected Works in Probability and Statistics (ed. Rojo, J.) 403–418 (Springer, 2012).
Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6 (2), 65–70 (1979).
Nauta, M. et al. From anecdotal evidence to quantitative evaluation methods: a systematic review on evaluating explainable AI, ACM Comput. Surveys, 55 (13s), 1–42, (2023).
Lundberg, S. M. & Lee, S. I. A unified approach to interpreting model predictions, in 31st Inter. Conf. Neural Infor. Proc. Syst. (NIPS’17), Long Beach, California, (2017).
Ribeiro, M. T., Singh, S. & Guestrin, C. Why Should I Trust You? Explaining the Predictions of Any Classifier, in 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16), San Francisco, California, (2016).
Acknowledgements
This work was supported by the Deanship of Scientific Research, the Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia, under the project (KFU253314).
Funding
ZZ is partially funded by his startup fund at The University of Texas Health Science Center at Houston, Houston, Texas, USA.
Author information
Authors and Affiliations
Contributions
Study conception and design: SMG, PKDP. Design and implementation: SMG. Analysis: SMG, PKDP. Manuscript draft: PKDP. Supervision: ZZ, Funding: ZZ, Manuscript edits and finalization: all authors.
Corresponding authors
Ethics declarations
Declaration of generative AI and AI-assisted technologies in the writing process: During the preparation of this work, the authors used GPT-4o to improve the readability and language of the manuscript. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the published article.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Ganie, S.M., Pramanik, P.K.D. & Zhao, Z. Lifestyle data-based multiclass obesity prediction with interpretable ensemble models incorporating SHAP and LIME analysis. Sci Rep 15, 36916 (2025). https://doi.org/10.1038/s41598-025-20936-4
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-20936-4

















