Abstract
Obesity is closely associated with various chronic diseases.Therefore, accurate, reliable and cost-effective methods for preventing its occurrence and progression are required. In this study, we developed a visualized obesity risk prediction system based on machine learning techniques, aiming to achieve personalized comprehensive health management for obesity. The system utilized a dataset consisting of 1678 anonymized health examination records, including individual lifestyle factors, body composition, blood routine, and biochemical tests. Ten multi-classification machine learning models, including Random Forest and XGBoost, were constructed to identify non-obese individuals (BMI < 25), class 1 obese individuals (25 ≤ BMI < 30), and class 2 obese individuals (30 ≤ BMI). By evaluating the performance of each model on the test set, we selected XGBoost as the best model and built the visualized obesity risk prediction system based on it. The system exhibited good predictive performance and interpretability, directly providing users with their obesity risk levels and determining corresponding intervention priorities. In conclusion, the developed obesity risk prediction system possesses high accuracy and interactivity, aiding physicians in formulating personalized health management plans and achieving comprehensive and accurate obesity management.
Similar content being viewed by others
Introduction
Obesity has become a significant public health issue worldwide, not only reducing individuals’ quality of life but also serving as a major risk factor for conditions such as diabetes, hypertension, and cardiovascular diseases1,2,3. In China, approximately one-fifth of children and half of adults suffer from overweight or obesity problems, making it the country with the highest number of overweight or obese people in the world4. However, most Chinese adults perceive obesity merely as a cosmetic problem rather than a life-threatening health issue, resulting in inadequate attention to obesity prevention and management5. To address the obesity epidemic, healthcare professionals need the ability to identify individuals at high risk of obesity, conduct necessary health education, and formulate personalized health management goals6. In this context, finding effective methods to predict the risk of obesity, identify individual obesity-related factors, becomes crucial for effective prevention and management of obesity.
The lifestyle and blood biochemical indicators of individuals are important bases for predicting the risk of obesity occurrence7, researchers mostly use traditional statistical methods such as correlation analysis to predict the risk of obesity occurrence8,9. With the development of artificial intelligence technology, machine learning, as a powerful tool for data analysis and prediction10, has provided new opportunities for predicting the risk of chronic diseases11,12,13. Previous studies, such as Islam et al., predicted hypertension risk using four machine learning algorithms, including Random Forest, after screening key risk factors with Least Absolute Shrinkage and Selection Operator (LASSO)14. Artzi et al. predicted gestational diabetes in the early stages of pregnancy based on electronic health records15. Davide et al. used Neural Network models to predict the incidence of various neonatal diseases16. From this, it can be seen that machine learning techniques can analyze the lifestyle and blood biochemical indicators of individuals to discover correlations and trends, predicting the risk of obesity for individuals. In addition, previous studies often only constructed machine learning models, and these models had a ‘black box’ characteristic, meaning it was difficult to explain which variables determined the prediction results. This posed challenges to formulating intervention measures to reduce adverse events and greatly limited the application of research results17,18.
To overcome these limitations, accurately predict individual obesity risk, and provide reasonable explanations for the prediction results, this study evaluated the performance of Random Forest (RF), XGBoost (XGB), Support Vector Machine (SVM), LightGBM (LGB), Decision Tree (DT), Gradient Boosting Tree (GBT), Multi-Layer Perceptron (MLP), K-Nearest Neighbors (KNN), Backpropagation Neural Network (BPNN), and Logistic Regression (LR) models in obesity prediction. Obtain the feature importance of the machine learning model with the best performance and combine it with SHapley Additive Explanation (SHAP) to develop an online obesity risk prediction system (Fig. 1). The developed online obesity risk prediction system is deployed on a webpage, allowing information input through various means to predict the risk probability of class 1 obesity in non-obese populations or class 2 obesity in class 1 obese populations. At the same time, the system can also use SHAP technology to explain individual obesity risk prediction results, rank the importance of features influencing obesity risk, and make the results interpretable. Therefore, the developed obesity risk prediction system possesses the advantages of accuracy, comprehensiveness, and practicality, assisting physicians in achieving personalized comprehensive health management for obesity and exhibiting broad potential for application and development.
Results
Research subjects
The dataset contains electronic medical examination records of 1678 individuals. Table S1 shows the distribution of lifestyle and blood biochemical indicators. According to BMI, 865 (51.5%) individuals in the dataset belonged to the non-obese group, 716 individuals (42.6%) belonged to class 1obese group, and 97 individuals belonged to class 2obese group (5.7%). The dataset comprised 68 variables, including gender, age, family medical history, eight lifestyle variables, two blood pressure variables, and 43 blood biochemical variables.
Classification performance
ROC curve is a method for evaluating the performance of classification models, and the area under the curve (AUC) represents the model’s performance, with a larger AUC indicating better performance. As shown in Fig. 2, the AUC values of the macro-average and micro-average ROC curves for RF, XGB, LGB, GBT, BPNN, and LR models were all around 0.95, indicating good predictive performance with high true positive rates and low false positive rates. Accuracy is the ratio of correctly classified samples to total samples. Compared to other models, the XGB model has the highest accuracy (Fig. 3). At the same time, we calculated the precision and recall rates for all models, and F1-Score is the harmonic mean of precision and recall rate (Table S2). The XGB model also has the highest average F1-Score (Fig. 3). These results indicate that the XGB model has the best overall performance among the ten machine learning models.
The ROC curves of different machine learning models for predicting different obesity class. The blue curve represents the non-obesity (class 0), the yellow curve represents the class 1 obese (class 1), and the red curve represents the class 2 obesity (class 2). The micro-average curve is a weighted average of true positive and false positive rates for all samples, represented by a red dashed line. The macro-average curve is the average of the ROC curves for the three class, represented by a blue dashed line.
We analyzed the misclassifications produced by the XGB model using the confusion matrix (Fig. 4). Among the 262 non-obese individuals, 83.97% were correctly classified, while the remaining 16.03% were misclassified as class 1 obese. Among the 208 class 1 obese individuals, 83.65% were correctly classified, 15.38% were misclassified as non-obese, and 0.97% were misclassified as class 2 obese. Among the 34 class 2 obese individuals, 61.76% were correctly classified, while the remaining 38.24% were misclassified as class 1 obese.
The mix matrix of the XGB model classification results based on the test set. The vertical coor-dinates class 0, class 1, and class 2 correspond to the actual number of people who are non-obese, class 1 obese, and class 2 obese, respectively, the horizontal coordinates class 0, class 1, and class 2 correspond to the number of people classified by the XGB model as non-obese, class 1 obese, and class 2 obese, respectively.
Feature importance
The five most important feature indicators of the XGB model obtained by the importance_scores algorithm are: ‘Hip circumference’, ‘Chest circumference’, ‘Body fat mass’, ‘Diet’ and ‘Triglycerides’. These are similar to the contributions of each feature to the prediction results obtained by SHAP (Fig. 5). SHAP is an interpretable artificial intelligence method based on game theory, it adopts the concept of Shapley value from cooperative game theory to assign an importance score to each feature, representing the contribution of each feature to the risk prediction model19,20. In the SHAP plot, the y-axis represents feature sorting, indicating the strength of each feature’s impact on the model, the color of the data points represents the feature values, where red corresponds to high values and blue corresponds to low values. We found that hip circumference was the most important feature in predicting individual obesity risk. Additionally, body fat percentage, chest circumference, triglycerides, dietary habits, and glycated hemoglobin A1 were also important features in the predicting obesity class. Among them, dietary habits had higher predictive value for class 2 obesity and some predictive value for class 1 obesity but no predictive value for non-obesity.
Risk prediction system
In the visualized prediction system, the left side is the information input area, where information can be entered one by one (Fig. 6A) or imported through a file (Fig. 6B). When entering information one by one, continuous variables can be entered by dragging (e.g., age), and categorical variables can be selected by clicking (e.g., gender). The right side of the system is the output window. The upper part displays the current BMI’s obesity class and the prediction result for the next obesity class, while the lower part provides personalized result analysis for intervention strategy development.
Discussion
Accurate and interpretable risk assessment is crucial for obesity prevention and intervention21,22. The proposed obesity prediction system in this study is an important health management tool that can assist physicians in deciding whether to intervene and develop personalized intervention plans.
We first assessed the performance of ten machine learning models in obesity prediction based on health examination data and established an obesity risk assessment model with good predictive performance. In contrast to similar studies23,24,25,26, we did not simply classify the population into obese and non-obese groups, we further divided the obese population into class 1 and class 2 obesity. Additionally, the predictive accuracy of our trained machine learning models on the test set was higher than that in similar studies. This maybe due to the fact that our best model was selected from a larger pool of machine learning models, and we employed a Monte Carlo Cross-Validation algorithm during the training process. The ten machine learning models we used included tree models, deep learning models, and traditional statistical models, and the tree model XGB demonstrated the best predictive performance. This may be because traditional statistical models like LR are more suitable for linear or normally distributed problems, while deep learning models are better suited for image or natural language processing tasks and have poorer predictive performance on small-sample tabular data27.
On the basis of the XGB model, we constructed a visual obesity risk prediction system using the SHAP algorithm, making the output results of the machine learning model interpretable. In this study, in addition to incorporating features such as age, gender, lifestyle, and blood routine, we also included body composition data such as total body water, protein content, and basal metabolic rate as variables. According to the SHAP interpretation results (Fig. 5C), bone mineral content was important predictors of class 2 obesity. This is consistent with the findings of Hwaung et al., which indicate that obesity is not only characterized by excessive adipose tissue but also by changes in characteristics such as protein content in skin and visceral organs28. Among the variables related to blood biochemistry, elevated triglycerides levels increased the risk of class 1 obesity, while elevated glycated hemoglobin and uric acid levels increased the risk of class 2 obesity. Conversely, decreased triglyceride and glycated hemoglobin A1 levels increased the probability of non-obesity (Fig. 5). These findings are in line with the research by Jeon et al., which identified triglycerides, glycated hemoglobin, and uric acid as important features for assessing obesity risk24. Indicators such as triglycerides and glycated hemoglobin are important features for assessing obesity risk, this may be because there is a close relationship between high triglyceride levels and insulin resistance, insulin resistance can lead to fluctuations in blood sugar levels, further stimulating excessive secretion of insulin, elevated insulin levels can promote fat synthesis and storage29,30. Therefore, when predicting obesity risk, it is necessary to consider not only common features such as lifestyle but also body composition and blood biochemical indicators in order to provide early intervention for individuals at high risk of obesity.
In this study, we constructed an obesity risk prediction system based on the XGB model and SHAP technology, which is accessible on a webpage. To explain the usage of this system, we presented an example of its use in Fig. 6. After inputting information into the left-hand input interface, the system indicates that the BMI of the examined individual does not currently reach the obesity level, but the risk probability of class 1 obesity is 34.77%. In the SHAP plot below, the length of the feature bar indicates the strength of its influence on the risk probability, where red represents positive influence and blue represents negative influence. Factors contributing positively to the risk of class 1 obesity include hip circumference, alcohol consumption, and glycated hemoglobin A1, while factors contributing negatively include triglycerides, lymphocyte percentage, red blood cell distribution width, and diet. Therefore, targeted control of factors such as hip circumference, alcohol consumption, and glycated hemoglobin A1 can reduce the risk of class 1 obesity. Our constructed obesity risk prediction system allows information input through individual entries and file reading, providing personalized obesity risk assessment for examination personnel in a more convenient and user-friendly manner, laying the foundation for the practical application of future prediction systems.
We are aware that our study has some limitations. Although we randomly divided the dataset into a training set and a test set, the results may still be influenced by the source of the data due to its single source. Additionally, the class 2 obese population was relatively less in the dataset, although we applied the SMOTE algorithm for over-sampling during model training, the F1-score for class 2 obesity by the best model was still lower than the other two classes, indicating relatively lower predictive ability for class 2 obesity. Lastly, the obesity risk prediction system developed in this study was based on the XGB model. While the XGB model demonstrated the best overall performance, its recall rate for the non-obese class was lower than that of the LGB and BPNN models, suggesting a potential need to improve the model’s predictive performance for the non-obese class through optimization algorithms or other approaches.
Conclusions
In summary, we have constructed an obesity risk assessment model with good predictive performance based on health management center examination data and ten machine learning models. Using SHAP to interpret the model’s output results, we have built an obesity risk prediction system accessible on a webpage. This system not only predicts the risk probability of class 1 obesity in non-obese populations but also forecasts the risk probability of class 2 obesity in class 1 obese populations, enabling comprehensive management of obesity issues. Additionally, our constructed obesity risk prediction system can output the important factors influencing individual obesity risk through SHAP technology, aiding healthcare professionals in providing targeted obesity management and achieving personalized services.
Methods
Study population
The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of Hangzhou Normal University Affiliated Hospital (protocol code 2023E2-KS-110 and date of July 2023). Due to the research involving previously collected anonymous data and the absence of a direct risk of individual privacy disclosure, in accordance with Article 39 of the ‘Ethical Review of Biomedical Research Involving Humans’ in China, the Ethics Committee of Hangzhou Normal University Affiliated Hospital has waived the requirement for informed consent. Throughout the entire research process, we rigorously adhere to ethical principles, maintain transparency, integrity, and comply with all relevant regulations. This study utilized an anonymized dataset of 1678 individuals from the Health Management Center of Hangzhou Normal University Affiliated Hospital, covering the period from May 31, 2022, to May 31, 2023. Individuals who undergo health check-ups are usually those seeking active health management and prevention, and they are not necessarily patients who have already been diagnosed with specific diseases. The study population comprised individuals aged 18 and above who underwent health examinations, excluding pregnant women and individuals with physical disabilities, resulting in a dataset with 1678 health examination records. According to the Asia-Pacific criteria, we divided the study population into non-obese individuals (BMI < 25, class 0), class 1 obese individuals (25 ≤ BMI < 30, class 1), and class 2 obese individuals (30 ≤ BMI, class 2)31. The dataset contained 42.6% class 1 obese individuals and 5.7% class 2 obese individuals.
Data preprocessing
The collected data included information on individuals’ age, gender, lifestyle, blood routine, and biochemical test results, involving a total of 80 variables (the missing rate of 12 variables is more than 30%, the missing rate of 9 variables is less than 30%, and the remaining variables have no missing values). To ensure data quality, we utilized the Multiple Imputation by Chained Equations (MICE) algorithm to impute variables with a missing rate below 30% and removed variables with a missing rate over 30%, ultimately retaining 68 variables32.
Machine learning models
Machine learning models were designed using Python 3.11 in this study. The dataset was randomly divided into a training set (70%) and a test set (30%). To address the issue of imbalanced sample distribution in the training set, SMOTE oversampling technique was applied. RF, XGB, SVM, LGB, DT, GBT, MLP, KNN, BPNN and LR models were constructed, and the parameters of the models were selected using Monte Carlo Cross-Validation.
Model evaluation
To evaluate the performance of the machine learning models, we calculated the Accuracy, Precision, Recall, and F1-score of the models in classifying non-obese, class 1 obese, and class 2 obese individuals in the test set33. We also plotted macro-average and micro-average ROC curves.
Visualized system
Based on the above analysis, we selected the machine learning model with the best predictive performance as the optimal model. We calculated its feature importance index through the importance_scores algorithm and SHAP, respectively, and constructed an online prediction system34.
Data availability
The dataset generated and/or analyzed during the current study are not publicly available but are available from the corresponding author and the first author on reasonable request. The code was based on Python3.11 programming language (http://www.python.org). The codes are available from the corresponding author on reasonable request.
References
Safaei, M., Sundararajan, E. A., Driss, M. & Boulila, W. & Shapi’i, A. A systematic literature review on obesity: Understanding the causes & consequences of obesity and reviewing various machine learning approaches used to predict obesity. Comput. Biol. Med.136 (2021).
Rosen, E. D. & Spiegelman, B. M. What we talk about when we talk about Fat. CELL. 156, 20–44 (2014).
Zhang, P. et al. Association between Obesity Type and common vascular and metabolic diseases: a cross-sectional study. Front. Endocrinol. 10 (2020).
Wang, Y., Zhao, L., Gao, L., Pan, A. & Xue, H. Obesity in China 3 Health policy and public health implications of obesity in China. LANCET DIABETES Endocrinol.9, 446–461 (2021).
Li, J. et al. Obesity pandemic in China: epidemiology, burden, challenges, and opportunities. Chin. Med. J.135, 1328–1330 (2022).
De Nucci, S. et al. Associations between serum biomarkers and non-alcoholic liver disease: results of a clinical study of Mediterranean patients with obesity. Front. Nutr.9, 1002669 (2022).
Kakudi, H. A., Loo, C. K., Moy, F. M., Kau, L. C. & Pasupa, K. DIAGNOSIS OF METABOLIC SYNDROME USING MACHINE LEARNING, STATISTICAL AND RISK QUANTIFICATION TECHNIQUES: A SYSTEMATIC LITERATURE REVIEW. MALAYSIAN J. Comput. Sci.34, 221–241 (2021).
TÜRKAY, İ. K. Examınatıon of The Relatıonshıp of Blood Parameters ın Overweıght Female and Male Indıvıduals. hypertension 1, 15.
Knazicka, Z. et al. Blood Concentration of Macro- and Microelements in Women Who Are Overweight/Obesity and Their Associations with Serum Biochemistry. LIFE-BASEL 14 (2024).
Panagoulias, D. P., Virvou, M. & Tsihrintzis, G. A. A novel framework for artificial intelligence explainability via the Technology Acceptance Model and Rapid Estimate of Adult Literacy in Medicine using machine learning. Expert Syst. Appl. 248 (2024).
Du, J. et al. Developing a hypertension visualization risk prediction system utilizing machine learning and health check-up data. Sci. Rep. 13 (2023).
Panagoulias, D. P., Sotiropoulos, D. N. & Tsihrintzis, G. A. SVM-Based Blood Exam Classification for Predicting Defining Factors in Metabolic Syndrome Diagnosis. ELECTRONICS 11 (2022).
Chen, H. et al. Using blood indexes to predict overweight statuses: an Extreme Learning Machine-Based Approach. PLOS ONE 10 (2015).
Islam, M. M. et al. Machine learning algorithm for characterizing risks of hypertension, at an early stage in Bangladesh. DIABETES & METABOLIC SYNDROME-CLINICAL RESEARCH & REVIEWS 15, 877–884 (2021).
Artzi, N. S. et al. Prediction of gestational diabetes based on nationwide electronic health records. Nat. Med.26, 71– (2020).
De Francesco, D. et al. A data-driven health index for neonatal morbidities. ISCIENCE25 (2022).
Linardatos, P., Papastefanopoulos, V. & Kotsiantis, S. Explainable AI: A Review of Machine Learning Interpretability Methods. ENTROPY 23 (2021).
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell.1, 206–215 (2019).
Aas, K., Jullum, M. & Løland, A. Explaining individual predictions when features are dependent: more accurate approximations to Shapley values. Artif. Intell.298, 103502 (2021).
Belle, V. & Papantonis, I. Principles and practice of explainable machine learning. Front. big Data. 4, 688969 (2021).
Zupo, R. et al. Associations between nutritional frailty and 8-year all‐cause mortality in older adults: the Salus in Apulia Study. J. Intern. Med.290, 1071–1082 (2021).
Zupo, R. et al. Beverages consumption and oral health in the aging population: a systematic review. Front. Nutr.8, 762383 (2021).
Thamrin, S. A., Arsyad, D. S., Kuswanto, H., Lawi, A. & Nasir, S. Predicting obesity in adults using machine learning techniques: An Analysis of Indonesian Basic Health Research 2018. Front. Nutr.8, 669155 (2021).
Jeon, J., Lee, S. & Oh, C. Age-specific risk factors for the prediction of obesity using a machine learning approach. Front. Public. Health. 10, 998782 (2023).
Parra, D. et al. Predicting the risk of overweight and obesity in Madrid-A Binary classification Approach with Evolutionary Feature Selection. Appl. sciences-basel. 12, 8251 (2022).
Kaur, R., Kumar, R. & Gupta, M. Predicting risk of obesity and meal planning to reduce the obese in adulthood using artificial intelligence. ENDOCRINE. 78, 458–469 (2022).
Yan, J. et al. A clinical decision support system for predicting coronary artery stenosis in patients with suspected coronary heart disease. Comput. Biol. Med. 151 (2022).
Hwaung, P. et al. Obesity tissue: composition, Energy expenditure, and Energy Content in adult humans. Obesity. 27, 1472–1481 (2019).
Yang, L. et al. Association of serum adipose triglyceride lipase levels with obesity and diabetes. Genet. Mol. Res.13, 6746–6751 (2014).
Weyman-Vela, Y., Guerrero-Romero, F. & Simental-Mendia, L. E. The triglycerides and glucose index is more strongly associated with metabolically healthy obesity phenotype than the lipid and obesity indices. J. Endocrinol. Investig.47, 865–871 (2024).
Lim, J. U. et al. Comparison of World Health Organization and Asia-Pacific body mass index classifications in COPD patients. Int. J. Chronic Obstr. Pulm. Dis.12, 2465–2475 (2017).
White, I. R., Royston, P. & Wood, A. M. Multiple imputation using chained equations: issues and guidance for practice. Stat. Med.30, 377–399 (2011).
Trigka, M. & Dritsas, E. Long-Term Coronary Artery Disease Risk Prediction with Machine Learning Models. SENSORS 23 (2023).
Lundberg, S. M. & Lee, S. I. A unified approach to interpreting model predictions. Adv. Neural. Inf. Process. Syst. 30 (2017).
Acknowledgements
This work was supported by grants from the Innovation Practice and Local Service Program at Hangzhou Normal University to J.D., and Zhejiang Provincial Natural Science Foundation of China (Grant No.LQ23H270015) to S.W.
Author information
Authors and Affiliations
Contributions
Conceptualization, J.D.; methodology, J.D.; validation, S.Y.; data curation, Y.Z. and C.Y.; writing—original draft preparation, X.C.; writing—review and editing, J.D.; resources, S.W. and J.D.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Du, J., Yang, S., Zeng, Y. et al. Visualization obesity risk prediction system based on machine learning. Sci Rep 14, 22424 (2024). https://doi.org/10.1038/s41598-024-73826-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-73826-6
This article is cited by
-
Machine learning framework for predicting susceptibility to obesity
Scientific Reports (2025)