Machine learning-aided risk prediction for metabolic syndrome based on 3 years study

Yang, Haizhen; Yu, Baoxian; OUYang, Ping; Li, Xiaoxi; Lai, Xiaoying; Zhang, Guishan; Zhang, Han

doi:10.1038/s41598-022-06235-2

Download PDF

Article
Open access
Published: 10 February 2022

Machine learning-aided risk prediction for metabolic syndrome based on 3 years study

Haizhen Yang^1,2,3,
Baoxian Yu^1,2,3,
Ping OUYang⁴,
Xiaoxi Li⁴,
Xiaoying Lai⁴,
Guishan Zhang⁵ &
…
Han Zhang^1,2,3

Scientific Reports volume 12, Article number: 2248 (2022) Cite this article

5510 Accesses
25 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Metabolic syndrome (MetS) is a group of physiological states of metabolic disorders, which may increase the risk of diabetes, cardiovascular and other diseases. Therefore, it is of great significance to predict the onset of MetS and the corresponding risk factors. In this study, we investigate the risk prediction for MetS using a data set of 67,730 samples with physical examination records of three consecutive years provided by the Department of Health Management, Nanfang Hospital, Southern Medical University, P.R. China. Specifically, the prediction for MetS takes the numerical features of examination records as well as the differential features by using the examination records over the past two consecutive years, namely, the differential numerical feature (DNF) and the differential state feature (DSF), and the risk factors of the above features w.r.t different ages and genders are statistically analyzed. From numerical results, it is shown that the proposed DSF in addition to the numerical feature of examination records, significantly contributes to the risk prediction of MetS. Additionally, the proposed scheme, by using the proposed features, yields a superior performance to the state-of-the-art MetS prediction model, which provides the potential of effective prescreening the occurrence of MetS.

Predictive analysis of metabolic syndrome based on 5-years continuous physical examination data

Article Open access 05 June 2023

Metabolomic profiles predict individual multidisease outcomes

Article Open access 22 September 2022

A comprehensive multi-task deep learning approach for predicting metabolic syndrome with genetic, nutritional, and clinical data

Article Open access 01 August 2024

Introduction

Metabolic syndrome (MetS) is a series of metabolic disorders of proteins, fats, carbohydrates and other natural substances¹. It has a high prevalence worldwide and the morbidity is still increasing^2,3. The aetiology of MetS is complex, and it has been widely recognized that the formation of MetS is related to insulin resistance, obesity, hypertension, and dyslipidemia^4,5. Besides, it has been pointed out in^6,7,8,9, that MetS may increase the risk of diabetes, cardiovascular diseases (CVDs), chronic kidney diseases and cancers, where the above diseases seriously endanger human’s health due to high mortality¹⁰. Therefore, it is significant to predict the onset of MetS in advance, which can prevent it from evolving into more serious diseases by early intervention and treatment.

Statistical methods have been widely used to identify the risk factors of MetS in various perspectives. Risk ratio is a commonly used method. Scuteri et al.¹¹ used a logistic regression model to derive relative risk (RR) of demographics and MetS components, and obtained that waist circumference (WC), triglyceride (TG), high density lipoprotein cholesterol (HDL-C) are the independent predictors of MetS. Wu¹² considered the odds ratio (OR) of cardiopulmonary fitness data to the risk of MetS 2 years later in the Taiwan military population. One traditional method for risk prediction is to set risk rules artificially. Taking an example of MetS risk prediction, Zou et al.¹³ set different risk scores for 4 MetS-related risk variables based on hazard ratio (HR) obtained from multiple logistic regression model, and then provided a risk model corresponding to the cumulative risk of these indicators, with the area under the receiver operating characteristic curve (AUC) of 0.690. Another traditional risk prediction method is based on the cut-off value of a single variable. For example, Jowitt et al.¹⁴ obtained the cut-off point of body mass index (BMI), WC, waist to hip ratio (WHR), waist to height ratio (WHtR) and total body fat (TBF) from previous studies, by which to determine the risk to MetS, and further to predict the occurrence of diabetes and CVDs. These models provided broad perspectives on the risk factors of MetS, but the prediction for the onset is not accurate enough for practical purposes due to the simple binary division of each variable. To address the above issue, Jeong et al.¹⁵ proposed an areal similarity degree-based model to identify the high-risk group of MetS using a weighted radar chart, where different importance of each variable as well as continuous numerical input was considered.

Machine learning has been regarded as a promising technique due to its powerful learning capability^16,17. With the help of machine learning, non-invasive indicators without blood drawing can be applied to predict MetS, enabling early diagnosis on MetS even in the areas with poor medical conditions^18,19. Besides, this technology has enabled the prediction of MetS to be applied to some uncommon fields like metabolic spectrum²⁰ and FibroScan ultrasonic elastography equipment²¹. The above works can achieve accurate identification of MetS. Since MetS are often accompanied by various complications^22,23, it is of significance for potential MetS patients to provide effective risk prediction in advance.

Empowered by machine learning, researches on risk prediction of MetS have been widely concerned in recent years. Farzaneh et al.²⁴ predicted the risk of MetS after 7 years by using anthropometric and some commonly used MetS related clinical examination indicators, and concluded that TG, blood pressure (BP) and BMI are the most important risk factors. Lee et al.²⁵ constructed a 2-year risk prediction model of MetS and showed the relationship that weight control in different BMI groups to the reduction of MetS predictive index (MPI) 2 years later. In²⁶ and²⁷, the genetic information was considered, but the results demonstrated that the diet, lifestyle and clinical information still plays a leading role in the risk prediction of MetS. Based on this fact, Lee et al.²⁸ combined the “Sasang constitutional (SC) types” features, which involving facial expressions and body posture into account to achieve a long-range prediction of MetS over 14 years. Li et al.²⁹ studied the relationship between children’s retinol binding protein 4 (RBP4) and 10-year risk of MetS. Although the above-mentioned models demonstrated that the relationship between MetS and some key clinical variables, such as TG, BP and BMI, are important for the risk prediction of MetS, the impact of the numerical and state changes of such clinical variables on MetS has not been reported yet.

To address the above issues, this paper concerns with a machine learning-aided longitudinal study on risk prediction of MetS by using a total of three consecutive years examination records of 67,730 individuals. To be specific, in addition to the numerical features of examination records, the numerical changes and the normal/abnormal state changes over the past two consecutive years are employed as features for classification for the prediction of MetS in the forthcoming year. To the best of the authors’ knowledge, it is the largest number of samples involved for MetS risk prediction. From numerical results, it is shown that the proposed risk prediction model yields a higher performance in comparison with the state-of-the-art methods. More importantly, we show that the impact of differential state features (DSFs) w.r.t. the clinical variables, i.e., TG, WC, BP and BMI, in addition to the numerical features of examination records, are significant to the risk of MetS, demonstrating that long-term unhealthy lifestyle over 2 years, regardless of age and gender, leads to a high incidence of MetS.

Results

Performance of differential features with different classifiers

Table 1 shows the performance comparison of MetS prediction models using three different classifiers with and without the proposed differential numerical features (DNFs) and DSFs. For fairness of comparison, all examination indicators of the previous 2 years with and without DNFs and DSFs are considered in experiments. 10-fold cross-validation experiment is carried out, where the metric of AUC is described in mean ± standard deviation (STD), and the best performance in each metric is marked in bold. In addition, we further plot the receiver operating characteristic (ROC) curves of the proposed MetS prediction model with/without the DNFs and DSFs. It can be seen from both Table 1 and Fig. 1 that both the proposed MetS predictive models with and without differential features perform robust with a very small STD value in terms of AUC. The result is reasonable, since the dimension of the dataset employed in this work reaches 67,730 individuals, which is larger than that has been reported by the existing contributions. Furthermore, it can be easily observed that the performance using DNFs and DSFs are superior to that without differential features in terms of all metrics. This result demonstrates that the variations of examination indicators during the consecutive 2 years can be viewed as effective features for predicting MetS in the forthcoming year. In addition, XGBoost performs the best in terms of AUC, Accuracy, Precision, F1-score, Specificity and F2-score, and it yields an AUC and Accuracy of up to 0.930 and 0.849, respectively. It is worth noting that the Precision and F1-score are 0.43 and 0.58 respectively. The result is similar to that of the existing studies^{14,19,25,27,30}, and is expected, since the number of positive samples is significantly less than that of negative ones. As a consequence, we select XGBoost as the classifier for the rest experiments unless indicated.

Table 1 Results based on three models with and without differential features.

Full size table

Risk factors of MetS

As shown in Fig. 2a, we only plot the top 20 important features from all 72 features, since these top 20 features contribute over 90% to the predictive performance of the model.

The 20 features can be divided into two categories: clinical variables and DSFs. The clinical variables include TG, WC, BMI, HDL-C, WHR, FL, SBP, FGLU, DBP, BP, and the DSFs include TG, WC, BP, BMI, SBP, FGLU, FL, HGB, DBP. Notably, the DSFs show strong robustness in the classification results, accounting for 9 out of the top 20 features and 6 out of the top 10 features. However, the DNF shows no obvious contribution to MetS model.

In order to further analyze the contribution of the top 20 features to the prediction of MetS, we provide an explainability analysis using SHAP tool³¹. As shown in Fig. 2b, among all 9 DSFs, the state changes in FGLU and TG contribute the most to the prediction of MetS. By similarity, the examination indicators of FGLU and TG are the top two features with the highest contribution to MetS. In addition to FGLU and TG, both the state changes and the examination indicators of WC and BMI are also important, suggesting that both the conditions whether the values of such indicators exceeding the normal upper limits or the status changes of N2A and A2A over the past 2 years could significantly increase the risk of MetS. It is also noted that the state changes of HGB from N2A and A2A are important features of increasing the risk of MetS, which has not been reported yet.

In view of this, we will further analyze the impact of abnormality in important clinical variables and two differential states (N2A and A2A) of important DSFs on MetS in different gender and age (divided by the world health organization) groups.

Impact of important clinical variables on MetS risk in different gender and age groups

Firstly, we statistically analyze the risk of MetS in different gender and age groups. As shown in Table 2, the prevalence of MetS for both genders grows with age, and it is higher in male than in female³⁰, but the differences are gradually reduced with age growth. For example, for the group aged 18–44, the prevalence ratio of MetS in male is approximately 8 times higher than that of female. For elder age group of more than 60 years old, the prevalence of MetS in male and female are comparable, i.e., 25.41% and 19.07%, respectively. The results are expected, and demonstrate that 20–25% elder people suffers the onset of MetS.

Table 2 Prevalence of MetS in the forthcoming year for different gender and age groups.

Full size table

Then, we statistically analyze the contribution of the clinical variables to different gender and age groups, by calculating the odds ratio (OR) of feature’s abnormality to MetS risk in the next year (the largest values of OR in different age groups are bold marked). As can be seen from Table 3 that the main risks of MetS in male aged 18–44 and 45–59 are abnormal TG and BMI. In addition, WC and FL also contribute to the risk of MetS in men under 44 years old. For male group over 60, the risks of MetS in addition to BMI, is mainly due to the abnormality of FL. Besides, the abnormalities of TG and WHR are also relatively important to this group.

Table 3 The OR of feature’s abnormality to MetS by age groups in male.

Full size table

Interestingly, it is seen from Table 4 that the most important risk factors of MetS for female aged 18–59 are TG, BMI, FL and FGLU. As age grows, WHR, in comparison with BMI, contribute more significance to the risk of MetS for female aged $\ge$ 45. For elder age group of $\ge$ 60, the most important clinical variables are HDL-C and WHR, respectively. From the aspect of age groups, it is observed that, (1) the impact of clinical variables on younger female (i.e., < 45) is more significant to that on elder ones. (2) The impact of clinical variables on the risk of MetS for female is more significant to that for male of the same age groups.

Table 4 The OR of feature’s abnormality to MetS by age groups in female.

Full size table

The above observations are expected and can be explained as follows. Elder female, in comparison with younger female, generally suffer from more concomitant diseases, of which the influences could potentially neutralize the contribution of single clinical variable on the risk of MetS. By similarity, the prevalence of male suffering from MetS is higher than that of female of the same age groups, and thus, the contribution of clinical variables to male are less obvious than female.

From the results in Tables 2, 3 and 4, it is shown that, the risks of MetS in female with abnormal clinical variables are higher than that in male of the same age groups, but the true prevalence of MetS in female is lower than male group. The potential reason is that, male groups, in comparison with female of the same age groups, generally have irregular diets and unhealthy lifestyle¹⁶, such as drinking, smoking, etc. Besides, for young and middle-aged female groups, the self-protection mechanism of female’s estrogen^16,32 is also an important reason for the low prevalence of MetS.

Impact of important DSFs on MetS risk in different gender and age groups

Next, we statistically analyze the impact that DSFs’ abnormalities have on the MetS of different gender and age groups. The results are shown in Tables 5 and 6, respectively.

Recall the definition of DSF in Eq. (2), the features include N2A (represents specific clinical variable is abnormal in recent 1 year), A2A (represents specific clinical variables are abnormal for past 2 years) and N2N (represents specific clinical variables are normal for past 2 years). For analysis, we evaluate the OR of DSFs’ abnormal states (N2A and A2A) of different gender and age groups by taking N2N state as a control group. For ease of analysis, the two largest values of OR w.r.t. N2A and A2A in different age groups are bold marked, respectively, and the values of OR w.r.t N2A higher than A2A are underlined.

For male aged 18–44, TG and BMI in N2A state have a relative high risk of MetS, and they have the highest risk when in A2A state. In addition, all the features show that compared with abnormality in the only recent 1 year, the risk of people with abnormality in both 2 years was significantly increased. It is still applicable to male over 45 years old. The difference is that with the increase of age, the risk of BMI in A2A state significantly reduced, even less than in the N2A state. And FGLU showed similar characteristics in male over 60 years old. This means that middle-aged and elderly male may have universal abnormal body weight, and the contribution to MetS is relatively stable when there is no significant change in this feature. Similarly, elderly male should also be aware of the significant changes in FGLU. A2A states of TG and HGB hold the highest risks in this age group.

Table 5 The OR of the presistent abnormality (A2A) compared to sudden abnormal state (N2A) in male.

Full size table

Table 6 The OR of the presistent abnormality (A2A) compared to sudden abnormal state (N2A) in female.

Full size table

It can be seen from Table 6 that for female aged from 18 to 44, the abnormality of TG, BMI, FGLU and FL lead to a higher risk of MetS in comparison with other clinical variables. When TG, FGLU and FL were abnormal for two consecutive years, the risk of MetS increased significantly. It is also noted that the impact of the abnormal DSFs in terms of TG, FGLU and FL on female aged from 45 to 59 was similar to that of the clinical variables on female aged from 18 to 44. This means that, benefiting from the protection of estrogen, the incidence of abnormal endocrine indicators in female $\le$ 59 is lower than that in male. Meanwhile, when TG and FL are abnormal for two consecutive years, it reflects that the endocrine mechanism disorder of people has exceeded their ability of self-protection by regulating the level of estrogen, leading to a significant increase in the risk of MetS. For female aged over 60, persistent obesity (associated with the abnormalities of both WC and BMI) and abnormal FL were also important risk factors of MetS.

In summary, the results shown in both Tables 5 and 6 demonstrate that, regardless of age and gender, the abnormal clinical variables of two consecutive years lead to higher MetS risk than that of only a single year. Clearly, the results encourage people to carry out necessary measures to avoid abnormal clinical variables for two consecutive years.

Finally, Table 7 shows the comparison between the proposed MetS predictive model and the state-of-the-art studies. It can be seen from Table 7 that, the proposed method, by taking advantages of the differential features of examination indicators over the past consecutive 2 years, yields the highest performance with AUC up to 0.930. Moreover, it is worth noting that the number of samples in dataset analyzed in this work reaches up to 67,730, which is larger than that has been reported yet. Such a large number of dataset can guarantee the robustness to the risk prediction of MetS.

Table 7 Comparison between the proposed MetS model and the state-of-the-art contributions.

Full size table

Discussion

Studies have shown that MetS is a major cause of diseases such as diabetes and CVDs. Based on a three-consecutive years longitudinal study, this paper studied the risk prediction by taking advantage of the examination records of the current year as well as the differential features of the past two consecutive years.

Based on XGBoost classifier, the impact of 10 clinical variables with the most importance to the risk of MetS is statistically analyzed on different gender and age groups. Specific observations are summarized as follows. Due to the relatively irregular lifestyle, male suffers from a higher prevalence of MetS in comparison with female of different age groups, suggesting that male should pay more attention to the risk of MetS. Thanks to the protective mechanism of estrogen, the ratio of young-aged female with MetS is significantly lower than other age groups. For elder female aged $\ge$ 60, the prevalence of MetS is approximately to that of male group. As regards male group, BMI^21,33 and FL^30,34,35 are critical to the risk of MetS for all age groups. In particular, the prevalence of MetS in young-aged group is sensitive to the abnormal of weight (in terms of BMI, WHR, WC and FL), suggesting that male $\le$ 44 years old should pay more attention to control their weight and shape of body. As regards female group, the abnormalities of endocrine clinical variables (in terms of TG, FL and FGLU) are highly related to the prevalence of MetS, especially for young-aged group, i.e., female $\le$ 44 years old. BMI is also of importance to the risk of MetS. In addition, the abnormality of WHR is more and more important to the risk of MetS as age grows, suggesting that middle-aged and older female should pay more attention to the changes of body shape. Owing to the interaction of concomitant disease, the importance of clinical variables abnormality on the risk of MetS is lower in the elderly than in the young and middle-aged groups.

Furthermore, we take the advantages of DSF w.r.t. the abnormal of clinical variables over the past 2 years, aiming to access the relationship between the DSF of specific clinical variables and the risk of MetS prevalence. Statistical results in terms of OR values w.r.t specific DSFs show that the most of the abnormal states over the past 2 years (A2A) lead to higher risk of MetS in comparison with the abnormal states occured only in recent 1 year (N2A). The result behind the observation suggests that any possible intervention should be carried out to prevent the abnormal state of clinical variables over consecutive 2 years. Additionally, it is observed that the abnormality of HGB lasts for consecutive 2 years significantly increases the risk of MetS for male group aged over 45. This result has not been reported yet, and may be explained by the correlation between HGB abnormalities and the occurrence of insulin resistance or MetS in^36,37.

More importantly, it is noted that, for BMI and FGLU in middle and old-aged groups (i.e., aged $\ge$ 45), the state N2A yields a higher risk of MetS than A2A, suggesting people of such age groups with normal weight and blood glucose should pay special attention to the abnormal state changes of such clinical variables.

In conclusion, with the help of three consecutive years of physical examination records, this paper analyzed the risk of MetS in different age and gender groups by using machine learning algorithms. The statistical results between the onset of MetS and the specific clinical variables (with corresponding state changes over the past consecutive 2 years) could benefit to understand the relationship between the lifestyle and pathogenesis of MetS.

Last but not least, this study has the following two limitations. Firstly, in view of the normal range of each examination indicator, the considered DNFs by taking the advantages of only numerical difference for two consecutive years could not be sufficient without non-uniform mapping w.r.t the specific range. This could be of the potential reason why the contributions of DNFs are trivial to the prediction of MetS. In further study, the non-uniform mapped w.r.t the numerical range of DNFs will be examined. In addition, all samples of dataset in this study are from Guangdong Province, China, and thus, the experimental results may have regional characteristics.

Methods

Diagnostic criteria for MetS

According to the Chinese Guidelines for the Prevention and Treatment of Type 2 Diabetes (2017 edition), people with three or more of the following five conditions can be diagnosed as MetS patients: (1) Abdominal obesity: WC $\ge$ 90/85 cm (male/ female). (2) Hyperglycemia: FGLU $\ge$ 6.1 mmol/L or 2-h postprandial blood glucose (PG) $\ge$ 7.8 mmol/L and (or) treatment of previously diagnosed diabetes. (3) Hypertension: BP $\ge$ 130/85 mmHg and (or) treatment of previously diagnosed hypertension. (4) Fasting TG $\ge$ 1.70 mmol/L. (5) Fasting HDL-C < 1.04 mmol/L.

Dataset

The data of this study is from the Department of Health Management, Nanfang Hospital, Southern Medical University, P.R. China. It contains 546,918 individuals who participated in physical examinations from 2009 to 2019, with a total of 1,039,564 medical records covering several cities in southern China, including Guangzhou, Foshan, Qingyuan, etc. In this data set, 32% of individuals have more than 1 record, 18% of individuals have 3 or more records.

Since part of the indicators were recorded manually according to tons of physical examination reports, inevitably there will be some mistakes. Then we used the upper and lower thresholds, which were set by doctors according to their experience for filtering of the outliers.

After desensitization, integration and cleaning, we obtained the usable structured data (537,283 records for males, and 403,899 records for females). The detailed statistical characteristics are shown in Table 8. There are 32 raw indicators collected in the examination, including anthropometry, blood parameters, other biochemical indicators, medical histories, gender and age.

Table 8 Basic statistical characteristics of the raw data set.

Full size table

The study was conducted under the approval of the Academic Committee of South China Normal University (Approval No.: SCNU-PHY-2020-063). All methods we used in the study were adherence to relevant ethical guidelines and regulations (Declaration of Helsinki). All subjects signed an informed consent form before inclusion in the present study.

Longitudinal MetS risk prediction model

The risk prediction model for MetS is shown in Fig. 3 (MS_result is the status whether suffering from MetS or not. MS_result = 0 and 1 represent the status with MetS and without MetS, respectively). Unlike the conventional methods, we take both indicators of the current year and the latest one before the current year into consideration in order to obtain features of physical change in time dimension. The prediction can be regarded as a supervised classification, where the status suffering from MetS in the next year is labeled as “1”, and records of the current year and differential features extracted from the past two records as the model input. Thus, a sample contains three records in the model.

Since the risk prediction of MetS represents the process suffering from MetS from a healthy state, the first two records in all three records should be healthy state. Considering the time difference of taking physical examination (usually in the first or third quarter in a year in CHINA), we set the maximum time interval between the first two records and the third one to 540 days.

After the above processing, 67,730 usable samples were obtained, in which the samples with/without MetS are 7971 and 59,759, respectively. For all samples, male and female account for 56% and 44% respectively.

Feature extraction

Features play a significant role for task classification. In this section, two kinds of differential features in time are proposed, characterizing the deviation of the value and state transition of indicators, respectively.

Differential numerical feature (DNF)

The differential numerical feature can be characterized as

$$\begin{aligned} I\_{DNF} = \Delta _I= I_0 - I_{-1} \end{aligned}$$

(1)

where $I_0$ and $I_{-1}$ denote the values of specific indicator I of current year and that of the latest record before current year, respectively.

As a consequence, $I\_{DNF}$ can describe the absolute numerical difference of indicators over years, including the increment, decrement, invariableness, and missing value. This kind of feature is extracted from the indicators with a numerical number, and thus 21 features are extracted.

Differential state feature (DSF)

DSF describes the state change process of indicator I over the past two examination records, and it can be characterized as

$$\begin{aligned} I\_{DSF} = S(I_{-1})\rightarrow S(I_0) \end{aligned}$$

(2)

where $S(I_{-1})$ and $S(I_0)$ represent the state of indicator I in the latest record before the current year and the current year, respectively, and its values are normal, abnormal or null. We set the upper limit of the clinical reference range of indicators except for HDL-C as the threshold, and beyond the threshold as “abnormal” state, since the increase in the values of indicators is associated with the risk of MetS. Among them, we set threshold of BMI as 28 kg/m². The “abnormal” state of HDL-C is defined as the value lower than its clinical range, since such indicator is protective to MetS.

The status of $I\_{DSF}$ can be normal-to-normal (N2N, represents indicators are normal for past 2 years), normal-to-abnormal (N2A, represents specific indicator is abnormal in recent 1 year), abnormal-to-normal (A2N, represents the indicator changes from abnormal to normal), abnormal-to-abnormal (A2A, represents specific indicators are abnormal for past 2 years) and missing value (specific indicator is empty in either record of the past 2 years). There are 26 DSFs in this paper except for gender, age, hip and three medical histories.

Dealing with missing value and normalization

The regular physical examination generally involves a fixed part of the items, so the presence of missing values is common in the records, which bring challenges to MetS prediction. In this study, we propose to fill the missing values of indicators based on the following criteria in terms of missing rate, data type and distribution.

If the amount of missing value is relatively large (70% or more of the data is missing), delete the features directly (in this case, the indicators HBA1c, PG and SMK_H are removed from the dataset.).
For features with numerical type, fill the missing indicators with the mean values when the values of such group of indicators follow normal distribution (features including BMI, CR, DBP, FGLU, HGB, Hip, LDL-C, PLT, RBC, SBP, TC, UA, WBC, and WC are filled accordingly.). If the values of such indicators follow skewed distribution, use the median to fill in the missing one (Age, ALT, AST, HDL-C, TG).
For non-numeric data, retain its missing value status and fill in a fixed value (for example, DM_H, HYT_H, FL, TN, HM, MGH, UALB, the DSFs).

For features deleted due to the high missing rate, the corresponding DNF and DSF are also deleted. After the above processing, there are 72 features in total, including 29 raw features, 19 DNFs and 24 DSFs.

Finally, we use the standard deviation normalization for features to normalize the contributions of different features to the model. Figure 4 shows the framework of predictive model for MetS based on machine learning techniques.

Experimental setup

In the experiments, the training set and test set are divided randomly by a ratio of 7 to 3. In order to validate the generalization ability of the model, the age and gender of the samples in the test set and the training set are of the same level.

We use three commonly used decision tree-based ensemble classification algorithms, namely, Random Forest (criterion = ‘entropy’, max_depth = 8, max_features = ‘sqrt’, n_estimators = 500), XGBoost (max_depth = 4, n_estimators = 500, learning_rate = 0.03, colsample_bytree = 0.5) and Stacking (combination of the above two algorithms), to perform the prediction of MetS. Without loss of generality, a threshold of probability should be set for the final decision. In the experiments, the maximum Youden index criteria is employed to determine the optimal threshold.

For measurement, we assess the performance of the proposed MetS prediction model by employing Accuracy, Precision, Recall (Sensitivity), Specificity, F1-score, F2-score (it favors Recall over Precision), which are given as

$$\begin{aligned} Accuracy &= \frac{TP+TN}{TP+FP+FN+TN} \end{aligned}$$

(3)

$$\begin{aligned} Precision & = \frac{TP}{TP+FP} \end{aligned}$$

(4)

$$\begin{aligned} Recall & = \frac{TP}{TP+FN} \end{aligned}$$

(5)

$$\begin{aligned} Specificity &= \frac{TN}{TN+FP} \end{aligned}$$

(6)

$$\begin{aligned} F1{\text{-}}score & = 2 \times \frac{Precision \times Recall}{Precision + Recall} \end{aligned}$$

(7)

$$\begin{aligned} F2{\text{-}}score & = 5 \times \frac{Precision \times Recall}{4 \times Precision + Recall} \end{aligned}$$

(8)

where TP (true positive), TN (true negative), FP (false positive) and FN (false negative) are the values in confusion matrix.

Besides, AUC is used to evaluate the performance of risk prediction. The value of AUC ranges from 0 to 1, and $AUC = 1$ denotes perfect classification.

References

Khunger, J. M., Kumar, N., Punia, V. P. S. & Malhotra, M. K. Study of prothrombotic changes in metabolic syndrome. Indian J. Hematol. Blood Transfus. 36, 695–699. https://doi.org/10.1007/s12288-020-01291-y (2020).
Article PubMed PubMed Central Google Scholar
Shin, S. & Jee, H. Prevalence of metabolic syndrome in the gulf cooperation council countries: Meta-analysis of cross-sectional studies.. J. Exerc. Rehabil. 16, 27–35. https://doi.org/10.12965/jer.1938758.379 (2020).
Article PubMed PubMed Central Google Scholar
Prasun, P. Mitochondrial dysfunction in metabolic syndrome. Biochim. Biophys. Acta Mol. Basis Dis. 1866, 165838. https://doi.org/10.1016/j.bbadis.2020.165838 (2020).
Article CAS PubMed Google Scholar
Kang, Y., Park, S., Kim, S. & Koh, H. Handgrip strength among Korean adolescents with metabolic syndrome in 2014–2015. J. Clin. Densitom. 23, 271–277. https://doi.org/10.1016/j.jocd.2018.09.002 (2020).
Article PubMed Google Scholar
Amedeo, L. et al. Nonalcoholic fatty liver disease: A precursor of the metabolic syndrome. Dig. Lived Dis. 47, 181–190. https://doi.org/10.1016/j.dld.2014.09.020 (2015).
Article Google Scholar
Niazi, E., Saraei, M., Aminian, O. & Izadi, N. Frequency of metabolic syndrome and its associated factors in health care workers. Diabetes Metab. Syndr. Clin. Res. Rev. 13, 338–342. https://doi.org/10.1016/j.dsx.2018.10.013 (2019).
Article Google Scholar
Katarina, S. et al. Estimation of the proportion of metabolic syndrome-free subjects on high cardiometabolic risk using two continuous cardiometabolic risk scores: a cross-sectional study in 16-to 20-year-old individuals.. Eur. J. Pediatr. 178, 1243–1253. https://doi.org/10.1007/s00431-019-03402-y (2019).
Article Google Scholar
O’Neill, S. & O’Driscoll, L. Metabolic syndrome: A closer look at the growing epidemic and its associated pathologies. Obes. Rev. 16, 1–12. https://doi.org/10.1111/obr.12229 (2015).
Article PubMed Google Scholar
Khoo, M., Oliveira, F. M. G. S. & Cheng, L. Understanding the metabolic syndrome: A modeling perspective. IEEE Rev. Biomed. Eng. 6, 143–155. https://doi.org/10.1109/RBME.2012.2232651 (2013).
Article PubMed Google Scholar
Mottillo, S. et al. The metabolic syndrome and cardiovascular risk a systematic review and meta-analysis. J. Am. Coll. Cardiol. 56, 1113–1132 (2010).
Article PubMed Google Scholar
Angelo, S. et al. Longitudinal paths to the metabolic syndrome: Can the incidence of the metabolic syndrome be predicted? The Baltimore longitudinal study of aging. J. Gerontol. 64, 590. https://doi.org/10.1093/gerona/glp004 (2009).
Article CAS Google Scholar
Wu, C. et al. Predictability of cardiorespiratory fitness on the risk of developing metabolic syndrome and diabetes mellitus in Taiwan adults: Preliminary analysis of a cohort study. Obes. Res. Clin. Pract. 12, 541–546 (2018).
Article PubMed Google Scholar
Zou, T. T. et al. MetS risk score: A clear scoring model to predict a 3-year risk for metabolic syndrome. Hormone Metab. Res. 50, 683–689. https://doi.org/10.1055/a-0677-2720 (2018).
Article CAS Google Scholar
Jowitt, L. M., Lu, L. W. W. & Rush, E. C. Migrant Asian Indians in New Zealand; prediction of metabolic syndrome using body weights and measures. Asia Pac. J. Clin. Nutr. 23, 385–393 (2014).
CAS PubMed Google Scholar
Jeong, S. et al. A novel model for metabolic syndrome risk quantification based on areal similarity degree. IEEE Trans. Biomed. Eng. 61, 665–679. https://doi.org/10.1109/TBME.2013.2286197 (2014).
Article PubMed Google Scholar
Apilak, W. et al. Predicting metabolic syndrome using the random forest method. Sci. World J. 2015, 581501. https://doi.org/10.1155/2015/581501 (2015).
Article Google Scholar
Guadalupe, O. G. E., Oscar, I. V., Maite, V. & Jose, H. T. Prediction of metabolic syndrome in a Mexican population applying machine learning algorithms. Symmetry Basel 12, 581. https://doi.org/10.3390/sym12040581 (2020).
Article CAS Google Scholar
Datta, S. & et. al. A machine learning approach for non-invasive diagnosis of metabolic syndrome. In 2019 IEEE 19th International Conference on Bioinformatics and Bioengineering (BIBE), 933–940, https://doi.org/10.1109/BIBE.2019.00175 (2019).
Darko, I. et al. Ann prediction of metabolic syndrome: A complex puzzle that will be completed. J. Med. Syst. 40, 264. https://doi.org/10.1007/s10916-016-0601-7 (2016).
Article Google Scholar
Lin, Z. et al. Exploring metabolic syndrome serum profiling based on gas chromatography mass spectrometry and random forest models. Anal. Chim. Acta 827, 22–27. https://doi.org/10.1016/j.aca.2014.04.008 (2014).
Article CAS PubMed ADS Google Scholar
Yu, C. S. et al. Predicting metabolic syndrome with machine learning models using a decision tree algorithm: Retrospective cohort study. Jmir Med. Inform. 8, e17110. https://doi.org/10.2196/17110 (2020).
Article PubMed PubMed Central Google Scholar
Arwa, Y. et al. Metabolic syndrome is independently associated with increased 20-year mortality in patients with stable coronary artery disease. Cardiovasc. Diabetol. 15, 149. https://doi.org/10.1186/s12933-016-0466-6 (2016).
Article Google Scholar
Scott, M. G. Metabolic syndrome: A multiplex cardiovascular risk factor. J. Clin. Endocrinol. Metab. 92, 399–404. https://doi.org/10.1210/jc.2006-0513 (2007).
Article CAS Google Scholar
Farzaneh, K. A., Saeed, J. & Masoumeh, S. Predicting metabolic syndrome using decision tree and support vector machine methods. Arya Atheroscler. 12, 146–152 (2016).
Google Scholar
Lee, S., Lee, H., Choi, J. R. & Koh, S. B. Development and validation of prediction model for risk reduction of metabolic syndrome by body weight control: A prospective population-based study. Sci. Rep. 10, 1–9. https://doi.org/10.1038/s41598-020-67238-5 (2020).
Article CAS Google Scholar
de Edelenyi, F. S. et al. Prediction of the metabolic syndrome status based on dietary and genetic parameters, using random forest. Genes Nutr. 3, 173–176. https://doi.org/10.1007/s12263-008-0097-y (2008).
Article Google Scholar
Choe, E. K. et al. Metabolic syndrome prediction using machine learning models with genetic and clinical information from a nonobese healthy population. Genomics Inform. 16, e31. https://doi.org/10.5808/GI.2018.16.4.e31 (2018).
Article PubMed PubMed Central Google Scholar
Lee, S. et al. Sasang constitutional types for the risk prediction of metabolic syndrome: A 14-year longitudinal prospective cohort study. BMC Complement. Altern. Med. 17, 438. https://doi.org/10.1186/s12906-017-1936-4 (2017).
Article PubMed PubMed Central Google Scholar
Li, G. et al. Childhood retinol-binding protein 4 (RBP4) levels predicting the 10-year risk of insulin resistance and metabolic syndrome: The BCAMS study. Cardiovasc. Diabetol. 17, 69. https://doi.org/10.1186/s12933-018-0707-y (2018).
Article CAS PubMed PubMed Central Google Scholar
Fazel, Y. et al. Epidemiology and natural history of non-alcoholic fatty liver disease. Metabolism 65, 1017–1025. https://doi.org/10.1016/j.metabol.2016.01.012 (2016).
Article CAS PubMed Google Scholar
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In Proceedings of the 31st international conference on neural information processing systems, 4768–4777 (2017).
Wang, X. et al. Prevalence of the metabolic syndrome among employees in Northeast China. Chin. Med. J. 128, 1989–1993. https://doi.org/10.4103/0366-6999.161337 (2015).
Article CAS PubMed PubMed Central Google Scholar
Arn, A., Gaka, B. & Rjha, C. Signalling mechanisms in the cardiovascular protective effects of estrogen: With a focus on rapid/membrane signalling. Curr. Res. Physiol. 4, 103–118. https://doi.org/10.1016/j.crphys.2021.03.003 (2021).
Article Google Scholar
Zhang, T. et al. Prediction of metabolic syndrome by non-alcoholic fatty liver disease in northern urban Han Chinese population: A prospective cohort study. PLoS ONE 9, e96651. https://doi.org/10.1371/journal.pone.0096651 (2014).
Article CAS PubMed PubMed Central ADS Google Scholar
Perveen, S., Shahbaz, M., Keshavjee, K. & Guergachi, A. A systematic machine learning based approach for the diagnosis of non-alcoholic fatty liver disease risk and progression. Sci. Rep. 8, 2112. https://doi.org/10.1038/s41598-018-20166-x (2018).
Article CAS PubMed PubMed Central ADS Google Scholar
Choi, K. M. et al. Relation between insulin resistance and hematological parameters in elderly Koreans-Southwest Seoul (SWS) study. Diabetes Res. Clin. Pract. 60, 205–212. https://doi.org/10.1016/S0168-8227(03)00059-7 (2003).
Article CAS PubMed Google Scholar
Kawamoto, R. et al. Hematological parameters are associated with metabolic syndrome in Japanese community-dwelling persons. Endocrinehttps://doi.org/10.1007/s12020-012-9662-7 (2013).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work is supported by Blue Fire Innovation Project of the Ministry of Education (Huizhou), No. CXZJHZ201803, Natural Science Foundation of Guangdong Province, No. 2019A1515011940, Science & Technology Project of Guangzhou, No. 202002030353.

Author information

Authors and Affiliations

School of Physics and Telecommunication Engineering, South China Normal University (SCNU), Guangzhou, 510006, China
Haizhen Yang, Baoxian Yu & Han Zhang
School of Electronics and Information Engineering, SCNU, Foshan, 528225, China
Haizhen Yang, Baoxian Yu & Han Zhang
Guangdong Provincial Engineering Technology Research Center of Cardiovascular Individual Medicine & Big Data, SCNU, Guangzhou, 510006, China
Haizhen Yang, Baoxian Yu & Han Zhang
Department of Health Management, Nanfang Hospital, Southern Medical University, Guangzhou, 510515, China
Ping OUYang, Xiaoxi Li & Xiaoying Lai
Key Laboratory of Digital Signal and Image Processing of Guangdong Provincial, College of Engineering, Shantou University, Shantou, 515063, China
Guishan Zhang

Authors

Haizhen Yang
View author publications
Search author on:PubMed Google Scholar
Baoxian Yu
View author publications
Search author on:PubMed Google Scholar
Ping OUYang
View author publications
Search author on:PubMed Google Scholar
Xiaoxi Li
View author publications
Search author on:PubMed Google Scholar
Xiaoying Lai
View author publications
Search author on:PubMed Google Scholar
Guishan Zhang
View author publications
Search author on:PubMed Google Scholar
Han Zhang
View author publications
Search author on:PubMed Google Scholar

Contributions

H.Z., P.O., B.Y., Xx.L. and Xy.L. designed the study. P.O, Xx.L. and Xy.L. were responsible for the management, collection and pretreatment of the data. H.Y. conducted the experiment and drafted the initial manuscript. H.Z., B.Y. and G.Z. validated the results. All authors critically revised the manuscript and approved the final manuscript version.

Corresponding authors

Correspondence to Baoxian Yu, Ping OUYang or Han Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yang, H., Yu, B., OUYang, P. et al. Machine learning-aided risk prediction for metabolic syndrome based on 3 years study. Sci Rep 12, 2248 (2022). https://doi.org/10.1038/s41598-022-06235-2

Download citation

Received: 13 September 2021
Accepted: 20 January 2022
Published: 10 February 2022
Version of record: 10 February 2022
DOI: https://doi.org/10.1038/s41598-022-06235-2

This article is cited by

Association between the metabolic score for insulin resistance trajectory and new-onset metabolic syndrome: a retrospective cohort study based on health check-up data in China
- Jianan Song
- Su Yan
- Jingfeng Chen
Lipids in Health and Disease (2025)
Effect of visceral fat on onset of metabolic syndrome
- Hiroto Bushita
- Naoki Ozato
- Yoshinori Tamada
Scientific Reports (2025)
From prevention to management: exploring AI’s role in metabolic syndrome management: a comprehensive review
- Udit Choubey
- Vashishta Avadhani Upadrasta
- Rohit Jain
The Egyptian Journal of Internal Medicine (2024)
A comprehensive multi-task deep learning approach for predicting metabolic syndrome with genetic, nutritional, and clinical data
- Minhyuk Lee
- Taesung Park
- Mira Park
Scientific Reports (2024)
Predictive analysis of metabolic syndrome based on 5-years continuous physical examination data
- Guohan Zou
- Qinghua Zhong
- Han Zhang
Scientific Reports (2023)

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Performance of differential features with different classifiers

Risk factors of MetS

Impact of important clinical variables on MetS risk in different gender and age groups

Impact of important DSFs on MetS risk in different gender and age groups

Discussion

Methods

Diagnostic criteria for MetS

Dataset

Longitudinal MetS risk prediction model

Feature extraction

Differential numerical feature (DNF)

Differential state feature (DSF)

Dealing with missing value and normalization

Experimental setup

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links