Introduction

With the accelerating pace of population aging in China, the prevalence of multimorbidity among older adults is becoming increasingly prominent. Multimorbidity refers to the co-existence of two or more chronic diseases or conditions1. Among individuals aged 65 years and older, the prevalence of multimorbidity reaches 64.7%2. The co-occurrence of coronary heart disease (CHD) and diabetes mellitus (DM) is one of the most common combinations in the elderly population.CHD is a major subtype within the spectrum of cardiovascular diseases. Its underlying mechanism involves organic stenosis or obstruction of the coronary arteries, leading to myocardial ischemia, hypoxia, and even necrosis. Therefore, it is often referred to as ischemic heart disease3. Clinical manifestations include angina pectoris, arrhythmias, myocardial infarction, and even sudden cardiac death4.

One of the major and modifiable risk factors for CHD that can be addressed at the population level is hyperglycemia or diabetes mellitus5.In recent years, Type 2 Diabetes Mellitus (T2DM) has become one of the most significant comorbidities of CHD, with its incidence steadily increasing6 and being associated with patient mortality7.In terms of diagnosis, conventional techniques include coronary angiography, coronary CT angiography (CTA), electrocardiography, and echocardiography. However, these methods require specialized equipment and trained personnel, making them costly and less accessible. Therefore, developing low-cost, convenient, and effective non-invasive diagnostic tools is crucial for the early detection of CHD-DM2 and may significantly reduce patient mortality.

Data indicate that glucose metabolism disorders are common among patients undergoing coronary angiography. Among 1040 patients with CHD, 62.2% exhibited abnormal glucose metabolism.The integrated management of CHD and T2DM, along with the identification of patients at risk for multiple comorbidities, is a high priority in clinical practice8.Currently, machine learning algorithms have been proven to be highly effective in predicting cardiovascular diseases9. In the medical field, the application of machine learning is rapidly permeating all aspects of clinical practice—from preprocessing clinical data to precise patient stratification and the personalization of treatment strategies—demonstrating an increasingly significant impact. Specifically, machine learning has played a significant role in disease diagnosis, treatment risk assessment, drug development, and medical data analysis10.

At present, no dedicated model exists for predicting the risk of diabetes in patients with CHD. This study employs machine learning algorithms to develop a clinical risk prediction model for CHD-DM2. By deeply mining and integrating clinical data from CHD-DM2 patients and systematically analyzing the key factors contributing to disease development, this model provides a solid foundation for early intervention and treatment. The incorporation of machine learning enables more precise individual risk assessment and offers new perspectives and possibilities for disease management, highlighting promising clinical applications.

Materials and methods

Study population

A retrospective collection of clinical data was conducted on 29,960 cardiovascular disease patients admitted to the First Affiliated Hospital of Xinjiang Medical University between January 1, 2001, and December 31, 2018. The collected data included: basic demographic information (gender, age, education level, occupation); personal lifestyle history (smoking, alcohol consumption); family history (presence of diabetes, hypertension, hyperlipidemia); and laboratory tests such as complete blood count (WBC, NE, LY, MO, EO, BA, NE1, LY1, etc.).

Inclusion criteria were as follows: CHD patients: Diagnosed with CHD by coronary angiography (CAG) or CTA, with clear clinical manifestations such as angina pectoris or other ischemic symptoms; age ≥ 18 years. CHD-DM2 patients: Met all inclusion criteria for CHD. Diagnosed with T2DM based on indicators such as C-peptide level, islet autoantibody testing, or age at diabetes onset. Complete glycemic control records were available, including data on glycated hemoglobin (HbA1c).

Exclusion criteria included: Incomplete or erroneous data: Patients with missing key clinical information, such as diagnostic records, or those with significant data inconsistencies. Severe comorbidities: Patients with serious hepatic or renal dysfunction, or other systemic diseases that could interfere with study validity. Individuals with active malignancies receiving chemotherapy or radiotherapy were also excluded.Based on the above inclusion and exclusion criteria, a total of 12,400 eligible patients with CHD and CHD-DM2 were ultimately included in the study.

This study is a retrospective analysis, with data sourced from the medical records of cardiovascular disease patients admitted to the First Affiliated Hospital of Xinjiang Medical University from January 1, 2001, to December 31, 2018. The study protocol was reviewed and approved by the Ethics Committee of Xinjiang Medical University (Approval Number: XJYKDXR20250515001), and exemption from informed consent was granted. The decision to exempt informed consent was based on the following criteria: 1. International ethical guidelines In accordance with Article 32 of the Declaration of Helsinki (2013 revised edition), informed consent may be waived in the following circumstances: The research risk is extremely low. This study only involves statistical analysis of anonymized medical record data and does not involve any intervention measures. According to the ethics committee’s risk assessment, the risk level of this study is “lowest,” which complies with the core principle of the Declaration that “the research risk is no greater than the minimum risk.” The reasonableness of secondary use of data: The research data is derived from medical records generated during the diagnostic and treatment process, constituting lawful and compliant secondary use. The data has undergone double anonymization (removal of direct and indirect identifiers such as names, ID numbers, and hospital admission numbers), ensuring it cannot be traced back to individuals, in accordance with the Declaration’s requirement that “exemption from informed consent shall not adversely affect the rights and health of research participants.” Objective limitations in contacting participants: Due to the study’s long time span of 18 years (2001–2018), after verification by the hospital’s medical records department, over 85% of patients’ contact information was found to be invalid or changed, meeting the exception in the Declaration that “if requiring informed consent would prevent the study from being conducted.” II. Legal Basis in China According to Article 23 of the “Ethical Review Measures for Life Sciences and Medical Research Involving Human Subjects” jointly issued by the National Health Commission and three other ministries in 2023 (National Health Commission Science and Education Development [2023] No. 4), an ethics committee may approve research exempt from informed consent if the following conditions are met: Risk controllability: The research poses extremely low risk and participants cannot be contacted (e.g., in this study, patients became unreachable due to the time span), meeting the criteria of Item (i) of this provision. Anonymization standards: Data has been thoroughly anonymized in accordance with the requirements of the Personal Information Protection Law, with all identifiable information removed, meeting the standard of “cannot be traced back to an individual” specified in Item (ii) of this provision. Protective measures for rights and interests: The study does not involve risks of personal privacy breaches or conflicts of commercial interest, meeting the requirements of subparagraph (3) of this clause. Therefore, this study strictly adheres to the principle of “minimizing risks and maximizing rights and interests,” and the decision to waive informed consent complies with international ethical guidelines and Chinese laws and regulations.

Data preprocessing

To ensure the integrity and accuracy of the analysis, we used the dplyr package in R (version 3.6.1) to identify variables with more than 30% missing values (e.g., Age, Educational Level), which were excluded from the final dataset. For variables with a missing rate below 30%, the mice package was employed to perform multiple imputation, effectively estimating and replacing missing data to retain dataset completeness and continuity (Fig. 1).The dummyVars function from the caret package in R was used to generate dummy variables for categorical data. For instance: Male (female = 0, male = 1); Educational Level (below high school = 1, high school or GED = 2, vocational school = 3, university = 4); Professional (mental worker = 0, manual worker = 1); Current Smoker (no = 0, yes = 1); Current Drinker (no = 0, yes = 1); Hypertensive History (no = 0, yes = 1); Diabetes History (no = 0, yes = 1); Pro (negative = 0, positive = 1); and Glu (negative = 0, positive = 1).In clinical data research, missing values can degrade model accuracy and may even result in misleading conclusions. Moreover, due to objective differences in disease prevalence, imbalanced distributions between positive and negative cases are common in medical datasets11, leading to poor classification performance for minority class samples12. To address the issue of data imbalance, this study employed the SMOTENC algorithm in combination with the themis package in R for data preprocessing. SMOTENC was used to generate synthetic samples for the minority class in order to balance the class distribution within the dataset.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Visualization of missing value patterns.

Feature selection

Feature selection is an important and commonly used dimensionality reduction technique that aims to identify an optimal subset of features by removing irrelevant and redundant information from the dataset13,14. By interpreting the most relevant features, deeper insights into the problem can be obtained15. The LASSO regression algorithm enables dimensionality reduction and variable selection for high-dimensional data16. In this study, a combined approach of univariate analysis and LASSO regression was employed. Potential candidate features were initially identified using univariate analysis with the nortest package in R, selecting variables with statistical significance (P < 0.05). These candidates were further refined using LASSO regression via the glmnet package in R, which introduces a penalty term to reduce model complexity and ultimately selects the most predictive feature set.

Model construction

In recent years, Machine Learning (ML) techniques have been widely applied in medical research, leveraging large datasets to uncover complex patterns that may not be readily identifiable by human observers, thereby offering a promising alternative approach17. In this study, seven machine learning models—XGBoost, Random Forest (RF), LightGBM, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Logistic Regression, and Logistic_Lasso—were used to construct predictive models. These algorithms are commonly applied to binary classification tasks in coronary heart disease and diabetes-related research18.

XGBoost algorithm

XGBoost is an extended implementation of the Boosting ensemble algorithm. By integrating multiple weak classifiers, it constructs an efficient decision tree–based ensemble learning framework, significantly reducing computational complexity and runtime while improving algorithmic efficiency. The objective function consists of two key components: a loss term that measures the difference between predicted and actual values, and a regularization term that controls model complexity and prevents overfitting. The objective function is expressed as:

$${\text{Obj}}\left( {{\uptheta }} \right) = \sum\limits_{{{\text{i}} = 1}}^{{\text{n}}} {\text{l}} \left( {{\text{y}}_{{\text{i}}} ,{{\hat{\text{y}}}}_{{\text{i}}} } \right) + \sum\limits_{{{\text{k}} = 1}}^{{\text{K}}} \Omega \left( {{\text{f}}_{{\text{k}}} } \right)$$
(1)

Here, \({\text{y}}_{\text{i}}\) denotes the true value, and \({\hat{\text{y}}}_{\text{i}}\) represents the predicted value. The function l refers to the loss function, commonly mean squared error (MSE) or log loss. \({\text{f}}_{\text{k}}\) denotes the k-th decision tree, and \(\Omega\) is the regularization term used to control model complexity.

RF algorithm

RF is an ensemble learning algorithm composed of multiple decision trees. It enhances model stability and predictive performance by constructing a multitude of trees. The core idea is to aggregate several weak classifiers (decision trees) into a strong classifier, improving prediction accuracy through majority voting or averaging. It can handle various types of features and reduces overfitting by randomly selecting features and training data subsets for each tree19. The RF model can be expressed as:

$$\text{f}\left(\text{x}\right)=\frac{1}{\text{B}}{\sum \limits_{\text{i}=1}^{\text{B}}}{\text{h}}_{\text{i}}\left(\text{x}\right)$$
(2)

Here, B is the number of decision trees in the forest, and \({h}_{i}\)(x) indicates the output of the i-th tree for a given input x.

LightGBM algorithm

LightGBM is based on the Gradient Boosting Decision Tree (GBDT) framework, an ensemble learning method that iteratively adds new trees to correct errors made by the previous ones, thereby enhancing predictive performance. By leveraging histogram-based algorithms, leaf-wise growth strategies with depth constraints, and parallel optimization, LightGBM achieves significant advantages in training speed, memory efficiency, and scalability for large-scale distributed data processing20. The objective function is defined as:

$${\text{L}}\left( {\uptheta } \right) = \mathop \sum \limits_{{{\text{i}} = 1}}^{{\text{N}}} {\text{l}}\left( {{\text{y}}_{{\text{i}}} ,{\hat{\text{y}}}_{{\text{i}}} } \right) + \Omega \;({\text{T}})$$
(3)

In this expression, \({\text{l}}\left( {{\text{y}}_{{\text{i}}} ,{{\hat{\text{y}}}}_{{\text{i}}} } \right)\) refers to the individual sample loss, while Ω \(\left(\text{T}\right)\) serves as a regularization component to control model complexity and reduce the risk of overfitting.

SVM algorithm

SVM is a classification method based on the principle of structural risk minimization. Its core idea is to find the optimal hyperplane that maximizes the margin between different classes, effectively separating them while maximizing the distance from the hyperplane to the nearest data points—known as support vectors. For datasets that are not linearly separable in the original feature space, SVM employs kernel functions to map the data into a higher-dimensional space where linear separation becomes feasible21. The SVM model can be expressed as:

$$\text{y}\left(\text{x}\right)={\text{W}}^{\text{T}} {\varnothing }\left(\text{X}\right)+\text{b}$$
(4)

W is the weight vector, \({\varnothing }\left(\text{X}\right)\) denotes the mapping function that transforms the input sample into a high-dimensional feature space, b is the bias term, and \(\text{y}\left(\text{x}\right)\) represents the predicted value for sample x.

KNN algorithm

KNN algorithm is a non-parametric and intuitive supervised learning method. Its core idea is that if the majority of a sample’s k nearest neighbors in the feature space belong to a particular class, the sample is also assigned to that class and is expected to share its characteristics. It determines the closest instances by calculating the distance between the query sample and all samples in the training dataset22. The distance between test and training samples is computed using the following formula:

$$\text{d}\left(\text{X},{\text{X}}_{\text{i}}\right)=\sqrt{{\sum }_{\text{j}=1}^{\text{m}}{\left({\text{x}}_{\text{j}}-{\text{x}}_{\text{ij}}\right)}^{2}}$$
(5)

Here, m denotes the number of features. After computing the distances, the k nearest samples with the smallest distances are selected, and their labels or values are used to predict the outcome of the test sample.

Logistic regression algorithm

Logistic regression is a widely used statistical model for binary classification tasks. Its core concept involves applying the sigmoid function to a linear combination of input features, thereby transforming regression outputs into probability values between 0 and 1 and effectively converting a regression problem into a classification task23.

$${h}_{\theta }\left(X\right)=\frac{1}{1+{e}^{-\theta Tx}}$$
(6)

In this expression, \({\text{h}}_{\uptheta }\left(\text{X}\right)\) represents the predicted probability of the input sample. X is the feature vector, and \(\theta\) is the parameter vector of the model, where each parameter corresponds to the weight of an input feature.

Model performance assessment

To ensure optimal performance and robustness, tenfold cross-validation was employed on the training dataset. This method averages performance metrics across multiple trials to provide a more reliable assessment of model performance. The model’s predictive ability was evaluated using the confusion matrix, area under the receiver operating characteristic curve (AUC), Receiver Operating Characteristic Curve(ROC),sensitivity, specificity, and precision. Clinical utility was further assessed using decision curve analysis (DCA). Feature importance analysis was conducted for the selected models to determine the contribution of each variable to the prediction24, thereby evaluating model reliability and clinical applicability, as well as identifying net benefit.

Model interpretability

SHAP was used to analyze and interpret the results of the machine learning models. As an advanced interpretable machine learning framework, SHAP provides detailed explanations for individual model predictions, enhancing the transparency of ML models and facilitating the adoption of AI technologies in clinical practice25. Its capabilities include quantifying the overall contribution of each feature, illustrating their specific influence on individual predictions, examining feature interactions, and analyzing the combined effects of feature dependencies26. It enables the visualization of feature importance relationships and supports comprehensive interpretation of model behavior.In R, the average absolute SHAP values were visualized to rank the relative importance of each variable in the model, providing a comprehensive understanding of their individual contributions to the predictions27. The SHAP beeswarm plot offers an intuitive visualization of how all variables influence model predictions. The SHAP waterfall plot illustrates the direction and magnitude of each feature’s contribution to the final prediction for an individual case. The SHAP dependence plot allows exploration of the relationship between a given variable and its SHAP value, as well as interactions between variables. These visualizations enable unbiased evaluation of each variable’s contribution within the system, allowing the impact of individual variable values on model output to be considered independently28.

Results

Baseline characteristics analysis

A total of 29,960 patients with cardiovascular disease were screened according to strict inclusion and exclusion criteria and subjected to multiple imputation (see Fig. 1). Baseline analysis was conducted on the remaining 12,400 eligible patients. Among them, 10,257 patients (82.7%) were in the CHD group, and 2,143 patients (17.2%) were in the CHD-DM2 group. Detailed results are presented in Table 1.

Table 1 Baseline characteristics of study population.

In this study, a comprehensive and systematic comparison of baseline characteristics was conducted between the CHD-DM2 group and the CHD group. The results showed significant differences in 62 baseline indicators between the two groups. Specifically, the CHD-DM2 group had a lower median weight compared to the CHD group (74 kg [65.0; 83.0] vs 75 kg [66.0; 82.0], P < 0.05). Regarding professional status, the proportion of mental workers in the CHD-DM2 group was higher (81.9%) than in the CHD group (77.8%) (P < 0.01), which may be related to differences in lifestyle, dietary habits, and occupational stress among specific professional groups.The CHD-DM2 group showed a markedly higher prevalence of hypertensive history than the CHD group (62.2% vs 46.0%, P < 0.001). Additional analyses identified significant intergroup differences in laboratory measures such as WBC count, MO1 count, hemoglobin levels, MCV, and MCH (all P < 0.05). These findings may indicate the multifaceted influence of T2DM on the physiological status of individuals with CHD.It is noteworthy that MPV and PDW levels were significantly elevated in the CHD-DM2 group compared to the CHD group (P < 0.001), indicating potential alterations in platelet function or activity associated with diabetes. Furthermore, the CHD-DM2 group showed higher positivity rates for Pro and Glu than the CHD group (8.91% vs 5.43%; 27.5% vs 5.87%; P < 0.001), highlighting the specific effects of T2DM on renal function and glycometabolic regulation.In terms of coagulation function, the CHD-DM2 group had a higher PT.Activity than the CHD group (105% vs 102%, P < 0.001), which may be associated with diabetes-related imbalance in the coagulation and fibrinolytic systems. Regarding biochemical markers, the CHD-DM2 group showed a slightly lower BUN level (5.36 mmol/L vs 5.20 mmol/L, P < 0.001), suggesting differences in metabolic status. Finally, BG levels were significantly lower in the CHD-DM2 group than in the CHD group (6.81 mmol/L vs 5.05 mmol/L, P < 0.001), directly reflecting the diagnostic characteristics of diabetes and the strict requirements for glycemic control. Detailed data are presented in Table 1.

Class imbalance handling

Table 2 presents a comparison between the original and balanced datasets obtained using the SMOTENC algorithm. As shown in Table 2, the original dataset exhibited a substantial class imbalance with an imbalance ratio of 4.786. To address this issue and prevent bias in the results, we used the smotenc() function from the themis package in R (version 3.6.1) to balance the data. Based on statistical considerations, the number of neighbors k was set to 10, and different k values were tested across datasets. The parameter over_ratio was set to 1, representing the target ratio between the majority and minority classes. Oversampling was applied to the minority class and undersampling to the majority class, resulting in a balanced distribution of CHD cases (10,257) and CHD-DM2 cases (2,143). Detailed distributions of class observations in both balanced and imbalanced training sets are shown in Table 2 and Fig. 2.

Table 2 Description of the original and balanced data.
Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Visualization of the original and balanced datasets using the SMOTENC algorithm.

Feature selection

We employed a comprehensive strategy that effectively combined univariate analysis with LASSO regression to achieve efficient and accurate feature selection. Initially, univariate analysis was used as a preliminary screening step, narrowing down the original pool of variables to 62 promising candidates. During this process, the error rate was carefully controlled to ensure robustness in feature selection. Subsequently, LASSO regression was introduced to further refine the feature set and enhance model performance. A rigorous ten-fold cross-validation procedure was implemented to evaluate model performance across different data subsets. Features with non-zero regression coefficients under cross-validation were retained, resulting in a final set of 25 key predictors out of the 62 initial variables. These predictors—Hypertensive.history, Diabetes.history, Weight, MCH, MCHC, PDW, Pro, ISR, Glu, BG, HbA1c, TG, LDL, APO.B, MVE, LVES, MVA, MVEA, SV, LAD, LCX, OM, and RCA—were identified as having the most significant influence on model predictive performance (Fig. 3A and B).

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Importance of each predictor in the LASSO model. (A) Coefficient profile plot for the LASSO regression model. (B) Optimal parameter selection via ten-fold cross-validation.

Model construction and evaluation

The study population (n = 12,400) was randomly divided into a training set (n = 8460) and a test set (n = 3680) in a 7:3 ratio. The training set was used to develop predictive models, and the test set was used for further validation.Participants were categorized into a CHD group (n = 10,257) and a CHD-DM2 group (n = 2,143), with model prediction based on CHD-DM2 status. The overall research workflow is illustrated in Fig. 4, which clearly outlines the steps of data partitioning, model construction, and validation.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Workflow diagram of the machine learning procedures in this study.

Seven widely used machine learning models were constructed and validated using the training and test datasets, including Logistic, Logistic_Lasso, SVM, KNN, XGBoost, RF, and LightGBM. When models were trained and tested on the original dataset, as shown in Table 3, the RF model demonstrated the best performance, with AUCs of 1.0000 and 0.9985, and accuracies of 0.9502 and 0.9282 in the training and test sets, respectively. The XGBoost model also showed superior performance compared to other models in both datasets. As shown in Table 4, RF achieved the highest values in the balanced dataset, with AUC, accuracy, sensitivity, and specificity values of 1 on both the training and test sets, indicating its strong discriminative capability in classification tasks. Next are the LightGBM and XGBoost models, which also perform well, with AUC values of 0.9816, 0.9683, 0.9869, and 0.91, and accuracy rates of 0.9289, 0.8367, 0.9439, and 0.8703 on the training and test sets, respectively. Figure 5 presents radar plots that visually demonstrate the positive impact of data balancing on both datasets and model performance. Compared with the imbalanced dataset, all seven models exhibited significantly reduced errors when trained on the balanced dataset, further underscoring the critical role of class balancing in improving model performance. On the balanced dataset, the RF model exhibited well-balanced and high values across accuracy, AUC, specificity, sensitivity, recall, and precision, indicating strong overall performance.

Table 3 Performance metrics of seven machine learning models on training and test sets using the original dataset.
Table 4 Performance metrics of seven machine learning models on training and test sets using the balanced dataset.
Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

Radar plots of seven machine learning models. (A) Radar plot of models on the training set from the original dataset; (B) Radar plot of models on the test set from the original dataset; (C) Radar plot of models on the training set from the balanced dataset; (D) Radar plot of models on the test set from the balanced dataset.

Figure 6 shows the ROC curves of five machine learning models trained on both balanced and imbalanced datasets. All models demonstrated superior performance when trained on the balanced dataset. On the imbalanced dataset, the XGBoost model achieved an AUC of 0.8706, outperforming the other four models (Fig. 4A). On the balanced dataset, the AUC of the XGBoost model increased to 0.9594, again surpassing the performance of the other four models (Fig. 4B). The DCA curves illustrate the clinical utility of the predictive models by evaluating net benefits across different probability thresholds29.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

ROC curve analysis of seven machine learning models. (A) ROC curves of models trained on the original dataset; (B) ROC curves of models tested on the original dataset; (C) ROC curves of models trained on the balanced dataset; (D) ROC curves of models tested on the balanced dataset.

Figure 7 displays the DCA curves of the seven models trained on both the original and balanced datasets. Compared to the original dataset, all models trained on the balanced dataset exhibited improved performance. Each of the seven models is represented by a distinct colored curve in the figure. Two benchmark lines—“All” (assuming all patients are high-risk) and “None” (assuming all patients are low-risk)—are also shown. A higher curve indicates greater net benefit at a given threshold. The RF (pink), LightGBM (yellow), and XGBoost (green) models showed higher curves, suggesting superior clinical utility at corresponding thresholds.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Decision curve analysis (DCA) of seven machine learning models. (A) DCA curves for models trained on the original dataset; (B) DCA curves for models tested on the original dataset; (C) DCA curves for models trained on the balanced dataset; (D) DCA curves for models tested on the balanced dataset.

Analysis of feature importance

Figure 8 illustrates the key contributions of various variables across the seven predictive models trained on the balanced dataset. In the kNN model, Diabetes history, BG, HbA1c, LAD, RCA, LCX, Glu, Hypertensive history, MVEA, and MVA were identified as the most important predictors. Similarly, for the SVM model, Diabetes history, BG, HbA1c, LAD, RCA, LCX, Glu, Hypertensive history, MVEA, and MVA were also regarded as critical predictive factors. In the XGBoost model, Diabetes history, BG, HbA1c, and PDW emerged as key predictive factors. By contrast, the logistic regression model highlighted the high importance of Diabetes history, LAD, Glu, LCX, Pro, and TG. Moreover, the logistic Lasso model underscored the significance of Diabetes history, LAD, Glu, LCX, RCA, and PDW.

Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

Variable importance across seven machine learning models trained on the balanced dataset.

In summary, variables such as Diabetes history, BG level, HbA1c, and LAD consistently emerged as important across different models. This highlights their significance in influencing the risk of CHD-DM2 and suggests they are key targets for risk assessment and potential intervention strategies.

Interpretability of the model

To provide a more intuitive understanding of the contribution of selected variables in predicting CHD-DM2, we applied SHAP-based interpretability analysis. Figure 9 displays the top ten features ranked by importance, with Diabetes history, LVES, ISR, BG, and HbA1c identified as key factors, and LAD, Weight, CH, and MVA as additional important features. In the SHAP summary plot, each feature’s contribution to the prediction is visualized using colored dots, with variables listed in descending order of importance. Each dot represents an individual patient, with the x-axis indicating the magnitude of the Shapley value. The grey vertical line separates positive from negative impacts: positive Shapley values indicate a positive contribution to prediction, while negative values indicate a negative effect. Darker purple dots denote higher Shapley values, and yellow represents lower values. The top three most influential variables in the model were Diabetes history, LVES, and ISR.

Fig. 9
Fig. 9The alternative text for this image may have been generated using AI.
Full size image

Visualization of SHAP values. (A) SHAP beeswarm plot displaying feature contributions across all samples; (B) Bar plot ranking features by mean absolute SHAP values.

To assess the individual contributions of each feature to model output, SHAP waterfall plots were generated for the two outcome variables. The y-axis indicates feature names, while the x-axis represents SHAP values. In Fig. 10, purple bars denote negative contributions to the predicted value, whereas yellow bars indicate positive contributions. Labels on the bars show the deviation of each feature’s value from the model’s base value. Features are ranked in descending order of absolute contribution. In Fig. 10A, the ISR feature value is 2.31, contributing + 0.248 to a base value of 1.5. In Fig. 10B, Diabetes.history contributes − 0.153 to the prediction.

Fig. 10
Fig. 10The alternative text for this image may have been generated using AI.
Full size image

SHAP waterfall plots illustrating individual-level model predictions. (A) SHAP-based explanation of a representative CHD patient. (B) SHAP-based explanation of a representative CHD-DM2 patient.

In Fig. 11, the baseline value is \(\text{E}[\text{f}(\text{X})]=1.5\), which, as in the previous waterfall plots, approximates the average predicted value across the entire training dataset. The model output value \(\text{f}(\text{X})=1.\) 71 is also shown in the figure. Feature names and their values are shown at the top of the plot; features with positive contributions are marked in yellow, while negative contributions are shown in purple-red bars. All yellow bars on the left side represent features that positively shift the prediction from the baseline. The names of the features are located at the top of each bar in the chart. The length of each bar reflects the magnitude of the feature’s contribution. All purplish-red bars on the right side represent features that negatively influence the deviation from the baseline; feature names are shown at the top of the bars, and bar length reflects the size of the contribution.

Fig. 11
Fig. 11The alternative text for this image may have been generated using AI.
Full size image

SHAP force plot indicating feature-level contributions for a case predicted to survive.

Discussion

In recent years, large-scale bioinformatics analyses have increasingly focused on identifying key biomarkers, garnering unprecedented attention30,31. The growing scale and inherent complexity of biological data have driven the widespread application of machine learning in biological research32. Artificial intelligence (AI) approaches offer considerable advantages in predictive accuracy and operational efficiency over traditional and domain-specific methods33. As a powerful family of algorithms, machine learning enables the precise identification of complex patterns in data to perform key tasks such as prediction and classification. With the explosive growth of data availability and computing power, machine learning has been extensively applied and deeply integrated across both industry and natural sciences, demonstrating its irreplaceable value and potential34,35.

In this study, the SMOTENC algorithm was integrated with state-of-the-art machine learning approaches to mitigate class imbalance and reduce bias in the analysis. The algorithm minimized the likelihood of overfitting by avoiding redundant sampling of minority class instances36. Comparative analyses of seven machine learning models before and after data balancing (KNN, SVM, XGBoost, Logistic_Lasso, Logistic, RF, LightGBM) demonstrated substantial improvements in predictive performance following the application of balancing methods. Notably, the RF, LightGBM, and XGBoost models exhibited the best predictive metrics on the balanced dataset, achieving the highest levels of precision, recall, and AUC.

CHD develops through a multifactorial process shaped by interactions between genetic predisposition and environmental exposures, with numerous variables contributing simultaneously. Extensive prior research has revealed numerous critical CHD risk factors, such as age, gender, hypertension, diabetes, smoking behavior, and abnormal lipid profiles37. To gain insights into the contribution of these selected variables to the CHD-DM2 prediction model, we applied SHAP (Shapley Additive Explanations) to interpret feature importance. SHAP values provide interpretability by quantifying each feature’s contribution to the model’s prediction, thereby revealing the most predictive variables in CHD. This method not only improves interpretability but also guides subsequent investigation and clinical strategies38. Our findings highlight diabetes history, BG, and HbA1c as key determinants of CHD-DM2 risk. Given their influence, more attention should be devoted to mitigating cardiovascular risk in T2DM populations39. T2DM is marked by chronic hyperglycemia and metabolic dysregulation due to insulin resistance and deficient insulin production, frequently resulting in sustained increases in blood pressure, lipid levels, and glucose, thereby impairing systemic metabolic balance. Epidemiological data indicate that elevated glucose levels—even below the diabetic threshold—significantly raise the risk of cardiovascular events40. In this context, effective glycemic and lipid management in individuals with CHD-DM2 is essential. Maintaining blood glucose within target levels through consistent monitoring and dietary control can alleviate irreversible damage from chronic hyperglycemia and hyperlipidemia, ultimately enhancing patient outcomes and delaying disease advancement.

In recent years, both domestic and international experts have widely acknowledged the strong association between CHD and hypertension and hyperglycemia, prompting active research into therapeutic strategies for CHD-DM2.These advancements warrant urgent clinical attention and translation into practical applications. Specifically, clinicians should closely monitor and document high-risk factors in patients with CHD-DM2 and implement tailored interventions accordingly. With continuous advancements in medical infrastructure and research, we anticipate the development of more effective and evidence-based strategies for the control and treatment of CHD-DM2, ultimately improving patient care and outcomes.