Abbreviations.

CAD, coronary artery disease;

CAG, coronary angiogram;

CCTA, coronary computed tomography angiography;

EMRs, electronic medical records;

LDL-C, low density lipoprotein cholesterol;

ML, machine learning;

Low density lipoprotein cholesterol (LDL-C) levels, measured in the blood, play a crucial role in cardiovascular disease development1. Lowering LDL-C levels is an effective strategy for reducing mortality risks and complications associated with these diseases2,3.

To prevent the recurrence of cardiovascular events, clinicians often prescribe high-dose statins to high-risk patients, particularly those with a history of coronary artery disease (CAD)4. Multiple studies have highlighted the effectiveness of high-dose statin therapy in reducing LDL-C levels5. Nevertheless, some studies have revealed that high-dose statin usage may increase side effect risks, including diabetes and liver dysfunction, with such adverse effects being particularly pronounced in Asian populations6,7. Notably, genetic factors enable substantial LDL-C reductions in Asians, even with lower doses of statins8. Tailoring statin doses to individual patient characteristics might be imperative in clinical practice.

Efforts toward achieving real-world clinical applicability have led to increased electronic medical record (EMRs) utilization in research. EMRs store a broad range of patient medical data digitally, including diagnoses, medication prescriptions, and test outcomes. Machine learning (ML) techniques can effectively identify complex relationships within EMR datasets9. Prior studies demonstrated the efficacy of combining EMRs and ML across medical settings, including disease prediction, treatment response assessment, and clinical decision support10,11.

The objective of this study is to develop a ML model to predict the likelihood of LDL-C target achievement in CAD patients, with a particular focus on Asian populations. By integrating a range of clinical variables that reflect patient characteristics, the model aims to provide insights into the factors influencing patient outcomes. Through this endeavor, the model will support clinical decision-making regarding appropriate statin dosages, enhancing patient safety and offering a tool for optimizing LDL-C management.

Methods

Ethical approval

The Institutional Review Board (IRB) of Asan Medical Center (AMC) approved the study protocols (No. 2021 − 0303) in accordance with the Declaration of Helsinki (2008). This study utilized data from the Asan Biomedical Research Environment (ABLE), a de-identified electronic medical record (EMR) database maintained by AMC, a major tertiary hospital in South Korea12. The ABLE database is comprised of anonymized information, and as a result, the study was exempt from the requirement for informed consent by the IRB. All experiments were conducted in compliance with pertinent guidelines and regulations. Patient data, including diagnoses, laboratory test results, and reports, were extracted for patients with CAD admitted to AMC from January 1, 2000, to December 30, 2020.

Study population

The study included patients hospitalized for CAD at AMC from January 1, 2000, to December 31, 2020, who had recorded LDL-C measurements. CAD included myocardial infarction (MI) or unstable angina, stable angina, and asymptomatic CAD (ACAD), with patients who underwent coronary revascularization also included. Only first hospitalizations were analyzed for patients with multiple CAD related hospitalizations.

The study included patients who had a second LDL-C test conducted 6 to 18 months after discharge. Patients with a history of statin or PCSK9 inhibitor use prior to hospitalization were excluded to avoid confounding treatment effects from concomitant medications. Additionally, patients without statin prescriptions during hospitalization were excluded. Patients who were on high-dose statins or low-dose statins during hospitalization, which made it challenging to provide individualized treatment, were also excluded. Figure 1 outlines the inclusion and exclusion criteria for study subjects.

Fig. 1
figure 1

Flow chart of study population selection criteria. Figure 1 illustrates the process of selecting the study participants. Initially, LDL-C measurements were taken during hospitalization for patients with CAD. Subsequent measurements were recorded 6–18 months post discharge. After applying exclusion criteria, a total of 9,402 patients were included in the final cohort. Abbreviations: CAD, coronary artery disease; LDL-C, low density lipoprotein cholesterol; PCSK9, proprotein convertase subtilisin/kexin type 9;.

The classification of statin dose was established in accordance with the American College of Cardiology and American Heart Association (ACC/AHA) guidelines and through discussions with clinical experts, taking into account the characteristics of both the hospital and the patients13. A detailed summary of this classification is provided in Table 1. To prevent underestimation of statin usage, if multiple doses were recorded on the same day, the highest dose was selected.

Table 1 Statin dosage intensity groupings.

The study population was divided into two groups based on whether the LDL-C target levels were achieved at the time of the second LDL-C measurement following discharge. According to the ACC guidelines, the target LDL-C level for patients with high-risk cardiovascular disease is less than 70 mg/dL or a reduction of 50% or more14. However, this study is a retrospective analysis conducted at a single institution, and it did not account for LDL-C levels prior to presentation at AMC or the history of statin therapy. To avoid bias resulting from the exclusion of these variables, the requirement for a 50% reduction in LDL-C levels was omitted. Furthermore, based on the guidelines for the management of dyslipidemia in Korea, this study ultimately defined the target LDL-C level as less than 70 mg/dL15.

The LDL-C values were based on directly measured values; in the absence of such data, they were estimated using the Friedewald formula. The Friedewald formula is expressed as follows: LDL-C = [Total cholesterol - High-density lipoprotein cholesterol (HDL-C)] - (Triglycerides / 5)16. In instances where LDL-C was measured multiple times on the same day, the mean value was employed.

Study design

Patients hospitalized for their initial CAD event from January 2000 to December 2020 were analyzed. All patients received moderate-dose statins during hospitalization for CAD and had at least two LDL-C measurements: one during hospitalization for CAD and another 6 to 18 months after the first measurement. To ensure precise clinical data capture, relevant clinical data were extracted and preprocessed for model feature construction. Three ML models were developed to predict attainment of the 70 mg/dL target LDL-C level. The study design is summarized in Fig. 2.

Fig. 2
figure 2

Overview of the study. Figure 2 outlines the study process: (1) We extracted diagnoses, medication data, and lab test results from the EMRs of patients hospitalized with CAD who were prescribed moderate-dose of statins. (2) We performed data preprocessing, which included one-hot encoding and categorization, to prepare the dataset. (3) We used a ML model to predict whether the patients’ second LDL-C values reached the target levels.

Data processing

The dataset for model development and validation was obtained from the EMRs of AMC. Diagnosis codes were defined according to International Classification of Diseases (ICD-10) standards. To reflect all relevant sub diagnoses, three-digit diagnosis codes were used, considering the hierarchical structure of ICD-10 codes.

The dataset was constructed based on the timing of CAD admission. The status of patients included diagnosis, prescriptions, gender, age, body mass index (BMI), systolic blood pressure (SBP), diastolic blood pressure (DBP), and CAD hospitalization duration. Prescription details reflected the total number of medication days prescribed, indicative of intake extent. Medical history included laboratory test results, smoking status, and whether coronary angiography and coronary CT angiography were performed. Laboratory test results represented median values within one year prior to CAD hospitalization. Diagnosis, prescription, and laboratory test data were defined as model features by selecting information commonly occurring within the patient cohort. Consequently, a dataset comprising 9,402 patients was compiled, including 26 current diagnosis characteristics, 19 past diagnosis characteristics, 30 prescription characteristics, 45 test result characteristics, and 9 baseline characteristics.

One-hot encoding was used to indicate patients’ conditions. For instance, a heart failure diagnosis was encoded as 1, while its absence was encoded as 0. Variables such as smoking status were coded based on records closest to the CAD admission date, with smokers coded as 1 and ex-smokers or nonsmokers coded as 0. Gender was coded as 1 for males and 0 for females. Categorical variables were transformed into numerical vectors for model use.

Numeric variables were preprocessed to fit model characteristics. Missing values were replaced by − 1 to avoid confusion with actual measurement values coded as 0. For the logistic regression (LR) model, numerical range normalization between 0 and 1 was performed to mitigate bias from substantial numerical values17.

Model development and evaluation

To prevent overfitting and improve general performance, all models underwent stratified K-fold cross-validation18. The training dataset comprised 70% of the data from the target patients. The model performance was calculated as the average over 5-fold cross-validation and evaluated using AUC, depicted through receiver operating characteristic (ROC) curves.

Three models were utilized to predict the attainment of target LDL-C levels: extreme gradient boosting (XGBoost)19, random forest (RF)20, and logistic regression (LR)21. Each model was trained using 129 features and optimized using GridSearch to identify optimal hyperparameters. For a comprehensive overview of the parameters utilized and the values selected for model tuning, please refer to Supplementary Table S1.Additionally, confidence intervals (CI) were calculated using Bootstrap resampling22, and six evaluation metrics were assessed.

Nevertheless, an excess of features may prove detrimental to the model’s efficacy. To achieve optimal performance, the recursive feature elimination (RFE) process is undertaken to select an appropriate number of features23. RFE is a feature selection technique that employs an iterative process of removing less important features with the objective of improving model performance. By determining the optimal number of features, the dimensionality of the final model is reduced.

Calibration curves and Brier scores were used to demonstrate the applicability of the final models. The calibration curves evaluated the agreement between predicted probabilities and actual outcomes for the three models, the Brier scores were reported along with CI.

Model interpretations

Shapley additive explanations (SHAP) were employed to comprehend the significance of each feature in the prediction process and to demonstrate explainability of model24. This method calculates and visualizes the importance score of each feature, providing insights into their contribution to the prediction. In contrast to common feature contribution methods, SHAP consistently attributes feature importance, presents intuitive results, and restores influential features well25. Furthermore, SHAP waterfall plots were employed to illustrate individual patient data, thereby furnishing information on characteristics that exert a positive or negative influence on predictions.

Statistical analysis

For comparisons of baseline characteristics, the chi-square test and the t-test were employed. Continuous variables were compared by using the T-test. The Chi-square tests were used for categorical variables. All statistical analysis was performed using R software version 4.2.3 (R Foundation for Statistical Computing, Vienna, Austria; www.r-project.org). P value < 0.05 was considered to indicate statistical significance.

Results

Baseline characteristics.

After applying exclusion criteria to a total of 53,369 patients hospitalized for coronary artery disease at the AMC from 2000 to 2020, a final cohort of 9,402 patients was formed. The patient cohort was divided into two groups: those with LDL-C levels below 70 mg/dL (n = 4,525, achieved LDL-C target group) and those who did not achieve the target LDL-C levels (n = 4,877, non-achieved LDL-C target group). Table 2 summarizes the demographic data, comorbidities, medication history, and laboratory test results of the patient cohort. The study subjects had an average age of 61.58 years (standard deviation 10.54), and 70.61% (6,639) were male.

Table 2 Baseline characteristics of the patient cohort.

ML model performance

This model predicted LDL-C goal achievement in patients prescribed moderate-dose statins using ML. Each algorithm was trained using the same randomly split training set (70% of the total patients) and evaluated using the test set (30% of the total patients). The models were optimized using 5-fold cross-validation, and the performance was evaluated to three decimal places. The three models were evaluated across six performance metrics, which are summarized along with CIs for each model in Table 3. Statistically significant differences were observed in specificity, accuracy, and PPV (Positive Predictive Value). Notably, XGBoost demonstrated specificity at 0.644 (95% CI: [0.628–0.661]), outperforming Random Forest at 0.286 (95% CI: [0.263–0.310]) and Logistic Regression at 0.293 (95% CI: [0.268–0.316]). XGBoost also achieved the highest accuracy at 0.659 (95% CI: [0.641–0.677]) and recorded a PPV of 0.642 (95% CI: [0.617–0.669]), both of which showed statistically significant differences compared to the other two models. The results of the three models, including CI and ROC curves for the test set, are summarized in Fig. 3. Additionally, the AUC and CI for patients receiving low or high statins, who were excluded from the main analysis, are summarized in Supplementary Table S1.

Table 3 Comparison of the three ML models by metric.
Fig. 3
figure 3

ROC curves of three model performance. Figure 3 illustrates the ROC curves for the XGBoost, Random Forest and Logistic Regression. Abbreviations: LR, logistic regression; RF, random forest; ROC, receiver operating characteristic; XGBoost, extreme gradient boosting;.

Feature reduction ML model

RFE was applied to all three models. While the original models were trained on 129 features, the RFE-applied models utilized a reduced set of features optimized for each model. Specifically, the XGBoost model was trained with 62 features, the Random Forest model and Logistic Regression with 73 features. The final model was trained on 70% of the patient dataset using 5-fold cross-validation, and the remaining 30% of the data was used to evaluate the performance. The performance metrics for each model are summarized in Table 4. Despite a reduction in features, all three models exhibited comparable or enhanced performance across most evaluation metrics.

Table 4 Performance evaluation of three models with RFE Applied.

ML model calibration

The calibration curve for the RFE-applied XGBoost, Random Forest, and Logistic Regression models are presented in Fig. 4. Each line represents the calibration curve for a specific model, with brier scores and CI shown in parentheses. Lower brier scores indicate a better alignment between predicted probabilities and actual outcomes. The results demonstrate that the XGBoost model has the lowest Brier score and exhibits consistent calibration. The XGBoost model achieved a brier score of 0.218 (95% CI: [0.215–0.221]), outperforming the Random Forest model with a brier score of 0.236 (95% CI: [0.235–0.238]) and the Logistic Regression model with a brier score of 0.243 (95% CI: [0.239–0.246]).

Fig. 4
figure 4

Calibration curves and brier scores of the RFE-applied models. Figure 4 illustrates the calibration curves and brier scores of RFE-applied models for predicting LDL-C goal achievement. Each line represents the calibration curve for a specific model, including XGBoost, Random Forest, and Logistic regression. The figures in parentheses indicate the brier scores of each model. A lower Brier score indicates a greater degree of agreement between the predicted probabilities and the actual outcomes. The perfect calibration line is shown as a dashed line. Predicted probabilities that are lower than the actual results fall below this line, while those that are higher fall above it. Abbreviations: LDL-C, low density lipoprotein cholesterol; RFE, the recursive feature elimination; XGB, extreme gradient boosting;.

Explainable ML model

The SHAP analysis results are primarily presented using the XGBoost model as an illustrative example to demonstrate the impact of features on model predictions. Comprehensive results for other models, including the RFE-applied Random Forest and RFE-applied Logistic Regression, are available in Supplementary Figure S2 and Supplementary Figure S3. Figure 5 illustrates the impact of each feature on model predictions using a SHAP summary plot. Each dot represents individual patient data, colored by patient characteristics. Density and color distribution analysis aids interpretation of model predictions. The model was trained using a total of 62 features, and the plot highlights the importance of the top 10 most significant features. The results indicate that patients with lower total cholesterol levels and those prescribed ezetimibe/rosuvastatin therapy are more likely to achieve target LDL-C levels. SHAP waterfall plots in Fig. 6 summarize the characteristics influencing each patient’s predicted outcomes and their contributions, using data from four randomly selected patients in the validation set. Red and blue bars indicate positive and negative impacts, respectively, on attainment of target LDL-C levels. Numbers on bars represent the characteristics’ contributions.

Fig. 5
figure 5

SHAP values of the top 10 features in the RFE-applied XGBoost model. Figure 5 illustrates the features influencing predictions in the RFE-applied XGBoost model using SHAP values. “(L)” denotes laboratory tests, whereas “(M)” represents medications. Each point represents a data point, with the X-axis indicating SHAP value magnitude and the Y-axis showing the feature values. Red points indicate high feature values, while blue points indicate low feature values. The SHAP value of each point shows how the actual value of a feature affected the model’s prediction. For example, a positive SHAP value indicates that the feature increased the predicted probability, while a negative SHAP value indicates that it decreased the predicted probability.

Fig. 6
figure 6

Individual SHAP waterfall plots of feature contributions. Figure 6 presents SHAP waterfall plots confirming that the characteristics affecting a patient’s predicted LDL-C value differ in their type and contribution. “(L)” denotes laboratory tests, whereas “(M)” represents medications. f(x) represents the individual model output for each patient, while E[f(x)] (−0.077) represent the average predicted value, which is the model output for the entire dataset. Patient 5a exhibited a total cholesterol level of 128 mg/dL and received ezetimibe/rosuvastatin therapy. Patient 5b had a total cholesterol level of 124 mg/dL. According to the model, these characteristics are associated with a positive impact on achieving the target LDL-C levels. In contrast, Patient 5c had a total cholesterol level of 213 mg/dL and did not receive rosuvastatin, whereas Patient 5d, with a total cholesterol level of 210 mg/dL, had taken rosuvastatin. Although the use of rosuvastatin in Patient 5d is associated with a positive impact on LDL-C goal attainment, the combined assessment of other clinical measures suggests that this patient is less likely to achieve the target LDL-C level. Abbreviations: DBP, diastolic blood pressure; HDL, high density lipoprotein; LDL-C, low density lipoprotein cholesterol; SBP, systolic blood pressure;.

Discussion

To prevent cardiovascular disease, statins are a primary therapeutic approach for lowering LDL-C levels26. Moderate-dose statins are recommended as initial therapy due to their lower risk of side effects compared to high-dose statins27, and they are prescribed to the majority of high-risk cardiovascular patients in Korea28. Despite the high efficacy of statins in the Asian population, existing Korean studies indicate that LDL-C goal achievement rates among high-risk cardiovascular patients remain low29,30. These results suggest the need for tailored statin treatment regimens that comprehensively reflect patient characteristics. This study aims to develop a predictive model to identify CAD patients who can achieve LDL-C targets with moderate-dose statins, utilizing clinical data from the patients.

ML models are capable of effectively capturing the diversity and complexity of EMR data, thus making them applicable in various medical fields31,32. This study integrates clinical variables such as diagnosis, medication history, and laboratory test results from real-world patient data. During this process, interactions between specific diagnoses or concomitant medications and statin treatment may exist33,34. To effectively account for these interactions, ML models were employed to automatically model the complex clinical characteristics among various medications and specific diagnoses, incorporating these into the predictive outcomes.

ML models effectively capture complex relationships between outcomes and predictors that are often difficult to model using traditional statistical methods35,36. While conventional methods are valuable for elucidating variable relationships, ML models frequently achieve higher accuracy and better account for intricate interactions among clinical variables. Consequently, ML models have been widely applied in clinical prediction studies to enhance performance37,38. In this study, we conducted a comparative analysis of the predictive performance of traditional LR models and ML models. Although the three models demonstrated similar AUROC values, XGBoost exhibited superior performance in terms of specificity (0.644, 95% CI: [0.628–0.661]) and accuracy (0.659, 95% CI: [0.641–0.677]), and PPV (0.642, 95% CI: [0.617–0.669]). By accounting for complex feature interactions, this approach facilitates comprehensive and accurate predictions that are challenging for clinicians to evaluate manually.

Feature selection was executed using the RFE process for the three ML models, all of which successfully preserved comparable AUROC values following feature selection. The XGBoost model achieved a 52% reduction in the number of features, while the Random Forest and Logistic Regression models achieved reductions of 43% each. This reduction efficiently removed non-essential variables, minimizing bias and enabling the integration of diverse clinical information. The reliability and clinical applicability of the RFE-applied models were evaluated using calibration curves and Brier scores, demonstrating their ability to mitigate overfitting and provide accurate probability estimates39,40. When comparing the three models, XGBoost achieved the lowest Brier score (0.218, 95% CI: [0.215–0.221]), outperforming Random Forest (0.236, 95% CI: [0.235–0.238]) and Logistic Regression (0.243, 95% CI: [0.239–0.246]). These findings suggest that, in comparison with the other two models, XGBoost provides more reliable and consistent predictions for achieving LDL-C goals in patients prescribed moderate-dose statins.

Previous studies have employed machine learning to achieve treatment goals for patients with high cardiovascular risk41. However, despite achieving high predictive accuracy, these models frequently lacked interpretability. To address the “black box” problem inherent in ML models, we employed SHAP to identify key variables influencing predictions. SHAP provides consistent insights into each feature’s contribution and offers visualizations of complex outcomes, improving the interpretability of ML models, especially when addressing intricate variable interactions42. In this study, SHAP analysis was applied to the RFE-applied XGBoost model, showing that factors such as medications and lipid profiles substantially impacted predictions. These findings for key factors, such as total cholesterol, align with previous statistical findings on LDL-C levels43,44, further supporting that the model accurately captures clinically relevant characteristics. Additionally, SHAP enabled the identification and visualization of features influencing individual predictions, enhancing clinicians understand the model’s mechanisms and supporting its potential use as a decision-making tool in clinical practice45,46.

By analyzing various patient characteristics to predict the achievement of target LDL-C levels with moderate-dose statin treatment, this study contributes to identify patients who would benefit from moderate-dose statins, thereby mitigating the potential risks associated with high-dose statins. The integration of a wide range of clinical variables enables more accurate and evidence-based outcome predictions. Additionally, the identification of patients likely to achieve LDL-C targets with moderate-dose statin monotherapy can be utilized to determine the necessity for supplementary combination therapy. This study supports clinical decision-making by providing detailed explanations for individual patients, thereby enhancing patient safety and ensuring effective LDL-C management.

This study, based on a retrospective cohort design with an Asian population and single-center data, may introduce biases in baseline characteristics, as it does not account for variations in patient demographics, treatment adherence, and medical records from other hospitals. Additionally, given that LDL-C values were maintained as averages through discussions with specialists to reflect the characteristics of Seoul Asan Medical Center, this approach may limit the generalizability of the findings to other clinical settings.

The performance and Brier scores of the ML models underscore the potential for further improvement through the integration of advanced algorithms and expanded datasets. It is emphasized that enhancing algorithmic capabilities and incorporating multicenter data are essential for achieving more robust and generalizable research outcomes. Accordingly, efforts will be made to continuously refine the models and methodologies to advance the quality and applicability of future studies.

Conclusion

In this study, ML-based models were employed to predict the likelihood of achieving LDL-C target levels in CAD patients treated with moderate-dose statins using EMR data from a tertiary hospital. Predictive experiments were conducted using models such as XGBoost, Random Forest, and Logistic Regression, achieving an AUROC of up to 0.709 despite an average feature reduction of 46%. The SHAP results confirmed the potential to improve the interpretability of ML-based clinical prediction models. These findings suggest the possibility of facilitating clinical decision-making and improving patient safety and treatment outcomes.