Abstract
Low-density lipoprotein cholesterol (LDL-C) is an important factor in the development of cardiovascular disease, making its management a key aspect of cardiovascular health. While high-dose statin therapy is often recommended for LDL-C reduction, careful consideration is needed due to patient-specific factors and potential side effects. This study aimed to develop a machine learning (ML) model to estimate the likelihood of achieving target LDL-C levels in patients hospitalized for coronary artery disease and treated with moderate-dose statins. The predictive performance of three ML models, including Extreme Gradient Boosting (XGBoost), Random Forest, and Logistic Regression, was evaluated using electronic medical records from the Asan Medical Center in Seoul across six performance metrics. Additionally, all three models achieved an average AUROC of 0.695 despite reducing features by over 43%. SHAP analysis was conducted to identify key features influencing model predictions, aiming insights into patient characteristics associated with achieving LDL-C targets. This study suggests that ML-based approaches may help identify patients likely to benefit from moderate-dose statins, potentially supporting personalized treatment strategies and clinical decision-making for LDL-C management.
Similar content being viewed by others
Abbreviations.
CAD, coronary artery disease;
CAG, coronary angiogram;
CCTA, coronary computed tomography angiography;
EMRs, electronic medical records;
LDL-C, low density lipoprotein cholesterol;
ML, machine learning;
Low density lipoprotein cholesterol (LDL-C) levels, measured in the blood, play a crucial role in cardiovascular disease development1. Lowering LDL-C levels is an effective strategy for reducing mortality risks and complications associated with these diseases2,3.
To prevent the recurrence of cardiovascular events, clinicians often prescribe high-dose statins to high-risk patients, particularly those with a history of coronary artery disease (CAD)4. Multiple studies have highlighted the effectiveness of high-dose statin therapy in reducing LDL-C levels5. Nevertheless, some studies have revealed that high-dose statin usage may increase side effect risks, including diabetes and liver dysfunction, with such adverse effects being particularly pronounced in Asian populations6,7. Notably, genetic factors enable substantial LDL-C reductions in Asians, even with lower doses of statins8. Tailoring statin doses to individual patient characteristics might be imperative in clinical practice.
Efforts toward achieving real-world clinical applicability have led to increased electronic medical record (EMRs) utilization in research. EMRs store a broad range of patient medical data digitally, including diagnoses, medication prescriptions, and test outcomes. Machine learning (ML) techniques can effectively identify complex relationships within EMR datasets9. Prior studies demonstrated the efficacy of combining EMRs and ML across medical settings, including disease prediction, treatment response assessment, and clinical decision support10,11.
The objective of this study is to develop a ML model to predict the likelihood of LDL-C target achievement in CAD patients, with a particular focus on Asian populations. By integrating a range of clinical variables that reflect patient characteristics, the model aims to provide insights into the factors influencing patient outcomes. Through this endeavor, the model will support clinical decision-making regarding appropriate statin dosages, enhancing patient safety and offering a tool for optimizing LDL-C management.
Methods
Ethical approval
The Institutional Review Board (IRB) of Asan Medical Center (AMC) approved the study protocols (No. 2021 − 0303) in accordance with the Declaration of Helsinki (2008). This study utilized data from the Asan Biomedical Research Environment (ABLE), a de-identified electronic medical record (EMR) database maintained by AMC, a major tertiary hospital in South Korea12. The ABLE database is comprised of anonymized information, and as a result, the study was exempt from the requirement for informed consent by the IRB. All experiments were conducted in compliance with pertinent guidelines and regulations. Patient data, including diagnoses, laboratory test results, and reports, were extracted for patients with CAD admitted to AMC from January 1, 2000, to December 30, 2020.
Study population
The study included patients hospitalized for CAD at AMC from January 1, 2000, to December 31, 2020, who had recorded LDL-C measurements. CAD included myocardial infarction (MI) or unstable angina, stable angina, and asymptomatic CAD (ACAD), with patients who underwent coronary revascularization also included. Only first hospitalizations were analyzed for patients with multiple CAD related hospitalizations.
The study included patients who had a second LDL-C test conducted 6 to 18 months after discharge. Patients with a history of statin or PCSK9 inhibitor use prior to hospitalization were excluded to avoid confounding treatment effects from concomitant medications. Additionally, patients without statin prescriptions during hospitalization were excluded. Patients who were on high-dose statins or low-dose statins during hospitalization, which made it challenging to provide individualized treatment, were also excluded. Figure 1 outlines the inclusion and exclusion criteria for study subjects.
Flow chart of study population selection criteria. Figure 1 illustrates the process of selecting the study participants. Initially, LDL-C measurements were taken during hospitalization for patients with CAD. Subsequent measurements were recorded 6–18 months post discharge. After applying exclusion criteria, a total of 9,402 patients were included in the final cohort. Abbreviations: CAD, coronary artery disease; LDL-C, low density lipoprotein cholesterol; PCSK9, proprotein convertase subtilisin/kexin type 9;.
The classification of statin dose was established in accordance with the American College of Cardiology and American Heart Association (ACC/AHA) guidelines and through discussions with clinical experts, taking into account the characteristics of both the hospital and the patients13. A detailed summary of this classification is provided in Table 1. To prevent underestimation of statin usage, if multiple doses were recorded on the same day, the highest dose was selected.
The study population was divided into two groups based on whether the LDL-C target levels were achieved at the time of the second LDL-C measurement following discharge. According to the ACC guidelines, the target LDL-C level for patients with high-risk cardiovascular disease is less than 70 mg/dL or a reduction of 50% or more14. However, this study is a retrospective analysis conducted at a single institution, and it did not account for LDL-C levels prior to presentation at AMC or the history of statin therapy. To avoid bias resulting from the exclusion of these variables, the requirement for a 50% reduction in LDL-C levels was omitted. Furthermore, based on the guidelines for the management of dyslipidemia in Korea, this study ultimately defined the target LDL-C level as less than 70 mg/dL15.
The LDL-C values were based on directly measured values; in the absence of such data, they were estimated using the Friedewald formula. The Friedewald formula is expressed as follows: LDL-C = [Total cholesterol - High-density lipoprotein cholesterol (HDL-C)] - (Triglycerides / 5)16. In instances where LDL-C was measured multiple times on the same day, the mean value was employed.
Study design
Patients hospitalized for their initial CAD event from January 2000 to December 2020 were analyzed. All patients received moderate-dose statins during hospitalization for CAD and had at least two LDL-C measurements: one during hospitalization for CAD and another 6 to 18 months after the first measurement. To ensure precise clinical data capture, relevant clinical data were extracted and preprocessed for model feature construction. Three ML models were developed to predict attainment of the 70 mg/dL target LDL-C level. The study design is summarized in Fig. 2.
Overview of the study. Figure 2 outlines the study process: (1) We extracted diagnoses, medication data, and lab test results from the EMRs of patients hospitalized with CAD who were prescribed moderate-dose of statins. (2) We performed data preprocessing, which included one-hot encoding and categorization, to prepare the dataset. (3) We used a ML model to predict whether the patients’ second LDL-C values reached the target levels.
Data processing
The dataset for model development and validation was obtained from the EMRs of AMC. Diagnosis codes were defined according to International Classification of Diseases (ICD-10) standards. To reflect all relevant sub diagnoses, three-digit diagnosis codes were used, considering the hierarchical structure of ICD-10 codes.
The dataset was constructed based on the timing of CAD admission. The status of patients included diagnosis, prescriptions, gender, age, body mass index (BMI), systolic blood pressure (SBP), diastolic blood pressure (DBP), and CAD hospitalization duration. Prescription details reflected the total number of medication days prescribed, indicative of intake extent. Medical history included laboratory test results, smoking status, and whether coronary angiography and coronary CT angiography were performed. Laboratory test results represented median values within one year prior to CAD hospitalization. Diagnosis, prescription, and laboratory test data were defined as model features by selecting information commonly occurring within the patient cohort. Consequently, a dataset comprising 9,402 patients was compiled, including 26 current diagnosis characteristics, 19 past diagnosis characteristics, 30 prescription characteristics, 45 test result characteristics, and 9 baseline characteristics.
One-hot encoding was used to indicate patients’ conditions. For instance, a heart failure diagnosis was encoded as 1, while its absence was encoded as 0. Variables such as smoking status were coded based on records closest to the CAD admission date, with smokers coded as 1 and ex-smokers or nonsmokers coded as 0. Gender was coded as 1 for males and 0 for females. Categorical variables were transformed into numerical vectors for model use.
Numeric variables were preprocessed to fit model characteristics. Missing values were replaced by − 1 to avoid confusion with actual measurement values coded as 0. For the logistic regression (LR) model, numerical range normalization between 0 and 1 was performed to mitigate bias from substantial numerical values17.
Model development and evaluation
To prevent overfitting and improve general performance, all models underwent stratified K-fold cross-validation18. The training dataset comprised 70% of the data from the target patients. The model performance was calculated as the average over 5-fold cross-validation and evaluated using AUC, depicted through receiver operating characteristic (ROC) curves.
Three models were utilized to predict the attainment of target LDL-C levels: extreme gradient boosting (XGBoost)19, random forest (RF)20, and logistic regression (LR)21. Each model was trained using 129 features and optimized using GridSearch to identify optimal hyperparameters. For a comprehensive overview of the parameters utilized and the values selected for model tuning, please refer to Supplementary Table S1.Additionally, confidence intervals (CI) were calculated using Bootstrap resampling22, and six evaluation metrics were assessed.
Nevertheless, an excess of features may prove detrimental to the model’s efficacy. To achieve optimal performance, the recursive feature elimination (RFE) process is undertaken to select an appropriate number of features23. RFE is a feature selection technique that employs an iterative process of removing less important features with the objective of improving model performance. By determining the optimal number of features, the dimensionality of the final model is reduced.
Calibration curves and Brier scores were used to demonstrate the applicability of the final models. The calibration curves evaluated the agreement between predicted probabilities and actual outcomes for the three models, the Brier scores were reported along with CI.
Model interpretations
Shapley additive explanations (SHAP) were employed to comprehend the significance of each feature in the prediction process and to demonstrate explainability of model24. This method calculates and visualizes the importance score of each feature, providing insights into their contribution to the prediction. In contrast to common feature contribution methods, SHAP consistently attributes feature importance, presents intuitive results, and restores influential features well25. Furthermore, SHAP waterfall plots were employed to illustrate individual patient data, thereby furnishing information on characteristics that exert a positive or negative influence on predictions.
Statistical analysis
For comparisons of baseline characteristics, the chi-square test and the t-test were employed. Continuous variables were compared by using the T-test. The Chi-square tests were used for categorical variables. All statistical analysis was performed using R software version 4.2.3 (R Foundation for Statistical Computing, Vienna, Austria; www.r-project.org). P value < 0.05 was considered to indicate statistical significance.
Results
Baseline characteristics.
After applying exclusion criteria to a total of 53,369 patients hospitalized for coronary artery disease at the AMC from 2000 to 2020, a final cohort of 9,402 patients was formed. The patient cohort was divided into two groups: those with LDL-C levels below 70 mg/dL (n = 4,525, achieved LDL-C target group) and those who did not achieve the target LDL-C levels (n = 4,877, non-achieved LDL-C target group). Table 2 summarizes the demographic data, comorbidities, medication history, and laboratory test results of the patient cohort. The study subjects had an average age of 61.58 years (standard deviation 10.54), and 70.61% (6,639) were male.
ML model performance
This model predicted LDL-C goal achievement in patients prescribed moderate-dose statins using ML. Each algorithm was trained using the same randomly split training set (70% of the total patients) and evaluated using the test set (30% of the total patients). The models were optimized using 5-fold cross-validation, and the performance was evaluated to three decimal places. The three models were evaluated across six performance metrics, which are summarized along with CIs for each model in Table 3. Statistically significant differences were observed in specificity, accuracy, and PPV (Positive Predictive Value). Notably, XGBoost demonstrated specificity at 0.644 (95% CI: [0.628–0.661]), outperforming Random Forest at 0.286 (95% CI: [0.263–0.310]) and Logistic Regression at 0.293 (95% CI: [0.268–0.316]). XGBoost also achieved the highest accuracy at 0.659 (95% CI: [0.641–0.677]) and recorded a PPV of 0.642 (95% CI: [0.617–0.669]), both of which showed statistically significant differences compared to the other two models. The results of the three models, including CI and ROC curves for the test set, are summarized in Fig. 3. Additionally, the AUC and CI for patients receiving low or high statins, who were excluded from the main analysis, are summarized in Supplementary Table S1.
ROC curves of three model performance. Figure 3 illustrates the ROC curves for the XGBoost, Random Forest and Logistic Regression. Abbreviations: LR, logistic regression; RF, random forest; ROC, receiver operating characteristic; XGBoost, extreme gradient boosting;.
Feature reduction ML model
RFE was applied to all three models. While the original models were trained on 129 features, the RFE-applied models utilized a reduced set of features optimized for each model. Specifically, the XGBoost model was trained with 62 features, the Random Forest model and Logistic Regression with 73 features. The final model was trained on 70% of the patient dataset using 5-fold cross-validation, and the remaining 30% of the data was used to evaluate the performance. The performance metrics for each model are summarized in Table 4. Despite a reduction in features, all three models exhibited comparable or enhanced performance across most evaluation metrics.
ML model calibration
The calibration curve for the RFE-applied XGBoost, Random Forest, and Logistic Regression models are presented in Fig. 4. Each line represents the calibration curve for a specific model, with brier scores and CI shown in parentheses. Lower brier scores indicate a better alignment between predicted probabilities and actual outcomes. The results demonstrate that the XGBoost model has the lowest Brier score and exhibits consistent calibration. The XGBoost model achieved a brier score of 0.218 (95% CI: [0.215–0.221]), outperforming the Random Forest model with a brier score of 0.236 (95% CI: [0.235–0.238]) and the Logistic Regression model with a brier score of 0.243 (95% CI: [0.239–0.246]).
Calibration curves and brier scores of the RFE-applied models. Figure 4 illustrates the calibration curves and brier scores of RFE-applied models for predicting LDL-C goal achievement. Each line represents the calibration curve for a specific model, including XGBoost, Random Forest, and Logistic regression. The figures in parentheses indicate the brier scores of each model. A lower Brier score indicates a greater degree of agreement between the predicted probabilities and the actual outcomes. The perfect calibration line is shown as a dashed line. Predicted probabilities that are lower than the actual results fall below this line, while those that are higher fall above it. Abbreviations: LDL-C, low density lipoprotein cholesterol; RFE, the recursive feature elimination; XGB, extreme gradient boosting;.
Explainable ML model
The SHAP analysis results are primarily presented using the XGBoost model as an illustrative example to demonstrate the impact of features on model predictions. Comprehensive results for other models, including the RFE-applied Random Forest and RFE-applied Logistic Regression, are available in Supplementary Figure S2 and Supplementary Figure S3. Figure 5 illustrates the impact of each feature on model predictions using a SHAP summary plot. Each dot represents individual patient data, colored by patient characteristics. Density and color distribution analysis aids interpretation of model predictions. The model was trained using a total of 62 features, and the plot highlights the importance of the top 10 most significant features. The results indicate that patients with lower total cholesterol levels and those prescribed ezetimibe/rosuvastatin therapy are more likely to achieve target LDL-C levels. SHAP waterfall plots in Fig. 6 summarize the characteristics influencing each patient’s predicted outcomes and their contributions, using data from four randomly selected patients in the validation set. Red and blue bars indicate positive and negative impacts, respectively, on attainment of target LDL-C levels. Numbers on bars represent the characteristics’ contributions.
SHAP values of the top 10 features in the RFE-applied XGBoost model. Figure 5 illustrates the features influencing predictions in the RFE-applied XGBoost model using SHAP values. “(L)” denotes laboratory tests, whereas “(M)” represents medications. Each point represents a data point, with the X-axis indicating SHAP value magnitude and the Y-axis showing the feature values. Red points indicate high feature values, while blue points indicate low feature values. The SHAP value of each point shows how the actual value of a feature affected the model’s prediction. For example, a positive SHAP value indicates that the feature increased the predicted probability, while a negative SHAP value indicates that it decreased the predicted probability.
Individual SHAP waterfall plots of feature contributions. Figure 6 presents SHAP waterfall plots confirming that the characteristics affecting a patient’s predicted LDL-C value differ in their type and contribution. “(L)” denotes laboratory tests, whereas “(M)” represents medications. f(x) represents the individual model output for each patient, while E[f(x)] (−0.077) represent the average predicted value, which is the model output for the entire dataset. Patient 5a exhibited a total cholesterol level of 128 mg/dL and received ezetimibe/rosuvastatin therapy. Patient 5b had a total cholesterol level of 124 mg/dL. According to the model, these characteristics are associated with a positive impact on achieving the target LDL-C levels. In contrast, Patient 5c had a total cholesterol level of 213 mg/dL and did not receive rosuvastatin, whereas Patient 5d, with a total cholesterol level of 210 mg/dL, had taken rosuvastatin. Although the use of rosuvastatin in Patient 5d is associated with a positive impact on LDL-C goal attainment, the combined assessment of other clinical measures suggests that this patient is less likely to achieve the target LDL-C level. Abbreviations: DBP, diastolic blood pressure; HDL, high density lipoprotein; LDL-C, low density lipoprotein cholesterol; SBP, systolic blood pressure;.
Discussion
To prevent cardiovascular disease, statins are a primary therapeutic approach for lowering LDL-C levels26. Moderate-dose statins are recommended as initial therapy due to their lower risk of side effects compared to high-dose statins27, and they are prescribed to the majority of high-risk cardiovascular patients in Korea28. Despite the high efficacy of statins in the Asian population, existing Korean studies indicate that LDL-C goal achievement rates among high-risk cardiovascular patients remain low29,30. These results suggest the need for tailored statin treatment regimens that comprehensively reflect patient characteristics. This study aims to develop a predictive model to identify CAD patients who can achieve LDL-C targets with moderate-dose statins, utilizing clinical data from the patients.
ML models are capable of effectively capturing the diversity and complexity of EMR data, thus making them applicable in various medical fields31,32. This study integrates clinical variables such as diagnosis, medication history, and laboratory test results from real-world patient data. During this process, interactions between specific diagnoses or concomitant medications and statin treatment may exist33,34. To effectively account for these interactions, ML models were employed to automatically model the complex clinical characteristics among various medications and specific diagnoses, incorporating these into the predictive outcomes.
ML models effectively capture complex relationships between outcomes and predictors that are often difficult to model using traditional statistical methods35,36. While conventional methods are valuable for elucidating variable relationships, ML models frequently achieve higher accuracy and better account for intricate interactions among clinical variables. Consequently, ML models have been widely applied in clinical prediction studies to enhance performance37,38. In this study, we conducted a comparative analysis of the predictive performance of traditional LR models and ML models. Although the three models demonstrated similar AUROC values, XGBoost exhibited superior performance in terms of specificity (0.644, 95% CI: [0.628–0.661]) and accuracy (0.659, 95% CI: [0.641–0.677]), and PPV (0.642, 95% CI: [0.617–0.669]). By accounting for complex feature interactions, this approach facilitates comprehensive and accurate predictions that are challenging for clinicians to evaluate manually.
Feature selection was executed using the RFE process for the three ML models, all of which successfully preserved comparable AUROC values following feature selection. The XGBoost model achieved a 52% reduction in the number of features, while the Random Forest and Logistic Regression models achieved reductions of 43% each. This reduction efficiently removed non-essential variables, minimizing bias and enabling the integration of diverse clinical information. The reliability and clinical applicability of the RFE-applied models were evaluated using calibration curves and Brier scores, demonstrating their ability to mitigate overfitting and provide accurate probability estimates39,40. When comparing the three models, XGBoost achieved the lowest Brier score (0.218, 95% CI: [0.215–0.221]), outperforming Random Forest (0.236, 95% CI: [0.235–0.238]) and Logistic Regression (0.243, 95% CI: [0.239–0.246]). These findings suggest that, in comparison with the other two models, XGBoost provides more reliable and consistent predictions for achieving LDL-C goals in patients prescribed moderate-dose statins.
Previous studies have employed machine learning to achieve treatment goals for patients with high cardiovascular risk41. However, despite achieving high predictive accuracy, these models frequently lacked interpretability. To address the “black box” problem inherent in ML models, we employed SHAP to identify key variables influencing predictions. SHAP provides consistent insights into each feature’s contribution and offers visualizations of complex outcomes, improving the interpretability of ML models, especially when addressing intricate variable interactions42. In this study, SHAP analysis was applied to the RFE-applied XGBoost model, showing that factors such as medications and lipid profiles substantially impacted predictions. These findings for key factors, such as total cholesterol, align with previous statistical findings on LDL-C levels43,44, further supporting that the model accurately captures clinically relevant characteristics. Additionally, SHAP enabled the identification and visualization of features influencing individual predictions, enhancing clinicians understand the model’s mechanisms and supporting its potential use as a decision-making tool in clinical practice45,46.
By analyzing various patient characteristics to predict the achievement of target LDL-C levels with moderate-dose statin treatment, this study contributes to identify patients who would benefit from moderate-dose statins, thereby mitigating the potential risks associated with high-dose statins. The integration of a wide range of clinical variables enables more accurate and evidence-based outcome predictions. Additionally, the identification of patients likely to achieve LDL-C targets with moderate-dose statin monotherapy can be utilized to determine the necessity for supplementary combination therapy. This study supports clinical decision-making by providing detailed explanations for individual patients, thereby enhancing patient safety and ensuring effective LDL-C management.
This study, based on a retrospective cohort design with an Asian population and single-center data, may introduce biases in baseline characteristics, as it does not account for variations in patient demographics, treatment adherence, and medical records from other hospitals. Additionally, given that LDL-C values were maintained as averages through discussions with specialists to reflect the characteristics of Seoul Asan Medical Center, this approach may limit the generalizability of the findings to other clinical settings.
The performance and Brier scores of the ML models underscore the potential for further improvement through the integration of advanced algorithms and expanded datasets. It is emphasized that enhancing algorithmic capabilities and incorporating multicenter data are essential for achieving more robust and generalizable research outcomes. Accordingly, efforts will be made to continuously refine the models and methodologies to advance the quality and applicability of future studies.
Conclusion
In this study, ML-based models were employed to predict the likelihood of achieving LDL-C target levels in CAD patients treated with moderate-dose statins using EMR data from a tertiary hospital. Predictive experiments were conducted using models such as XGBoost, Random Forest, and Logistic Regression, achieving an AUROC of up to 0.709 despite an average feature reduction of 46%. The SHAP results confirmed the potential to improve the interpretability of ML-based clinical prediction models. These findings suggest the possibility of facilitating clinical decision-making and improving patient safety and treatment outcomes.
Data availability
The data that support the findings of this study are available from the corresponding author on reasonable request owing to ethical concerns and confidentiality agreements.
References
Ridker, P. M. LDL cholesterol: controversies and future therapeutic directions. Lancet 384, 607–617 (2014).
Nissen, S. E. et al. Statin therapy, LDL cholesterol, C-reactive protein, and coronary artery disease. N Engl. J. Med. 352, 29–38 (2005).
Heart Protection Study Collaborative Group. MRC/BHF Heart Protection study of cholesterol lowering with simvastatin in 20,536 high-risk individuals: a randomised placebo-controlled trial. Lancet 360, 7–22 (2002).
Stone, N. J. et al. 2013 ACC/AHA guideline on the treatment of blood cholesterol to reduce atherosclerotic cardiovascular risk in adults: a report of the American College of Cardiology/American Heart Association Task Force on Practice guidelines. Circulation 129 (Suppl. 2), S1–S45 (2014).
Jones, P. H., McKenney, J. M., Karalis, D. G. & Downey, J. NASDAC investigators. Comparison of the efficacy and safety of atorvastatin initiated at different starting doses in patients with dyslipidemia. Am. Heart J. 149, e1–e8 (2005).
Ko, M. J. et al. Time-and dose‐dependent association of statin use with risk of clinically relevant new‐onset diabetes mellitus in primary prevention: a nationwide observational cohort study. J. Am. Heart Association. 8, e011320 (2019).
Silva, M. et al. Meta-analysis of drug-induced adverse events associated with intensive-dose statin therapy. Clin. Ther. 29, 253–260 (2007).
Lee, J. H. et al. Effects of ezetimibe/simvastatin 10/20 mg vs. atorvastatin 20 mg on apolipoprotein B/apolipoprotein A1 in Korean patients with type 2 diabetes mellitus: results of a randomized controlled trial. Am. J. Cardiovasc. Drugs. 13, 343–351 (2013).
Segura-Bedmar, I., Colon-Ruiz, C., Tejedor-Alonso, M. Á. & Moro-Moro, M. Predicting of anaphylaxis in big data EMR by exploring machine learning approaches. J. Biomed. Inform. 87, 50–59 (2018).
Gultepe, E. et al. From vital signs to clinical outcomes for patients with sepsis: a machine learning basis for a clinical decision support system. J. Am. Med. Inform. Assoc. 21, 315–325 (2014).
Ridgway, J. P., Lee, A., Devlin, S., Kerman, J. & Mayampurath, A. Machine learning and clinical informatics for improving HIV care continuum outcomes. Curr. HIV/AIDS. Rep. 18, 229–236 (2021).
Shin, S. et al. Lessons learned from development of de-identification system for biomedical research in a Korean Tertiary Hospital. Healthc. Inf. Res. 19, 102–109 (2013).
Stone, N. J. et al. Treatment of blood cholesterol to reduce atherosclerotic cardiovascular disease risk in adults: synopsis of the 2013 American College of Cardiology/American Heart Association cholesterol guideline. Ann. Intern. Med. 160, 339–343 (2014).
Lloyd-Jones, D. M. et al. 2022 ACC expert consensus decision pathway on the role of nonstatin therapies for LDL-cholesterol lowering in the management of atherosclerotic cardiovascular disease risk: a report of the American College of Cardiology Solution Set Oversight Committee. J. Am. Coll. Cardiol. 80, 1366–1418 (2022).
Rhee, E. J. et al. 2018 guidelines for the management of dyslipidemia in Korea. J. Lipid Atherosclerosis. 8, 78 (2019).
Friedewald, W. T., Levy, R. I. & Fredrickson, D. S. Estimation of the concentration of low-density lipoprotein cholesterol in plasma, without use of the preparative ultracentrifuge. Clin. Chem. 18, 499–502 (1972).
Singh, D. & Singh, B. Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 97, 105524 (2020).
Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection. Proc. 14th Int. Joint Conf. Artif. Intell. 2, 1137–1145 (1995).
Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794, ACM, (2016).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Hosmer, D. W. & Lemeshow, S. Applied Logistic Regression 2nd edn (John Wiley and Sons, Inc., 2000).
Carpenter, J. & Bithell, J. Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. Stat. Med. 19, 1141–1164 (2000).
Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002).
Barredo Arrieta, A. et al. Explainable Artificial Intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible. AI Inf. Fusion. 58, 82–115 (2020).
Lundberg, S. M., Erion, G. G. & Lee, S. I. Consistent individualized feature attribution for tree ensembles. Preprint at (2018). https://arxiv.org/abs/1802.03888v3
Morofuji, Y. et al. Beyond lipid-lowering: effects of statins on cardiovascular and cerebrovascular diseases and cancer. Pharmaceuticals 15, 151 (2022).
Arnold, M. J., O’Malley, P. G. & Downs, J. R. Key recommendations on managing dyslipidemia for cardiovascular risk reduction: stopping where the evidence does. Am. Family Phys. 103, 455–458 (2021).
Park, M. W. et al. Moderate-intensity versus high-intensity statin therapy in Korean patients with angina undergoing percutaneous coronary intervention with drug-eluting stents: a propensity-score matching analysis. PLoS One. 13, e0207889 (2018).
Yang, Y. S., Yang, B. R., Kim, M. S., Hwang, Y. & Choi, S. H. Low-density lipoprotein cholesterol goal attainment rates in high-risk patients with cardiovascular diseases and diabetes mellitus in Korea: a retrospective cohort study. Lipids Health Dis. 19, 1–13 (2020).
Kwon, O. et al. Cardiovascular event rates in statin-treated Korean patients with cardiovascular disease: estimates from a real-world population using electronic medical record data. Cardiovasc. Drugs Ther. 35, 1–12 (2021).
Rajula, H. S. R., Verlato, G., Manchia, M., Antonucci, N. & Fanos, V. Comparison of conventional statistical methods with machine learning in medicine: diagnosis, drug development, and treatment. Medicina 56, 455 (2020).
Yamashita, T. et al. Machine learning for classification of postoperative patient status using standardized medical data. Comput. Methods Programs Biomed. 214, 106583 (2022).
Koh, K. K., Han, S. H., Oh, P. C., Shin, E. K. & Quon, M. J. Combination therapy for treatment or prevention of atherosclerosis: focus on the lipid-RAAS interaction. Atherosclerosis 209, 307–313 (2010).
Liao, Y. et al. Lipid metabolism patterns and relevant clinical and molecular features of coronary artery disease patients: an integrated bioinformatic analysis. Lipids Health Dis. 21, 87 (2022).
Han, K. et al. A review of approaches for predicting drug–drug interactions based on machine learning. Front. Pharmacol. 12, 814858 (2022).
Subramani, S. et al. Cardiovascular diseases prediction by machine learning incorporation with deep learning. Front. Med. 10, 1150933 (2023).
Belsti, Y. et al. Comparison of machine learning and conventional logistic regression-based prediction models for gestational diabetes in an ethnically diverse population; the Monash GDM Machine learning model. Int. J. Med. Informatics. 179, 105228 (2023).
Churpek, M. M. et al. Multicenter comparison of machine learning methods and conventional regression for predicting clinical deterioration on the wards. Crit. Care Med. 44, 368–374 (2016).
Huang, Y., Li, W., Macheret, F. & Gabriel, R. A. Ohno-Machado, L. A tutorial on calibration measurements and calibration models for clinical prediction models. J. Am. Med. Inf. Assoc. 27, 621–633 (2020).
Licher, S. et al. External validation of four dementia prediction models for use in the general community-dwelling population: a comparative analysis from the Rotterdam Study. Eur. J. Epidemiol. 33, 645–655 (2018).
Krentz, A. J., Haddon-Hill, G., Zou, X., Pankova, N. & Jaun, A. Machine learning applied to cholesterol-lowering pharmacotherapy: proof-of-concept in high-risk patients treated in primary care. Metab. Syndr. Relat. Disord. 21, 453–459 (2023).
Wang, K. et al. Interpretable prediction of 3-year all-cause mortality in patients with heart failure caused by coronary heart disease based on machine learning and SHAP. Comput. Biol. Med. 137, 104813 (2021).
PP, A. et al. Machine learning predictive models of LDL-C in the population of eastern India and its comparison with directly measured and calculated LDL-C. Ann. Clin. Biochem. 59, 76–86 (2022).
Yu, M., Liang, C., Kong, Q., Wang, Y. & Li, M. Efficacy of combination therapy with ezetimibe and statins versus a double dose of statin monotherapy in participants with hypercholesterolemia: a meta-analysis of literature. Lipids Health Dis. 19, 1–7 (2020).
Zhao, P. et al. An explainable machine-learning model to analyze the effects of a PCSK9 inhibitor on thrombolysis in STEMI patients. J. Med. Biol. Eng. 43, 339–349 (2023).
Zhao, P. et al. Using machine learning to predict the in-hospital mortality in women with ST-segment elevation myocardial infarction. Rev. Cardiovasc. Med. 24, 126 (2023).
Acknowledgements
This work was supported by the Korea Medical Device Development Fund grant funded by the Korea government (the Ministry of Science and ICT, the Ministry of Trade, Industry and Energy, the Ministry of Health & Welfare, the Ministry of Food and Drug Safety) (Project Number: 1711195603, RS-2020-KD000097, 50%). Additional support was provided by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HR20C0026, 50%).
Author information
Authors and Affiliations
Contributions
J.H designed the study and extracted and analyzed the data and wrote the manuscript. T.J. and Y.-H.K. supervised the study and revised this manuscript. Y.K., H.K., J.S., H.C., M.K., J.H., G.K., S.P., S.K., H.J., B.K. reviewed this manuscript. All authors read and approved the final version of the manuscript before submission.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Abbreviations
RFE, the recursive feature elimination; XGBoost, extreme gradient boosting;
Tables:
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Han, J., Kim, Y., Kang, H.J. et al. Predicting low density lipoprotein cholesterol target attainment using machine learning in patients with coronary artery disease receiving moderate-dose statin therapy. Sci Rep 15, 5346 (2025). https://doi.org/10.1038/s41598-025-88693-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-88693-y
Keywords
This article is cited by
-
Leveraging BERT for embedding ICD codes from large scale cardiovascular EMR data to understand patient diagnostic patterns
BMC Medical Informatics and Decision Making (2025)