Predicting low density lipoprotein cholesterol target attainment using machine learning in patients with coronary artery disease receiving moderate-dose statin therapy

Han, Jiye; Kim, Yunha; Kang, Hee Jun; Seo, Jiahn; Choi, Heejung; Kim, Minkyoung; Kee, Gaeun; Park, Seohyun; Ko, Soyoung; Jung, HyoJe; Kim, Byeolhee; Jun, Tae Joon; Kim, Young-Hak

doi:10.1038/s41598-025-88693-y

Download PDF

Article
Open access
Published: 13 February 2025

Predicting low density lipoprotein cholesterol target attainment using machine learning in patients with coronary artery disease receiving moderate-dose statin therapy

Jiye Han¹,
Yunha Kim²,
Hee Jun Kang¹,
Jiahn Seo¹,
Heejung Choi¹,
Minkyoung Kim¹,
Gaeun Kee²,
Seohyun Park¹,
Soyoung Ko¹,
HyoJe Jung¹,
Byeolhee Kim²,
Tae Joon Jun³^na1 &
…
Young-Hak Kim⁴^na1

Scientific Reports volume 15, Article number: 5346 (2025) Cite this article

3377 Accesses
4 Citations
Metrics details

Subjects

Abstract

Low-density lipoprotein cholesterol (LDL-C) is an important factor in the development of cardiovascular disease, making its management a key aspect of cardiovascular health. While high-dose statin therapy is often recommended for LDL-C reduction, careful consideration is needed due to patient-specific factors and potential side effects. This study aimed to develop a machine learning (ML) model to estimate the likelihood of achieving target LDL-C levels in patients hospitalized for coronary artery disease and treated with moderate-dose statins. The predictive performance of three ML models, including Extreme Gradient Boosting (XGBoost), Random Forest, and Logistic Regression, was evaluated using electronic medical records from the Asan Medical Center in Seoul across six performance metrics. Additionally, all three models achieved an average AUROC of 0.695 despite reducing features by over 43%. SHAP analysis was conducted to identify key features influencing model predictions, aiming insights into patient characteristics associated with achieving LDL-C targets. This study suggests that ML-based approaches may help identify patients likely to benefit from moderate-dose statins, potentially supporting personalized treatment strategies and clinical decision-making for LDL-C management.

Personalizing cholesterol treatment recommendations for primary cardiovascular disease prevention

Article Open access 07 January 2022

Performance and clinical utility of supervised machine-learning approaches in detecting familial hypercholesterolaemia in primary care

Article Open access 30 October 2020

Predictive model development combining CT-FFR and SYNTAX score for major adverse cardiovascular events in complex coronary artery disease

Article Open access 28 February 2025

Abbreviations.

CAD, coronary artery disease;

CAG, coronary angiogram;

CCTA, coronary computed tomography angiography;

EMRs, electronic medical records;

LDL-C, low density lipoprotein cholesterol;

ML, machine learning;

Low density lipoprotein cholesterol (LDL-C) levels, measured in the blood, play a crucial role in cardiovascular disease development¹. Lowering LDL-C levels is an effective strategy for reducing mortality risks and complications associated with these diseases^2,3.

To prevent the recurrence of cardiovascular events, clinicians often prescribe high-dose statins to high-risk patients, particularly those with a history of coronary artery disease (CAD)⁴. Multiple studies have highlighted the effectiveness of high-dose statin therapy in reducing LDL-C levels⁵. Nevertheless, some studies have revealed that high-dose statin usage may increase side effect risks, including diabetes and liver dysfunction, with such adverse effects being particularly pronounced in Asian populations^6,7. Notably, genetic factors enable substantial LDL-C reductions in Asians, even with lower doses of statins⁸. Tailoring statin doses to individual patient characteristics might be imperative in clinical practice.

Efforts toward achieving real-world clinical applicability have led to increased electronic medical record (EMRs) utilization in research. EMRs store a broad range of patient medical data digitally, including diagnoses, medication prescriptions, and test outcomes. Machine learning (ML) techniques can effectively identify complex relationships within EMR datasets⁹. Prior studies demonstrated the efficacy of combining EMRs and ML across medical settings, including disease prediction, treatment response assessment, and clinical decision support^10,11.

The objective of this study is to develop a ML model to predict the likelihood of LDL-C target achievement in CAD patients, with a particular focus on Asian populations. By integrating a range of clinical variables that reflect patient characteristics, the model aims to provide insights into the factors influencing patient outcomes. Through this endeavor, the model will support clinical decision-making regarding appropriate statin dosages, enhancing patient safety and offering a tool for optimizing LDL-C management.

Methods

Ethical approval

The Institutional Review Board (IRB) of Asan Medical Center (AMC) approved the study protocols (No. 2021 − 0303) in accordance with the Declaration of Helsinki (2008). This study utilized data from the Asan Biomedical Research Environment (ABLE), a de-identified electronic medical record (EMR) database maintained by AMC, a major tertiary hospital in South Korea¹². The ABLE database is comprised of anonymized information, and as a result, the study was exempt from the requirement for informed consent by the IRB. All experiments were conducted in compliance with pertinent guidelines and regulations. Patient data, including diagnoses, laboratory test results, and reports, were extracted for patients with CAD admitted to AMC from January 1, 2000, to December 30, 2020.

Study population

The study included patients hospitalized for CAD at AMC from January 1, 2000, to December 31, 2020, who had recorded LDL-C measurements. CAD included myocardial infarction (MI) or unstable angina, stable angina, and asymptomatic CAD (ACAD), with patients who underwent coronary revascularization also included. Only first hospitalizations were analyzed for patients with multiple CAD related hospitalizations.

The study included patients who had a second LDL-C test conducted 6 to 18 months after discharge. Patients with a history of statin or PCSK9 inhibitor use prior to hospitalization were excluded to avoid confounding treatment effects from concomitant medications. Additionally, patients without statin prescriptions during hospitalization were excluded. Patients who were on high-dose statins or low-dose statins during hospitalization, which made it challenging to provide individualized treatment, were also excluded. Figure 1 outlines the inclusion and exclusion criteria for study subjects.

The classification of statin dose was established in accordance with the American College of Cardiology and American Heart Association (ACC/AHA) guidelines and through discussions with clinical experts, taking into account the characteristics of both the hospital and the patients¹³. A detailed summary of this classification is provided in Table 1. To prevent underestimation of statin usage, if multiple doses were recorded on the same day, the highest dose was selected.

Table 1 Statin dosage intensity groupings.

Full size table

The study population was divided into two groups based on whether the LDL-C target levels were achieved at the time of the second LDL-C measurement following discharge. According to the ACC guidelines, the target LDL-C level for patients with high-risk cardiovascular disease is less than 70 mg/dL or a reduction of 50% or more¹⁴. However, this study is a retrospective analysis conducted at a single institution, and it did not account for LDL-C levels prior to presentation at AMC or the history of statin therapy. To avoid bias resulting from the exclusion of these variables, the requirement for a 50% reduction in LDL-C levels was omitted. Furthermore, based on the guidelines for the management of dyslipidemia in Korea, this study ultimately defined the target LDL-C level as less than 70 mg/dL¹⁵.

The LDL-C values were based on directly measured values; in the absence of such data, they were estimated using the Friedewald formula. The Friedewald formula is expressed as follows: LDL-C = [Total cholesterol - High-density lipoprotein cholesterol (HDL-C)] - (Triglycerides / 5)¹⁶. In instances where LDL-C was measured multiple times on the same day, the mean value was employed.

Study design

Patients hospitalized for their initial CAD event from January 2000 to December 2020 were analyzed. All patients received moderate-dose statins during hospitalization for CAD and had at least two LDL-C measurements: one during hospitalization for CAD and another 6 to 18 months after the first measurement. To ensure precise clinical data capture, relevant clinical data were extracted and preprocessed for model feature construction. Three ML models were developed to predict attainment of the 70 mg/dL target LDL-C level. The study design is summarized in Fig. 2.

Data processing

The dataset for model development and validation was obtained from the EMRs of AMC. Diagnosis codes were defined according to International Classification of Diseases (ICD-10) standards. To reflect all relevant sub diagnoses, three-digit diagnosis codes were used, considering the hierarchical structure of ICD-10 codes.

The dataset was constructed based on the timing of CAD admission. The status of patients included diagnosis, prescriptions, gender, age, body mass index (BMI), systolic blood pressure (SBP), diastolic blood pressure (DBP), and CAD hospitalization duration. Prescription details reflected the total number of medication days prescribed, indicative of intake extent. Medical history included laboratory test results, smoking status, and whether coronary angiography and coronary CT angiography were performed. Laboratory test results represented median values within one year prior to CAD hospitalization. Diagnosis, prescription, and laboratory test data were defined as model features by selecting information commonly occurring within the patient cohort. Consequently, a dataset comprising 9,402 patients was compiled, including 26 current diagnosis characteristics, 19 past diagnosis characteristics, 30 prescription characteristics, 45 test result characteristics, and 9 baseline characteristics.

One-hot encoding was used to indicate patients’ conditions. For instance, a heart failure diagnosis was encoded as 1, while its absence was encoded as 0. Variables such as smoking status were coded based on records closest to the CAD admission date, with smokers coded as 1 and ex-smokers or nonsmokers coded as 0. Gender was coded as 1 for males and 0 for females. Categorical variables were transformed into numerical vectors for model use.

Numeric variables were preprocessed to fit model characteristics. Missing values were replaced by − 1 to avoid confusion with actual measurement values coded as 0. For the logistic regression (LR) model, numerical range normalization between 0 and 1 was performed to mitigate bias from substantial numerical values¹⁷.

Model development and evaluation

To prevent overfitting and improve general performance, all models underwent stratified K-fold cross-validation¹⁸. The training dataset comprised 70% of the data from the target patients. The model performance was calculated as the average over 5-fold cross-validation and evaluated using AUC, depicted through receiver operating characteristic (ROC) curves.

Three models were utilized to predict the attainment of target LDL-C levels: extreme gradient boosting (XGBoost)¹⁹, random forest (RF)²⁰, and logistic regression (LR)²¹. Each model was trained using 129 features and optimized using GridSearch to identify optimal hyperparameters. For a comprehensive overview of the parameters utilized and the values selected for model tuning, please refer to Supplementary Table S1.Additionally, confidence intervals (CI) were calculated using Bootstrap resampling²², and six evaluation metrics were assessed.

Nevertheless, an excess of features may prove detrimental to the model’s efficacy. To achieve optimal performance, the recursive feature elimination (RFE) process is undertaken to select an appropriate number of features²³. RFE is a feature selection technique that employs an iterative process of removing less important features with the objective of improving model performance. By determining the optimal number of features, the dimensionality of the final model is reduced.

Calibration curves and Brier scores were used to demonstrate the applicability of the final models. The calibration curves evaluated the agreement between predicted probabilities and actual outcomes for the three models, the Brier scores were reported along with CI.

Model interpretations

Shapley additive explanations (SHAP) were employed to comprehend the significance of each feature in the prediction process and to demonstrate explainability of model²⁴. This method calculates and visualizes the importance score of each feature, providing insights into their contribution to the prediction. In contrast to common feature contribution methods, SHAP consistently attributes feature importance, presents intuitive results, and restores influential features well²⁵. Furthermore, SHAP waterfall plots were employed to illustrate individual patient data, thereby furnishing information on characteristics that exert a positive or negative influence on predictions.

Statistical analysis

For comparisons of baseline characteristics, the chi-square test and the t-test were employed. Continuous variables were compared by using the T-test. The Chi-square tests were used for categorical variables. All statistical analysis was performed using R software version 4.2.3 (R Foundation for Statistical Computing, Vienna, Austria; www.r-project.org). P value < 0.05 was considered to indicate statistical significance.

Results

Baseline characteristics.

After applying exclusion criteria to a total of 53,369 patients hospitalized for coronary artery disease at the AMC from 2000 to 2020, a final cohort of 9,402 patients was formed. The patient cohort was divided into two groups: those with LDL-C levels below 70 mg/dL (n = 4,525, achieved LDL-C target group) and those who did not achieve the target LDL-C levels (n = 4,877, non-achieved LDL-C target group). Table 2 summarizes the demographic data, comorbidities, medication history, and laboratory test results of the patient cohort. The study subjects had an average age of 61.58 years (standard deviation 10.54), and 70.61% (6,639) were male.

Table 2 Baseline characteristics of the patient cohort.

Full size table

ML model performance

This model predicted LDL-C goal achievement in patients prescribed moderate-dose statins using ML. Each algorithm was trained using the same randomly split training set (70% of the total patients) and evaluated using the test set (30% of the total patients). The models were optimized using 5-fold cross-validation, and the performance was evaluated to three decimal places. The three models were evaluated across six performance metrics, which are summarized along with CIs for each model in Table 3. Statistically significant differences were observed in specificity, accuracy, and PPV (Positive Predictive Value). Notably, XGBoost demonstrated specificity at 0.644 (95% CI: [0.628–0.661]), outperforming Random Forest at 0.286 (95% CI: [0.263–0.310]) and Logistic Regression at 0.293 (95% CI: [0.268–0.316]). XGBoost also achieved the highest accuracy at 0.659 (95% CI: [0.641–0.677]) and recorded a PPV of 0.642 (95% CI: [0.617–0.669]), both of which showed statistically significant differences compared to the other two models. The results of the three models, including CI and ROC curves for the test set, are summarized in Fig. 3. Additionally, the AUC and CI for patients receiving low or high statins, who were excluded from the main analysis, are summarized in Supplementary Table S1.

Table 3 Comparison of the three ML models by metric.

Full size table

Feature reduction ML model

RFE was applied to all three models. While the original models were trained on 129 features, the RFE-applied models utilized a reduced set of features optimized for each model. Specifically, the XGBoost model was trained with 62 features, the Random Forest model and Logistic Regression with 73 features. The final model was trained on 70% of the patient dataset using 5-fold cross-validation, and the remaining 30% of the data was used to evaluate the performance. The performance metrics for each model are summarized in Table 4. Despite a reduction in features, all three models exhibited comparable or enhanced performance across most evaluation metrics.

Table 4 Performance evaluation of three models with RFE Applied.

Full size table

ML model calibration

The calibration curve for the RFE-applied XGBoost, Random Forest, and Logistic Regression models are presented in Fig. 4. Each line represents the calibration curve for a specific model, with brier scores and CI shown in parentheses. Lower brier scores indicate a better alignment between predicted probabilities and actual outcomes. The results demonstrate that the XGBoost model has the lowest Brier score and exhibits consistent calibration. The XGBoost model achieved a brier score of 0.218 (95% CI: [0.215–0.221]), outperforming the Random Forest model with a brier score of 0.236 (95% CI: [0.235–0.238]) and the Logistic Regression model with a brier score of 0.243 (95% CI: [0.239–0.246]).

Explainable ML model

The SHAP analysis results are primarily presented using the XGBoost model as an illustrative example to demonstrate the impact of features on model predictions. Comprehensive results for other models, including the RFE-applied Random Forest and RFE-applied Logistic Regression, are available in Supplementary Figure S2 and Supplementary Figure S3. Figure 5 illustrates the impact of each feature on model predictions using a SHAP summary plot. Each dot represents individual patient data, colored by patient characteristics. Density and color distribution analysis aids interpretation of model predictions. The model was trained using a total of 62 features, and the plot highlights the importance of the top 10 most significant features. The results indicate that patients with lower total cholesterol levels and those prescribed ezetimibe/rosuvastatin therapy are more likely to achieve target LDL-C levels. SHAP waterfall plots in Fig. 6 summarize the characteristics influencing each patient’s predicted outcomes and their contributions, using data from four randomly selected patients in the validation set. Red and blue bars indicate positive and negative impacts, respectively, on attainment of target LDL-C levels. Numbers on bars represent the characteristics’ contributions.

Discussion

To prevent cardiovascular disease, statins are a primary therapeutic approach for lowering LDL-C levels²⁶. Moderate-dose statins are recommended as initial therapy due to their lower risk of side effects compared to high-dose statins²⁷, and they are prescribed to the majority of high-risk cardiovascular patients in Korea²⁸. Despite the high efficacy of statins in the Asian population, existing Korean studies indicate that LDL-C goal achievement rates among high-risk cardiovascular patients remain low^29,30. These results suggest the need for tailored statin treatment regimens that comprehensively reflect patient characteristics. This study aims to develop a predictive model to identify CAD patients who can achieve LDL-C targets with moderate-dose statins, utilizing clinical data from the patients.

ML models are capable of effectively capturing the diversity and complexity of EMR data, thus making them applicable in various medical fields^31,32. This study integrates clinical variables such as diagnosis, medication history, and laboratory test results from real-world patient data. During this process, interactions between specific diagnoses or concomitant medications and statin treatment may exist^33,34. To effectively account for these interactions, ML models were employed to automatically model the complex clinical characteristics among various medications and specific diagnoses, incorporating these into the predictive outcomes.

ML models effectively capture complex relationships between outcomes and predictors that are often difficult to model using traditional statistical methods^35,36. While conventional methods are valuable for elucidating variable relationships, ML models frequently achieve higher accuracy and better account for intricate interactions among clinical variables. Consequently, ML models have been widely applied in clinical prediction studies to enhance performance^37,38. In this study, we conducted a comparative analysis of the predictive performance of traditional LR models and ML models. Although the three models demonstrated similar AUROC values, XGBoost exhibited superior performance in terms of specificity (0.644, 95% CI: [0.628–0.661]) and accuracy (0.659, 95% CI: [0.641–0.677]), and PPV (0.642, 95% CI: [0.617–0.669]). By accounting for complex feature interactions, this approach facilitates comprehensive and accurate predictions that are challenging for clinicians to evaluate manually.

Feature selection was executed using the RFE process for the three ML models, all of which successfully preserved comparable AUROC values following feature selection. The XGBoost model achieved a 52% reduction in the number of features, while the Random Forest and Logistic Regression models achieved reductions of 43% each. This reduction efficiently removed non-essential variables, minimizing bias and enabling the integration of diverse clinical information. The reliability and clinical applicability of the RFE-applied models were evaluated using calibration curves and Brier scores, demonstrating their ability to mitigate overfitting and provide accurate probability estimates^39,40. When comparing the three models, XGBoost achieved the lowest Brier score (0.218, 95% CI: [0.215–0.221]), outperforming Random Forest (0.236, 95% CI: [0.235–0.238]) and Logistic Regression (0.243, 95% CI: [0.239–0.246]). These findings suggest that, in comparison with the other two models, XGBoost provides more reliable and consistent predictions for achieving LDL-C goals in patients prescribed moderate-dose statins.

Previous studies have employed machine learning to achieve treatment goals for patients with high cardiovascular risk⁴¹. However, despite achieving high predictive accuracy, these models frequently lacked interpretability. To address the “black box” problem inherent in ML models, we employed SHAP to identify key variables influencing predictions. SHAP provides consistent insights into each feature’s contribution and offers visualizations of complex outcomes, improving the interpretability of ML models, especially when addressing intricate variable interactions⁴². In this study, SHAP analysis was applied to the RFE-applied XGBoost model, showing that factors such as medications and lipid profiles substantially impacted predictions. These findings for key factors, such as total cholesterol, align with previous statistical findings on LDL-C levels^43,44, further supporting that the model accurately captures clinically relevant characteristics. Additionally, SHAP enabled the identification and visualization of features influencing individual predictions, enhancing clinicians understand the model’s mechanisms and supporting its potential use as a decision-making tool in clinical practice^45,46.

By analyzing various patient characteristics to predict the achievement of target LDL-C levels with moderate-dose statin treatment, this study contributes to identify patients who would benefit from moderate-dose statins, thereby mitigating the potential risks associated with high-dose statins. The integration of a wide range of clinical variables enables more accurate and evidence-based outcome predictions. Additionally, the identification of patients likely to achieve LDL-C targets with moderate-dose statin monotherapy can be utilized to determine the necessity for supplementary combination therapy. This study supports clinical decision-making by providing detailed explanations for individual patients, thereby enhancing patient safety and ensuring effective LDL-C management.

This study, based on a retrospective cohort design with an Asian population and single-center data, may introduce biases in baseline characteristics, as it does not account for variations in patient demographics, treatment adherence, and medical records from other hospitals. Additionally, given that LDL-C values were maintained as averages through discussions with specialists to reflect the characteristics of Seoul Asan Medical Center, this approach may limit the generalizability of the findings to other clinical settings.

The performance and Brier scores of the ML models underscore the potential for further improvement through the integration of advanced algorithms and expanded datasets. It is emphasized that enhancing algorithmic capabilities and incorporating multicenter data are essential for achieving more robust and generalizable research outcomes. Accordingly, efforts will be made to continuously refine the models and methodologies to advance the quality and applicability of future studies.

Conclusion

In this study, ML-based models were employed to predict the likelihood of achieving LDL-C target levels in CAD patients treated with moderate-dose statins using EMR data from a tertiary hospital. Predictive experiments were conducted using models such as XGBoost, Random Forest, and Logistic Regression, achieving an AUROC of up to 0.709 despite an average feature reduction of 46%. The SHAP results confirmed the potential to improve the interpretability of ML-based clinical prediction models. These findings suggest the possibility of facilitating clinical decision-making and improving patient safety and treatment outcomes.

Data availability

The data that support the findings of this study are available from the corresponding author on reasonable request owing to ethical concerns and confidentiality agreements.

References

Ridker, P. M. LDL cholesterol: controversies and future therapeutic directions. Lancet 384, 607–617 (2014).
Article PubMed Google Scholar
Nissen, S. E. et al. Statin therapy, LDL cholesterol, C-reactive protein, and coronary artery disease. N Engl. J. Med. 352, 29–38 (2005).
Article PubMed MATH Google Scholar
Heart Protection Study Collaborative Group. MRC/BHF Heart Protection study of cholesterol lowering with simvastatin in 20,536 high-risk individuals: a randomised placebo-controlled trial. Lancet 360, 7–22 (2002).
Article Google Scholar
Stone, N. J. et al. 2013 ACC/AHA guideline on the treatment of blood cholesterol to reduce atherosclerotic cardiovascular risk in adults: a report of the American College of Cardiology/American Heart Association Task Force on Practice guidelines. Circulation 129 (Suppl. 2), S1–S45 (2014).
PubMed MATH Google Scholar
Jones, P. H., McKenney, J. M., Karalis, D. G. & Downey, J. NASDAC investigators. Comparison of the efficacy and safety of atorvastatin initiated at different starting doses in patients with dyslipidemia. Am. Heart J. 149, e1–e8 (2005).
Article PubMed Google Scholar
Ko, M. J. et al. Time-and dose‐dependent association of statin use with risk of clinically relevant new‐onset diabetes mellitus in primary prevention: a nationwide observational cohort study. J. Am. Heart Association. 8, e011320 (2019).
Article Google Scholar
Silva, M. et al. Meta-analysis of drug-induced adverse events associated with intensive-dose statin therapy. Clin. Ther. 29, 253–260 (2007).
Article PubMed MATH Google Scholar
Lee, J. H. et al. Effects of ezetimibe/simvastatin 10/20 mg vs. atorvastatin 20 mg on apolipoprotein B/apolipoprotein A1 in Korean patients with type 2 diabetes mellitus: results of a randomized controlled trial. Am. J. Cardiovasc. Drugs. 13, 343–351 (2013).
Article PubMed Google Scholar
Segura-Bedmar, I., Colon-Ruiz, C., Tejedor-Alonso, M. Á. & Moro-Moro, M. Predicting of anaphylaxis in big data EMR by exploring machine learning approaches. J. Biomed. Inform. 87, 50–59 (2018).
Article PubMed Google Scholar
Gultepe, E. et al. From vital signs to clinical outcomes for patients with sepsis: a machine learning basis for a clinical decision support system. J. Am. Med. Inform. Assoc. 21, 315–325 (2014).
Article PubMed MATH Google Scholar
Ridgway, J. P., Lee, A., Devlin, S., Kerman, J. & Mayampurath, A. Machine learning and clinical informatics for improving HIV care continuum outcomes. Curr. HIV/AIDS. Rep. 18, 229–236 (2021).
Article PubMed PubMed Central Google Scholar
Shin, S. et al. Lessons learned from development of de-identification system for biomedical research in a Korean Tertiary Hospital. Healthc. Inf. Res. 19, 102–109 (2013).
Article ADS MATH Google Scholar
Stone, N. J. et al. Treatment of blood cholesterol to reduce atherosclerotic cardiovascular disease risk in adults: synopsis of the 2013 American College of Cardiology/American Heart Association cholesterol guideline. Ann. Intern. Med. 160, 339–343 (2014).
Article PubMed MATH Google Scholar
Lloyd-Jones, D. M. et al. 2022 ACC expert consensus decision pathway on the role of nonstatin therapies for LDL-cholesterol lowering in the management of atherosclerotic cardiovascular disease risk: a report of the American College of Cardiology Solution Set Oversight Committee. J. Am. Coll. Cardiol. 80, 1366–1418 (2022).
Article PubMed Google Scholar
Rhee, E. J. et al. 2018 guidelines for the management of dyslipidemia in Korea. J. Lipid Atherosclerosis. 8, 78 (2019).
Article MATH Google Scholar
Friedewald, W. T., Levy, R. I. & Fredrickson, D. S. Estimation of the concentration of low-density lipoprotein cholesterol in plasma, without use of the preparative ultracentrifuge. Clin. Chem. 18, 499–502 (1972).
Article PubMed MATH Google Scholar
Singh, D. & Singh, B. Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 97, 105524 (2020).
Article MATH Google Scholar
Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection. Proc. 14th Int. Joint Conf. Artif. Intell. 2, 1137–1145 (1995).
MATH Google Scholar
Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794, ACM, (2016).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Article MATH Google Scholar
Hosmer, D. W. & Lemeshow, S. Applied Logistic Regression 2nd edn (John Wiley and Sons, Inc., 2000).
Book MATH Google Scholar
Carpenter, J. & Bithell, J. Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. Stat. Med. 19, 1141–1164 (2000).
Article PubMed MATH Google Scholar
Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002).
Article MATH Google Scholar
Barredo Arrieta, A. et al. Explainable Artificial Intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible. AI Inf. Fusion. 58, 82–115 (2020).
Article Google Scholar
Lundberg, S. M., Erion, G. G. & Lee, S. I. Consistent individualized feature attribution for tree ensembles. Preprint at (2018). https://arxiv.org/abs/1802.03888v3
Morofuji, Y. et al. Beyond lipid-lowering: effects of statins on cardiovascular and cerebrovascular diseases and cancer. Pharmaceuticals 15, 151 (2022).
Article PubMed PubMed Central Google Scholar
Arnold, M. J., O’Malley, P. G. & Downs, J. R. Key recommendations on managing dyslipidemia for cardiovascular risk reduction: stopping where the evidence does. Am. Family Phys. 103, 455–458 (2021).
Google Scholar
Park, M. W. et al. Moderate-intensity versus high-intensity statin therapy in Korean patients with angina undergoing percutaneous coronary intervention with drug-eluting stents: a propensity-score matching analysis. PLoS One. 13, e0207889 (2018).
Article PubMed PubMed Central Google Scholar
Yang, Y. S., Yang, B. R., Kim, M. S., Hwang, Y. & Choi, S. H. Low-density lipoprotein cholesterol goal attainment rates in high-risk patients with cardiovascular diseases and diabetes mellitus in Korea: a retrospective cohort study. Lipids Health Dis. 19, 1–13 (2020).
Article Google Scholar
Kwon, O. et al. Cardiovascular event rates in statin-treated Korean patients with cardiovascular disease: estimates from a real-world population using electronic medical record data. Cardiovasc. Drugs Ther. 35, 1–12 (2021).
MATH Google Scholar
Rajula, H. S. R., Verlato, G., Manchia, M., Antonucci, N. & Fanos, V. Comparison of conventional statistical methods with machine learning in medicine: diagnosis, drug development, and treatment. Medicina 56, 455 (2020).
Article PubMed PubMed Central MATH Google Scholar
Yamashita, T. et al. Machine learning for classification of postoperative patient status using standardized medical data. Comput. Methods Programs Biomed. 214, 106583 (2022).
Article PubMed MATH Google Scholar
Koh, K. K., Han, S. H., Oh, P. C., Shin, E. K. & Quon, M. J. Combination therapy for treatment or prevention of atherosclerosis: focus on the lipid-RAAS interaction. Atherosclerosis 209, 307–313 (2010).
Article PubMed Google Scholar
Liao, Y. et al. Lipid metabolism patterns and relevant clinical and molecular features of coronary artery disease patients: an integrated bioinformatic analysis. Lipids Health Dis. 21, 87 (2022).
Article PubMed PubMed Central MATH Google Scholar
Han, K. et al. A review of approaches for predicting drug–drug interactions based on machine learning. Front. Pharmacol. 12, 814858 (2022).
Article PubMed PubMed Central Google Scholar
Subramani, S. et al. Cardiovascular diseases prediction by machine learning incorporation with deep learning. Front. Med. 10, 1150933 (2023).
Article MATH Google Scholar
Belsti, Y. et al. Comparison of machine learning and conventional logistic regression-based prediction models for gestational diabetes in an ethnically diverse population; the Monash GDM Machine learning model. Int. J. Med. Informatics. 179, 105228 (2023).
Article MATH Google Scholar
Churpek, M. M. et al. Multicenter comparison of machine learning methods and conventional regression for predicting clinical deterioration on the wards. Crit. Care Med. 44, 368–374 (2016).
Article PubMed PubMed Central MATH Google Scholar
Huang, Y., Li, W., Macheret, F. & Gabriel, R. A. Ohno-Machado, L. A tutorial on calibration measurements and calibration models for clinical prediction models. J. Am. Med. Inf. Assoc. 27, 621–633 (2020).
Article MATH Google Scholar
Licher, S. et al. External validation of four dementia prediction models for use in the general community-dwelling population: a comparative analysis from the Rotterdam Study. Eur. J. Epidemiol. 33, 645–655 (2018).
Article PubMed PubMed Central MATH Google Scholar
Krentz, A. J., Haddon-Hill, G., Zou, X., Pankova, N. & Jaun, A. Machine learning applied to cholesterol-lowering pharmacotherapy: proof-of-concept in high-risk patients treated in primary care. Metab. Syndr. Relat. Disord. 21, 453–459 (2023).
Article PubMed Google Scholar
Wang, K. et al. Interpretable prediction of 3-year all-cause mortality in patients with heart failure caused by coronary heart disease based on machine learning and SHAP. Comput. Biol. Med. 137, 104813 (2021).
Article PubMed MATH Google Scholar
PP, A. et al. Machine learning predictive models of LDL-C in the population of eastern India and its comparison with directly measured and calculated LDL-C. Ann. Clin. Biochem. 59, 76–86 (2022).
Article Google Scholar
Yu, M., Liang, C., Kong, Q., Wang, Y. & Li, M. Efficacy of combination therapy with ezetimibe and statins versus a double dose of statin monotherapy in participants with hypercholesterolemia: a meta-analysis of literature. Lipids Health Dis. 19, 1–7 (2020).
Article PubMed PubMed Central Google Scholar
Zhao, P. et al. An explainable machine-learning model to analyze the effects of a PCSK9 inhibitor on thrombolysis in STEMI patients. J. Med. Biol. Eng. 43, 339–349 (2023).
Article MATH Google Scholar
Zhao, P. et al. Using machine learning to predict the in-hospital mortality in women with ST-segment elevation myocardial infarction. Rev. Cardiovasc. Med. 24, 126 (2023).
Article PubMed PubMed Central MATH Google Scholar

Download references

Acknowledgements

This work was supported by the Korea Medical Device Development Fund grant funded by the Korea government (the Ministry of Science and ICT, the Ministry of Trade, Industry and Energy, the Ministry of Health & Welfare, the Ministry of Food and Drug Safety) (Project Number: 1711195603, RS-2020-KD000097, 50%). Additional support was provided by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HR20C0026, 50%).

Author information

Tae Joon Jun PhD and Young-Hak Kim MD, PhD contributed equally to this work.

Authors and Affiliations

Department of Information Medicine, Asan Medical Center, 88, Olympic-ro 43-gil, Songpa-gu, Seoul, 05505, Republic of Korea
Jiye Han, Hee Jun Kang, Jiahn Seo, Heejung Choi, Minkyoung Kim, Seohyun Park, Soyoung Ko & HyoJe Jung
Department of Medical Science, Asan Medical Center, Asan Medical Institute of Convergence Science and Technology, University of Ulsan College of Medicine, Olympic-ro 43-gil, Songpa-gu, Seoul, 05505, Republic of Korea
Yunha Kim, Gaeun Kee & Byeolhee Kim
Big Data Research Center, Asan Institute for Life Sciences, Asan Medical Center, 88, Olympic- ro 43-gil, Songpa-gu, Seoul, 05505, Republic of Korea
Tae Joon Jun
Division of Cardiology, Department of Internal Medicine, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympic-ro 43-gil, Songpa-gu, Seoul, 05505, Republic of Korea
Young-Hak Kim

Authors

Jiye Han
View author publications
Search author on:PubMed Google Scholar
Yunha Kim
View author publications
Search author on:PubMed Google Scholar
Hee Jun Kang
View author publications
Search author on:PubMed Google Scholar
Jiahn Seo
View author publications
Search author on:PubMed Google Scholar
Heejung Choi
View author publications
Search author on:PubMed Google Scholar
Minkyoung Kim
View author publications
Search author on:PubMed Google Scholar
Gaeun Kee
View author publications
Search author on:PubMed Google Scholar
Seohyun Park
View author publications
Search author on:PubMed Google Scholar
Soyoung Ko
View author publications
Search author on:PubMed Google Scholar
HyoJe Jung
View author publications
Search author on:PubMed Google Scholar
Byeolhee Kim
View author publications
Search author on:PubMed Google Scholar
Tae Joon Jun
View author publications
Search author on:PubMed Google Scholar
Young-Hak Kim
View author publications
Search author on:PubMed Google Scholar

Contributions

J.H designed the study and extracted and analyzed the data and wrote the manuscript. T.J. and Y.-H.K. supervised the study and revised this manuscript. Y.K., H.K., J.S., H.C., M.K., J.H., G.K., S.P., S.K., H.J., B.K. reviewed this manuscript. All authors read and approved the final version of the manuscript before submission.

Corresponding author

Correspondence to Tae Joon Jun.

Ethics declarations

Competing interests

The authors declare no competing interests.

Abbreviations

RFE, the recursive feature elimination; XGBoost, extreme gradient boosting;

Tables:

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Han, J., Kim, Y., Kang, H.J. et al. Predicting low density lipoprotein cholesterol target attainment using machine learning in patients with coronary artery disease receiving moderate-dose statin therapy. Sci Rep 15, 5346 (2025). https://doi.org/10.1038/s41598-025-88693-y

Download citation

Received: 04 June 2024
Accepted: 30 January 2025
Published: 13 February 2025
DOI: https://doi.org/10.1038/s41598-025-88693-y

Keywords

This article is cited by

Leveraging BERT for embedding ICD codes from large scale cardiovascular EMR data to understand patient diagnostic patterns
- Minkyoung Kim
- Yunha Kim
- Young-Hak Kim
BMC Medical Informatics and Decision Making (2025)

Subjects

Abstract

Similar content being viewed by others

Personalizing cholesterol treatment recommendations for primary cardiovascular disease prevention

Performance and clinical utility of supervised machine-learning approaches in detecting familial hypercholesterolaemia in primary care

Predictive model development combining CT-FFR and SYNTAX score for major adverse cardiovascular events in complex coronary artery disease

Methods

Ethical approval

Study population

Study design

Data processing

Model development and evaluation

Model interpretations

Statistical analysis

Results

ML model performance

Feature reduction ML model

ML model calibration

Explainable ML model

Discussion

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Abbreviations

Additional information

Publisher’s note

Electronic supplementary material

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Leveraging BERT for embedding ICD codes from large scale cardiovascular EMR data to understand patient diagnostic patterns

Search

Quick links