Introduction

Cardiovascular diseases (CVDs) represent chronic conditions that impose considerable financial and non-financial burdens on both patients and governments1. A principal concern in these patients is the occurrence of other cardiovascular events following acute myocardial infarction (AMI) or stroke. The rate of these events was as high as 23.3% in a study2. Furthermore, a comparison of the healthcare costs associated with the initial event and the costs associated with the secondary event showed a two-fold increase3, showing the importance of risk assessment in these patients regarding a post-MI cardiovascular event. For this purpose, some tools have been developed that use clinical features at the time of AMI to determine the high-risk patients. The Global Registry of Acute Coronary Events (GRACE) risk score is the most widely accepted tool4.

Obstructive sleep apnea (OSA) is an underdiagnosed sleep-disordered breathing. This condition is characterized by the partial or complete obstruction of the respiratory tract, leading to disrupted oxygenation, increased sympathetic activity, and oxidative stress5. Studies have shown that untreated OSA patients are at a higher risk for cardiovascular events after MI6,7. Although the golden standard for detecting OSA is polysomnographic evaluation, the STOP-BANG questionnaire is also a validated tool for risk stratification of patients8.

Artificial intelligence (AI) has recently become a significant method in data science and is increasingly impacting medical research9,10,11, characterized by a rapid rate of innovation. This development is partially driven by the significant increase in computational power and data accessibility. The recent utilization of machine learning (ML) algorithms is transforming data analysis in cardiovascular and respiratory medicine12,13,14. ML provides the capability to examine, choose, and incorporate extensive interrelated variables while identifying nonlinear dependencies (patterns) that enhance the optimization of classification and prediction tasks, surpassing the limitations of traditional statistics15.

The current study sought to implement and assess the efficacy of ML in integrating the STOP-BANG score with existing clinical variables and derived scores to identify whether the OSA risk score could enhance existing risk prediction tools for post-myocardial infarction cardiovascular events within the same hospital admission.

Methods and materials

Study design

The prospective observational study was conducted on patients presenting with either ST-elevation MI (STEMI) or non-ST-elevation (NSTEMI). This study followed the guidelines provided by the Declaration of Helsinki and was approved by the Ethics Committee of Mashhad University of Medical Sciences with the reference code of IR.MUMS.IRH.REC.1403.144. The research took place in Ghaem Hospital, Mashhad, Iran, from March 2024 to March 2025. The inclusion criteria were: (I) a minimum age of 18 and (II) duration between symptom onset and hospitalization < 48 h, and those under active OSA treatment were excluded. Informed consent was obtained from all subjects and/or their legal guardian (s). The eligible patients were then assessed by the researchers for data collection. The patients received standard treatment for MI (either medical or percutaneous coronary intervention (PCI)) throughout the study period and were followed during hospitalization for their outcome.

Data collection

A data collection sheet was used encompassing fields for demographic (sex and age) and anthropometric (height, weight, and neck circumference) data. The patients were also enquired about their medical history of CVD, their medication, and risk factors for cardiovascular incidents. Furthermore, the laboratory reports of electrolytes and hematological indices were retrieved using patients’ records. The angiography (the culprit vessel and TIMI-flow) and electrocardiogram (MI type, STEMI localization) reports were also recorded on the data collection sheets. The patients received a score according to the GRACE scoring system. Using an online calculator, the GRACE score was calculated using age, heart rate, systolic blood pressure, serum creatinine, and Killip class16. Also, the patients were classified based on Killip classification. Furthermore, the patients were evaluated in terms of the STOP-BANG questionnaire items. The STOP-BANG tool assesses snoring, tiredness, observed apnea, high blood pressure, body mass index, age, neck circumference, and male gender17, assigning one point per item for a total possible score of 0–8. and each received a score. Traditionally, the STOP-BANG score shows a low (0–2), intermediate3,4, or high5,6,7,8 risk for OSA. However, a later study proposing a modified risk stratification based on STOP-BANG showed enhanced specificity18. The patients in this study were categorized accordingly. The patients labeled as intermediate risk (a STOP-BANG score of 3–4) were thus categorized into high-risk in the presence of (a) a STOP score equal to or higher than two and (b) a BMI > 35 kg/m, a neck circumference > 40 cm, or male sex.

Outcomes

The definition of an outcome in this study was based on the occurrence of a cardiovascular event during the index hospitalization of the patients. These events included all-cause mortality, new-onset atrial fibrillation (AF), re-infarction, worsening heart failure, ventricular arrhythmia, recurrent or refractory angina, cardiogenic shock, and stroke. The definition of AF requires the diagnosis of AF according to ICD-9 code 427.31 during the hospital stay19. Re-infarction was diagnosed according to the ESC/ACC committee definition20. Criteria for worsening heart failure included new-onset pulmonary edema or worsening signs/symptoms of heart failure necessitating changes in medical treatment or a need for mechanical ventilation. Refractory angina refers to chest pains lasting for longer than 48 h after PCI or those needing intensified medication. Cardiogenic shock was defined according to the Shock Trial21, and stroke was diagnosed in the case of a rapid onset of documented neurologic deficit lasting > 24 h or until death.

Machine learning

All the processes of data handling and model development in this study were conducted using Python 3.11, Jupyter Notebooks22, the SciPy 1.11.4, the Scikit Learn module 1.2.2, Pandas 2.1.4, NumPy 1.26.4, XGBoost 2.1.3, and LightGBM 4.5.0. The figures were plotted using Matplotlib 3.7.5 and Seaborn 0.13.2, and SHAP (Shapley Additive exPlanations) 0.46.0 was used during feature selection.

We split the dataset into 70% training and 30% testing. All preprocessing steps, including imputation and scaling, in addition to feature-selection steps, including RFE and SHAP, were performed strictly on the training set. Specifically, we computed the median values (for SimpleImputer) and the robust-scaling parameters using only the training data and then applied those same parameters to the test data. During the preprocessing, missing values were imputed using SimpleImputer with “median” selected as the strategy for the continuous features due to the non-normal distribution of the data. The details of the missing values for the included features are shown in Table 1. Finally, to normalize the features’ range, RobustScaler was used on the training set. The present dataset was imbalanced; therefore, we used settings native to the used algorithms to address this issue rather than using oversampling methods.

Table 1 The amount of missing values in the features used in training the models.

Feature selection was balanced on technical measures and domain knowledge. Primarily, the features that were incorporated in GRACE and STOP-BANG were excluded. Then, Recursive Feature Elimination with five-fold cross-validation (using RandomForestClassifier) and Shapley values (using XGBoost Classifier) were used, and the features found important by both methods were then revised in terms of their relevance, ensuring alignment with domain knowledge. The final selected features were SPO2, CRP, hemoglobin, platelets, K, glucose, admission to PCI time, EF, culprit vessel, STOP BANG score, and GRACE score. The details of the features selected or removed based on technical measures or domain knowledge are shown in S1.

Models were trained over three sets from the selected features to address the research question. The first model was trained on all the features, the second model was trained on the GRACE score as the sole feature, and the third used GRACE and STOP-BANG scores. The dataset was split into 70% train and 30% test sets in each scenario using a constant method to ensure the three models were evaluated over the same individuals in the test set. The test set was left for the final step to minimize the risk of over/underfitting. Seven models were trained over the whole feature set, including the Extra Trees Classifier, Logistic Regression, Decision Tree, Random Forest, Gradient Boosting Classifier (GBC), XGBoost, and LightGBM. However, given the simplicity of the other two models using STOP-BANG and GRACE scores, Logistic Regression was selected in these scenarios. In order to tune the hyperparameters for the selected ML algorithms, hyperparameter tuning was applied using five-fold cross-validation over the train set. The training phase was conducted using ten-fold cross-validation. In the whole set scenario, the best algorithm was selected according to the global generalizability and F1 scores (i.e., the Extra Trees estimator) and named the ML Model. Then, the Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves, along with the respective area under the curve (AUC), were plotted by comparing the three models. Also, decision curve analysis (DCA)23 was applied to compare the clinical usefulness of the models. Finally, the confusion matrix and the importance of the feature of the ML Model were plotted. The y-axis in a DCA plot is the net benefit (a balanced representation of the benefits of true positives and the costs of false positives), and the x-axis is the threshold probability at which a decision is made. For example, a conservative approach would be to take the intervention at lower threshold probabilities, which may come at the cost of higher false positives. Therefore, DCA plots are a practical way to show the net benefit of a given model or intervention. The decision curve of an intervention is compared with three main curves: (a) always act: this is when the intervention is made all the time, no matter the prediction; (b) never act: this is when no interventions are made, leading to no harm and no benefits; (c) Oracle, which is the perfect scenario with perfect predictions and thus, perfect net benefit. The decision curves for a model are therefore compared with these lines.

Statistical analysis

Due to the non-normal distribution of the data, continuous variables were reported as median with interquartile range (IQR). The study sample was divided based on the occurrence or absence of an event and then classified according to STOP-BANG risk categories. Then, the continuous and categorical features were compared between the groups within each classification using non-parametric tests. The improvement of models (from GRACE to GRACE + STOP-BANG) was assessed using the method proposed by DeLong’s test to compare the AUCs of curves24. Furthermore, the Brier scores were calculated for each of the three models.

Results

The present study included 227 patients eligible for enrollment; 66 (29.07%) patients had at least one cardiovascular event after their index MI during the hospitalization, while 161 (70.93%) did not. Unfortunately, four patients (1.76%) expired during their hospitalization. The baseline clinical features of the patients are compared between the two groups in Table 2. Noticeably, the patients with an in-hospital event had a significantly higher proportion of male patients, older age, and a longer hospitalization duration than those without an in-hospital event. Furthermore, these patients had significantly lower hemoglobin and higher RDW, creatinine, and CRP. The ejection fraction was significantly lower among the patients with an in-hospital event (p-value < 0.001), while the GRACE score was significantly higher in these patients (p-value < 0.001). Although the STOP-BANG scores were significantly higher among patients with an event (p-value < 0.001), categorizing subjects into three classes showed no significant differences between the two groups (p-value = 0.3) (Table 2).

Table 2 The demographic and clinical features of study patients grouped into patients with and without a post-MI cardiovascular event.

The patients were also grouped based on the modified risk stratification of OSA. Accordingly, 57 (25.11%) were categorized as low-risk, 13 (5.72%) as intermediate, and 157 (69.16%) as high-risk patients. The GRACE score significantly differed between the groups and was highest in the group with intermediate risk for OSA (p-value = 0.02). The detailed comparison of these groups in terms of the assessed clinical features is shown in Table 3.

Table 3 The comparison of baseline features and the clinical outcome of study participants grouped based on the STOP-BANG modified risk stratification groups.
Table 4 The performance metrics of the developed models over cross validation training and test set

The Extra Trees Classifier was ranked first among algorithms using the whole features with a ROC-AUC = 0.82 (95% CI 0.66–0.92), F1 score = 0.66, precision = 0.75 and recall = 0.60. Training Logistic Regression using GRACE and STOP-BANG scores yielded ROC-AUC = 0.76 (95% CI 0.61–0.88), F1 score = 0.59, precision = 0.71, and recall = 0.50. Also, training Logistic Regression using GRACE scores only resulted in ROC-AUC = 0.70 (95% CI 0.58–0.82), F1 score = 0.57, precision = 0.52, and recall = 0.65. Comparing the GRACE + STOP-BANG and GRACE models shows improvement in ROC-AUC by 0.06, F1 score by 0.02, accuracy by 0.08, and precision by 0.19 (Table 4) and AP by 0.17 (Fig. 1); however, the results of DeLong’s test showed that the improvement in ROC-AUC was statistically non-significant. Furthermore, the ML Model had the lowest Brier score (0.14), while the GRACE + STOP-BANG model showed a slightly lower Brier score than the GRACE model.

Fig. 1
figure 1

The ROC (Left) and PR (Right) curves and AUCs for the three models trained in the study: ML Model is the Extra Trees Classifier using all the features, while the two other models incorporate only the GRACE or the STOP-BANG scores.

Also, an independent model using all the selected features but excluding the STOP-BANG score was trained and compared with the ML Model to determine the effect of removing STOP-BANG from the model. The new ROC-AUC (0.78 (95% CI 0.63–0.90)) showed a non-significant reduction (p-value = 0.46) compared to the ML model. However, the Brier score of the ML model dropped by 0.02, the F1 score by 0.10, recall by 0.15, and accuracy by 0.03, while precision did not change.

The decision curve analysis showed higher benefits for the ML Model than the other two models. Also, applying the STOP-BANG risk score in the risk prediction of patients improved the new model’s net benefit over a moderately wide threshold probability range (0.4–0.7) (Fig. 2, Top). Also, retraining a model without STOP-BANG resulted in less net benefit across almost all threshold probabilities (Fig. 2, Bottom). These results showed that implementing STOP-BANG in the risk assessment of acute MI patients improved the net benefit of the models.

Fig. 2
figure 2

The decision curve analysis comparing the net benefit of the three models with the always-act and never-act scenarios (Top) and results of removing of STOP-BANG from the ML Model (Bottom).

Figure 3 shows the confusion matrix for the first (Extra Trees Classifier) and second (Random Forest) ranked algorithms in ML Model training. Accordingly, the model missed 9 (45%) of the events in the test set. On the other hand, the feature importance plot for the Extra Trees Classifier showed that the STOP-BANG score was placed as the first important feature in the outcome prediction, while the GRACE score was placed fourth. The Random Forest algorithm also showed similar results; STOP-BANG was the first most important feature, while the GRACE score was among the least important features (Fig. 4). Moreover, the decision tree plot shown in Fig. 5 shows the importance of STOP-BANG, hemoglobin, and GRACE scores in categorizing patients. Accordingly, 84% of patients with STOP-BANG score ≤ 5.5, hemoglobin ≤ 11.85, and GRACE > 136.85 were patients with an in-hospital event.

Fig. 3
figure 3

Confusion matrices of the first and second best algorithms using the whole features (Extra Trees Classifier and Random Forest, respectively).

Fig. 4
figure 4

The feature importance of the first and second best algorithms using the whole features (Extra Trees Classifier and Random Forest, respectively), showing the ranking of predictors by their contribution to predictive performance.

Fig. 5
figure 5

Radial Decision Tree visualization with impurity-based node coloring: nodes represent decision points, colored by impurity (darker blue indicates higher impurity). Edge thickness highlights gain importance (thicker lines for higher gain), and node size corresponds to decision depth. The first and second percentages in each bracket show the portion of patients without and with events, respectively.

Discussion

The present study showed that implementing the STOP-BANG score to the conventionally assessed risk estimation of MI patients for an in-hospital cardiac event (i.e., the GRACE score) improved the model. The performance of the models was compared from different perspectives. This study showed that adding STOP-BANG to the GRACE model lead to higher f1, accuracy, and precision, which suggests that it makes overall better predictions and has a lower false positive rate. However, comparing the two models did not show a statistically significant improvement in the ROC-AUC, which might not be a reliable indicator of model performance in our study. The dataset in this paper was imbalanced, which limits the reliability of ROC-AUC25 for judging the model’s performance. On the other hand, evaluating the key determinants of the ML Model’s performance showed that the STOP-BANG score was the top significant contributor in different algorithms. Moreover, the use of DCAs in this study further showed the importance of STOP-BANG in identifying high-risk patients. DCA plots are particularly useful for practical decision-making when applying a model/intervention. DCA plots are designed to overcome the limitations of traditional statistics26, and this study showed a considerable drop in the Net Benefit of the ML Model when STOP-BANG was removed. Hence, these findings suggest that it still holds considerable value in predicting the target variable.

Several reports support the increased risk of post-MI events among patients with OSA27,28. Furthermore, a study showed the prognostic value of OSA severity due to nocturnal hypoxia29. The results of our study also highlighted the importance of OSA by using a valid risk estimating tool for this condition. Moreover, a meta-analysis has shown that OSA also contributes to increasing the risk of cardiovascular events in patients undergoing PCI30. OSA plays a significant role in the prognosis of CVDs through multiple pathophysiological mechanisms. OSA contributes to oxidative stress, chronic inflammation, and endothelial dysfunction, all of which accelerate the progression of atherosclerosis and increase the risk of MI and stroke31. Additionally, OSA is strongly linked to cardiac arrhythmias, particularly atrial fibrillation AF, worsening outcomes, and complicating rhythm control strategies. The intermittent hypoxia observed in OSA further exacerbates vascular damage by activating inflammatory pathways and increasing macrophage recruitment, ultimately leading to heightened coronary atherosclerosis and adverse cardiovascular events31,32.

Another study implementing ML models to explore the benefit of integrating STOP-BANG scores into GRACE scores showed performance improvement14. Unlike the present study, their results showed significant improvement in the ROC-AUC of the model after adding STOP-BANG. Our study was further imbalanced compared to them in terms of class imbalance. Furthermore, the sets of features used were different between the studies. Also, the distribution of patients in terms of STOP-BANG risk categories (low, moderate, and high) was considerably different between the two studies. Although both studies showed the importance of using STOP-BANG in better risk assessment of patients, the non-significant rise in ROC-AUC in the present study may be thus explained. Similar to the present study, they also showed that ejection fraction was among the significant features contributing to the model’s prediction. This is in line with the class I indication of echocardiography outlined in the management of non-STEMI patients, according to the 2014 AHA/ACC Guidelines33. Noteworthy, the death rate in the present study was 1.8%, comparable to other larger cohorts34,35.

Notably, smoking status was not significantly associated with events despite a previous study showing an association between smoking and atrial fibrillation36. While some reports describe a ‘smoker’s paradox’—with smokers appearing to have better short-term outcomes after MI—further analysis often reveals that any apparent benefit is ‘pseudo-protective.’ Indeed, longer-term data indicate that smoking increases the risk of recurrent MI and overall mortality37, and the acute post-MI setting and our in-hospital timeframe may have precluded the detection of such an effect in this cohort.

As part of this study, an external validation of the GRACE tool was conducted over a sample of patients. The results showed that the GRACE tool had a moderate discriminative ability over our population with an AUC of 0.70 (95% CI 0.58–0.82)38. As the tool was originally not developed over a population similar to the present study, suboptimal performance is expected. For example, a Pakistani study also showed a moderate discriminative ability for GRACE39. Furthermore, another Asian study showed low discriminative ability for the GRACE model40. On the other hand, a Canadian study showed 0.86 (95% CI 0.84–0.88)35, and a Dutch study showed 0.86 (95% CI 0.83 to 0.90) C-static values, showing considerably higher discrimination of the tool over those populations compared to ours.

The present study’s limitations included the absence of a confirmatory evaluation for patients labeled as high-risk for OSA. However, this approach may be more feasible in clinical practice, given the cost-benefit of a questionnaire versus the gold-standard diagnostic tool. Furthermore, although the samples were selected from a referral hospital, the single-center nature and relatively small sample size may limit the generalizability of our findings. Although we employed domain-guided feature selection and cross-validation techniques to mitigate overfitting, the limited number of outcome events relative to the number of predictors still presents a risk of model overfitting. As such, the findings should be interpreted as hypothesis-generating and require validation in larger, multicenter cohorts.

We also used five-fold/ten-fold cross-validation during model training; however, the limited number of outcome events relative to model complexity presents a risk of overfitting despite our attempts to minimize it. This risk underscores the preliminary nature of our findings, which require validation in larger, preferably multicenter cohorts. Also, our approach to handling missing data warrants comment. Although multiple imputation is often recommended in predictive modeling studies41, we opted for median imputation due to the low proportion of missingness (all features < 10%) and the non-normal distribution of our variables. Finally, given the frequent presence of imbalanced datasets in medical research, we specifically addressed this by deliberately avoiding synthetic over-sampling or under-sampling techniques to prevent data distortion.

Conclusion

The current study demonstrated that using the STOP-BANG score in conjunction with the GRACE score enhances key performance metrics such as F1-score, accuracy, and precision in the assessment of MI patients regarding post-MI in-hospital cardiovascular events. Also, the significant contribution of STOP-BANG in the trained machine learning algorithms and its impact on DCA reinforce its clinical relevance. These findings suggest that STOP-BANG remains a valuable addition to risk assessment models, offering practical benefits for clinical decision-making. Furthermore, this study showed that the GRACE tool has a moderate discriminative ability on the study sample.