Introduction

Abdominal aortic aneurysm (AAA) is a dilation of the abdominal aorta to greater than 3 cm, which becomes life-threatening when ruptured, carrying a mortality rate over 80%1,2.Open surgical repair significantly reduces AAA rupture risk; however, the procedure itself carries a high rate of complications3. Menard and colleagues demonstrated that adverse events occur in over 25% of patients undergoing open repair of non-ruptured, infrarenal AAA3. As a result, the Society for Vascular Surgery (SVS) AAA guidelines recommend careful assessment of surgical risk when considering patients for intervention4.

Currently, there are no standardized tools to predict adverse outcomes following open AAA repair. A systematic review of 13 risk prediction models showed significant methodological limitations and variable performance across different populations5. Additionally, tools such as the American College of Surgeons (ACS) National Surgical Quality Improvement Program (NSQIP) surgical risk calculator6use modelling techniques that require manual input of clinical variables, deterring routine use in busy medical environments7. The ability for clinicians to predict post-operative outcomes using clinical judgment alone is suboptimal, with a systematic review of 27 studies demonstrating area under the receiver operating characteristic curve (AUROC) values ranging from 0.51 to 0.758. Therefore, there is an important need to develop better and more practical risk prediction tools for patients being considered for open AAA repair.

Machine learning (ML) is an evolving technology that enables computers to learn from large datasets and make accurate predictions9. This field has been driven by the explosion of electronic information combined with increasing computational power10. Previously, our group used the Vascular Quality Initiative (VQI) database to develop a ML algorithm that accurately predicts in-hospital major adverse cardiovascular events (MACE) following open AAA repair11. The primary outcome was in-hospital MACE because longer term myocardial infarction and stroke were not well captured in the VQI open AAA database11. Furthermore, VQI primarily comprises data from North American centres12. In contrast, the NSQIP database contains data from patients across ~ 15 countries worldwide and captures 30-day outcomes13. Therefore, a NSQIP algorithm may have advantages over a VQI model, including the ability to capture MACE beyond the initial hospitalization and greater potential to be generalizable across countries13. In this study, we applied ML to the NSQIP database to predict 30-day MACE and other outcomes following elective, non-ruptured open AAA repair using pre-operative data.

Methods

Ethics

ACS NSQIP approved the experimental protocol and provided the blinded dataset. The Unity Health Toronto Research Ethics Board deemed that this study was exempt from review as the data came from a large, deidentified registry. Due to the retrospective nature of the study involving data originating from an anonymized registry, the Unity Health Toronto Research Ethics Board waived the need for obtaining informed consent. All methods were performed in accordance with the relevant guidelines and regulations, including the Declaration of Helsinki14.

Design

We conducted a ML-based prognostic study and the findings were reported based on the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis + Artificial Intelligence (TRIPOD + AI) statement (eTable 1)15. ML methods were based on our previous work11.

Dataset

Created in 2004, the ACS NSQIP database contains demographic, clinical, and 30-day outcomes data on surgical patients across over 700 hospitals in approximately 15 countries worldwide13. The information is prospectively collected from electronic health records by trained and certified clinical reviewers and regularly audited by ACS for accuracy16. Targeted NSQIP registries for vascular operations were developed in 2011 by vascular surgeons, which contain additional procedure-specific variables and outcomes17.

Cohort

All patients who underwent open AAA repair between 2011 and 2021 in the ACS NSQIP targeted database were included. This information was merged with the main ACS NSQIP database for a complete set of generic and procedure-specific variables and outcomes. Patients treated for ruptured or symptomatic AAA, thoracoabdominal aortic aneurysm, thromboembolic disease, dissection, and graft infection were excluded.

Features

Thirty-five pre-operative variables were used as input features for our ML models. Given the unique advantage of ML in handling many input features, all available variables in the NSQIP database were used as input features to maximize predictive performance. Demographic variables included age, sex, body mass index (BMI), race, ethnicity, and origin status. Comorbidities included hypertension, diabetes, smoking status, congestive heart failure (CHF), chronic obstructive pulmonary disease (COPD), end stage renal disease (ESRD) requiring dialysis, functional status, and American Society of Anesthesiologists (ASA) class. Previous procedures included open or endovascular AAA repair or open abdominal surgery. Pre-operative laboratory investigations included serum sodium, blood urea nitrogen (BUN), serum creatinine, albumin, white blood cell count, hematocrit, platelet count, international normalized unit (INR), and partial thromboplastin time (PTT). Anatomic characteristics included AAA diameter, surgical approach, and proximal/distal AAA extent. Concomitant procedures included renal, visceral, and lower extremity revascularization. To account for the impact of advancements in surgical/anesthesia techniques over time on outcomes, the year of operation was included as an input variable. A complete list of features and definitions can be found in eTable 2.

Outcomes

The primary outcome was 30-day post-procedural MACE, defined as a composite of myocardial infarction (MI), stroke, or death. MI was defined as electrocardiogram changes indicative of acute MI (ST elevation > 1 mm in two or more contiguous leads, new left bundle branch block, or new q-wave in two or more contiguous leads), new elevation in troponin greater than 3 times the regular upper level of the reference range in the setting of suspected myocardial ischemia, or physician/advanced provider diagnosis of MI. Stroke was defined as motor, sensory, or cognitive dysfunction which persists for 24 h in the setting of a suspected stroke. Death was defined as all-cause mortality.

Thirty-day post-procedural secondary outcomes included individual components of the primary outcome, any re-intervention, other morbidity, non-home discharge, and unplanned readmission. Other morbidity was defined as a composite of ischemic colitis, lower extremity ischemia requiring intervention, secondary AAA rupture, surgical site infection (SSI), wound dehiscence, pneumonia, unplanned reintubation, pulmonary embolism (PE), failure to wean from ventilator (cumulative time of ventilator-assisted respirations > 48 h), acute kidney injury (AKI; rise in creatinine of > 2 mg/dL from pre-operative value or requirement of dialysis in a patient who did not require dialysis pre-operatively), urinary tract infection (UTI), cardiac arrest, deep vein thrombosis (DVT) requiring therapy, Clostridium difficile infection, sepsis, or septic shock. Non-home discharge was defined as discharge to rehabilitation, skilled care, or other facility. These outcomes are defined by the ACS NSQIP data dictionary18.

Model development

Six ML models were trained to predict 30-day primary and secondary outcomes: Extreme Gradient Boosting (XGBoost), random forest, Naïve Bayes classifier, radial basis function (RBF) support vector machine (SVM), multilayer perceptron (MLP) artificial neural network (ANN) with a single hidden layer, sigmoid activation function, and cross-entropy loss function, and logistic regression. The reason for choosing these models is because they have been previously shown to achieve excellent performance for predicting surgical outcomes19,20,21. The baseline comparator was logistic regression, which is the most common modelling technique used in traditional risk prediction tools22.

Our data were split into training and test sets in a 70:30 ratio23. We then performed 10-fold cross-validation and grid search on the training set to find optimal hyperparameters for each ML model24,25. Preliminary analysis of our data demonstrated that the primary outcome was uncommon, occurring in 311/3,620 (8.6%) of patients in our cohort. Class balance was improved using Random Over-Sample Examples (ROSE), a technique that uses smoothed bootstrapping to draw new samples from the feature space neighbourhood around the minority class and is a commonly used method to support predictive modelling of rare events26. The models were then evaluated on test set data and ranked based on the primary discriminatory metric of AUROC. Our best performing model was XGBoost, which had the following hyperparameters that were optimized for our dataset: number of rounds = 150, maximum tree depth = 3, learning rate = 0.3, gamma = 0, column sample by tree = 0.6, minimum child weight = 1, subsample = 1. eTable 3 details the process for selecting these hyperparameters. Once we identified the best performing ML model for the primary outcome, the algorithm was further trained to predict secondary outcomes.

Statistical analysis

Baseline demographic and clinical characteristics for patients with and without 30-day MACE were summarized as means (standard deviation) or number (proportion). Differences in characteristics between groups were assessed using independent t-test for continuous variables or chi-square test for categorical variables. Statistical significance was set at two-tailed p < 0.05.

The primary metric for assessing model performance was AUROC (95% CI), a validated method to assess discriminatory ability that considers both sensitivity and specificity27. Secondary performance metrics were accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). To further assess model performance, we plotted a calibration curve and calculated the Brier score, which measures the agreement between predicted and observed event probabilities28. In the final model, feature importance was determined by ranking the top 10 predictors based on the variable importance score (gain), which is a measure of the relative impact of each covariate in contributing to an overall prediction29. To assess model robustness on various populations, we performed subgroup analysis of predictive performance based on age (under vs. over 70 years), sex (male vs. female), race (White vs. non-White), ethnicity (Hispanic vs. Non-Hispanic), proximal AAA extent (infrarenal vs. juxta/para/suprarenal), presence of prior open or endovascular AAA repair, and need for concomitant renal/visceral/lower extremity revascularization.

Based on a validated sample size calculator for clinical prediction models, to achieve a minimum AUROC of 0.8 with an outcome rate of ~ 8% and 35 input features, the minimum sample size required is 3,213 patients with 258 events30,31. Our cohort of 3,620 patients with 311 primary events meets this sample size requirement. There was less than 5% missing data for variables of interest; therefore, complete-case analysis was applied whereby only non-missing covariates for each patient were considered32. This has been demonstrated to be a valid analytical method for datasets with small amounts of missing data (< 5%) and reflects predictive modelling of real-world data, which inherently includes missing information33,34. Given the small percentage of missing data, the use of various imputation methods, including multiple imputation by chained equations and mean/mode imputation, did not change model performance. All analyses were performed in R version 4.2.135with the following packages: caret36, xgboost37, ranger38, naivebayes39, e107140, nnet41, and pROC42.

Conference presentation

Presented at the Society for Vascular Surgery 2024 Vascular Annual Meeting in Chicago, Illinois, United States (June 19–22, 2024).

Results

Patients and events

From a cohort of 5,784 patients who underwent open AAA repair in the NSQIP targeted database between 2011 and 2021, we excluded 2,164 patients for the following reasons: treatment for ruptured AAA (n = 1,421), symptomatic AAA (n = 401), type IV thoracoabdominal aortic aneurysm (n = 103), thromboembolic disease (n = 124), dissection (n = 90), and graft infection (n = 25). Overall, we included 3,620 patients. The primary outcome of 30-day MACE occurred in 311 (8.6%) patients. The 30-day secondary outcomes occurred in the following distribution: MI (n = 154 [4.3%]), stroke (n = 28 [0.8%]), death (n = 170 [4.7%]), re-intervention (n = 344 [9.5%]), other morbidity (n = 947 [26.2%]; composite of ischemic colitis (n = 120), lower extremity ischemia requiring intervention (n = 83), secondary AAA rupture (n = 11), SSI (n = 118), dehiscence (n = 48), pneumonia (n = 226), unplanned reintubation (n = 223), PE (n = 21), failure to wean from ventilator (n = 274), AKI (n = 209), UTI (n = 76), cardiac arrest (n = 81), DVT (n = 46), sepsis (n = 54), septic shock (n = 88), Clostridium difficile infection (n = 28)), non-home discharge (n = 905 [25.0%]), and unplanned readmission (n = 214 [5.9%]).

Pre-operative demographic and clinical characteristics

Compared to patients without a primary outcome, those who developed 30-day MACE were older and more likely to be transferred from another hospital and reside in nursing homes. They were also more likely to have hypertension, insulin dependent diabetes, CHF, COPD, a previous endovascular AAA repair, and be current smokers and ASA class 4 or higher. Notable differences in laboratory investigations included patients with a primary outcome having higher levels of creatinine and BUN. Anatomically, patients with 30-day MACE had a larger mean AAA diameter, with a greater proportion having suprarenal, pararenal, or juxtarenal aneurysms. They were also more likely to require concomitant renal, visceral, and lower extremity revascularization (Table 1).

Table 1 Pre-operative demographic and clinical characteristics of patients undergoing open abdominal aortic aneurysm repair with and without major adverse cardiovascular events at 30 days.

Model performance

Of the 6 ML models evaluated on test set data for predicting 30-day MACE following open AAA repair, XGBoost had the best performance with an AUROC (95% CI) of 0.90 (0.89–0.91) compared to random forest [0.88 (0.87–0.89)], RBF SVM [0.86 (0.85–0.88)], Naïve Bayes [0.82 (0.81–0.83)], MLP ANN [0.77 (0.75–0.79)], and logistic regression [0.66 (0.64–0.68)]. The other performance metrics of XGBoost were the following: accuracy 0.81 (95% CI 0.80–0.82), sensitivity 0.81, specificity 0.81, PPV 0.82, and NPV 0.80 (Table 2).

Table 2 Model performance on test set data for predicting 30-day major adverse cardiovascular events following open abdominal aortic aneurysm repair using pre-operative features.

For 30-day secondary outcomes, XGBoost achieved the following AUROC’s (95% CI): MI [0.87 (0.86–0.88)], stroke [0.88 (0.87–0.89)], death [0.91 (0.90–0.92)], re-intervention [0.81 (0.80–0.83)], other morbidity [0.85 (0.84–0.87)], non-home discharge [0.90 (0.89–0.91)], and unplanned readmission [0.83 (0.81–0.84)] (Table 3).

Table 3 XGBoost performance on test set data for predicting 30-day primary and secondary outcomes following open abdominal aortic aneurysm repair using pre-operative features.

The ROC curve for prediction of 30-day MACE using XGBoost is demonstrated in Fig. 1. Our model achieved good calibration with a Brier score of 0.03, indicating excellent agreement between predicted and observed evented probabilities (Fig. 2). The top 10 predictors of 30-day MACE in our XGBoost model were the following: (1) prior endovascular AAA repair, (2) concomitant renal revascularization, (3) CHF, (4) ASA class, (5) concomitant visceral revascularization, (6) age, (7) proximal AAA extent, (8) transfer from another hospital, (9) COPD, and (10) pre-operative creatinine (Fig. 3).

Fig. 1
figure 1

Receiver operating characteristic curve for predicting 30-day major adverse cardiovascular events following open abdominal aortic aneurysm repair using Extreme Gradient Boosting (XGBoost) model. AUROC (area under the receiver operating characteristic curve), CI (confidence interval).

Fig. 2
figure 2

Calibration plot with Brier score for predicting 30-day major adverse cardiovascular events following open abdominal aortic aneurysm repair using Extreme Gradient Boosting (XGBoost) model.

Fig. 3
figure 3

Variable importance scores (gain) for the top 10 predictors of 30-day major adverse cardiovascular events (MACE) following open abdominal aortic aneurysm repair in the Extreme Gradient Boosting (XGBoost) model. Abbreviations: EVAR (endovascular aneurysm repair), CHF (congestive heart failure), AAA (abdominal aortic aneurysm), COPD (chronic obstructive pulmonary disease). Explanation of figure: patients who underwent a prior EVAR, required concomitant renal/visceral revascularization, had a more proximal AAA extent, or were transferred from another hospital likely underwent a more anatomically/technically challenging repair, and therefore, were more likely to suffer 30-day MACE. Furthermore, patients with older age or more comorbidities, including CHF, COPD, or chronic kidney disease as measured by pre-operative creatinine, thereby contributing to a higher ASA class, were more medically complex and therefore more likely to suffer 30-day MACE.

Subgroup analysis

Our XGBoost model performance for predicting 30-day MACE remained excellent on subgroup analysis of specific demographic and clinical populations with the following AUROC’s (95% CI): age < 70 [0.90 (0.88–0.92)] and age > 70 [0.88 (0.87–0.90)] (eFigure 1), males [0.89 (0.88–0.91)] and females [0.91 (0.89–0.92)] (eFigure 2), White patients [0.89 (0.88–0.91)] and non-White patients [0.90 (0.88–0.91)] (eFigure 3), Hispanic patients [0.89 (0.87–0.92)] and non-Hispanic patients [0.90 (0.89–0.91)] (eFigure 4), infrarenal AAA [0.90 (0.89–0.92)] and juxta/para/suprarenal AAA [0.89 (0.87–0.91)] (eFigure 5), patients with prior AAA repair [0.90 (0.88–0.91)] and without prior AAA repair [0.90 (0.89–0.91)] (eFigure 6), and patients requiring concomitant renal/visceral/lower extremity revascularization [0.89 (0.88–0.91)] and those who did not [0.89 (0.88–0.91)] (eFigure 7).

Discussion

Summary of findings

In this study, we used data from the ACS NSQIP targeted AAA files between 2011 and 2021 consisting of 3,620 patients who underwent open AAA repair to develop ML models that accurately predict 30-day MACE with an AUROC of 0.90. Our algorithms also predicted 30-day MI, stroke, death, re-intervention, other morbidity, non-home discharge, and unplanned readmission with AUROC’s ranging from 0.81 to 0.91. We showed several other key findings. First, patients who develop 30-day MACE following open AAA repair represent a high-risk population with several predictive factors at the pre-operative stage. Notably, they are older with more comorbidities, have more complex aneurysms, and are more likely to require concomitant revascularization procedures. Second, we trained 6 ML models to predict 30-day MACE using pre-operative features and showed that XGBoost achieved the best performance. Our model was well calibrated and achieved a Brier score of 0.03. On subgroup analysis based on age, sex, race, ethnicity, proximal AAA extent, prior AAA repair, and need for concomitant revascularization, our algorithms maintained robust performance. Finally, we identified the top 10 predictors of 30-day MACE in our ML models. These features can be used by clinicians to identify factors that contribute to risk predictions, thereby guiding patient selection and pre-operative optimization. For example, patients with multiple comorbidities could be further assessed and optimized through pre-operative consultations with cardiologists or internal medicine specialists to mitigate adverse events43,44. Overall, we have developed robust ML-based prognostic models with excellent predictive ability for perioperative outcomes following open AAA repair, which may help guide clinical decision-making to improve outcomes and reduce costs from complications, reinterventions, and readmissions.

Comparison to existing literature

A systematic review of 13 outcome prediction models for AAA repair was published by Lijftogt et al. (2017)5. The authors showed that most models had significant methodological and clinical limitations, including lack of calibration measures, heterogeneous datasets, and large numbers of variables requiring manual input5. After assessing multiple widely studied models including the Cambridge POSSUM (Physiological and Operative Severity Score for the enUmeration of Mortality and morbidity), Glasgow Aneurysm Score (GAS), and Vascular Study Group of New England (VSGNE) model, the authors concluded that best performing algorithms were the British Aneurysm Repair (BAR) score (AUROC 0.83) and Vascular Biochemistry and Hematology Outcome Model (VBHOM; AUROC 0.85)5. Both models were geographically limited to United Kingdom datasets5. VBHOM was trained on patients undergoing either elective or ruptured AAA repair; however, when applied to an elective open AAA repair cohort, performance declined significantly to AUROC 0.685. We applied novel ML techniques to a multi-national NSQIP cohort consisting specifically of patients undergoing open repair for non-ruptured, non-symptomatic AAA and achieved an AUROC of 0.90 for the primary outcome of 30-day MACE. Our ML model also has the advantages of excellent calibration (Brier score 0.03) and automated input of variables. Additionally, most existing models focus on mortality prediction, while our algorithm’s ability to predict secondary outcomes including non-home discharge and unplanned readmission allows for potential impact on both patient outcomes and health care costs45. Overall, in comparison to existing tools, our ML algorithms are methodologically robust, perform better, and consider a greater number of clinically relevant outcomes.

Using NSQIP data, Bonde et al. (2021) trained ML algorithms on a cohort of patients undergoing over 2,900 different procedures to predict peri-operative complications, achieving AUROC’s of 0.85–0.8846. Given that patients undergoing open AAA repair represent a unique population often with a high number of vascular comorbidities, the applicability of general surgical risk prediction tools may be limited47. Elsewhere, Eslami and colleagues (2017) used logistic regression to develop a mortality risk prediction model for elective endovascular and open AAA repair using NSQIP data, achieving an AUROC of 0.7548. By applying more advanced ML techniques to an updated NSQIP cohort and developing algorithms specific to patients undergoing open AAA repair, we achieved an AUROC of 0.90. Therefore, we demonstrate the value of building procedure-specific ML models, which can improve predictive performance. In comparison to our previously described VQI open AAA repair model11, the NSQIP algorithm predicts longer term MACE (30-day vs. in-hospital) and achieved similar predictive performance (AUROC’s ≥ 0.90). Given that NSQIP data captures information from ~ 15 countries13, compared to 3 for VQI12, the NSQIP model may be more generalizable across countries. This demonstrates the value of using NSQIP data to build ML models. This algorithm complements our previously described ML model for predicting 1-year mortality following endovascular AAA repair49.

Explanation of findings

There are several explanations for our findings. First, patients who develop adverse events following open AAA repair represent a high-risk group with multiple cardiovascular and anatomic risk factors, which is corroborated by previous literature50. The SVS AAA guidelines provide several recommendations regarding careful surgical risk assessment, appropriate patient/procedure selection, and pre-operative optimization for patients being considered for AAA repair4. In particular, there is a strong recommendation for smoking cessation at least 2 weeks before aneurysm repair, yet over 50% of patients who developed 30-day MACE in our cohort were current smokers at the time of intervention4. Therefore, there are important opportunities to improve care for patients by understanding their surgical risk and medically optimizing them prior to surgery51. Additionally, we demonstrate the contribution of anatomic complexity to adverse outcomes due to proximal aneurysm extent, prior endovascular AAA repair, and need for concomitant renal/visceral/lower extremity revascularization, which is corroborated by previous literature52. These patients may therefore benefit from multidisciplinary vascular assessment to ensure appropriate patient/procedure selection, including consideration of surveillance and advanced endovascular therapies53. Second, our ML models performed better than logistic regression for several reasons. Compared to logistic regression, advanced ML techniques can better model complex, non-linear relationships between inputs and outputs54,55. This is especially important in health care data, where patient outcomes can be influenced by many factors56. Our best performing algorithm was XGBoost, which has unique advantages including the avoidance of overfitting and faster computing while maintaining precision57,58,59. Furthermore, XGBoost works well with structured data, which may explain why it outperformed more complex algorithms such as MLP ANN on the NSQIP database60. While advanced ML models may outperform logistic regression in predictive performance, it is important to consider the limitations of ML models, including overfitting, reliance on large datasets, and potential lack of mechanistic insight61,62. Traditional statistical techniques are better designed to characterize specific relationships between variables and outcomes, while ML is more focused on developing high-performing prediction models while potentially sacrificing explainability63,64. Third, our model performance remained robust on subgroup analysis of specific demographic and clinical populations. This is an important finding given that algorithm bias is a significant issue in ML studies65. We were likely able to avoid these biases due to the excellent capture of sociodemographic data by ACS NSQIP, a multi-national database that includes diverse patient populations66,67. Fourth, the fact that a prior endovascular AAA repair was the most important predictor of 30-day MACE in our model suggests the importance of following device manufacturers’ instructions for use to reduce the risk of complications requiring open re-intervention such as endoleak, graft migration, and occlusion68.

Implications

Our ML models can be used to guide clinical decision-making in several ways. Pre-operatively, a patient predicted to be at high risk of an adverse outcome should be further assessed in terms of modifiable and non-modifiable factors69. Patients with significant non-modifiable risks may benefit from surveillance alone or consideration of endovascular therapy70. Specifically, those with anatomically complex aneurysms may benefit from multidisciplinary vascular assessment to optimize patient selection and procedure planning53. Those with modifiable risks, such as multiple cardiovascular comorbidities, should be referred to cardiologists or internal medicine specialists for further evaluation43,44. Pre-operative anesthesiology consultation may also be helpful for high-risk patients if not already a part of routine care71. Additionally, patients at high risk of non-home discharge or readmission should receive early support from allied health professionals to optimize safe discharge planning72. These peri-operative decisions guided by our ML models have the potential to improve outcomes and reduce costs by mitigating adverse events.

The programming code used for the development and evaluation of our ML models is publicly available through GitHub. These tools can be used by clinicians involved in the peri-operative management of patients being considered for open AAA repair. On a systems-level, our models can be readily implemented by the > 700 centres globally that currently participate in ACS NSQIP. They also have potential for use at non-NSQIP sites, as the input features are commonly captured variables for the routine care of vascular surgery patients73. Given the challenges of deploying prediction models into clinical practice, consideration of implementation science principles is critical74. A major limitation of existing tools is the need for manual input of variables by clinicians, which is time-consuming and deters routine use7. Our ML models can automatically extract a patient’s prospectively collected NSQIP data to make surgical risk predictions, thereby improving practicality in busy clinical settings6. Given that our models are designed to automatically extract many features from the NSQIP database to make risk predictions and model performance declined significantly with a reduction in the number of features, we opted not to develop a simplified model with fewer input variables. Rather than a web-based application for clinician use, which is limited by the need to manually input variables, our recommended deployment plan involves integration strategies with hospital electronic health record (EHR) systems and a user-friendly decision-support tool. Specifically, significant efforts have been made to integrate clinical registry data with EHR systems75,76,77. For example, clinical notes may be automatically extracted into structured NSQIP variables through natural language processing techniques75,76,77. Through this work, our open-source models can be deployed in hospital EHR systems with the support of institutional data analytics teams to provide automated risk predictions to support clinical decision-making. We advocate for dedicated health care data analytics teams at the institution level, as their significant benefits have been previously demonstrated and model implementation can be facilitated by these experts78. Through this study, we have also provided a framework for the development of robust ML models that predict open AAA repair outcomes, which can be applied by individual centers for their specific patient populations.

Limitations

Our study has several limitations. First, our models were developed using ACS NSQIP data. Future studies should assess whether performance can be generalized to institutions that do not participate in ACS NSQIP or record the pre-operative features used in our models. Validation of the models using an external dataset would strengthen generalizability. However, this was not plausible for this study given differences in the definitions of variables and outcomes between various databases, such as NSQIP and VQI12,18. With the recent creation of the ACS/SVS Vascular Verification Program, there may be an opportunity to perform external validation of our models on a unified NSQIP/VQI dataset in the future79. Furthermore, prospective validation of our models on a real-world cohort to assess predictive performance and impact on patient outcomes in future work would further strengthen the potential clinical utility of the models. Second, the ACS NSQIP database captures 30-day outcomes. Evaluation of ML algorithms on other data sources with longer follow-up would augment our understanding of long-term surgical risk. Third, we evaluated 6 ML models; however, other ML models exist. We chose these 6 ML models because they have been demonstrated to achieve the best performance for predicting surgical outcomes using structured data19. We achieved excellent performance; however ongoing evaluation of novel ML techniques would be prudent. Fourth, surgeon experience and hospital volumes were not recorded in our ACS NSQIP dataset, which may have reduced the predictive performance of our models. Given the potential explanatory power of these variables on surgical outcomes, it would be prudent to build future ML models on datasets that capture these variables.

Conclusions

In this study, we used the ACS NSQIP database to develop robust ML models that pre-operatively predict 30-day MACE following open AAA repair with excellent performance (AUROC 0.90). Our algorithms also predicted MI, stroke, death, re-intervention, other morbidity, non-home discharge, and readmission with AUROC’s of 0.81–0.91. Notably, our models remained robust across demographic/clinical subpopulations and outperformed existing prediction tools and logistic regression, and therefore, have potential for important utility in the peri-operative management of patients being considered for open AAA repair to mitigate adverse outcomes. Prospective validation of our ML algorithms is warranted.