Background

The mechanical ventilation through an endotracheal tube is a life-saving, invasive intervention designed to support or replace spontaneous respiration for patients experiencing acute respiratory distress1,2. Critically ill patients often require tracheal intubation3,4,5, and between 38 and 88% of patients admitted to critical care units require intratracheal intubation and mechanical ventilation6,7. A total of 10-15% of intubated ICU patients experience hypoxia, 18% experience cardiovascular instability, and 2% experience cardiac arrest8,9. In addition, the first-pass success rate of tracheal intubation in the critically ill is often < 80%, with up to 20% of tracheal intubations taking more than two attempts, while multiple intubation attempts are associated with a higher risk of adverse outcomes10. The incidence of peri-intubation adverse events is further elevated in patients with anticipated difficult airways11. Moreover, the worse physiological reserve of ICU patients significantly contributes to increased risks of profound peri-intubation hypoxemia, hypotension, arrhythmia, cardiac arrest, and death12.

Despite the high stakes involved13,14,15, the timing of tracheal intubation in the ICU has historically lacked rigorous research support. The optimal timing for intubation, early versus late16,17, remains a subject of controversy, resulting in considerable variability in clinical intubation practices18. Thus, there is an urgent need for objective and standardized methods to determine the appropriate timing and patient selection for intubation in order to reduce long-term complications and mortality rates among ICU patients19,20,21.

As a subfield of artificial intelligence, machine learning is a methodology that enables computers to identify patterns and classify data by utilizing historical or existing data to train models, thereby facilitating predictions about outcomes22. Machine learning has been applied in various healthcare domains, including gastroenterology, pulmonology, neurology, cardiology, emergency medicine, and critical care medicine22,23,24,25,26,27,28,29. As electronic healthcare record (EHR) data sets grow in size and complexity, predictive models based on machine learning can enhance patient safety and improve outcomes while controlling costs30,31,32,33.

While the field of medicine is increasingly leveraging the potential of machine learning, a key challenge in applying ML models to healthcare remains their interpretability30. Although algorithms such as logistic regression and decision trees offer a high degree of transparency, many commonly used machine learning algorithms face significant interpretability challenges, commonly referred to as the ‘black box’ problem34. To address the interpretability challenge, Lundberg and Lee proposed SHAP (Shapley Additive exPlanations), a model-agnostic method based on Shapley values from cooperative game theory, which offers insights into individual predictions for kernel-based and tree-based models alike35. SHAP’s capacity to attribute each feature’s contribution to a model’s prediction has proven valuable in medical applications, including early-stage prognostic predictions for sepsis23 and mortality prediction for gastrointestinal bleeding in intensive care units22. Despite these developments, to our knowledge, SHAP has not yet been explored in the context of intubation prediction, presenting an opportunity for further investigation.

Therefore, the objective of our study was to develop and validate a machine learning model based on clinical features for the early prediction of tracheal intubation in critically ill patients. The most effective model was selected for external validation and its interpretation was performed using the SHAP method.

Methods

Source of data

The data utilized for model development were obtained from the MIMIC-IV36 database at Beth Israel Deaconess Medical Center, which houses comprehensive and high-quality data on patients admitted to ICUs between 2008 and 2019. Deidentified data were contained in MIMIC-IV, including admissions, discharges, diagnoses, monitoring data and laboratory results.

The external validation data were obtained from the eICU Collaborative Research Database (eICU-CRD37), a multicenter database comprising deidentified health data associated with over 200,000 ICU admissions in the United States between 2014 and 2015. Both databases are released under the Health Insurance Portability and Accountability Act (HIPAA) safe harbor provision, which eliminates the need for additional ethical approval. The data extraction was performed by one author (Xiangjie Duan, CITI record number: 42039823) who obtained access to both databases. This study was reported in accordance with the recommendations of the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis statement (TRIPOD, additional file 2)38.

Selection of participants and outcome definition

We enrolled all patients aged 18 and above in the MIMIC-IV database who were not mechanically ventilated invasively prior to ICU admission. For patients with multiple hospital admissions, we only included data from their first ICU stay and initial hospitalization. Patients were excluded if (1) they stayed in the ICU for less than 24 h, (2) intubation was initiated before hour 24 of ICU admission, or (3) extubated before consecutive 24 h of their first invasive mechanical ventilation39,40. Additionally, patients without blood gas measurements were also excluded since the previous studies have demonstrated that pulse oximetry was not reliable indicators at low oxygen saturation levels (a common sign of endotracheal intubation), especially in critically ill patients compared with healthy volunteers41. And caution should be exercised when interpreting these results compared to invasive tests such as blood gas analysis42. In the eICU-CRD validation cohort, all eligible patients who did not meet any exclusion criteria described above were included.

The outcome of the present study was defined as the requirement for endotracheal intubation within 24 to 96 h following admission to the ICU. For prediction purpose, we set an initial prediction time window beyond 24 h for the following reasons. Firstly, as physician shifts are often 12–24 h, it is crucial to establish a prediction time window after this period, which facilitates the replacement physician in gaining a more comprehensive understanding of the patients’ potential need of intubation. Secondly, research has indicated that during the initial 24 h of ICU admission, patients often exhibit erratic fluctuations in their physiological and biochemical indicators43,44. Consequently, newly admitted ICU patients received heightened attention from the healthcare team compared to those already admitted, and we have extended the earliest prediction time beyond 24 h to optimize the allocation of medical resources, including ventilators and nursing teams. Additionally, certain indicators such as APACHE score can only be accurately measured after spending at least 24 h in the ICU.

Variable collection

In the MIMIC-IV cohort, data were extracted from hours 4 to 24 after ICU admission, including 5 categories of variables: (a) demographic characteristics; (b) other static information (e.g., length of stay before ICU admission, length of stay in the ICU before intubation); (c) vital signs; (d) laboratory data (e.g., blood analysis, blood gas, liver function, renal function, electrolyte, etc.) and (e) clinical scores (e.g., SOFA, GCS, etc.) were calculated and included as predictors. The detailed descriptions of all predictive variables are listed in Supplementary Table 1 (additional file 1). These variables were selected to develop a model grounded in vital signs, routine laboratory tests commonly performed in the ICU, and standard interventions, ensuring broad applicability across diverse ICU settings without imposing additional burden.

Model building

The dataset from MIMIC-IV was randomly split into three distinct sets with a ratio of 7:1:2. This allocation resulted in 8,391 samples designated for training the prediction models. An internal validation set that comprises 1,199 samples was reserved for feature selection and model simplification. Additionally, a hold-out test set of 2,398 samples was allocated to independently assess the model’s performance. The dataset obtained from eICU-CRD was utilized to assess the generalizability of the prediction model, serving as the external validation set.

Prior to model training, we conducted a preliminary analysis of the dataset to identify features with more than 50% missing data, which were subsequently excluded from the analysis. To handle the missing values, we employed K-nearest neighbours imputation within the training, internal validation, and test sets separately to avoid data leakage. We employed a range of techniques including SMOTE, Random oversampling, Random undersampling, Tomek links and the combination of SMOTE and ENN using the Python library imbalanced-learn package in order to address the class imbalance problem in the dataset. To construct prediction classifiers, we employed seven machine learning algorithms, namely CatBoost (Categorical Boosting), LightGBM (Light Gradient Boosting Machine), XGBoost (eXtreme Gradient Boosting), RF (random forest), LR (Logistic Regression), SVM (Support Vector Machine) and NB (Naïve Bayes). Detailed descriptions of each algorithm can be found in Supplementary Table 2 (additional file 1).

We employed ten-fold cross-validation within the training dataset for Bayesian optimization in the hyperparameter tuning process. This process involved partitioning the training dataset into 10 folds, where one fold was designated as the inner validation set and the remaining 9 folds were utilized for inner training purposes. The hyperparameter configuration yielding the highest average performance across all cross-validation folds was selected.

Model performance evaluation

We used six commonly used evaluation indices to evaluate the overall model performance, namely, accuracy, area under the receiver operating characteristic curve (AUC), sensitivity, specificity, F1 score and matthews correlation coefficient (MCC). These evaluation indices were commonly reported in the evaluation of classification problems with ML. For each metric, we conducted 1000 resamples to calculate the 95% confidence intervals (CIs). Each resample was drawn with a sample size equivalent to that of our test set to ensure the accuracy and representativeness of the confidence intervals. The metrics were defined as follows:

$$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$
(1)
$$Sensitivity=\frac{TP}{TP+FN}$$
(2)
$$Specificity=\frac{TN}{FP+TN}$$
(3)
$$Precision=2\cdot\:\frac{TP}{TP+FP}$$
(4)
$$F1\:score=2\cdot\:\frac{Precision\cdot\:Sensitivity}{Precision+Sensitivity}\:$$
(5)
$$MCC=\frac{(TP\cdot\:TN)-(FP\cdot\:FN)}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$$
(6)

The metric definitions employed in this study encompassed the following abbreviations: TP (true positives), TN (true negatives), FP (false positives), and FN (false negatives). Sensitivity, also referred to as the true positive rate (TPR = TP / (TP + FN)), was defined as the proportion of intubated patients correctly identified. Specificity, also known as the true negative rate (TNR = TN / (TN + FP)), denoted the fraction of non-intubated patients correctly identified.

Model interpretation

We employed the SHAP framework, a game theory-based approach that provides consistent and accurate calculations of feature contributions to any machine learning model’s prediction, for interpreting our ML model. This framework assigns an importance value (SHAP value) to each feature in order to explain the prediction of individual observations and summarize their overall contribution. Moreover, it effectively captures feature interactions, including cases where a specific value may either increase or decrease risk (e.g., a heart rate of 130 can be considered protective for a child but detrimental for a geriatric patient). Additionally, the SHAP values allow for the interpretation of each observation within the dataset.

Statistical analysis

The baseline characteristics were compared between the intubation group and the non-intubation group. Categorical variables were presented as frequencies or proportions and compared using either the chi-square test or Fisher’s exact test. Continuous variables were reported as mean ± standard deviation (SD) along with median and 95% confidence interval (CI), as well as first quartile (Q1) and third quartile (Q3). Student’s t-test was employed for normally distributed continuous variables, while the Mann-Whitney U test was used for skewed distribution. Statistical significance was defined as a two-sided p-value < 0.05.

Results

Demographic characteristics

Among the 76,943 patients included in the MIMIC-IV database, a final cohort of 11,988 adult patients was selected for this study. The schematic representation of the patient screening process is illustrated in Fig. 1.

Fig. 1
figure 1

The schematic representation of the patient screening process.

Baseline characteristics, including demographics, laboratory values, and clinical parameters, between eligible participants with and without intubation are summarized in Supplementary Table 3 (additional file 1). The mean age, BMI and sex were not significantly different. Patients in the non-intubated group had a lower respiratory rate and higher GCS score than intubated patients.

Model development and simplification

A total of 51 variables were collected within the first 24 h after admission to the ICU, excluding 10 variables with more than 50% missing data. We developed seven machine learning binary classifiers for predicting intubation risk in ICU patients, including CatBoost, LightGBM, XGBoost, RF, LR, SVM and NB. Among these algorithms, Naïve Bayes exhibited the poorest performance while CatBoost demonstrated the best performance. As shown in Table 1, the CatBoost model displayed relatively superior fit with an area under the curve (AUC) of 0.881 in the test cohort compared to other ML models (AUC: LightGBM: 0.873; XGBoost: 0.871; RF: 0.824; SVM: 0.813; LR: 0.813; NB: 0.768). Additional model evaluation indices are also presented in Table 1.The CatBoost model was therefore selected for further prediction analysis in this study. Figure 2a illustrates receiver operating characteristic (ROC) curves for these models on a held-out test set. The performance of different sampling methods using CatBoost was presented in Supplementary Table 4 (additional file 1) and the confusion matrix of the CatBoost model was shown in supplementary Fig. 1 (additional file 1). Moreover, the final hyperparameters setting of CatBoost model was shown in the Supplementary Table 5 (additional file 1). As shown in the Fig. 2b, the CatBoost model demonstrated improved calibration after applying isotonic regression, with a Brier score of 0.043, which implies that the predicted probabilities are more aligned with the actual outcomes.

Table 1 Performance of the machine learning models based on the test set.
Fig. 2
figure 2

ROC curve and calibration curve of the prediction models. (a) The ROC curve of the predictive models on the test set. (b) The calibration plot of the prediction models.

In most cases, an ensemble model incorporating a larger number of variables yeilds more precise predictions compared to a model with fewer variables. Nevertheless, it is more economically viable and efficient to attain comparable or identical enhancements by utilizing prominent features, thereby offering advantages to clinical practice. To identify the number of variables needed for the machine learning model to achieve an equivalent predictive capability as observed when employing all 41 variables, we assessed the fluctuation of the area under the curve (AUC) in relation to the inclusion of varying numbers of variables in the models in the internal validation set. Firstly, the variables were ranked by arranging them in descending order of importance. Subsequently, the model was retrained with an increasing number of variables. As depicted in Fig. 3a, there is a gradual decrease in classification error rate as the number of variables increases. However, beyond five features, the AUC value remains relatively stable and removing additional features leads to a decline in AUC. Importantly, adding more features does not increase the AUC. The significance scores of the top five variables are illustrated in Fig. 3b. While the CatBoost model utilizing all available variables achieved an impressive AUC of 0.88, a slightly lower AUC of 0.86 was obtained for the simplified model comprising only these selected five key features (Fig. 3c). This suggests that employing an ensemble model with these essential features is both feasible and practical. Consequently, we adopted Catboost with these top five important features as our final model selection criteria, which included Glasgow coma scale, body mass index, arterial partial pressure of oxygen, respiratory rate, length of stay before ICU admission.

Fig. 3
figure 3

Model performance evaluation. Comparison between the full and simplified CatBoost models was conducted. The full model was developed using all available features, while the simplified model was derived based on key features selected by the recursive feature elimination algorithm. (a) AUC of the CatBoost model with varying numbers of variables was evaluated. (b) Feature importance in CatBoost with 5 variables was analyzed. (c) Receiver operating characteristic curves (ROCs) were compared between the full and simplified models. (d) Validation of our model’s performance in the eICU-CRD database involved different cohorts divided by ICU type. eICU represented all types of ICUs in the eICU-CRD database, Med-Surg referred to medicine-surgery ICUs, MICU indicated medicine ICUs, and SICU denoted surgery ICUs.

External validation and explanation of the simplified model

The model was externally validated against the eICU dataset (baseline characteristics of the external validation set was shown in Supplementary Table 6 ( additional file 1), yielding an AUC score of 0.75, which suggests a robust generalization capability. Additionally, a detailed subgroup analysis was carried out across various ICU types within the eICU dataset. Notably, the model excelled in the medical ICU, achieving an AUC of 0.82, indicating its superior performance in this specific setting (Fig. 3d).

Subsequently, the SHAP method was employed to obtain a comprehensive explanation for our cohorts. We computed the feature importance using the SHAP value for the simplified CatBoost model, which exhibited excellent discriminatory ability in the validation cohort and demonstrated suitability for clinical practice. Figure 4 provides an overview of how factors positively or negatively affect the simplified CatBoost model.

Fig. 4
figure 4

SHAP value impact on model output. Illustrates the relationship between feature values and their corresponding SHAP values in Catboost with 5 features. In this figure, red points represent relatively higher numerical feature values, while blue points correspond to relatively lower ones. Grey points represent the missing values. The horizontal axis denotes the assigned SHAP value for patients belonging to each category based on this particular feature. A positive SHAP value signifies an increased risk of intubation, whereas a negative value implies a decreased risk. Features were ranked according to the sum of absolute SHAP values across all samples analyzed. Notably, GCS (Glasgow Coma Scale) emerges as the most influential predictor for intubation in our final model, where a lower GCS score indicates a higher risk of intubation.

Furthermore, we utilized the SHAP method to acquire explanations for individual patients. To further investigate feature contributions on specific individuals and explore clinical applications of our final model, we randomly selected two patients from the validation cohort. Fig. 5 visually interprets the simplified CatBoost model’s findings. A detailed discussion regarding the clinical significance of these results will be provided.

Fig. 5
figure 5

SHAP explanation force plot for 2 patients from the held-out validation set of the ML model. The SHAP force plot can be utilized to visually represent the Shapley value for each feature as a force, which either positively contributes (positive value) or negatively impacts (negative value) the prediction from its baseline. The SHAP values indicate the predictive relevance of individual patients’ features and their respective contributions to intubation prediction. If the bold-faced number, representing the probabilistic predicted values f(x), is positive, it indicates that intubation is required for the patient; conversely, if it is negative, intubation may not be necessary. The f(x) represents the log odds ratio of each observation. Red features (on the left) signify an increased risk while blue features suggest a decreased risk. The length of arrows aids in visualizing the magnitude of effect on prediction; longer arrows indicating larger effects. LOS before ICU refers to length of stay prior to ICU admission; RR denotes respiratory rate; PaO2 stands for arterial partial pressure of oxygen; GCS refers to Glasgow coma scale; BMI denotes body mass index.

Model application

The simplified intubation prediction model, developed by the CatBoost algorithm, was implemented into a python-based online web application framework for utilization by clinicians (https://dmj163-app-intubation-app-udioou.streamlit.app/). Figure 6 illustrates an example of utilizing our tool.

Fig. 6
figure 6

An example of the web-based tool. A user needs to enter the average value measured during the patient’s ICU admission between 4 and 24 h and click the “predict” button. The probability of invasive mechanical ventilation within 24–96 h after ICU admission assessed by the model will be shown to the user.

Discussion

Currently, ML algorithms have emerged as a potent tool for predicting the need for intubation in hospitalized patients, particularly in ICU settings45,46. Arvind et al.44 developed a random forest model that achieved an AUC of 0.83 for predicting intubation within 72 h following a 24-hour sampling window for COVID-19 patients admitted to hospitals. Another study by Shashikumar et al.39 introduced a highly effective deep learning tool adaptable to both COVID-19 and general hospital populations in two large urban academic health centers; all validation models exhibited AUC values exceeding 0.88. Although their models achieved excellent performance, they required significantly more predictive parameters than standard approaches. An excessive number of parameters in prediction models can be costly to collect and inefficient for frontline physicians’ use, especially when many rely on laboratory tests that may not be readily available upon admission. Siu47 et al. also developed a random forest model, which had an AUC of 0.86, using the MIMIC database. However, their model may not be trusted enough for clinicians to use due to the “black box phenomenon”, as mentioned previously. As far as we know, our model was the first to apply the SHAP method in predicting ICU patients’ need for intubation to interpret the predication of each observation, which made clinicians comprehend how a prediction was given. Furthermore, our simplified model attained an AUC of 0.860 using only five easily accessible parameters, and it exhibited robust performance in an external validation cohort.

Meanwhile, our study also demonstrated that the CatBoost model outperformed logistic regression in accurately predicting intubation among ICU patients. We identified several significant variables associated with the need for intubation in ICU patients. The variable importance analysis revealed that the Glasgow Coma Scale (GCS), body mass index (BMI), arterial partial pressure of oxygen (PaO2), respiratory rate (RR) and length of stay (LOS) before ICU were the most influential input parameters contributing to the predicted risk of intubation in ICU patients.

The GCS score is widely recognized as a standard assessment tool for evaluating impaired consciousness and coma in critically ill trauma and non-trauma patients, aiding predictions of neurological outcomes such as complications, impaired recovery, and mortality. Siu et al.39 further confirmed its association with intubation risk. Consistently, our current study found that the GCS score carried the highest weight within the CatBoost model, indicating its utmost significance as a predictor for determining intubation necessity among ICU admitted patients.

Furthermore, our investigation revealed a close relationship between body mass index and intubation likelihood in ICU patients. This finding aligns with previous studies highlighting BMI’s pivotal role as a predictor within machine learning algorithms for intubation prediction44. Additionally, both PaO2 and RR emerged as important factors influencing intubation prediction accuracy in ICU patients—a consistent observation supported by prior research findings34,39,44,47,48.

Despite the success of various algorithms in this domain, a crucial determinant for physicians to adopt ML model predictions for clinical decision making lies in their ability to comprehend the underlying reasoning behind these predictions. Therefore, it is imperative to elucidate the factors contributing to an ML model’s prediction and enable physicians to understand why a specific prediction aligns with optimal patient care. To the best of our knowledge, our study is the first to apply the SHAP algorithm within the CatBoost model to develop an interpretable model for predicting intubation in ICU patients, achieving exceptional predictive performance. Figure 5 presents the SHAP explanation force plot that illuminates how the prediction model informs clinical decision support. Although SHAP does not completely resolve the black box issue, for clinicians (in other words, non-experts in computer science), using the SHAP algorithm can help enhance their trust in the model and improve their understanding of its predictions, thereby assisting them in making better clinical decisions regarding intubation.

The following are two randomly selected examples from the validation dataset, illustrating how the application of the SHAP method aids ICU physicians in interpreting model predictions. The SHAP baseline value is 0.01.

Patient 1 (Fig. 5a) is a 56-year-old female with a Charlson Comorbidity Index of 6, while her other laboratory indicators fall within normal ranges. The model yields a SHAP value of 1.71, corresponding to a predicted intubation probability of 84.8%. The primary factors contributing to this prediction include her overweight status, low blood oxygen saturation, and low GCS score. A review of the patient’s outcomes confirms that she underwent intubation within 24 to 96 h of ICU admission. This accurately predicted outcome highlights the importance of interpretability in machine learning models, fostering trust among healthcare providers.

Patient 2 (Fig. 5b) is a 77-year-old male with a Charlson Comorbidity Index of 11, indicating a substantial burden of comorbidities. His laboratory results reveal a red blood cell count of 3.07 × 10^12/L, suggesting severe anemia and reduced oxygen-carrying capacity. The model assigns a SHAP value of -0.05, resulting in a predicted intubation probability of 49%. This intermediate probability suggests that clinicians should closely monitor the patient’s condition to ensure timely provision of advanced respiratory support if his respiratory status deteriorates.

Our findings are important for several reasons. First, we developed and externally validated an interpretable ML model that predicts the need for tracheal intubation using commonly accessible clinical variables. The model demenstrated robust external validation performance, with an AUC of up to 0.82 in the MICU of the eICU-CRD database, thereby ensuring the reliability of the study. Moreover, such findings have the potential to reduce adverse outcomes and mortality rates in ICU patients associated with delayed intubation, while also minimizing the risk of related complications by preventing unnecessary intubations. Additionally, they can enhance the efficiency of ICU equipment (e.g., ventilators) utilization and optimize the use of medical human resources, ultimately contributing to a reduction in the overall disease burden. Second, we showed high predictive value, which proved that modern machine learning models performed better by leveraging the complex interactions in the clinical data than classical statistical analyses (e.g., logic regression). Third, a simplified model based on five selected variables was developed. According to the simplified model, an online web tool was also published. By entering the values of five variables, the prediction result based on our model will be shown to the user. Under the condition of ensuring accuracy, it achieved practicality as much as possible, which facilitated clinicians’ use in clinical practice.

However, certain limitations were identified in the present study. Firstly, our study adopted a retrospective modeling approach which prevented us from establishing a causal relationship between features and outcomes. The occurrence of intubation in the dataset may not accurately reflect the true needs of the patients. Retrospective studies and data-driven studies rely on previously collected data, which may come from electronic health records, public database, clinical trials, medical billing, patient-reported outcomes, or other health-related information. We generally lacked direct control over the data collection process and, consequently, must accept the data in its existing state to some extent. Secondly, in the model development process, we assumed that the clinical decision to intubate, as recorded in the data, is the best available proxy for the true need for intubation at the time of assessment. This assumption aligns with the practical application of our model, where the goal is to support clinical decision-making rather than to second-guess decisions made retrospectively. Another concern is the low proportion of positive samples which may lead to high-risk false positives. Although the proportion of positive samples is relatively low, this accurately reflects real-world scenarios. To address this issue, we have employed random undersampling during model training to balance the class distribution. However, the moderate F1 score and Matthews Correlation Coefficient indicates a moderate level of correlation between the predicted and actual classifications and suggests that while the model exhibits some degree of predictive accuracy, it is not as robust as we would ideally desire. Additionally, the highest AUC value achieved by our model suggests room for improvement in terms of its reliability and predictive power. This is an important limitation as it indicates that the model may not consistently perform at the desired level across all scenarios. Lastly, it is worth noting that our data was obtained from a publicly available database and did not include some state-of-the-art parameters such as the ROX index and enough samples about the oxygen support before intubation; hence we may have overlooked potential variables in model building. We are currently conducting a prospective cohort study in the ICU of University-Town Hospital of Chongqing Medical University and the eICU of Chao-yang Hospital of Capital Medical University to improve the model by incorporating more comprehensive variables related to intubation prediction. This will allow us to optimize the model’s performance by enhancing its training on a larger and more diverse patient set.

Conclusions

We developed a CatBoost prediction model that has undergone external validation and demonstrates superior performance in predicting the need for intubation in critically ill patients within 24 to 96 h of ICU admission. Notably, this model relies exclusively on easily obtainable parameters that are routinely captured during patient monitoring. Furthermore, by incorporating the SHAP method, our model enhances interpretability and offers clinicians valuable insights into the underlying factors influencing patient outcomes.