Introduction

Over the past decade, visits to emergency departments (ED) have steadily increased globally, resulting in overcrowding that significantly reduces the quality of patient care and satisfaction1,2,3,4,5. ED overcrowding is a critical global health issue, driven by factors such as increased patient volume, acuity, and insufficient inpatient bed capacity, leading to prolonged ED length of stay6. This phenomenon, amplified during public health crises like the COVID-19 pandemic7 ,degrades the quality of care, delays treatment for time-sensitive conditions, and is associated with increased patient mortality. In our resource-limited setting, these challenges are magnified, making accurate and timely triage not just a matter of efficiency, but a critical determinant of patient outcomes.

Triage, the initial process of identifying life-threatening conditions and prioritizing care, is vital for efficient resource allocation in the ED. Standardized triage systems, such as the Emergency Severity Index (ESI)8 and the Canadian Triage and Acuity Scale (CTAS)9, exhibit variable accuracy, ranging from 59.2 to 82.9%. These systems rely heavily on clinical judgment, which can lead to significant variability and suboptimal outcomes10,11. Inaccurate triage contributes to ED overcrowding, delays in care, and increased mortality risks12,13,14. Over-triage strains resources, while under-triage impedes timely critical care2. Furthermore, nurse-based triage is influenced by cognitive biases and sociodemographic factors, raising concerns about fairness15,16,17,18,19.

To address these limitations, data-driven predictive models have been developed to improve triage accuracy. These models incorporate predictors readily available at triage, including vital signs, coded chief complaint, and patient history20,21,22,23,24,25,26. Electronic health records (EHRs) are rich sources of data, with unstructured clinical notes representing a significant portion of patient information27,28. Natural language processing (NLP) allows machine learning models to leverage this unstructured data, such as free-text chief complaints, by transforming them into various numerical features21,22,23,26,29. However, few studies have adopted this method, representing a considerable gap in research.

This study aims to develop and internally validate a machine learning model that integrates structured data with free-text chief complaints to predict the need for critical care. Our goal is to create a clinically applicable model that can reduce variability, improve predictive accuracy, and support real-time decision-making in the ED.

Methods

Study design and participants

This retrospective cohort study utilized data from the EHRs of Maharaj Nakhon Chiang Mai Hospital, a tertiary university hospital in northern Thailand with approximately 60,000 annual ED visits. Data were collected from January 1, 2018 to December 31, 2022. This study received approval from the Research Ethics Committee, Faculty of Medicine, Chiang Mai University—Panel 5 (Institutional Review Board), including a waiver of informed consent (Research ID: 0068/Study code: EME-2566-0068). All methods were performed in accordance with relevant guidelines and regulations, including the Declaration of Helsinki and institutional ethical standards. All patient identifiers were removed before analysis. This study was conducted and reported in accordance with the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis with Artificial Intelligence (TRIPOD + AI) guideline. A completed TRIPOD + AI checklist is provided in the supplementary materials.

We included consecutive adult patient visits (≥ 18 years). We focused on adult patients as pediatric physiology and triage considerably different and would require a dedicated model. A qualified emergency nurse manually assigned a CTAS triage for each patient based on clinical guidelines9. We excluded visits that were duplicated, had missing triage labels, were dead on arrival, left without being seen, were transferred to another hospital, or had missing ED disposition data.

Data and outcomes

Patient demographic and clinical data available at the time of triage were included. Demographic variables included patient age and sex. Arrival characteristics included the mode of arrival (walk-in, emergency medical services, or referral) and case type (trauma or non-trauma). Clinical data included vital signs (heart rate, respiratory rate, blood pressure, oxygen saturation, and temperature), level of consciousness measured by the Glasgow Coma Scale (GCS) and the chief complaint. The chief complaint was documented by the triage nurse as unstructured free text in the local language (Thai), often with English medical abbreviations. An analysis of the most frequent chief complaints is provided in Supplementary Table S1.

The primary outcome was Intensive care unit (ICU) admission directly from the ED. The prediction horizon is the point of ED disposition; the model is designed to predict this outcome using only data available at the time of triage to assist in early decision-making.

Model development

Sample size calculations were conducted using the pmsampsize module30. Based on a prior research study’s ICU admission prevalence of 0.8%21, a c-statistic of 0.85, a shrinkage factor of 0.9, and 15 predictive parameters, the required sample size was determined to be 15,713 resulting in 4.19 events per predictor variable. Our final dataset of 163,452 visits far exceeded this minimum requirement. The included visits were stratified and randomized, split into a training set (80%) and a test set (20%) to preserve a balanced distribution of outcomes.

Data preparation began with addressing missing values. The number and percentage of missing values for each predictor are reported in Table 1. We used a dual strategy for imputation: continuous numerical predictors were imputed using the K-nearest neighbors (KNN) algorithm with k = 5, while categorical predictors were imputed using the most frequent value. To assess the robustness of this imputation, a sensitivity analysis was performed by training the final XGBoost model on a complete-case dataset (n = 123,541). The minimal deviation in performance (AUROC 0.915 vs. 0.917) suggested that our imputation method did not introduce significant bias. A comparison of data distributions before and after imputation for continuous variables is provided in Supplementary Fig. S1, confirming their similarity.

For feature engineering, we processed both unstructured and structured data. Unstructured free-text chief complaints were converted into 512-dimension semantic vector representations using the pre-trained Multilingual Universal Sentence Encoder. To manage the high dimensionality of these embeddings, Principal Component Analysis (PCA) was used to reduce them to 50 principal components, which retained over 95% of the variance. Other structured input features were handled using one-hot encoding for categorical variables and standardization for continuous variables.

We developed three machine learning models of increasing complexity: logistic regression with a lasso penalty, random forest, and XGBoost (eXtreme Gradient Boosting). These models were compared to a reference model based on the CTAS triage level. We tuned hyperparameters for each model using a random search with a 5-fold cross-validation strategy on the training set. The final hyperparameters for each model are listed in Supplementary Table S2.

Model evaluation

Model performance was evaluated in the test set using the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). While AUROC assesses discriminative performance, AUPRC offers a better measure for imbalanced datasets by emphasizing positive predictive value31. Performance metrics were reported as the mean and 95% confidence intervals generated from 1,000 bootstrapped samples. We used 500 bootstrapped samples from the test set for the prediction instability plot, mean absolute prediction error, and calibration plot to confirm the stability and reliability of predictions across datasets32,33. SHapley Additive exPlanations (SHAP) were used to interpret the final XGBoost model and identify the most important predictive features. Analyses were performed using Python version 3.10.15.

Results

Participants

This study reviewed 172,791 patient visits during its duration. After these exclusions, the final cohort consisted of 163,452 visits (Fig. 1). Table 1 describes the characteristics of patients in this study. Overall, 13,406 visits (8.2%) resulted in ICU admission. A total of 2016 visits (1.2%) were triaged as CTAS level 3–5 and below and eventually admitted to the ICU, representing potential under-triage cases.

Fig. 1
figure 1

Inclusion and exclusion flowchart.

Table 1 Demographic data of the patient across different triage levels.

Model performance

The model performance metrics are shown in Table 2; Fig. 2. The XGBoost model demonstrated the highest discrimination (AUROC: 0.917 [95% CI 0.911–0.922]) and precision-recall (AUPRC: 0.629 [95% CI 0.608–0.649]). Both the random forest and XGBoost models achieved higher AUROC and AUPRC values than the logistic regression and the baseline CTAS system.

Table 2 Comparison of predictive performance between CTAS triage and predicting models.
Fig. 2
figure 2

AUROC and AUPRC comparison between CTAS triage and the models.

SHAP of the models revealed that the top predictors contributing to ICU admission were mode of arrival, age, vital signs, and chief complaint (Figs. 3 and 4). The waterfall plot in Fig. 4 shows how individual sampled features influence the likelihood of ICU admission for a specific prediction. The model’s average output or base value (E[f(x)]) adjusts incrementally based on the contributions of these features. The cumulative contributions from all features lead to a final log-odds of 2.812, which corresponds to a predicted probability of approximately 94.3% for ICU admission for the patient.

Fig. 3
figure 3

SHAP feature importance summary plot. y-axis: predictors, x-axis: mean of absolute SHAP value.

Fig. 4
figure 4

Waterfall plot illustrating how individual features affect the prediction for a single patient. y-axis: predictors, x-axis: log-odds of ICU admission.

Discussion

This study demonstrates the potential of machine learning models to predict ICU admission from the ED more accurately than the conventional CTAS triage system. Our top-performing model, XGBoost, showed superior discrimination, a finding consistent with other studies that highlight the ability of tree-based models to capture complex, non-linear interactions among predictors20,23. The improvement in AUROC (0.917 vs. 0.882) is noteworthy, but the substantial improvement in AUPRC (0.629 vs. 0.333) is particularly compelling. AUPRC is more informative than AUROC in settings with class imbalance, such as ICU prediction, and this large gain suggests our model is significantly better at ensuring that patients flagged as high-risk are truly likely to require ICU admission, thereby improving the positive predictive value of the triage assessment. The step-like pattern observed in the CTAS ROC curve (Fig. 2) reflects its nature as a categorical scale with five discrete levels, which inherently limits its ability to provide nuanced, continuous risk stratification compared to the machine learning models.

Previous studies have shown that machine learning can improve triage outcomes using both structured and unstructured data, primarily in developed countries20,21,23,29,34. This study builds on those findings and demonstrates that such improvements are also achievable in resource-limited settings. Differences include excluding pain scores as they are subjective and unreliable for determining patient acuity35. Furthermore, the integration of NLP to analyze chief complaints in free text enabled the model to interpret textual data. Conventional triage systems, including ESI and CTAS, rely on chief complaints for categorization. The NLP approach can capture the subtle variations in clinical presentations, allowing for a broader categorization of chief complaints. Additionally, the use of multilingual embeddings effectively manages the linguistic diversity of clinical documentation in the local context, allowing the model to interpret text written in Thai with occasional English medical terms. However, reliance on free-text chief complaints introduces variability that could affect model prediction reliability.

Our findings also align with other advanced triage systems. For instance, the TriAge-Go system, a sophisticated software as a medical device (SaMD), also showed improved prediction over standard triage in a recent prospective evaluation36. While TriAge-Go represents a highly advanced implementation, our model demonstrates that significant improvements can also be achieved in resource-limited settings using readily available data.

This study shows that data-driven tools can make ED decisions more effective. By providing a real-time risk score, the model can flag high-risk patients for triage nurses, helping to mitigate under-triage and focus attention where it is most needed. It can also aid in resource management by providing more accurate forecasts for ICU bed demand.

Limitations

Several limitations should be acknowledged. First, a significant limitation is our definition of the ground truth. The primary outcome of direct ICU admission from the ED does not account for disposition errors, such as unplanned ICU transfers (UIT) from a general ward within 24 h of admission. These cases often represent patients who were under-triaged, and their exclusion may mean our model was trained on more clearly identifiable cases of critical illness, potentially overestimating its performance. Future prospective studies should incorporate UIT to create a more robust and clinically accurate composite outcome.

Second, this study is a retrospective, single-center analysis conducted on a predominantly Asian population, which inherently limits its generalizability to other settings and demographic groups. A critical challenge for all predictive models is performance degradation upon real-world, prospective implementation. Studies on models like the Rothman Index and the EPIC Sepsis Model have shown that even models validated on large retrospective datasets can experience a significant drop in performance when deployed, often due to data drift or overfitting37,38. Therefore, our model must be considered an early-stage development, and its clinical utility can only be confirmed through rigorous external and prospective validation.

Third, the reliance on free-text chief complaints, while powerful, introduces variability. The quality and detail of documentation can differ between nurses, which could affect model reliability. Future work should prioritize improving the quality of free-text inputs and exploring standardization using systems like SNOMED CT to enhance data consistency39.

Forth, the outcome was limited to ICU admissions. Certain conditions, such as anaphylaxis or reactive airway disease, require immediate attention but may not result in ICU admission. In contrast, conditions associated with high mortality, such as unconsciousness, may lead to death in the ED rather than admission to the ICU. Outcomes such as emergency procedures, early mortality, or ED resource utilization could provide a more comprehensive evaluation of patient acuity.

Finally, the absence of detailed patient history as a predictor may have constrained the model’s performance. Incorporating prior medical information could significantly enhance prediction accuracy and help address potential biases.

Conclusion

This study demonstrates that a machine learning model leveraging structured and unstructured EHR data can effectively predict the need for ICU admission with strong performance. The incorporation of free-text chief complaints and multilingual embeddings significantly enhanced prediction accuracy. While further validation is required, this work highlights a promising pathway toward more accurate, efficient, and equitable triage in the emergency department.