Introduction

Effectively screening patients at risk of clinical deterioration is crucial in the emergency department (ED) to ensure timely intervention and improve patient outcomes1. Limited resources and space, combined with an uncontrolled influx of patients, present significant challenges that require rigorous management. Thus, prioritizing patients needing immediate care, efficiently allocating resources, and implementing appropriate interventions are essential to providing optimal patient care2,3,4.

Early warning systems (EWS) such as the Modified Early Warning Score (MEWS) and National Early Warning Score (NEWS) have been widely used in EDs to identify patients at risk of clinical deterioration5,6. These tools are easy to use and facilitate rapid risk stratification, but their accuracy and generalizability are limited by fixed scoring rules and reliance on a small set of variables. To potentially overcome the limitations of conventional early warning scores and enhance the screening of high-priority patients, various artificial intelligence (AI)-based clinical decision support systems (CDSS) have been introduced to support clinical decision-making and improve patient triage7,8,9,10,11. These systems assist in real-time clinical decision-making by primarily predicting patient outcomes, such as mortality rates or probability of intensive care unit (ICU) admission. However, a significant gap remains in AI-based CDSS: the ability to adaptively provide updated results as patient status changes over time and to specify necessary interventions. This is particularly important because the conditions of patients in the ED are dynamic and can change rapidly7,8,10,12,13,14.

Transformer learning is a novel approach that offers enhanced data processing capabilities, efficiently managing the large and complex datasets commonly encountered in medical settings15. Transformers are ideal for CDSS in EDs due to their ability to handle irregular time series data, adapt continuously, manage missing data, and offer interpretability16,17,18. Using time embeddings and attention mechanisms, transformers capture important patterns across uneven time intervals, which are common in ED data19,20. Transformers support continuous re-learning, allowing the CDSS to remain current based on emerging clinical knowledge21. With attention-based weighting, transformers can address missing data effectively by focusing on the most informative inputs22. Furthermore, the model’s attention weights offer interpretability, allowing clinicians to visualize the factors that influence the decision, enhancing transparency and trust23.

This study aims to develop and validate a transformer model-based early warning score (TEWS) that can reflect real-time changes in patient status and provide specific intervention recommendations. Furthermore, we sought to develop a system that integrates the TEWS into the electronic health record (EHR) system to provide clinical decision support and actionable information to healthcare providers.

Methods

Study setting and population

The study was retrospective and observational, using data from patient ED visits at one tertiary referral hospital in Republic of Korea with approximately 1,980 inpatient beds and approximately 60,000 ED visits per year. Adult patients (aged 19 years or older) who visited the ED of the study site between 2015 and 2022 were included in this study. We excluded patients who had signed “Do not attempt resuscitation” orders and those whose vital signs were not measured in the ED.

For external validation, the Marketplace for Medical Information in Intensive Care (MIMIC)-IV-ED database was used. Patients (aged 19 years or older) with data between 2011 and 2019 were included24. This study was conducted in accordance with the TRIPOD-AI (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis-Artificial Intelligence) guidelines25.

Prediction outcome

The purpose of the predictive model was to estimate the likelihood of the five adverse events (AEs) of critical interventions and outcomes occurring within 24 h; (1) the use of vasopressors (norepinephrine, epinephrine, dopamine, and vasopressin infusion), (2) advanced respiratory support (high-flow nasal cannula, noninvasive positive ventilation, and mechanical ventilation intubation), (3) ICU admission, (4) progression to septic shock (according to the Sepsis-3 definition)26, and (5) in-hospital cardiac arrest.

Data preprocessing

Data were extracted from the hospital’s clinical data warehouse, including demographic characteristics, vital signs, laboratory test results, procedural events, drug administration, and outcomes. Preprocessing proceeded sequentially through the following steps: (1) outlier detection and removal, (2) normalization, (3) resampling and windowing, and (4) data handling. Outliers were identified and excluded based on clinically acceptable ranges for vital signs and reportable ranges for laboratory data. (Table S1). All variables were normalized to a 0–1 range using min-max scaling. The normalized data were combined into multi-dimensional vectors according to the number of variables and into two-dimensional vectors for age and gender.

To reflect the temporal changes in variables, the data were structured as time series composed of 15-minute intervals over 48 h. The 48-hour observation window was implemented as a sliding window aligned with each prediction time point, incorporating data from the most recent 48 h. For most patients who stayed in the ED for less than 48 h after arrival, only the available data were used as model input. If multiple data points existed within a 15-minute interval, the last data point was used, and missing data points were replaced with ‘0.’ This resulted in data with a (192, N) format. Labeling was designed to capture the timing of AEs. Label 1 (acute deterioration) was assigned if AEs occurred within 24 h from the last timestamp, while Label 0 (normal) was assigned if no acute deterioration occurred within 24 h. The dataset was split into training, validation, and test sets in a 60%, 20%, and 20% ratio based on patient admission numbers.

After finalizing a data use agreement and completing the “Protection of Human Subjects” training, researchers can access the MIMIC-IV database online (https://mimic.physionet.org). We obtained authorization to access the MIMIC-IV database. Researchers who completed the Collaborative Institutional Training Initiative (CIIT) program extracted the following data using PostgreSQL tools version 15.4 (PostgreSQL Global Development Group, Berkeley, CA, USA) and Python version 3.7.5. We obtained authorization and accessed the MIMIC-IV database on 27 October 2021. The data codes used for the study analyses were from the MIMIC Code Repository27.

Model and training

We developed and tested separate transformer models for five adverse outcomes (Fig. 1).

Fig. 1
figure 1

Overview of the Transformer-Based Early Warning Score model.

The input to the TEWS model consists of two components: a multivariate time-series sequence (after preprocessing) and a set of static features. For each patient \(\:i\), the time-series input is denoted as:

$$\:{X}_{i}={\left\{{\mathbf{x}}_{i}^{\left(t\right)}\right\}}_{t=1}^{T},\hspace{1em}{\mathbf{x}}_{i}^{\left(t\right)}\in\:{R}^{d}$$

where \(\:{\mathbf{x}}_{i}^{\left(t\right)}\) represents the observed features (e.g., vital signs and laboratory results) at time step t, sampled every 15 min over a 48-hour window, resulting in \(\:T=192\) steps. In addition, static patient-level information such as age, sex, and mode of arrival is encoded as:

$$\:{\varvec{s}}_{i}\in\:{R}^{k}$$

The final model input is the combination of both modalities:

$$\:{x}_{i}=\left({X}_{i},\:{\varvec{s}}_{i}\right)$$

The time-series component \(\:{X}_{i}\)​ is processed by a 4-layer Transformer encoder with single-head self-attention and a hidden dimension of 512. To preserve temporal ordering, standard sinusoidal positional encoding is added to the input sequence prior to encoding.

The static vector \(\:{\varvec{s}}_{i}\)​ is passed through a fully connected (FC) layer. The resulting representations are concatenated and passed through a sigmoid activation to produce the predicted probability of an adverse event:

$$\:\widehat{{y}_{i}}={\upsigma\:}(\mathbf{W}\:\cdot\:\left[\text{T}\text{r}\text{a}\text{n}\text{s}\text{f}\text{o}\text{r}\text{m}\text{e}\text{r}\left({X}_{i}\right),\left.\vert\vert \text{F}\text{C}\left({\varvec{s}}_{i}\right)\right]+b\right)$$

where \(\:\Vert\:\) denotes vector concatenation, \(\:\sigma\:\) is the sigmoid activation function, and \(\:\mathbf{W},b\) are the learnable weight matrix and bias term, respectively.

During training, the binary cross-entropy loss function was used. To address the issue of data imbalance, higher weights were assigned to cases of acute deterioration. The loss function was defined as follows:

$$\:{L}_{BCE}=-\frac{1}{N}{\sum\:}_{i=1}^{N}\left[{\omega\:}_{pos}\cdot\:{y}_{i}\text{log}\left(y\hat {}_{i}\right)+{\omega\:}_{neg}\cdot\:\left(1-{y}_{i}\right)\text{log}\left(1-y\hat {}_{i}\right)\right]$$
$$\:\left({\omega\:}_{pos}=5.495,\:{\omega\:}_{neg}=0.505\right)$$

where N represents the number of samples, \(\:{y}_{i}\in\:\left\{\text{0,1}\right\}\) represents the actual label of the ith sample, \(\:y\hat {}_{i}\) represents the model’s predicted probability for the ith sample, \(\:{\omega\:}_{pos}\) represents the weight for the positive class (Label 1), and \(\:{\omega\:}_{neg}\) represents the weight for the negative class (Label 0). The weights were calculated using scikit-learn’s compute_class_weight function. Adjusting the weights of the loss function enhanced the model sensitivity to relapse.

We optimized the hyperparameters using a grid search, training the TEWS model with all possible combinations. Each configuration was evaluated based on the validation loss, and the combination that yielded the lowest loss was selected as the best.

The learning rate was set at 1e-4, with AdamW used as the optimizer and a batch size of 1000. The model was trained for a total of 200 epochs. The entire code was written using Python v3.7.5, and the machine learning algorithm was implemented using TensorFlow v2.5.0. The GPU used was TITAN V, with CUDA version 11.2 and cuDNN version 8.9.1.

Model explainability

To ensure that the TEWS model is interpretable and actionable in clinical settings, we incorporated model explainability using gradient-weighted class activation mapping (Grad-CAM)28. Grad-CAM, originally designed for convolutional neural networks, was adapted to analyze the importance of features in our transformer model. By visualizing the model’s attention to different features, Grad-CAM allowed us to identify the variables contributing most highly to the predictions of the model at specific time intervals for individual cases.

Through this analysis, we observed that the top contributing features varied across individual patients, reflecting the ability of the model to dynamically adapt to unique clinical presentations. This case-specific explainability was integrated into the EHR system, allowing clinicians to view the most relevant features influencing the predictions of the model for each patient in real time9.

Feature selection

We tested 44 variables and conducted multiple-step model testing with feature reduction to select the optimal features for our models (Table S1). The key consideration in the feature selection process was to ensure acceptable prognostic performance29. Features were selected based on their high contribution to model performance and their consistent appearance across models. The final feature selection also considered practical aspects such as data processing efficiency, number of measurements, computational costs, ease of EMR integration, and feasibility of external validation. We tested a full TEWS model with selected features and a TEWS model using vital signs only.

Model performance measure and validation

We measured and compared the model performance by area under the receiver operating characteristic curve (AUROC) for predicting each outcome. Additionally, we calculated the area under the precision-recall curve (AUPRC), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) at the optimal cut-off determined by the Youden Index. The TEWS model output was displayed as a score ranging from 0 to 1. Each score was categorized into three risk groups (low, intermediate, and high risk) using cut-off values selected to achieve sensitivity of 90% and specificity of 95%. These initial thresholds were further refined for each outcome by considering false alarm counts, PPV, and NPV to optimize clinical implementation.

We compared the performance of our TEWS model with the MEWS30, calculated at the same time intervals. Setting the MEWS threshold at ≥ 5 points as the reference, we evaluated the performance of TEWS at a comparable sensitivity level and calculated the number of false alarms per 1,000 patients31.

For external validation, we evaluated the predictive performance of the TEWS models for each outcome using the MIMIC-IV-ED database. The cardiac arrest prediction model was not evaluated as cardiac arrest occurrence data are not available in the MIMIC-IV-ED database.

To improve model performance during external validation, we used transfer learning by fine-tuning the model with 1% (approximately 1,600 patients) and 5% of the MIMIC-IV-ED database32. This enabled the model to adapt to the characteristics of the external dataset. Performance evaluation was conducted using an independent 20% subset of the MIMIC-IV-ED database. In addition, we tested the performance of alternative models including logistic regression and XgBoost models using the final variables.

Statistical analysis

Categorical variables are reported as frequencies and percentages, while continuous variables are reported as means (standard deviations, SD). The significance of differences in continuous variables between groups was assessed using Student’s t-test, while differences in categorical variables between groups were analyzed using chi-square tests. Confidence intervals were calculated using bootstrapping. To assess the statistical significance of the differences in diagnostic performance including AUROC between TEWS and MEWS, we performed a bootstrap-based t-test with 1,000 resamples. A two-tailed p value < 0.05 was considered statistically significant. All analyses were performed using R version 3.6.3. (R Foundation for Statistical Computing, Vienna, Austria) and Python version 3.7.5.

Results

Demographics

A total of 414,748 subjects was analyzed, among whom 15,486 (3.7%) experienced AEs (Fig. 2). Baseline characteristics are shown in Table 1. The vital signs and laboratory variables were significantly worse in the group that experienced AEs compared to those without AE. AEs included vasopressor use (n = 6,304, 40.7%), respiratory support (n = 5,017, 32.4%), ICU admission (n = 8,492, 54.9%), septic shock (n = 4,124, 26.6%), and cardiac arrest (n = 548, 3.5%). For external validation, 410,880 patients (27,838 AEs, 6.7%) were analyzed from the MIMIC-IV-ED database (Figure S1 and Table S2).

Fig. 2
figure 2

Study population.

Table 1 Basic characteristics of the study population.

Model development and performance

The full TEWS model incorporated 13 key variables selected through iterative feature reduction from the initial 44 variables (Fig. S2). The model included vital signs (systolic blood pressure, diastolic blood pressure, heart rate, respiratory rate, body temperature, and peripheral oxygen saturation) and laboratory values (hemoglobin, blood urea nitrogen, sodium, potassium, lactate, arterial pH, and bicarbonate). The vital sign-only TEWS included the six vital signs mentioned above.

The TEWS (full model) demonstrated superior prognostic performance compared to MEWS across all adverse outcomes; vasopressor use: 0.934 (95% CI: 0.932–0.936), respiratory supports: 0.909 (95% CI: 0.905–0.912), ICU admission: 0.855 (95% CI: 0.853–0.856), septic shock: 0.936 (95% CI: 0.933–0.938), and cardiac arrest: 0.833 (95% CI: 0.820–0.848) (Table 2). Similarly, the vital sign-only TEWS model also exhibited better predictive performance than MEWS. Further results about the performance matrix and the other predictive models (transformer models with 23 and 44 variables, logistic regression, XgBoost) are shown in Tables S3, S4, and S5.

Table 2 Comparative prognostic performance of TEWS and MEWS models in predicting adverse outcomes.

When comparing TEWS and MEWS at similar sensitivity thresholds (MEWS ≥ 5), TEWS demonstrated superior performance with significantly higher specificity (99.0-99.5% vs. 96.0-96.8%), PPV (25.3–56.6% vs. 9.5–18.6%), and NPV (99.4–100.0% vs. 98.4–99.5%) across outcomes (Table 3). Additionally, TEWS showed substantially lower false alarm counts per 1,000 patients compared to MEWS.

Table 3 Comparison of accuracy and false alarm counts per 1,000 patients between TEWS and MEWS with similar sensitivity.

External validation

In external validation using MIMIC-IV-ED data, TEWS demonstrated superior performance compared to MEWS across all outcomes (Table 4). Initially, the vital sign-only TEWS showed better performance than the full model, with AUROC values ranging from 0.815 to 0.905 compared to 0.759–0.872 for the full model. After applying transfer learning, both models showed significant improvement. With 1% data transfer learning, the performance of the full model improved (AUROC 0.851 to 0.903); with 5% data transfer learning, it improved further (AUROC 0.863 to 0.901). The full model generally outperformed the vital sign-only model after transfer learning.

Table 4 Area under the receiver operating characteristic of MIMIC-IV-ED data validation for predicting the adverse outcomes.

Model integration for EHR systems

The TEWS system has been successfully integrated into the EHR system, providing real-time risk assessment for ED patients (Fig. 3). The interface displays a patient list with TEWS results that are continuously updated. When clinicians click on the TEWS alarm icon, they can access detailed information including specific high-risk outcomes and the top three contributing features identified by the AI model. When clinicians click on a patient from the ED patient list view, the right panel displays the most recent TEWS information alongside the vital signs and nursing records.

Fig. 3
figure 3

Electronic health record system integration view. This is a modified figure based on the original electronic health record system screen.

Discussion

We developed and validated a novel early warning system for predicting adverse outcomes in ED patients, using transformer models to process time series information, including vital signs and laboratory results, from a patient’s initial visit to discharge. The TEWS system demonstrated superior prognostic performance compared to the MEWS. The TEWS includes multiple models that predict diverse outcomes, providing comprehensive information about patient deterioration from various perspectives. The TEWS predicts both procedural needs (e.g., respiratory support and vasopressor use) and patient status (e.g., cardiac arrest, ICU admission, and septic shock). This may allow TEWS to provide predictions not only about patient conditions, but also about the procedures that may be required, ultimately delivering targeted information to improve patient outcomes13.

While implementing AI-based systems in healthcare is complex and faces barriers such as alert fatigue and workflow integration, real-time early warning systems like TEWS can support physicians and nurses by helping prioritize patients and enabling earlier identification of those at risk for deterioration. Rather than replacing clinical judgment, TEWS may enhance situational awareness by continuously analyzing patient data and providing interpretable, outcome-specific risk predictions at the point of care. At our institution, TEWS has been incorporated into quality improvement initiatives to reduce time to blood pressure stabilization in critically ill patients and to expedite antibiotic administration in septic shock, demonstrating its potential to facilitate timely interventions. Nonetheless, sustained interdisciplinary collaboration and ongoing refinement are essential to ensure clinical value and successful adoption.

To achieve practical applicability, it is crucial to demonstrate the effectiveness of the model across datasets. AI-based models often perform well in a developmental environment, but a decrease in function may occur when they are applied externally33. TEWS showed acceptable predictability for most outcomes in the MIMIC-IV-ED dataset, which differs significantly from the original study site. However, we observed initial performance variations, particularly in the full TEWS model, with some outcomes showing decreased AUROC. This decline in performance often is observed in external validation and can be attributed to various factors including potential overfitting of the initial model, differences in variable distributions and measurement frequencies between institutions, variations in clinical practice patterns, and differences in patient populations34. Notably, the vital sign-only TEWS with limited variables demonstrated more robust performance in external validation, suggesting that models with fewer, standardized variables may be more generalizable across healthcare settings. Furthermore, the implementation of transfer learning with just 1–5% of external data significantly improved the performance of the full model, indicating that this approach could address institutional differences while requiring minimal additional data for model adaptation.

While the transformer model did not demonstrate overwhelmingly superior performance compared to XGBoost, we selected the transformer approach for its greater extensibility and practical advantages in clinical deployment. Transformer models are specifically designed to process sequential time-series data and support transfer learning, as shown by notable improvements in external validation after fine-tuning with only 1–5% of external data, a capability not feasible with tree-based models like XGBoost. Furthermore, transformer architectures allow for future integration of multimodal clinical data, such as imaging and clinical narratives, enhancing adaptability and interpretability for dynamic, real-time risk prediction in the ED. Thus, despite only modest gains in AUROC, we believe the transformer model provides a more robust and flexible platform for ongoing clinical application. The strength of the transformer model lies in its ability to accurately capture the state of variables over time, improving the accuracy of acute deterioration prediction by effectively learning temporal dependencies17,20,35. In the ED, the patient condition can change within a relatively short time frame; the TEWS can capture these characteristics and provide timely predictions.

We observed a relatively lower performance in predicting ICU admission and cardiac arrest compared to other outcomes. The lower predictive accuracy for ICU admission probably reflects the complex nature of ICU admission decisions in different healthcare settings. Factors beyond clinical severity, such as ED crowding and low ICU capacity, may influence admission patterns36,37. For example, a previous study has shown that even patients with septic shock were managed in the EDs of Korea without ICU admission38. Additionally, there might be differences in ICU admission criteria due to the varying characteristics among centers. This should be considered when applying TEWS externally or when developing similar models39,40. For cardiac arrest prediction, the extremely low incidence of in-hospital cardiac arrest in the ED presented a significant challenge for model training, and some cardiac arrests could not be predicted by vital signs and laboratory tests.

For real-world application, TEWS aims to extract essential input variables from the numerous data points available in the ED. As explained earlier, this study conducted several rounds of sensitivity analysis and feature selection to identify the most critical variables for predicting outcomes. Through this process, vital signs and laboratory variables were selected, resulting in a practical model that can be implemented in a real ED setting, reducing computing time and effort during operation.

This study demonstrates that development is only the beginning; the ultimate goal is integrating the developed model for real-time risk assessment and actionable information to healthcare providers. Through these efforts, TEWS has been integrated into the EMR system at the study site to allow monitoring and user feedback.

Further studies are needed to continue the development and enhancement of TEWS, as well as to evaluate its practical utility in clinical settings.

Limitations

There are some limitations in our study. First, the retrospective nature may have introduced selection bias, potentially affecting the generalizability of our findings. In addition, there may be unmeasured variables or outcomes that could potentially impact the performance or generalizability of the model. Second, the incidence of the five outcomes predicted by TEWS was imbalanced. As a result, all predicted outcomes had a low AUPRC and low PPV. In the medical field, most candidates have a low likelihood of adverse outcomes because such outcomes typically occur in unhealthy states. Given that this study was based on real-world data, the imbalance is a potential limitation. Similar challenges have been reported in other machine learning-based prediction studies in healthcare13,41,42. Third, although we used the MIMIC-IV-ED database for external validation, this dataset may not fully represent the diversity of EDs globally. The performance of TEWS may vary by healthcare system and patient population. Fourth, there was the only moderate performance improvement after transfer learning. We believe that the moderate gains observed after transfer learning reflect a good baseline performance, and even a 2–3% improvement in AUROC can be meaningful given the limited amount of fine-tuning data used. Fifth, we also used separate models for each adverse event, which may limit the benefits of shared learning across related outcomes. A multi-task model that predicts multiple outcomes simultaneously could improve performance, especially for rare events, and will be considered in future work. Finally, the TEWS model may be challenging to implement in resource-limited settings without comprehensive EHR systems. Our model relies on time series data, which may not be consistently available or accurately recorded in all clinical settings, potentially limiting their applicability. These limitations are common challenges in developing and implementing AI-based CDSS in healthcare. Future research should focus on addressing these limitations through prospective studies, more diverse external validation, and continued refinement of the model to balance complexity with clinical applicability.

Conclusion

This study developed and validated the TEWS system for predicting multiple adverse outcomes in ED patients. The TEWS models demonstrated superior prognostic performance compared to the MEWS across various outcomes in internal and external validation. The successful integration of the AI solution into the EHR system demonstrates its potential as a clinical decision support tool for providing real-time risk assessment and actionable information.