Abstract
Emergency department (ED) crowding necessitates rapid, accurate risk stratification to optimize care and resource allocation. Traditional clinical prediction tools like NEWS, APACHE II, and SOFA score have limited generalizability or rely on extensive inputs, including vitals. We developed RISKINDEX, a machine-learning tool predicting 31-day mortality using routine laboratory values, age, and sex. To evaluate the clinical impact of a machine learning–based risk score in routine care, we conducted an investigator-initiated, open-label, randomized, non-inferiority trial (MARS-ED) at the Maastricht University Medical Center+ ED. Adult patients ( ≥ 18 years) assessed by an internal medicine specialist and with ≥4 laboratory tests were eligible. Patients were randomized 1:1 using computer-generated permuted blocks to standard care (n = 659) or standard care plus access to the RISKINDEX (n = 644). No blinding was possible because physicians needed to view the RISKINDEX. The primary outcomes for this study were the prognostic accuracy for 31-day mortality and clinical impact of the RISKINDEX. In total, 1303 participants were analyzed. RISKINDEX’s prognostic accuracy matched or outperformed clinical intuition (AUROC 0.84 vs. 0.73–0.76) and was statistically higher than NEWS, APACHE II, and SOFA prediction tools (AUROC 0.65–0.75). RISKINDEX predictions aligned with clinicians’ expectations in only about half of cases, with highest discordance among less experienced physicians. Despite its prognostic accuracy, the RISKINDEX did not alter treatment plans (1/644 changes; 0.16%) or clinical outcomes, and clinicians perceived low added value. No adverse events related to the intervention occurred, and recruitment was completed as planned. These findings show that prognostic accuracy alone is insufficient to achieve clinical impact in the ED and that user-centered, actionable model design is needed to ensure relevance, trust, and responsiveness. ClinicalTrials.gov registration: NCT05497830.
Similar content being viewed by others
Introduction
Emergency department (ED) visits are increasing globally. Crowding in the ED, associated prolonged waiting times, and the demographic shift towards older age, increase the complexity of decision-making and the risk of adverse outcome1,2,3,4,5. Therefore, rapid risk stratification is key to improve patient outcomes and optimize resource allocation.
Risk stratification in the ED relies on a combination of clinical intuition and objective measures such as vital signs and laboratory results. To standardize this process and enhance decision-making amidst multitasking in the ED, various clinical prediction tools have been developed, of which commonly used are the National Early Warning Score (NEWS), the Acute Physiology and Chronic Health Evaluation II (APACHE II), and the Sepsis-related Organ Failure Assessment (SOFA) score6,7,8,9,10,11,12. These scores complement clinical intuition to provide care in the ED. Nevertheless, they have some important limitations: they are often tailored to specific patient populations limiting their generalizability and precision13. Additionally, they require substantial input (i.e., vital signs and the results of non-routine laboratory tests), making accurate calculation in the ED challenging. Furthermore, relatively little progress has been made in the implementation and integration of prediction tools into clinical practice14.
To address the need for a more accessible and universally applicable prediction tool, machine learning (ML)-based prediction tools have been introduced15,16,17,18. Laboratory test results have previously been shown to yield promising prognostic models; notably, the pioneering work of Horne and colleagues leveraged routine tests like complete blood count and basic metabolic profile to develop a highly predictive risk score for mortality, laying a critical foundation for modern AI-driven clinical decision support tools12. In line with these findings, we have recently developed the RISKINDEX, a ML-based risk score to assess patient risk in the ED19,20,21,22,23. The RISKINDEX predicts 31-day mortality based on routine laboratory tests, age, and sex. Its value (ranging from 0 to 100) reflects the probability of 31-day mortality for an individual. The RISKINDEX becomes automatically available alongside lab results for ED patients with at least four laboratory tests ordered, helping to streamline and standardize decision-making. Although the RISKINDEX has been externally validated in three other medical centers, demonstrating consistent excellent prognostic accuracy for 31-day mortality21, it is unknown whether the RISKINDEX outperforms clinical intuition and traditional clinical prediction tools when integrated in the routine ED workflow.
In the Machine Learning for Risk Stratification in the Emergency Department (MARS-ED) study, we employed a randomized controlled trial to evaluate the prognostic accuracy and clinical impact of the RISKINDEX. This design was chosen for its dual advantages: it allowed us to isolate the predictive performance of the RISKINDEX by minimizing confounding from clinical actions triggered by its use, while also enabling a direct assessment of its influence on clinical outcomes. In this work, we show that the RISKINDEX statistically matches or outperforms both clinical intuition and traditional prediction tools, yet its integration in routine care does not alter clinical decisions or outcomes. These findings underscore that prognostic accuracy alone is insufficient to achieve clinical impact.
Results
Study sample
We enrolled a convenience sample of 1303 adult ED patients who were randomly assigned to the intervention group (644 participants) or the control group (659 participants) (Fig. 1 and Table 1). Patients in the control group were managed based on ED care as usual and for those in the intervention group, the attending physician was shown the result of the RISKINDEX and could therefore base their medical treatment for those patients on the RISKINDEX on top of routine care. The median age was 69 years (IQR: 58–79). In total, 814 patients (62.5%) were admitted to the hospital. The most common reason for the ED visit and admission to the hospital was infectious disease (n = 376, 27.8%). Detailed information with regard to requested laboratory tests is described in Supplementary Table 1. In total, 90 patients (6.9%) died within 31 days after the ED visit, 45 patients in each group.
When entering the ED, the patient was immediately assessed for eligibility and randomized as soons as informed consent was obtained. Patients were randomized to either the intervention group (standard care plus RISK-INDEX) or control group (standard care).
There were no significant differences in age, hospital admission rate and 31-day mortality between enrolled and non-enrolled patients (Supplementary Table 2).
Primary analysis: prognostic accuracy and clinical impact of RISKINDEX
To evaluate the RISKINDEX in the routine ED workflow, we compared its prognostic accuracy for 31-day mortality against physicians’ clinical intuition. The RISKINDEX demonstrated high prognostic accuracy, achieving an area under the receiver operating characteristics curve (AUROC) of 0.84 (95% CI: 0.78–0.90). This statistically matched or outperformed that of the ED physician’s clinical intuition (Fig. 2), which was assessed through three distinct questions designed to capture the clinician’s level of concern, perceived severity of illness, and degree of surprise (see Methods). The “concern” question showed an AUROC of 0.74 (95% CI 0.68–0.81, p = 0.017), the “severity” question an AUROC of 0.76 (95% CI 0.68–0.83, p = 0.05), and the “surprise” question an AUROC of 0.73 (95% CI 0.66–0.80, p = 0.007), all below the accuracy of the RISKINDEX. These accuracies corresponded with an area under the precision-recall curve (AUPRC) of 0.33 [0.23–0.43] for the RISKINDEX and between 0.20 [0.14–0.28] and 0.22 [0.16–0.31] for physician’s clinical intuition questions (Supplementary Fig. 1). The prognostic accuracy for 7-day mortality showed similar prognostic accuracy for the RISKINDEX and the physicians’ clinical intuition (Supplementary Fig. 2A). A sensitivity analysis performed in the control group demonstrated that the RISKINDEX statistically matched one and outperformed the two other clinical intuition questions (Supplementary Table 3). Notably, the prognostic accuracy for 31-day mortality of the ED physician’s clinical intuition varied by the clinician’s level of experience: it was higher in experienced internal medicine specialists ( > 6 years, AUROC: 0.77–0.83) compared with residents ( < 6 years, AUROC: 0.72–0.77) (see Supplementary Table 4).
The RISKINDEX demonstrated high prognostic accuracy, achieving an area under the receiver operating characteristics curve (AUROC) of 0.84 (95% CI: 0.78–0.90). This statistically matched or outperformed that of the ED physician’s clinical intuition, assessed through three distinct questions: the “concern” question showed an AUROC of 0.74 (95% CI 0.68–0.81, p = 0.017), the “severity” question an AUROC of 0.76 (95% CI 0.68–0.83, p = 0.05), and the “surprise” question an AUROC of 0.73 (95% CI 0.66–0.80, p = 0.007). The AUROCs were compared by using the method of DeLong.
Despite overall high prognostic accuracy, the RISKINDEX did not lead to policy changes. In only 1 patient (out of 644 in the intervention group; 0.16%), the ED physician indicated that the patient was reassessed during the ED visit as a result of a reported RISKINDEX being higher than that physician expected. Interestingly, the RISKINDEX aligned with the ED physician’s expectations in only half of the cases (n = 289, 46.8%). In the remaining cases, its predictions were either higher (n = 164, 26.6%) or lower (n = 152, 24.6%) than the physician expected.
Secondary analysis: comparison to traditional clinical prediction tools, clinical outcomes and feasibility
The RISKINDEX demonstrated higher prognostic accuracy for 31-day mortality compared to traditional clinical prediction tools (Fig. 3). It showed a statistically higher AUROC than the NEWS score (AUROC 0.65 (95% CI 0.56–0.74), p < 0.001), APACHE II score (AUROC 0.65 (95% CI 0.51–0.78), p = 0.008), and SOFA score (AUROC 0.75 (95% CI 0.68–0.83), p = 0.017). These accuracies corresponded with an area under the precision-recall curve (AUPRC) of 0.33 [0.24–0.43] for the RISKINDEX and between 0.19 [0.12–0.28] and 0.13 [0.09–0.18] for the clinical predictions tools (Supplementary Fig. 3). The distribution of the RISKINDEX and the traditional prediction tools is shown in Supplementary Fig. 4.
The RISKINDEX demonstrated higher prognostic accuracy for 31-day mortality (AUROC 0.84 (95% CI: 0.78–0.90)) compared to traditional clinical prediction tools. It showed a statistically higher AUROC than the NEWS score (AUROC 0.65 (95% CI 0.56–0.74), p < 0.001), APACHE II score (AUROC 0.65 (95% CI 0.51–0.78), p = 0.008), and SOFA score (AUROC 0.75 (95% CI 0.68–0.83), p = 0.017). The AUROCs were compared by using the method of DeLong.
Implementation of the RISKINDEX had no effect on clinical outcomes, as reflected by unchanged treatment plans, hospital admissions rates, length of stay, and ICU admissions (Table 1). This is consistent with a subanalysis showing that the perceived added value of the RISKINDEX (scored using a Likert scale ranging from 1 to 10) was low, with a median score of 2 (IQR: 1–4) (Supplementary Fig. 5). Notably, the perceived value of the RISKINDEX increased in patients that were considered moderate- to high-risk (p < 0.001 for trend, Supplementary Fig. 6).
Discussion
Rapid and accurate risk stratification in the ED is essential to optimize patient care. In this prospective randomized controlled study, we implemented the ML-based RISKINDEX in the ED workflow and showed that prognostic performance of the RISKINDEX statistically matched or outperformed clinical intuition and well-known clinical prediction tools. Despite this statistical high performance, the RISKINDEX did not alter treatment plans or clinical outcomes.
We report five major findings: first, the RISKINDEX demonstrated robust prognostic accuracy for 31-day mortality, achieving an AUROC of 0.84, performing statistically equally well or better than clinical intuition questions, which showed an AUROC of 0.73–0.76 in the intervention group and 0.73–0.83 in the control group. Second, the RISKINDEX statistically outperformed three well-known clinical prediction tools: NEWS (AUROC 0.65), APACHE II (AUROC 0.70) and SOFA score (AUROC 0.75), while these latter AUROCs were similar to those reported in previous studies24,25,26,27,28. Third, the potential benefit of the RISKINDEX varied based on the physician’s level of experience. A stratified post hoc analysis revealed that less experienced physicians are likely to benefit most from ML guided decision support. This is particularly relevant as junior physicians often serve as primary caregivers in the ED. Fourth, the RISKINDEX prediction aligned with the ED physicians’ expectations in only about half of cases. Despite this substantial discordance, clinicians perceived low added value. Fifth, despite statistically higher or equal performance of the RISKINDEX in comparison with clinical intuition questions and traditional risk scores, its clinical impact was absent and integration of the RISKINDEX in the ED workflow led to no adjustments in treatment plans. Although the trial was not powered for most clinical endpoints, its results emphasize that prognostic accuracy alone is insufficient to drive clinical outcomes.
Various reasons may underlie the gap between prognostic accuracy and clinical impact: RISKINDEX was developed to predict 31-day mortality, enabling a direct comparison with several commonly used prediction models in this study. However, this comes with a tradeoff: many ED decisions may be more strongly associated with short term mortality measures. Hence, the limited impact on clinical outcomes may reflect suboptimal alignment between the 31-day mortality endpoint of RISKINDEX, and the ED’s primary focus on stabilization and patient disposition. In a post-hoc analysis, however, we found similar results for prognostic accuracy for 7-day mortality.
The gap also underscores key challenges in the interaction between clinicians and (both AI and non-AI) prediction tools. Prediction tools such as RISKINDEX do not operate in isolation; they rely on and complement physicians’ intuition and expertise. Despite our efforts to minimize implementation barriers, the substantial discordance between RISKINDEX predictions and physicians’ expectations confirm low response to AI based recommendations. Notably, perceived value increased in patients that are considered moderate- to high-risk. These observations emphasize the importance of tailored and user-centered implementation strategies that enhance clinician responsiveness to AI driven insights.
Accordingly, our findings strongly argue for user-centered design across both model development and implementation. First, future studies should retarget the prediction to actionable decisions in the ED (e.g., 7-day adverse outcomes, admission vs discharge, or need for early escalation), co-specified with ED physicians. Second, the conversion of a 0–100 score into more directive, threshold-linked recommendations (e.g., site-calibrated ‘rule-out’ and ‘high-risk’ cut-points tied to explicit actions and acceptable miss rates) may be explored. For example, a hospital may define an acceptable rate of ‘low-risk’ misclassification (e.g., 1%) based on physicians’ risk tolerance for adverse events, resulting in a negative predictive value (NPV) of 99%. This NPV can be used to derive the corresponding RISKINDEX rule-out threshold. Third, patient-level explanations and reliability information alongside recommendations can be considered to address counter-intuition and support trust. Last, the RISKINDEX could be integrated earlier in the workflow (e.g. a dynamic electronic health record-embedded display when laboratory results return) to maximize actionability under time pressure. Future studies should undertake these steps in co-design with clinical end-users, such as frontline ED physicians, to maximize potential clinical value.
There are limitations to this study. First, the study was conducted in a single medical center, potentially limiting generalizability. However, previous multicenter validation of the RISKINDEX suggests that its performance is robust after adaptation to each medical center’s population, despite local differences in patient demographics21. Second, the study took place in a crowded ED, and represented a convenience sample. To assess potential inclusion bias, we compared age, hospital admission rate and 31-day mortality of our study sample to all internal medicine patients in our ED during the period in which the study was performed, and found no relevant differences. Third, despite extensive physician briefing and direct, in-person presentation of RISKINDEX predictions by the research team, the clinical impact of the RISKINDEX was absent. As such, our implementation strategy serves as a valuable negative example—highlighting that even models with strong prognostic accuracy can fail to influence clinical practice when not fully aligned with clinical workflows and decision priorities—while offering important lessons for future research on implementation of AI models. Fourth, although our comparators (NEWS, SOFA and APACHE II) were not originally designed for use in the ED, they are among the few scores validated in ED populations and therefore constitute a potential benchmark29,30,31,32.
In conclusion, this randomized trial demonstrates that the machine learning–based RISKINDEX achieved statistically higher prognostic accuracy than traditional scores and matched or exceeded clinician intuition questions, yet had no impact on clinical decisions or outcomes. This gap between statistical performance and clinical impact underscores that AI tools must be actionable, timely, and trusted to influence care. Future efforts should couple accurate prediction with user-centered and decision-linked interventions, and test these in trials measuring not only predictive gains but also changes in clinician behavior, patient outcomes, and cost-effectiveness.
Methods
Study design and setting
The study was designed as an investigator-initiated, open-label, randomized, non-inferiority, clinical trial. The protocol of the Machine Learning for Risk Stratification in the Emergency Department (MARS-ED) study has been published previously22. In short, the study was performed at the ED of the Maastricht University Medical Center + (MUMC + ), which is a secondary/tertiary care medical center in the Netherlands, with 6800 patients visiting the ED for assessment by an internal medicine specialist each year.
This study was approved by the medical ethical committee (METC) of the MUMC+ (METC 21-068) and was registered at clinicaltrials.gov (NCT05497830). The study was conducted according to the principles of the Declaration of Helsinki and reported in line with the Consorted Standards of Reporting Trials - Artificial Intelligence (CONSORT-AI) guidelines (Supplementary Table 5)33. The first participant was enrolled on 16th of September 2022, and the last enrolled patient was on 17th of July 2024.
Participants
Adult patients ( ≥ 18 years old) presenting to the ED who were assessed and treated by an internal medicine specialist with at least four laboratory results were eligible for inclusion. Participants provided written informed consent. Patients were excluded when they revisited the ED within one month after the index ED presentation, since their revisit was included in the follow-up period of that index visit.
Randomization
We performed a randomized clinical trial to assess the prognostic accuracy regarding 31-day mortality of the RISKINDEX, its clinical impact (policy changes) and its effect on secondary clinical outcomes (hospital admission, length of hospital stay, and admission to ICU). When a patient entered the ED for assessment and treatment by an internal medicine specialist, the patient was immediately assessed for eligibility and randomized as soon as informed consent was obtained. Patients were randomized to either the intervention or control group (standard care), using a computer-generated permutated block randomization with a 1:1 allocation ratio. This study was not blinded, since the physician needed to be informed of the RISKINDEX in order to assess the magnitude of the clinical impact of the RISKINDEX.
Study preparation
In the months preceding trial launch, we prepared ED physicians for RISKINDEX use through briefings delivered during regular teaching and scientific meetings (n = 4). These briefings extensively covered model inputs (age, sex, routine laboratory tests), the interpretation of the 0–100 probability, key limitations and appropriate use (decision support rather than directive), and previously obtained results including comparison to physicians, multicenter predictive accuracy and explainability. Furthermore, we conducted two pilot projects: first, we surveyed ED physicians (n = 17) to compare alternative display formats on preference and clarity (Likert scales), including a calibrated probability display, a color-banded categorical display, and a gauge/decile format (Supplementary Fig. 7). Based on these ratings, we implemented a calibrated probability display that reports a single 0–100 value corresponding to the estimated probability of 31-day mortality. Second, physicians practiced with virtual cases to rehearse interpretation of the RISKINDEX.
Study intervention and procedure
An overview of the patients’ timeline is shown in Supplementary Fig. 8. After complete assessment of the patient, the ED physicians were asked questions regarding their clinical intuition (Table 2). In the intervention group only, the RISKINDEX was personally presented by a research member to the attending physician, after complete assessment of the patient and a preliminary treatment plan had been made. A member of the research team informed the ED physician about the study, explained the RISKINDEX variables, the meaning of the calculated probability, and the high predictive accuracy for 31-day mortality found in the previous multicenter study. In order to ensure that presenting the RISKINDEX would not influence the ED physicians clinical intuition, the electronic case record form was designed in such a way that the RISKINDEX could not be calculated until after the clinical intuition questions had been answered. Subsequently, these physicians were questioned about the alignment of the RISKINDEX with their initial clinical intuition and any resulting changes in the treatment plan (Table 2). Finally, in a subgroup of 121 patients we asked the ED physicians about their perceived added value of the RISKINDEX (using a Likert scale ranging from 1 to 10).
RISKINDEX
The RISKINDEX is a machine learning (ML)-derived risk score that predicts all-cause 31-day mortality using routine laboratory tests ordered by the attending physician, along with basic patient characteristics (age and sex)21. The computed RISKINDEX (value 0–100) corresponds to an individual’s probability of 31-day mortality.
Clinical prediction tools
To compare the prognostic accuracy for 31-day mortality of the RISKINDEX with that of clinical prediction tools, we selected commonly used prediction tools based on their prevalence and global use. We selected the National Early Warning Score (NEWS), the Acute Physiology and Chronic Health Evaluation II (APACHE II), and the Sepsis-related Organ Failure Assessment (SOFA) score6,7,8. Although originally derived for inpatient and intensive-care cohorts, these three scores are among the few that have been validated, and clinically used, in ED populations and were therefore used as comparators for our study29,30,31,32.
Data collection
In short, we collected data on patient characteristics, comorbidity, triage category (based on the Manchester triage system (MTS)34), reason for ED visit, vital signs, laboratory tests, and clinical endpoints (e.g., hospital admission, admission to intensive care unit (ICU) within 31 days, 31-day mortality). The answers to the questions of all questionnaires were immediately recorded in an electronic case record form. The retrieval of data regarding the results of laboratory tests and the results of the RISKINDEX was automated. To ensure data quality, all data on outcomes, and a sample of all data was double checked by another member of the research team and/or the study monitor, and discrepancies were resolved through discussion with a second reviewer. Data monitoring was performed by the Clinical Trial Center Maastricht (CTCM).
Outcomes
The primary outcomes for this study were the prognostic accuracy for 31-day mortality and clinical impact of the RISKINDEX. Secondary outcomes included the prognostic accuracy for 31-day mortality compared to clinical predictions tools, differences in clinical outcomes including hospital admission, length of hospital stay, admission to ICU, and feasibility of the RISKINDEX.
Statistical analysis
Assuming a 31-day mortality of 8%, we calculated a required sample size of 1250 patients and anticipated including 1300 patients during the study period, based on ED patient volumes at MUMC + . With regard to policy changes, a post hoc power calculation reveals that 388 patients in each arm were needed to detect 2% policy changes (80% power; 5% significance). In the current study, a total of 1.303 patients were included which would allow us to detect a 1.2% difference in policy changes (80% power). Baseline characteristics were analyzed using descriptive statistics. Categorical variables were reported as frequency counts with percentages and continuous variables as medians with interquartile range (IQR) or means with standard deviation (SD), depending on their distribution.
Our primary analysis assessed the prognostic accuracy and clinical impact of the RISKINDEX. The prognostic accuracy for 31-day mortality was compared against physicians’ clinical intuition by calculating the areas under the receiver operating characteristics curve (AUROC) in the intervention group. The AUROCs were compared by using the method of DeLong35. The precision-recall curve (with area under this precision-recall curve (AUPRC)) was employed to evaluate the balance between sensitivity and positive predictive value36. As a post-hoc analysis, the prognostic accuracy of the RISKINDEX was also compared against the physician’s clinical intuition to predict 7-day mortality. As a post-hoc analysis, clinical intuition was further analyzed by stratifying physicians’ experience into two categories: 0–6 years (residents in specialist training) and >6 years (medical specialists). Intrinsic to the evaluation of the prognostic accuracy of a risk score, the clinical actions guided by that risk score could have influenced the primary outcome of 31-day mortality, thereby obscuring its true performance. Therefore, a sensitivity analysis was conducted within the control group to ensure an unbiased evaluation unaffected by clinical actions guided by the results of the RISKINDEX. The clinical impact was evaluated by assessing the number and type of policy changes in the ED after presentation of the RISKINDEX. Although not powered for clinical endpoints, the randomized controlled design allowed a direct exploratory evaluation of the effect of the RISKINDEX on secondary clinical endpoints. Feasibility of the RISKINDEX was assessed by physicians’ alignment and its perceived added value. The alignment between the RISKINDEX and ED physicians’ expectations was evaluated by asking physicians whether the RISKINDEX result matched, exceeded, or fell below their expectations. The perceived added value of the RISKINDEX was evaluated by asking ED physicians to rate its usefulness on a Likert scale from 1 to 10. This assessment was conducted in a subgroup of 114 patients.
All analyses were performed in R, version 4.1.3 (The R Foundation for Statistical Computing). Source code for the data analysis and in-house developed interface is available in a public repository (https://github.com/wptmdoorn/marsedstudy).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
Data associated with this study are provided in the manuscript and/or Supplementary Information. Individual-level data involving human participants are not publicly available because they contain potentially identifying or sensitive clinical information. De-identified participant-level data and supporting clinical documents will be made available to external researchers upon request, after review and approval by the sponsor and the medical ethics committee, and following the signing of a data access and/or materials transfer agreement. Requests should be directed to the corresponding author.
Code availability
The source code for the data analysis and in-house developed interface is available in a public repository (https://github.com/wptmdoorn/marsedstudy).
References
van der Linden, N. et al. Effects of emergency department crowding on the delivery of timely care in an inner-city hospital in the Netherlands. Eur. J. Emerg. Med. 23, 337–343 (2016).
Ter Avest, E., Onnes, B. T., van der Vaart, T. & Land, M. J. Hurry up, it’s quiet in the emergency department. Neth. J. Med. 76, 32–35 (2018).
Guttmann, A., Schull, M. J., Vermeulen, M. J. & Stukel, T. A. Association between waiting times and short term mortality and hospital admission after departure from emergency department: population based cohort study from Ontario, Canada. BMJ 342, d2983 (2011).
Liew, D., Liew, D. & Kennedy, M. P. Emergency department length of stay independently predicts excess inpatient length of stay. Med J. Aust. 179, 524–526 (2003).
Sun, B. C. et al. Effect of emergency department crowding on outcomes of admitted patients. Ann. Emerg. Med 61, 605–611 e606 (2013).
Royal College of Physicians. National Early Warning Score (NEWS): Standardising the assessment of acute illness severity in the NHS Report of a working party. London: RCP, (2012).
Vincent, J. L. et al. The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure. On behalf of the Working Group on Sepsis-Related Problems of the European Society of Intensive Care Medicine. Intensive Care Med 22, 707–710 (1996).
Knaus, W. A., Draper, E. A., Wagner, D. P. & Zimmerman, J. E. APACHE II: a severity of disease classification system. Crit. Care Med 13, 818–829 (1985).
van Geffen, M., van der Waaij, K. M. & Stassen, P. M. Number, nature & impact of incoming telephone calls on residents and their work during evening shifts. Acute Med 21, 5–11 (2022).
Laxmisan, A. et al. The multitasking clinician: decision-making and cognitive demand during and after team handoffs in emergency care. Int J. Med Inf. 76, 801–811 (2007).
Chisholm, C. D., Collison, E. K., Nelson, D. R. & Cordell, W. H. Emergency department workplace interruptions: are emergency physicians “interrupt-driven” and “multitasking”? Acad. Emerg. Med 7, 1239–1243 (2000).
Horne, B. D. et al. Exceptional mortality prediction by risk scores from common laboratory tests. Am. J. Med 122, 550–558 (2009).
Challen, K. & Goodacre, S. W. Predictive scoring in non-trauma emergency patients: a scoping review. Emerg. Med J. 28, 827–837 (2011).
Chan, S. L. et al. Implementation of Prediction Models in the Emergency Department from an Implementation Science Perspective-Determinants, Outcomes, and Real-World Impact: A Scoping Review. Ann. Emerg. Med 82, 22–36 (2023).
Xie, F. et al. Development and Assessment of an Interpretable Machine Learning Triage Tool for Estimating Mortality After Emergency Admissions. JAMA Netw. Open 4, e2118467 (2021).
Yu, J. Y. et al. Inter hospital external validation of interpretable machine learning based triage score for the emergency department using common data model. Sci. Rep. 14, 6666 (2024).
Sanchez-Salmeron, R. et al. Machine learning methods applied to triage in emergency services: A systematic review. Int Emerg. Nurs. 60, 101109 (2022).
Naemi, A. et al. Machine learning techniques for mortality prediction in emergency departments: a systematic review. BMJ Open 11, e052663 (2021).
Taylor, R. A. et al. Prediction of In-hospital Mortality in Emergency Department Patients With Sepsis: A Local Big Data-Driven, Machine Learning Approach. Acad. Emerg. Med 23, 269–278 (2016).
Shafaf, N. & Malek, H. Applications of Machine Learning Approaches in Emergency Medicine; a Review Article. Arch. Acad. Emerg. Med 7, 34 (2019).
van Doorn, W. P. T. M. et al. Explainable Machine Learning Models for Rapid Risk Stratification in the Emergency Department: A Multicenter Study. J. Appl Lab Med 9, 212–222 (2024).
van Dam, P. et al. Machine learning for risk stratification in the emergency department (MARS-ED) study protocol for a randomized controlled pilot trial on the implementation of a prediction model based on machine learning technology predicting 31-day mortality in the emergency department. Scand. J. Trauma Resusc. Emerg. Med 32, 5 (2024).
van Doorn, W. et al. A comparison of machine learning models versus clinical evaluation for mortality prediction in patients with sepsis. PLoS One 16, e0245157 (2021).
Zhou, Q., Chen, Z. H., Cao, Y. H. & Peng, S. Clinical impact and quality of randomized controlled trials involving interventions evaluating artificial intelligence prediction tools: a systematic review. NPJ Digit Med 4, 154 (2021).
Benedetto, U. et al. Machine learning improves mortality risk prediction after cardiac surgery: Systematic review and meta-analysis. J. Thorac. Cardiovasc Surg. 163, 2075–2087 e2079 (2022).
Shin, S. et al. Machine learning vs. conventional statistical models for predicting heart failure readmission and mortality. ESC Heart Fail 8, 106–115 (2021).
Shung, D. L. et al. Validation of a Machine Learning Model That Outperforms Clinical Risk Scoring Systems for Upper Gastrointestinal Bleeding. Gastroenterology 158, 160–167 (2020).
Xu, X. et al. Radiomic analysis of contrast-enhanced CT predicts microvascular invasion and outcome in hepatocellular carcinoma. J. Hepatol. 70, 1133–1144 (2019).
Guan, G., Lee, C. M. Y., Begg, S., Crombie, A. & Mnatzaganian, G. The use of early warning system scores in prehospital and emergency department settings to predict clinical deterioration: A systematic review and meta-analysis. PLoS One 17, e0265559 (2022).
Ruangsomboon, O. et al. The utility of the Rapid Emergency Medicine Score (REMS) compared with three other early warning scores in predicting in-hospital mortality among COVID-19 patients in the emergency department: a multicenter validation study. BMC Emerg. Med 23, 45 (2023).
van Dam, P. et al. Head-to-head comparison of 19 prediction models for short-term outcome in medical patients in the emergency department: a retrospective study. Ann. Med 55, 2290211 (2023).
Xie, F. et al. Benchmarking emergency department prediction models with machine learning and public electronic health records. Sci. Data 9, 658 (2022).
Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat. Med 26, 1364–1374 (2020).
Cooke, M. W. & Jinks, S. Does the Manchester triage system detect the critically ill? J. Accid. Emerg. Med 16, 179–181 (1999).
DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988).
Romero-Brufau, S., Huddleston, J. M., Escobar, G. J. & Liebow, M. Why the C-statistic is not informative to evaluate early warning scores and what metrics to use. Crit. Care 19, 285 (2015).
Downar, J., Goldman, R., Pinto, R., Englesakis, M. & Adhikari, N. K. The “surprise question” for predicting death in seriously ill patients: a systematic review and meta-analysis. CMAJ 189, E484–E493 (2017).
Ouchi, K. et al. Association of Emergency Clinicians’ Assessment of Mortality Risk With Actual 1-Month Mortality Among Older Adults Admitted to the Hospital. JAMA Netw. Open 2, e1911139 (2019).
Zelis, N. et al. Short-term mortality in older medical emergency patients can be predicted using clinical intuition: A prospective study. PLoS One 14, e0208741 (2019).
Barais, M. et al. Gut Feelings Questionnaire in daily practice: a feasibility study using a mixed-methods approach in three European countries. BMJ Open 8, e023488 (2018).
Acknowledgements
We thank all of the participants who were screening for and participated in this clinical trial. We thank S. van Kuijk for his help in designing this clinical trial. We acknowledge M. Nobel and S. Puts for providing the server for the online study interface, and R. Verlinden and J. Bleus for their assistance in linking the laboratory information system with the online study interface. The authors are also grateful to the following medical students for their large role prior to and during this clinical trial: A. van de Koolwijk, B. van Beek, B. Bugter, C. Hovens, E. Vangelder, F. van Gils, F. Dijkstra, I. Welie, J. Weijers, J. van Welij, L. Vondenhoff, M. van Rijswijk, M. van Roosendaal, M. Ronner, P. Ummels, P. Smeets, R.L. Pena, S. Cimino, S. Lievens, W. Luimes and F. Dijkstra.
Author information
Authors and Affiliations
Contributions
P.V.D. contributed to concept and design of the study; acquisition, checking, analysis and interpretation of data; and drafting of the manuscript. W.V.D. contributed to concept and design of the study; creating the online study interface; analysis and interpretation of data; and drafting the manuscript. L.S. contributed to drafting the study protocol; the acquisition and checking of data; and critical revision of the manuscript. L.L. contributed to drafting the study protocol; the acquisition and checking of data; and critical revision of the manuscript. J.C. contributed to concept and design of the study; interpretation of data; and critical revision of the manuscript. O.B. contributed to drafting the study protocol; and critical revision of the manuscript. P.S. contributed to concept and design of the study; interpretation of data; supervision; and drafting of the manuscript. S.M. contributed to concept and design of the study; interpretation of data; supervision; and drafting of the manuscript. All authors vouched for data accuracy and fidelity to the study protocol; and read and approved the submitted version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
van Dam, P.M.E.L., van Doorn, W.P.T.M., Sevenich, L. et al. Machine learning for risk stratification in the emergency department (MARS-ED): a randomized controlled trial. Nat Commun 17, 242 (2026). https://doi.org/10.1038/s41467-025-66947-7
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41467-025-66947-7





