Machine learning for risk stratification in the emergency department (MARS-ED): a randomized controlled trial

van Dam, Paul M. E. L.; van Doorn, William P. T. M.; Sevenich, Lotte; Lambriks, Lars; Cals, Jochen W. L.; Bekers, Otto; Stassen, Patricia M.; Meex, Steven J. R.

doi:10.1038/s41467-025-66947-7

Download PDF

Article
Open access
Published: 01 December 2025

Machine learning for risk stratification in the emergency department (MARS-ED): a randomized controlled trial

Nature Communications volume 17, Article number: 242 (2026) Cite this article

8650 Accesses
4 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Emergency department (ED) crowding necessitates rapid, accurate risk stratification to optimize care and resource allocation. Traditional clinical prediction tools like NEWS, APACHE II, and SOFA score have limited generalizability or rely on extensive inputs, including vitals. We developed RISK^INDEX, a machine-learning tool predicting 31-day mortality using routine laboratory values, age, and sex. To evaluate the clinical impact of a machine learning–based risk score in routine care, we conducted an investigator-initiated, open-label, randomized, non-inferiority trial (MARS-ED) at the Maastricht University Medical Center+ ED. Adult patients ( ≥ 18 years) assessed by an internal medicine specialist and with ≥4 laboratory tests were eligible. Patients were randomized 1:1 using computer-generated permuted blocks to standard care (n = 659) or standard care plus access to the RISK^INDEX (n = 644). No blinding was possible because physicians needed to view the RISK^INDEX. The primary outcomes for this study were the prognostic accuracy for 31-day mortality and clinical impact of the RISK^INDEX. In total, 1303 participants were analyzed. RISK^INDEX’s prognostic accuracy matched or outperformed clinical intuition (AUROC 0.84 vs. 0.73–0.76) and was statistically higher than NEWS, APACHE II, and SOFA prediction tools (AUROC 0.65–0.75). RISK^INDEX predictions aligned with clinicians’ expectations in only about half of cases, with highest discordance among less experienced physicians. Despite its prognostic accuracy, the RISK^INDEX did not alter treatment plans (1/644 changes; 0.16%) or clinical outcomes, and clinicians perceived low added value. No adverse events related to the intervention occurred, and recruitment was completed as planned. These findings show that prognostic accuracy alone is insufficient to achieve clinical impact in the ED and that user-centered, actionable model design is needed to ensure relevance, trust, and responsiveness. ClinicalTrials.gov registration: NCT05497830.

Predicting acute clinical deterioration with interpretable machine learning to support emergency care decision making

Article Open access 21 August 2023

A machine learning model for predicting 28-day mortality in ICU patients with community-acquired pneumonia and acute kidney injury

Article Open access 09 December 2025

Development and validation of a machine learning model for predicting mortality risk in veno-arterial extracorporeal membrane oxygenation patients

Article Open access 24 November 2025

Introduction

Emergency department (ED) visits are increasing globally. Crowding in the ED, associated prolonged waiting times, and the demographic shift towards older age, increase the complexity of decision-making and the risk of adverse outcome^1,2,3,4,5. Therefore, rapid risk stratification is key to improve patient outcomes and optimize resource allocation.

Risk stratification in the ED relies on a combination of clinical intuition and objective measures such as vital signs and laboratory results. To standardize this process and enhance decision-making amidst multitasking in the ED, various clinical prediction tools have been developed, of which commonly used are the National Early Warning Score (NEWS), the Acute Physiology and Chronic Health Evaluation II (APACHE II), and the Sepsis-related Organ Failure Assessment (SOFA) score^{6,7,8,9,10,11,12}. These scores complement clinical intuition to provide care in the ED. Nevertheless, they have some important limitations: they are often tailored to specific patient populations limiting their generalizability and precision¹³. Additionally, they require substantial input (i.e., vital signs and the results of non-routine laboratory tests), making accurate calculation in the ED challenging. Furthermore, relatively little progress has been made in the implementation and integration of prediction tools into clinical practice¹⁴.

To address the need for a more accessible and universally applicable prediction tool, machine learning (ML)-based prediction tools have been introduced^15,16,17,18. Laboratory test results have previously been shown to yield promising prognostic models; notably, the pioneering work of Horne and colleagues leveraged routine tests like complete blood count and basic metabolic profile to develop a highly predictive risk score for mortality, laying a critical foundation for modern AI-driven clinical decision support tools¹². In line with these findings, we have recently developed the RISK^INDEX, a ML-based risk score to assess patient risk in the ED^{19,20,21,22,23}. The RISK^INDEX predicts 31-day mortality based on routine laboratory tests, age, and sex. Its value (ranging from 0 to 100) reflects the probability of 31-day mortality for an individual. The RISK^INDEX becomes automatically available alongside lab results for ED patients with at least four laboratory tests ordered, helping to streamline and standardize decision-making. Although the RISK^INDEX has been externally validated in three other medical centers, demonstrating consistent excellent prognostic accuracy for 31-day mortality²¹, it is unknown whether the RISK^INDEX outperforms clinical intuition and traditional clinical prediction tools when integrated in the routine ED workflow.

In the Machine Learning for Risk Stratification in the Emergency Department (MARS-ED) study, we employed a randomized controlled trial to evaluate the prognostic accuracy and clinical impact of the RISK^INDEX. This design was chosen for its dual advantages: it allowed us to isolate the predictive performance of the RISK^INDEX by minimizing confounding from clinical actions triggered by its use, while also enabling a direct assessment of its influence on clinical outcomes. In this work, we show that the RISK^INDEX statistically matches or outperforms both clinical intuition and traditional prediction tools, yet its integration in routine care does not alter clinical decisions or outcomes. These findings underscore that prognostic accuracy alone is insufficient to achieve clinical impact.

Results

Study sample

We enrolled a convenience sample of 1303 adult ED patients who were randomly assigned to the intervention group (644 participants) or the control group (659 participants) (Fig. 1 and Table 1). Patients in the control group were managed based on ED care as usual and for those in the intervention group, the attending physician was shown the result of the RISK^INDEX and could therefore base their medical treatment for those patients on the RISK^INDEX on top of routine care. The median age was 69 years (IQR: 58–79). In total, 814 patients (62.5%) were admitted to the hospital. The most common reason for the ED visit and admission to the hospital was infectious disease (n = 376, 27.8%). Detailed information with regard to requested laboratory tests is described in Supplementary Table 1. In total, 90 patients (6.9%) died within 31 days after the ED visit, 45 patients in each group.

Table 1 Characteristics of the study sample

Full size table

There were no significant differences in age, hospital admission rate and 31-day mortality between enrolled and non-enrolled patients (Supplementary Table 2).

Primary analysis: prognostic accuracy and clinical impact of RISK^INDEX

To evaluate the RISK^INDEX in the routine ED workflow, we compared its prognostic accuracy for 31-day mortality against physicians’ clinical intuition. The RISK^INDEX demonstrated high prognostic accuracy, achieving an area under the receiver operating characteristics curve (AUROC) of 0.84 (95% CI: 0.78–0.90). This statistically matched or outperformed that of the ED physician’s clinical intuition (Fig. 2), which was assessed through three distinct questions designed to capture the clinician’s level of concern, perceived severity of illness, and degree of surprise (see Methods). The “concern” question showed an AUROC of 0.74 (95% CI 0.68–0.81, p = 0.017), the “severity” question an AUROC of 0.76 (95% CI 0.68–0.83, p = 0.05), and the “surprise” question an AUROC of 0.73 (95% CI 0.66–0.80, p = 0.007), all below the accuracy of the RISK^INDEX. These accuracies corresponded with an area under the precision-recall curve (AUPRC) of 0.33 [0.23–0.43] for the RISK^INDEX and between 0.20 [0.14–0.28] and 0.22 [0.16–0.31] for physician’s clinical intuition questions (Supplementary Fig. 1). The prognostic accuracy for 7-day mortality showed similar prognostic accuracy for the RISK^INDEX and the physicians’ clinical intuition (Supplementary Fig. 2A). A sensitivity analysis performed in the control group demonstrated that the RISK^INDEX statistically matched one and outperformed the two other clinical intuition questions (Supplementary Table 3). Notably, the prognostic accuracy for 31-day mortality of the ED physician’s clinical intuition varied by the clinician’s level of experience: it was higher in experienced internal medicine specialists ( > 6 years, AUROC: 0.77–0.83) compared with residents ( < 6 years, AUROC: 0.72–0.77) (see Supplementary Table 4).

Fig. 2: Prognostic accuracy for 31-day mortality of the RISKINDEX compared with the ED physician’s clinical intuition. — **Fig. 2: Prognostic accuracy for 31-day mortality of the RISK^INDEX compared with the ED physician’s clinical intuition.**

Despite overall high prognostic accuracy, the RISK^INDEX did not lead to policy changes. In only 1 patient (out of 644 in the intervention group; 0.16%), the ED physician indicated that the patient was reassessed during the ED visit as a result of a reported RISK^INDEX being higher than that physician expected. Interestingly, the RISK^INDEX aligned with the ED physician’s expectations in only half of the cases (n = 289, 46.8%). In the remaining cases, its predictions were either higher (n = 164, 26.6%) or lower (n = 152, 24.6%) than the physician expected.

Secondary analysis: comparison to traditional clinical prediction tools, clinical outcomes and feasibility

The RISK^INDEX demonstrated higher prognostic accuracy for 31-day mortality compared to traditional clinical prediction tools (Fig. 3). It showed a statistically higher AUROC than the NEWS score (AUROC 0.65 (95% CI 0.56–0.74), p < 0.001), APACHE II score (AUROC 0.65 (95% CI 0.51–0.78), p = 0.008), and SOFA score (AUROC 0.75 (95% CI 0.68–0.83), p = 0.017). These accuracies corresponded with an area under the precision-recall curve (AUPRC) of 0.33 [0.24–0.43] for the RISK^INDEX and between 0.19 [0.12–0.28] and 0.13 [0.09–0.18] for the clinical predictions tools (Supplementary Fig. 3). The distribution of the RISK^INDEX and the traditional prediction tools is shown in Supplementary Fig. 4.

**Fig. 3: Prognostic accuracy for 31-day mortality of the RISKINDEX compared with traditional clinical prediction tools.**

Implementation of the RISK^INDEX had no effect on clinical outcomes, as reflected by unchanged treatment plans, hospital admissions rates, length of stay, and ICU admissions (Table 1). This is consistent with a subanalysis showing that the perceived added value of the RISK^INDEX (scored using a Likert scale ranging from 1 to 10) was low, with a median score of 2 (IQR: 1–4) (Supplementary Fig. 5). Notably, the perceived value of the RISK^INDEX increased in patients that were considered moderate- to high-risk (p < 0.001 for trend, Supplementary Fig. 6).

Discussion

Rapid and accurate risk stratification in the ED is essential to optimize patient care. In this prospective randomized controlled study, we implemented the ML-based RISK^INDEX in the ED workflow and showed that prognostic performance of the RISK^INDEX statistically matched or outperformed clinical intuition and well-known clinical prediction tools. Despite this statistical high performance, the RISK^INDEX did not alter treatment plans or clinical outcomes.

We report five major findings: first, the RISK^INDEX demonstrated robust prognostic accuracy for 31-day mortality, achieving an AUROC of 0.84, performing statistically equally well or better than clinical intuition questions, which showed an AUROC of 0.73–0.76 in the intervention group and 0.73–0.83 in the control group. Second, the RISK^INDEX statistically outperformed three well-known clinical prediction tools: NEWS (AUROC 0.65), APACHE II (AUROC 0.70) and SOFA score (AUROC 0.75), while these latter AUROCs were similar to those reported in previous studies^{24,25,26,27,28}. Third, the potential benefit of the RISK^INDEX varied based on the physician’s level of experience. A stratified post hoc analysis revealed that less experienced physicians are likely to benefit most from ML guided decision support. This is particularly relevant as junior physicians often serve as primary caregivers in the ED. Fourth, the RISK^INDEX prediction aligned with the ED physicians’ expectations in only about half of cases. Despite this substantial discordance, clinicians perceived low added value. Fifth, despite statistically higher or equal performance of the RISK^INDEX in comparison with clinical intuition questions and traditional risk scores, its clinical impact was absent and integration of the RISK^INDEX in the ED workflow led to no adjustments in treatment plans. Although the trial was not powered for most clinical endpoints, its results emphasize that prognostic accuracy alone is insufficient to drive clinical outcomes.

Various reasons may underlie the gap between prognostic accuracy and clinical impact: RISK^INDEX was developed to predict 31-day mortality, enabling a direct comparison with several commonly used prediction models in this study. However, this comes with a tradeoff: many ED decisions may be more strongly associated with short term mortality measures. Hence, the limited impact on clinical outcomes may reflect suboptimal alignment between the 31-day mortality endpoint of RISK^INDEX, and the ED’s primary focus on stabilization and patient disposition. In a post-hoc analysis, however, we found similar results for prognostic accuracy for 7-day mortality.

The gap also underscores key challenges in the interaction between clinicians and (both AI and non-AI) prediction tools. Prediction tools such as RISK^INDEX do not operate in isolation; they rely on and complement physicians’ intuition and expertise. Despite our efforts to minimize implementation barriers, the substantial discordance between RISK^INDEX predictions and physicians’ expectations confirm low response to AI based recommendations. Notably, perceived value increased in patients that are considered moderate- to high-risk. These observations emphasize the importance of tailored and user-centered implementation strategies that enhance clinician responsiveness to AI driven insights.

Accordingly, our findings strongly argue for user-centered design across both model development and implementation. First, future studies should retarget the prediction to actionable decisions in the ED (e.g., 7-day adverse outcomes, admission vs discharge, or need for early escalation), co-specified with ED physicians. Second, the conversion of a 0–100 score into more directive, threshold-linked recommendations (e.g., site-calibrated ‘rule-out’ and ‘high-risk’ cut-points tied to explicit actions and acceptable miss rates) may be explored. For example, a hospital may define an acceptable rate of ‘low-risk’ misclassification (e.g., 1%) based on physicians’ risk tolerance for adverse events, resulting in a negative predictive value (NPV) of 99%. This NPV can be used to derive the corresponding RISK^INDEX rule-out threshold. Third, patient-level explanations and reliability information alongside recommendations can be considered to address counter-intuition and support trust. Last, the RISK^INDEX could be integrated earlier in the workflow (e.g. a dynamic electronic health record-embedded display when laboratory results return) to maximize actionability under time pressure. Future studies should undertake these steps in co-design with clinical end-users, such as frontline ED physicians, to maximize potential clinical value.

There are limitations to this study. First, the study was conducted in a single medical center, potentially limiting generalizability. However, previous multicenter validation of the RISK^INDEX suggests that its performance is robust after adaptation to each medical center’s population, despite local differences in patient demographics²¹. Second, the study took place in a crowded ED, and represented a convenience sample. To assess potential inclusion bias, we compared age, hospital admission rate and 31-day mortality of our study sample to all internal medicine patients in our ED during the period in which the study was performed, and found no relevant differences. Third, despite extensive physician briefing and direct, in-person presentation of RISK^INDEX predictions by the research team, the clinical impact of the RISK^INDEX was absent. As such, our implementation strategy serves as a valuable negative example—highlighting that even models with strong prognostic accuracy can fail to influence clinical practice when not fully aligned with clinical workflows and decision priorities—while offering important lessons for future research on implementation of AI models. Fourth, although our comparators (NEWS, SOFA and APACHE II) were not originally designed for use in the ED, they are among the few scores validated in ED populations and therefore constitute a potential benchmark^29,30,31,32.

In conclusion, this randomized trial demonstrates that the machine learning–based RISK^INDEX achieved statistically higher prognostic accuracy than traditional scores and matched or exceeded clinician intuition questions, yet had no impact on clinical decisions or outcomes. This gap between statistical performance and clinical impact underscores that AI tools must be actionable, timely, and trusted to influence care. Future efforts should couple accurate prediction with user-centered and decision-linked interventions, and test these in trials measuring not only predictive gains but also changes in clinician behavior, patient outcomes, and cost-effectiveness.

Methods

Study design and setting

The study was designed as an investigator-initiated, open-label, randomized, non-inferiority, clinical trial. The protocol of the Machine Learning for Risk Stratification in the Emergency Department (MARS-ED) study has been published previously²². In short, the study was performed at the ED of the Maastricht University Medical Center + (MUMC + ), which is a secondary/tertiary care medical center in the Netherlands, with 6800 patients visiting the ED for assessment by an internal medicine specialist each year.

This study was approved by the medical ethical committee (METC) of the MUMC+ (METC 21-068) and was registered at clinicaltrials.gov (NCT05497830). The study was conducted according to the principles of the Declaration of Helsinki and reported in line with the Consorted Standards of Reporting Trials - Artificial Intelligence (CONSORT-AI) guidelines (Supplementary Table 5)³³. The first participant was enrolled on 16^th of September 2022, and the last enrolled patient was on 17^th of July 2024.

Participants

Adult patients ( ≥ 18 years old) presenting to the ED who were assessed and treated by an internal medicine specialist with at least four laboratory results were eligible for inclusion. Participants provided written informed consent. Patients were excluded when they revisited the ED within one month after the index ED presentation, since their revisit was included in the follow-up period of that index visit.

Randomization

We performed a randomized clinical trial to assess the prognostic accuracy regarding 31-day mortality of the RISK^INDEX, its clinical impact (policy changes) and its effect on secondary clinical outcomes (hospital admission, length of hospital stay, and admission to ICU). When a patient entered the ED for assessment and treatment by an internal medicine specialist, the patient was immediately assessed for eligibility and randomized as soon as informed consent was obtained. Patients were randomized to either the intervention or control group (standard care), using a computer-generated permutated block randomization with a 1:1 allocation ratio. This study was not blinded, since the physician needed to be informed of the RISK^INDEX in order to assess the magnitude of the clinical impact of the RISK^INDEX.

Study preparation

In the months preceding trial launch, we prepared ED physicians for RISK^INDEX use through briefings delivered during regular teaching and scientific meetings (n = 4). These briefings extensively covered model inputs (age, sex, routine laboratory tests), the interpretation of the 0–100 probability, key limitations and appropriate use (decision support rather than directive), and previously obtained results including comparison to physicians, multicenter predictive accuracy and explainability. Furthermore, we conducted two pilot projects: first, we surveyed ED physicians (n = 17) to compare alternative display formats on preference and clarity (Likert scales), including a calibrated probability display, a color-banded categorical display, and a gauge/decile format (Supplementary Fig. 7). Based on these ratings, we implemented a calibrated probability display that reports a single 0–100 value corresponding to the estimated probability of 31-day mortality. Second, physicians practiced with virtual cases to rehearse interpretation of the RISK^INDEX.

Study intervention and procedure

An overview of the patients’ timeline is shown in Supplementary Fig. 8. After complete assessment of the patient, the ED physicians were asked questions regarding their clinical intuition (Table 2). In the intervention group only, the RISK^INDEX was personally presented by a research member to the attending physician, after complete assessment of the patient and a preliminary treatment plan had been made. A member of the research team informed the ED physician about the study, explained the RISK^INDEX variables, the meaning of the calculated probability, and the high predictive accuracy for 31-day mortality found in the previous multicenter study. In order to ensure that presenting the RISK^INDEX would not influence the ED physicians clinical intuition, the electronic case record form was designed in such a way that the RISK^INDEX could not be calculated until after the clinical intuition questions had been answered. Subsequently, these physicians were questioned about the alignment of the RISK^INDEX with their initial clinical intuition and any resulting changes in the treatment plan (Table 2). Finally, in a subgroup of 121 patients we asked the ED physicians about their perceived added value of the RISK^INDEX (using a Likert scale ranging from 1 to 10).

Table 2 Questionnaires regarding clinical intuition and medical treatment changes

Full size table

RISK^INDEX

The RISK^INDEX is a machine learning (ML)-derived risk score that predicts all-cause 31-day mortality using routine laboratory tests ordered by the attending physician, along with basic patient characteristics (age and sex)²¹. The computed RISK^INDEX (value 0–100) corresponds to an individual’s probability of 31-day mortality.

Clinical prediction tools

To compare the prognostic accuracy for 31-day mortality of the RISK^INDEX with that of clinical prediction tools, we selected commonly used prediction tools based on their prevalence and global use. We selected the National Early Warning Score (NEWS), the Acute Physiology and Chronic Health Evaluation II (APACHE II), and the Sepsis-related Organ Failure Assessment (SOFA) score^6,7,8. Although originally derived for inpatient and intensive-care cohorts, these three scores are among the few that have been validated, and clinically used, in ED populations and were therefore used as comparators for our study^29,30,31,32.

Data collection

In short, we collected data on patient characteristics, comorbidity, triage category (based on the Manchester triage system (MTS)³⁴), reason for ED visit, vital signs, laboratory tests, and clinical endpoints (e.g., hospital admission, admission to intensive care unit (ICU) within 31 days, 31-day mortality). The answers to the questions of all questionnaires were immediately recorded in an electronic case record form. The retrieval of data regarding the results of laboratory tests and the results of the RISK^INDEX was automated. To ensure data quality, all data on outcomes, and a sample of all data was double checked by another member of the research team and/or the study monitor, and discrepancies were resolved through discussion with a second reviewer. Data monitoring was performed by the Clinical Trial Center Maastricht (CTCM).

Outcomes

The primary outcomes for this study were the prognostic accuracy for 31-day mortality and clinical impact of the RISK^INDEX. Secondary outcomes included the prognostic accuracy for 31-day mortality compared to clinical predictions tools, differences in clinical outcomes including hospital admission, length of hospital stay, admission to ICU, and feasibility of the RISK^INDEX.

Statistical analysis

Assuming a 31-day mortality of 8%, we calculated a required sample size of 1250 patients and anticipated including 1300 patients during the study period, based on ED patient volumes at MUMC + . With regard to policy changes, a post hoc power calculation reveals that 388 patients in each arm were needed to detect 2% policy changes (80% power; 5% significance). In the current study, a total of 1.303 patients were included which would allow us to detect a 1.2% difference in policy changes (80% power). Baseline characteristics were analyzed using descriptive statistics. Categorical variables were reported as frequency counts with percentages and continuous variables as medians with interquartile range (IQR) or means with standard deviation (SD), depending on their distribution.

Our primary analysis assessed the prognostic accuracy and clinical impact of the RISK^INDEX. The prognostic accuracy for 31-day mortality was compared against physicians’ clinical intuition by calculating the areas under the receiver operating characteristics curve (AUROC) in the intervention group. The AUROCs were compared by using the method of DeLong³⁵. The precision-recall curve (with area under this precision-recall curve (AUPRC)) was employed to evaluate the balance between sensitivity and positive predictive value³⁶. As a post-hoc analysis, the prognostic accuracy of the RISK^INDEX was also compared against the physician’s clinical intuition to predict 7-day mortality. As a post-hoc analysis, clinical intuition was further analyzed by stratifying physicians’ experience into two categories: 0–6 years (residents in specialist training) and >6 years (medical specialists). Intrinsic to the evaluation of the prognostic accuracy of a risk score, the clinical actions guided by that risk score could have influenced the primary outcome of 31-day mortality, thereby obscuring its true performance. Therefore, a sensitivity analysis was conducted within the control group to ensure an unbiased evaluation unaffected by clinical actions guided by the results of the RISK^INDEX. The clinical impact was evaluated by assessing the number and type of policy changes in the ED after presentation of the RISK^INDEX. Although not powered for clinical endpoints, the randomized controlled design allowed a direct exploratory evaluation of the effect of the RISK^INDEX on secondary clinical endpoints. Feasibility of the RISK^INDEX was assessed by physicians’ alignment and its perceived added value. The alignment between the RISK^INDEX and ED physicians’ expectations was evaluated by asking physicians whether the RISK^INDEX result matched, exceeded, or fell below their expectations. The perceived added value of the RISK^INDEX was evaluated by asking ED physicians to rate its usefulness on a Likert scale from 1 to 10. This assessment was conducted in a subgroup of 114 patients.

All analyses were performed in R, version 4.1.3 (The R Foundation for Statistical Computing). Source code for the data analysis and in-house developed interface is available in a public repository (https://github.com/wptmdoorn/marsedstudy).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Data associated with this study are provided in the manuscript and/or Supplementary Information. Individual-level data involving human participants are not publicly available because they contain potentially identifying or sensitive clinical information. De-identified participant-level data and supporting clinical documents will be made available to external researchers upon request, after review and approval by the sponsor and the medical ethics committee, and following the signing of a data access and/or materials transfer agreement. Requests should be directed to the corresponding author.

Code availability

The source code for the data analysis and in-house developed interface is available in a public repository (https://github.com/wptmdoorn/marsedstudy).

References

van der Linden, N. et al. Effects of emergency department crowding on the delivery of timely care in an inner-city hospital in the Netherlands. Eur. J. Emerg. Med. 23, 337–343 (2016).
Article ADS PubMed Google Scholar
Ter Avest, E., Onnes, B. T., van der Vaart, T. & Land, M. J. Hurry up, it’s quiet in the emergency department. Neth. J. Med. 76, 32–35 (2018).
PubMed Google Scholar
Guttmann, A., Schull, M. J., Vermeulen, M. J. & Stukel, T. A. Association between waiting times and short term mortality and hospital admission after departure from emergency department: population based cohort study from Ontario, Canada. BMJ 342, d2983 (2011).
Article PubMed PubMed Central Google Scholar
Liew, D., Liew, D. & Kennedy, M. P. Emergency department length of stay independently predicts excess inpatient length of stay. Med J. Aust. 179, 524–526 (2003).
Article PubMed Google Scholar
Sun, B. C. et al. Effect of emergency department crowding on outcomes of admitted patients. Ann. Emerg. Med 61, 605–611 e606 (2013).
Article PubMed Google Scholar
Royal College of Physicians. National Early Warning Score (NEWS): Standardising the assessment of acute illness severity in the NHS Report of a working party. London: RCP, (2012).
Vincent, J. L. et al. The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure. On behalf of the Working Group on Sepsis-Related Problems of the European Society of Intensive Care Medicine. Intensive Care Med 22, 707–710 (1996).
Article PubMed Google Scholar
Knaus, W. A., Draper, E. A., Wagner, D. P. & Zimmerman, J. E. APACHE II: a severity of disease classification system. Crit. Care Med 13, 818–829 (1985).
Article PubMed Google Scholar
van Geffen, M., van der Waaij, K. M. & Stassen, P. M. Number, nature & impact of incoming telephone calls on residents and their work during evening shifts. Acute Med 21, 5–11 (2022).
Article PubMed Google Scholar
Laxmisan, A. et al. The multitasking clinician: decision-making and cognitive demand during and after team handoffs in emergency care. Int J. Med Inf. 76, 801–811 (2007).
Article Google Scholar
Chisholm, C. D., Collison, E. K., Nelson, D. R. & Cordell, W. H. Emergency department workplace interruptions: are emergency physicians “interrupt-driven” and “multitasking”? Acad. Emerg. Med 7, 1239–1243 (2000).
Article PubMed Google Scholar
Horne, B. D. et al. Exceptional mortality prediction by risk scores from common laboratory tests. Am. J. Med 122, 550–558 (2009).
Article PubMed Google Scholar
Challen, K. & Goodacre, S. W. Predictive scoring in non-trauma emergency patients: a scoping review. Emerg. Med J. 28, 827–837 (2011).
Article PubMed Google Scholar
Chan, S. L. et al. Implementation of Prediction Models in the Emergency Department from an Implementation Science Perspective-Determinants, Outcomes, and Real-World Impact: A Scoping Review. Ann. Emerg. Med 82, 22–36 (2023).
Article PubMed Google Scholar
Xie, F. et al. Development and Assessment of an Interpretable Machine Learning Triage Tool for Estimating Mortality After Emergency Admissions. JAMA Netw. Open 4, e2118467 (2021).
Article PubMed PubMed Central Google Scholar
Yu, J. Y. et al. Inter hospital external validation of interpretable machine learning based triage score for the emergency department using common data model. Sci. Rep. 14, 6666 (2024).
Article ADS PubMed PubMed Central Google Scholar
Sanchez-Salmeron, R. et al. Machine learning methods applied to triage in emergency services: A systematic review. Int Emerg. Nurs. 60, 101109 (2022).
Article PubMed Google Scholar
Naemi, A. et al. Machine learning techniques for mortality prediction in emergency departments: a systematic review. BMJ Open 11, e052663 (2021).
Article PubMed PubMed Central Google Scholar
Taylor, R. A. et al. Prediction of In-hospital Mortality in Emergency Department Patients With Sepsis: A Local Big Data-Driven, Machine Learning Approach. Acad. Emerg. Med 23, 269–278 (2016).
Article PubMed PubMed Central Google Scholar
Shafaf, N. & Malek, H. Applications of Machine Learning Approaches in Emergency Medicine; a Review Article. Arch. Acad. Emerg. Med 7, 34 (2019).
PubMed PubMed Central Google Scholar
van Doorn, W. P. T. M. et al. Explainable Machine Learning Models for Rapid Risk Stratification in the Emergency Department: A Multicenter Study. J. Appl Lab Med 9, 212–222 (2024).
Article PubMed Google Scholar
van Dam, P. et al. Machine learning for risk stratification in the emergency department (MARS-ED) study protocol for a randomized controlled pilot trial on the implementation of a prediction model based on machine learning technology predicting 31-day mortality in the emergency department. Scand. J. Trauma Resusc. Emerg. Med 32, 5 (2024).
Article PubMed PubMed Central Google Scholar
van Doorn, W. et al. A comparison of machine learning models versus clinical evaluation for mortality prediction in patients with sepsis. PLoS One 16, e0245157 (2021).
Article PubMed PubMed Central Google Scholar
Zhou, Q., Chen, Z. H., Cao, Y. H. & Peng, S. Clinical impact and quality of randomized controlled trials involving interventions evaluating artificial intelligence prediction tools: a systematic review. NPJ Digit Med 4, 154 (2021).
Article PubMed PubMed Central Google Scholar
Benedetto, U. et al. Machine learning improves mortality risk prediction after cardiac surgery: Systematic review and meta-analysis. J. Thorac. Cardiovasc Surg. 163, 2075–2087 e2079 (2022).
Article PubMed Google Scholar
Shin, S. et al. Machine learning vs. conventional statistical models for predicting heart failure readmission and mortality. ESC Heart Fail 8, 106–115 (2021).
Article PubMed Google Scholar
Shung, D. L. et al. Validation of a Machine Learning Model That Outperforms Clinical Risk Scoring Systems for Upper Gastrointestinal Bleeding. Gastroenterology 158, 160–167 (2020).
Article PubMed Google Scholar
Xu, X. et al. Radiomic analysis of contrast-enhanced CT predicts microvascular invasion and outcome in hepatocellular carcinoma. J. Hepatol. 70, 1133–1144 (2019).
Article PubMed Google Scholar
Guan, G., Lee, C. M. Y., Begg, S., Crombie, A. & Mnatzaganian, G. The use of early warning system scores in prehospital and emergency department settings to predict clinical deterioration: A systematic review and meta-analysis. PLoS One 17, e0265559 (2022).
Article PubMed PubMed Central Google Scholar
Ruangsomboon, O. et al. The utility of the Rapid Emergency Medicine Score (REMS) compared with three other early warning scores in predicting in-hospital mortality among COVID-19 patients in the emergency department: a multicenter validation study. BMC Emerg. Med 23, 45 (2023).
Article PubMed PubMed Central Google Scholar
van Dam, P. et al. Head-to-head comparison of 19 prediction models for short-term outcome in medical patients in the emergency department: a retrospective study. Ann. Med 55, 2290211 (2023).
Article PubMed PubMed Central Google Scholar
Xie, F. et al. Benchmarking emergency department prediction models with machine learning and public electronic health records. Sci. Data 9, 658 (2022).
Article PubMed PubMed Central Google Scholar
Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat. Med 26, 1364–1374 (2020).
Article PubMed PubMed Central Google Scholar
Cooke, M. W. & Jinks, S. Does the Manchester triage system detect the critically ill? J. Accid. Emerg. Med 16, 179–181 (1999).
Article PubMed PubMed Central Google Scholar
DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988).
Article PubMed Google Scholar
Romero-Brufau, S., Huddleston, J. M., Escobar, G. J. & Liebow, M. Why the C-statistic is not informative to evaluate early warning scores and what metrics to use. Crit. Care 19, 285 (2015).
Article PubMed PubMed Central Google Scholar
Downar, J., Goldman, R., Pinto, R., Englesakis, M. & Adhikari, N. K. The “surprise question” for predicting death in seriously ill patients: a systematic review and meta-analysis. CMAJ 189, E484–E493 (2017).
Article PubMed PubMed Central Google Scholar
Ouchi, K. et al. Association of Emergency Clinicians’ Assessment of Mortality Risk With Actual 1-Month Mortality Among Older Adults Admitted to the Hospital. JAMA Netw. Open 2, e1911139 (2019).
Article PubMed PubMed Central Google Scholar
Zelis, N. et al. Short-term mortality in older medical emergency patients can be predicted using clinical intuition: A prospective study. PLoS One 14, e0208741 (2019).
Article PubMed PubMed Central Google Scholar
Barais, M. et al. Gut Feelings Questionnaire in daily practice: a feasibility study using a mixed-methods approach in three European countries. BMJ Open 8, e023488 (2018).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank all of the participants who were screening for and participated in this clinical trial. We thank S. van Kuijk for his help in designing this clinical trial. We acknowledge M. Nobel and S. Puts for providing the server for the online study interface, and R. Verlinden and J. Bleus for their assistance in linking the laboratory information system with the online study interface. The authors are also grateful to the following medical students for their large role prior to and during this clinical trial: A. van de Koolwijk, B. van Beek, B. Bugter, C. Hovens, E. Vangelder, F. van Gils, F. Dijkstra, I. Welie, J. Weijers, J. van Welij, L. Vondenhoff, M. van Rijswijk, M. van Roosendaal, M. Ronner, P. Ummels, P. Smeets, R.L. Pena, S. Cimino, S. Lievens, W. Luimes and F. Dijkstra.

Author information

These authors contributed equally: Paul M. E. L. van Dam, William P. T. M. van Doorn.
These authors jointly supervised this work: Patricia M. Stassen, Steven J. R. Meex.

Authors and Affiliations

Department of Internal Medicine, Division of General Internal Medicine, Section Acute Medicine, Maastricht University Medical Center +, Maastricht, Netherlands
Paul M. E. L. van Dam, Lotte Sevenich, Lars Lambriks & Patricia M. Stassen
Department of Clinical Chemistry, Central Diagnostic Laboratory, Maastricht University Medical Center +, Maastricht, Netherlands
William P. T. M. van Doorn, Otto Bekers & Steven J. R. Meex
Department of Family Medicine, Care and Public Health Research Institute (CAPHRI), Maastricht University, Maastricht, Netherlands
Jochen W. L. Cals & Patricia M. Stassen
Cardiovascular Research Institute Maastricht (CARIM), Maastricht University, Maastricht, Netherlands
Otto Bekers & Steven J. R. Meex

Authors

Paul M. E. L. van Dam
View author publications
Search author on:PubMed Google Scholar
William P. T. M. van Doorn
View author publications
Search author on:PubMed Google Scholar
Lotte Sevenich
View author publications
Search author on:PubMed Google Scholar
Lars Lambriks
View author publications
Search author on:PubMed Google Scholar
Jochen W. L. Cals
View author publications
Search author on:PubMed Google Scholar
Otto Bekers
View author publications
Search author on:PubMed Google Scholar
Patricia M. Stassen
View author publications
Search author on:PubMed Google Scholar
Steven J. R. Meex
View author publications
Search author on:PubMed Google Scholar

Contributions

P.V.D. contributed to concept and design of the study; acquisition, checking, analysis and interpretation of data; and drafting of the manuscript. W.V.D. contributed to concept and design of the study; creating the online study interface; analysis and interpretation of data; and drafting the manuscript. L.S. contributed to drafting the study protocol; the acquisition and checking of data; and critical revision of the manuscript. L.L. contributed to drafting the study protocol; the acquisition and checking of data; and critical revision of the manuscript. J.C. contributed to concept and design of the study; interpretation of data; and critical revision of the manuscript. O.B. contributed to drafting the study protocol; and critical revision of the manuscript. P.S. contributed to concept and design of the study; interpretation of data; supervision; and drafting of the manuscript. S.M. contributed to concept and design of the study; interpretation of data; supervision; and drafting of the manuscript. All authors vouched for data accuracy and fidelity to the study protocol; and read and approved the submitted version of the manuscript.

Corresponding author

Correspondence to Paul M. E. L. van Dam.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Reporting Summary (download PDF )

Transparent Peer Review file (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

van Dam, P.M.E.L., van Doorn, W.P.T.M., Sevenich, L. et al. Machine learning for risk stratification in the emergency department (MARS-ED): a randomized controlled trial. Nat Commun 17, 242 (2026). https://doi.org/10.1038/s41467-025-66947-7

Download citation

Received: 12 May 2025
Accepted: 19 November 2025
Published: 01 December 2025
Version of record: 08 January 2026
DOI: https://doi.org/10.1038/s41467-025-66947-7