Introduction

Emergency department (ED) visits are increasing globally. Crowding in the ED, associated prolonged waiting times, and the demographic shift towards older age, increase the complexity of decision-making and the risk of adverse outcome1,2,3,4,5. Therefore, rapid risk stratification is key to improve patient outcomes and optimize resource allocation.

Risk stratification in the ED relies on a combination of clinical intuition and objective measures such as vital signs and laboratory results. To standardize this process and enhance decision-making amidst multitasking in the ED, various clinical prediction tools have been developed, of which commonly used are the National Early Warning Score (NEWS), the Acute Physiology and Chronic Health Evaluation II (APACHE II), and the Sepsis-related Organ Failure Assessment (SOFA) score6,7,8,9,10,11,12. These scores complement clinical intuition to provide care in the ED. Nevertheless, they have some important limitations: they are often tailored to specific patient populations limiting their generalizability and precision13. Additionally, they require substantial input (i.e., vital signs and the results of non-routine laboratory tests), making accurate calculation in the ED challenging. Furthermore, relatively little progress has been made in the implementation and integration of prediction tools into clinical practice14.

To address the need for a more accessible and universally applicable prediction tool, machine learning (ML)-based prediction tools have been introduced15,16,17,18. Laboratory test results have previously been shown to yield promising prognostic models; notably, the pioneering work of Horne and colleagues leveraged routine tests like complete blood count and basic metabolic profile to develop a highly predictive risk score for mortality, laying a critical foundation for modern AI-driven clinical decision support tools12. In line with these findings, we have recently developed the RISKINDEX, a ML-based risk score to assess patient risk in the ED19,20,21,22,23. The RISKINDEX predicts 31-day mortality based on routine laboratory tests, age, and sex. Its value (ranging from 0 to 100) reflects the probability of 31-day mortality for an individual. The RISKINDEX becomes automatically available alongside lab results for ED patients with at least four laboratory tests ordered, helping to streamline and standardize decision-making. Although the RISKINDEX has been externally validated in three other medical centers, demonstrating consistent excellent prognostic accuracy for 31-day mortality21, it is unknown whether the RISKINDEX outperforms clinical intuition and traditional clinical prediction tools when integrated in the routine ED workflow.

In the Machine Learning for Risk Stratification in the Emergency Department (MARS-ED) study, we employed a randomized controlled trial to evaluate the prognostic accuracy and clinical impact of the RISKINDEX. This design was chosen for its dual advantages: it allowed us to isolate the predictive performance of the RISKINDEX by minimizing confounding from clinical actions triggered by its use, while also enabling a direct assessment of its influence on clinical outcomes. In this work, we show that the RISKINDEX statistically matches or outperforms both clinical intuition and traditional prediction tools, yet its integration in routine care does not alter clinical decisions or outcomes. These findings underscore that prognostic accuracy alone is insufficient to achieve clinical impact.

Results

Study sample

We enrolled a convenience sample of 1303 adult ED patients who were randomly assigned to the intervention group (644 participants) or the control group (659 participants) (Fig. 1 and Table 1). Patients in the control group were managed based on ED care as usual and for those in the intervention group, the attending physician was shown the result of the RISKINDEX and could therefore base their medical treatment for those patients on the RISKINDEX on top of routine care. The median age was 69 years (IQR: 58–79). In total, 814 patients (62.5%) were admitted to the hospital. The most common reason for the ED visit and admission to the hospital was infectious disease (n = 376, 27.8%). Detailed information with regard to requested laboratory tests is described in Supplementary Table 1. In total, 90 patients (6.9%) died within 31 days after the ED visit, 45 patients in each group.

Fig. 1: Flowchart of the study sample.
Fig. 1: Flowchart of the study sample.The alternative text for this image may have been generated using AI.
Full size image

When entering the ED, the patient was immediately assessed for eligibility and randomized as soons as informed consent was obtained. Patients were randomized to either the intervention group (standard care plus RISK-INDEX) or control group (standard care).

Table 1 Characteristics of the study sample

There were no significant differences in age, hospital admission rate and 31-day mortality between enrolled and non-enrolled patients (Supplementary Table 2).

Primary analysis: prognostic accuracy and clinical impact of RISKINDEX

To evaluate the RISKINDEX in the routine ED workflow, we compared its prognostic accuracy for 31-day mortality against physicians’ clinical intuition. The RISKINDEX demonstrated high prognostic accuracy, achieving an area under the receiver operating characteristics curve (AUROC) of 0.84 (95% CI: 0.78–0.90). This statistically matched or outperformed that of the ED physician’s clinical intuition (Fig. 2), which was assessed through three distinct questions designed to capture the clinician’s level of concern, perceived severity of illness, and degree of surprise (see Methods). The “concern” question showed an AUROC of 0.74 (95% CI 0.68–0.81, p = 0.017), the “severity” question an AUROC of 0.76 (95% CI 0.68–0.83, p = 0.05), and the “surprise” question an AUROC of 0.73 (95% CI 0.66–0.80, p = 0.007), all below the accuracy of the RISKINDEX. These accuracies corresponded with an area under the precision-recall curve (AUPRC) of 0.33 [0.23–0.43] for the RISKINDEX and between 0.20 [0.14–0.28] and 0.22 [0.16–0.31] for physician’s clinical intuition questions (Supplementary Fig. 1). The prognostic accuracy for 7-day mortality showed similar prognostic accuracy for the RISKINDEX and the physicians’ clinical intuition (Supplementary Fig. 2A). A sensitivity analysis performed in the control group demonstrated that the RISKINDEX statistically matched one and outperformed the two other clinical intuition questions (Supplementary Table 3). Notably, the prognostic accuracy for 31-day mortality of the ED physician’s clinical intuition varied by the clinician’s level of experience: it was higher in experienced internal medicine specialists ( > 6 years, AUROC: 0.77–0.83) compared with residents ( < 6 years, AUROC: 0.72–0.77) (see Supplementary Table 4).

Fig. 2: Prognostic accuracy for 31-day mortality of the RISKINDEX compared with the ED physician’s clinical intuition.
Fig. 2: Prognostic accuracy for 31-day mortality of the RISKINDEX compared with the ED physician’s clinical intuition.The alternative text for this image may have been generated using AI.
Full size image

The RISKINDEX demonstrated high prognostic accuracy, achieving an area under the receiver operating characteristics curve (AUROC) of 0.84 (95% CI: 0.78–0.90). This statistically matched or outperformed that of the ED physician’s clinical intuition, assessed through three distinct questions: the “concern” question showed an AUROC of 0.74 (95% CI 0.68–0.81, p = 0.017), the “severity” question an AUROC of 0.76 (95% CI 0.68–0.83, p = 0.05), and the “surprise” question an AUROC of 0.73 (95% CI 0.66–0.80, p = 0.007). The AUROCs were compared by using the method of DeLong.

Despite overall high prognostic accuracy, the RISKINDEX did not lead to policy changes. In only 1 patient (out of 644 in the intervention group; 0.16%), the ED physician indicated that the patient was reassessed during the ED visit as a result of a reported RISKINDEX being higher than that physician expected. Interestingly, the RISKINDEX aligned with the ED physician’s expectations in only half of the cases (n = 289, 46.8%). In the remaining cases, its predictions were either higher (n = 164, 26.6%) or lower (n = 152, 24.6%) than the physician expected.

Secondary analysis: comparison to traditional clinical prediction tools, clinical outcomes and feasibility

The RISKINDEX demonstrated higher prognostic accuracy for 31-day mortality compared to traditional clinical prediction tools (Fig. 3). It showed a statistically higher AUROC than the NEWS score (AUROC 0.65 (95% CI 0.56–0.74), p < 0.001), APACHE II score (AUROC 0.65 (95% CI 0.51–0.78), p = 0.008), and SOFA score (AUROC 0.75 (95% CI 0.68–0.83), p = 0.017). These accuracies corresponded with an area under the precision-recall curve (AUPRC) of 0.33 [0.24–0.43] for the RISKINDEX and between 0.19 [0.12–0.28] and 0.13 [0.09–0.18] for the clinical predictions tools (Supplementary Fig. 3). The distribution of the RISKINDEX and the traditional prediction tools is shown in Supplementary Fig. 4.

Fig. 3: Prognostic accuracy for 31-day mortality of the RISKINDEX compared with traditional clinical prediction tools.
Fig. 3: Prognostic accuracy for 31-day mortality of the RISKINDEX compared with traditional clinical prediction tools.The alternative text for this image may have been generated using AI.
Full size image

The RISKINDEX demonstrated higher prognostic accuracy for 31-day mortality (AUROC 0.84 (95% CI: 0.78–0.90)) compared to traditional clinical prediction tools. It showed a statistically higher AUROC than the NEWS score (AUROC 0.65 (95% CI 0.56–0.74), p < 0.001), APACHE II score (AUROC 0.65 (95% CI 0.51–0.78), p = 0.008), and SOFA score (AUROC 0.75 (95% CI 0.68–0.83), p = 0.017). The AUROCs were compared by using the method of DeLong.

Implementation of the RISKINDEX had no effect on clinical outcomes, as reflected by unchanged treatment plans, hospital admissions rates, length of stay, and ICU admissions (Table 1). This is consistent with a subanalysis showing that the perceived added value of the RISKINDEX (scored using a Likert scale ranging from 1 to 10) was low, with a median score of 2 (IQR: 1–4) (Supplementary Fig. 5). Notably, the perceived value of the RISKINDEX increased in patients that were considered moderate- to high-risk (p < 0.001 for trend, Supplementary Fig. 6).

Discussion

Rapid and accurate risk stratification in the ED is essential to optimize patient care. In this prospective randomized controlled study, we implemented the ML-based RISKINDEX in the ED workflow and showed that prognostic performance of the RISKINDEX statistically matched or outperformed clinical intuition and well-known clinical prediction tools. Despite this statistical high performance, the RISKINDEX did not alter treatment plans or clinical outcomes.

We report five major findings: first, the RISKINDEX demonstrated robust prognostic accuracy for 31-day mortality, achieving an AUROC of 0.84, performing statistically equally well or better than clinical intuition questions, which showed an AUROC of 0.73–0.76 in the intervention group and 0.73–0.83 in the control group. Second, the RISKINDEX statistically outperformed three well-known clinical prediction tools: NEWS (AUROC 0.65), APACHE II (AUROC 0.70) and SOFA score (AUROC 0.75), while these latter AUROCs were similar to those reported in previous studies24,25,26,27,28. Third, the potential benefit of the RISKINDEX varied based on the physician’s level of experience. A stratified post hoc analysis revealed that less experienced physicians are likely to benefit most from ML guided decision support. This is particularly relevant as junior physicians often serve as primary caregivers in the ED. Fourth, the RISKINDEX prediction aligned with the ED physicians’ expectations in only about half of cases. Despite this substantial discordance, clinicians perceived low added value. Fifth, despite statistically higher or equal performance of the RISKINDEX in comparison with clinical intuition questions and traditional risk scores, its clinical impact was absent and integration of the RISKINDEX in the ED workflow led to no adjustments in treatment plans. Although the trial was not powered for most clinical endpoints, its results emphasize that prognostic accuracy alone is insufficient to drive clinical outcomes.

Various reasons may underlie the gap between prognostic accuracy and clinical impact: RISKINDEX was developed to predict 31-day mortality, enabling a direct comparison with several commonly used prediction models in this study. However, this comes with a tradeoff: many ED decisions may be more strongly associated with short term mortality measures. Hence, the limited impact on clinical outcomes may reflect suboptimal alignment between the 31-day mortality endpoint of RISKINDEX, and the ED’s primary focus on stabilization and patient disposition. In a post-hoc analysis, however, we found similar results for prognostic accuracy for 7-day mortality.

The gap also underscores key challenges in the interaction between clinicians and (both AI and non-AI) prediction tools. Prediction tools such as RISKINDEX do not operate in isolation; they rely on and complement physicians’ intuition and expertise. Despite our efforts to minimize implementation barriers, the substantial discordance between RISKINDEX predictions and physicians’ expectations confirm low response to AI based recommendations. Notably, perceived value increased in patients that are considered moderate- to high-risk. These observations emphasize the importance of tailored and user-centered implementation strategies that enhance clinician responsiveness to AI driven insights.

Accordingly, our findings strongly argue for user-centered design across both model development and implementation. First, future studies should retarget the prediction to actionable decisions in the ED (e.g., 7-day adverse outcomes, admission vs discharge, or need for early escalation), co-specified with ED physicians. Second, the conversion of a 0–100 score into more directive, threshold-linked recommendations (e.g., site-calibrated ‘rule-out’ and ‘high-risk’ cut-points tied to explicit actions and acceptable miss rates) may be explored. For example, a hospital may define an acceptable rate of ‘low-risk’ misclassification (e.g., 1%) based on physicians’ risk tolerance for adverse events, resulting in a negative predictive value (NPV) of 99%. This NPV can be used to derive the corresponding RISKINDEX rule-out threshold. Third, patient-level explanations and reliability information alongside recommendations can be considered to address counter-intuition and support trust. Last, the RISKINDEX could be integrated earlier in the workflow (e.g. a dynamic electronic health record-embedded display when laboratory results return) to maximize actionability under time pressure. Future studies should undertake these steps in co-design with clinical end-users, such as frontline ED physicians, to maximize potential clinical value.

There are limitations to this study. First, the study was conducted in a single medical center, potentially limiting generalizability. However, previous multicenter validation of the RISKINDEX suggests that its performance is robust after adaptation to each medical center’s population, despite local differences in patient demographics21. Second, the study took place in a crowded ED, and represented a convenience sample. To assess potential inclusion bias, we compared age, hospital admission rate and 31-day mortality of our study sample to all internal medicine patients in our ED during the period in which the study was performed, and found no relevant differences. Third, despite extensive physician briefing and direct, in-person presentation of RISKINDEX predictions by the research team, the clinical impact of the RISKINDEX was absent. As such, our implementation strategy serves as a valuable negative example—highlighting that even models with strong prognostic accuracy can fail to influence clinical practice when not fully aligned with clinical workflows and decision priorities—while offering important lessons for future research on implementation of AI models. Fourth, although our comparators (NEWS, SOFA and APACHE II) were not originally designed for use in the ED, they are among the few scores validated in ED populations and therefore constitute a potential benchmark29,30,31,32.

In conclusion, this randomized trial demonstrates that the machine learning–based RISKINDEX achieved statistically higher prognostic accuracy than traditional scores and matched or exceeded clinician intuition questions, yet had no impact on clinical decisions or outcomes. This gap between statistical performance and clinical impact underscores that AI tools must be actionable, timely, and trusted to influence care. Future efforts should couple accurate prediction with user-centered and decision-linked interventions, and test these in trials measuring not only predictive gains but also changes in clinician behavior, patient outcomes, and cost-effectiveness.

Methods

Study design and setting

The study was designed as an investigator-initiated, open-label, randomized, non-inferiority, clinical trial. The protocol of the Machine Learning for Risk Stratification in the Emergency Department (MARS-ED) study has been published previously22. In short, the study was performed at the ED of the Maastricht University Medical Center + (MUMC + ), which is a secondary/tertiary care medical center in the Netherlands, with 6800 patients visiting the ED for assessment by an internal medicine specialist each year.

This study was approved by the medical ethical committee (METC) of the MUMC+ (METC 21-068) and was registered at clinicaltrials.gov (NCT05497830). The study was conducted according to the principles of the Declaration of Helsinki and reported in line with the Consorted Standards of Reporting Trials - Artificial Intelligence (CONSORT-AI) guidelines (Supplementary Table 5)33. The first participant was enrolled on 16th of September 2022, and the last enrolled patient was on 17th of July 2024.

Participants

Adult patients ( ≥ 18 years old) presenting to the ED who were assessed and treated by an internal medicine specialist with at least four laboratory results were eligible for inclusion. Participants provided written informed consent. Patients were excluded when they revisited the ED within one month after the index ED presentation, since their revisit was included in the follow-up period of that index visit.

Randomization

We performed a randomized clinical trial to assess the prognostic accuracy regarding 31-day mortality of the RISKINDEX, its clinical impact (policy changes) and its effect on secondary clinical outcomes (hospital admission, length of hospital stay, and admission to ICU). When a patient entered the ED for assessment and treatment by an internal medicine specialist, the patient was immediately assessed for eligibility and randomized as soon as informed consent was obtained. Patients were randomized to either the intervention or control group (standard care), using a computer-generated permutated block randomization with a 1:1 allocation ratio. This study was not blinded, since the physician needed to be informed of the RISKINDEX in order to assess the magnitude of the clinical impact of the RISKINDEX.

Study preparation

In the months preceding trial launch, we prepared ED physicians for RISKINDEX use through briefings delivered during regular teaching and scientific meetings (n = 4). These briefings extensively covered model inputs (age, sex, routine laboratory tests), the interpretation of the 0–100 probability, key limitations and appropriate use (decision support rather than directive), and previously obtained results including comparison to physicians, multicenter predictive accuracy and explainability. Furthermore, we conducted two pilot projects: first, we surveyed ED physicians (n = 17) to compare alternative display formats on preference and clarity (Likert scales), including a calibrated probability display, a color-banded categorical display, and a gauge/decile format (Supplementary Fig. 7). Based on these ratings, we implemented a calibrated probability display that reports a single 0–100 value corresponding to the estimated probability of 31-day mortality. Second, physicians practiced with virtual cases to rehearse interpretation of the RISKINDEX.

Study intervention and procedure

An overview of the patients’ timeline is shown in Supplementary Fig. 8. After complete assessment of the patient, the ED physicians were asked questions regarding their clinical intuition (Table 2). In the intervention group only, the RISKINDEX was personally presented by a research member to the attending physician, after complete assessment of the patient and a preliminary treatment plan had been made. A member of the research team informed the ED physician about the study, explained the RISKINDEX variables, the meaning of the calculated probability, and the high predictive accuracy for 31-day mortality found in the previous multicenter study. In order to ensure that presenting the RISKINDEX would not influence the ED physicians clinical intuition, the electronic case record form was designed in such a way that the RISKINDEX could not be calculated until after the clinical intuition questions had been answered. Subsequently, these physicians were questioned about the alignment of the RISKINDEX with their initial clinical intuition and any resulting changes in the treatment plan (Table 2). Finally, in a subgroup of 121 patients we asked the ED physicians about their perceived added value of the RISKINDEX (using a Likert scale ranging from 1 to 10).

Table 2 Questionnaires regarding clinical intuition and medical treatment changes

RISKINDEX

The RISKINDEX is a machine learning (ML)-derived risk score that predicts all-cause 31-day mortality using routine laboratory tests ordered by the attending physician, along with basic patient characteristics (age and sex)21. The computed RISKINDEX (value 0–100) corresponds to an individual’s probability of 31-day mortality.

Clinical prediction tools

To compare the prognostic accuracy for 31-day mortality of the RISKINDEX with that of clinical prediction tools, we selected commonly used prediction tools based on their prevalence and global use. We selected the National Early Warning Score (NEWS), the Acute Physiology and Chronic Health Evaluation II (APACHE II), and the Sepsis-related Organ Failure Assessment (SOFA) score6,7,8. Although originally derived for inpatient and intensive-care cohorts, these three scores are among the few that have been validated, and clinically used, in ED populations and were therefore used as comparators for our study29,30,31,32.

Data collection

In short, we collected data on patient characteristics, comorbidity, triage category (based on the Manchester triage system (MTS)34), reason for ED visit, vital signs, laboratory tests, and clinical endpoints (e.g., hospital admission, admission to intensive care unit (ICU) within 31 days, 31-day mortality). The answers to the questions of all questionnaires were immediately recorded in an electronic case record form. The retrieval of data regarding the results of laboratory tests and the results of the RISKINDEX was automated. To ensure data quality, all data on outcomes, and a sample of all data was double checked by another member of the research team and/or the study monitor, and discrepancies were resolved through discussion with a second reviewer. Data monitoring was performed by the Clinical Trial Center Maastricht (CTCM).

Outcomes

The primary outcomes for this study were the prognostic accuracy for 31-day mortality and clinical impact of the RISKINDEX. Secondary outcomes included the prognostic accuracy for 31-day mortality compared to clinical predictions tools, differences in clinical outcomes including hospital admission, length of hospital stay, admission to ICU, and feasibility of the RISKINDEX.

Statistical analysis

Assuming a 31-day mortality of 8%, we calculated a required sample size of 1250 patients and anticipated including 1300 patients during the study period, based on ED patient volumes at MUMC + . With regard to policy changes, a post hoc power calculation reveals that 388 patients in each arm were needed to detect 2% policy changes (80% power; 5% significance). In the current study, a total of 1.303 patients were included which would allow us to detect a 1.2% difference in policy changes (80% power). Baseline characteristics were analyzed using descriptive statistics. Categorical variables were reported as frequency counts with percentages and continuous variables as medians with interquartile range (IQR) or means with standard deviation (SD), depending on their distribution.

Our primary analysis assessed the prognostic accuracy and clinical impact of the RISKINDEX. The prognostic accuracy for 31-day mortality was compared against physicians’ clinical intuition by calculating the areas under the receiver operating characteristics curve (AUROC) in the intervention group. The AUROCs were compared by using the method of DeLong35. The precision-recall curve (with area under this precision-recall curve (AUPRC)) was employed to evaluate the balance between sensitivity and positive predictive value36. As a post-hoc analysis, the prognostic accuracy of the RISKINDEX was also compared against the physician’s clinical intuition to predict 7-day mortality. As a post-hoc analysis, clinical intuition was further analyzed by stratifying physicians’ experience into two categories: 0–6 years (residents in specialist training) and >6 years (medical specialists). Intrinsic to the evaluation of the prognostic accuracy of a risk score, the clinical actions guided by that risk score could have influenced the primary outcome of 31-day mortality, thereby obscuring its true performance. Therefore, a sensitivity analysis was conducted within the control group to ensure an unbiased evaluation unaffected by clinical actions guided by the results of the RISKINDEX. The clinical impact was evaluated by assessing the number and type of policy changes in the ED after presentation of the RISKINDEX. Although not powered for clinical endpoints, the randomized controlled design allowed a direct exploratory evaluation of the effect of the RISKINDEX on secondary clinical endpoints. Feasibility of the RISKINDEX was assessed by physicians’ alignment and its perceived added value. The alignment between the RISKINDEX and ED physicians’ expectations was evaluated by asking physicians whether the RISKINDEX result matched, exceeded, or fell below their expectations. The perceived added value of the RISKINDEX was evaluated by asking ED physicians to rate its usefulness on a Likert scale from 1 to 10. This assessment was conducted in a subgroup of 114 patients.

All analyses were performed in R, version 4.1.3 (The R Foundation for Statistical Computing). Source code for the data analysis and in-house developed interface is available in a public repository (https://github.com/wptmdoorn/marsedstudy).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.