Introduction

Unanticipated clinical deterioration of hospital patients remains a pernicious problem that contributes to in-hospital morbidity and mortality1,2,3,4,5. Often, following the clinical deterioration event, clinicians who review these cases can recognize trends in vital signs, laboratory tests, and other electronic health record elements that, in retrospect, might have raised clinical suspicion of illness6. For example, the coincidence of rising heart rate and falling blood pressure should suggest incipient shock, but often the changes in physiological parameters are subtle and insidious. Automated detection of such signatures of illness might inform clinicians of patients who could benefit from earlier assessment and proactive clinical action1,7.

Since its inception more than 20 years ago in the neonatal intensive care unit (ICU)8,9, predictive analytics monitoring has been increasingly recognized for its potential to improve patient outcomes by early detection of deterioration. Artificial intelligence (AI) and machine learning methods support synthesizing unprocessed cardiorespiratory monitoring data, laboratory results and clinician-entered vital signs into computational models that can visually represent the estimation of the risk of clinical deterioration10. Since AI’s new and sometimes opaque methodologies lie at its core, predictive analytics monitoring has been held to high standards for transparency. For example, there are numerous guidelines and opinion pieces to inform the developer and the user11,12,13,14.

Incontrovertible proof of the benefit of predictive analytics monitoring in the form of large randomized clinical trials has been lacking. While a meta-analysis of artificial intelligence-based sepsis alerts showed an average mortality reduction of 30%, it included only six randomized trials with 2938 patients15. To this list, we can add the original trial in this field, that of heart rate characteristics monitoring for neonatal sepsis in 3003 premature infants, which showed a 22% relative mortality reduction when risk estimates were displayed to clinicians16. Here, we extend the work in neonates to evaluate the impact of predictive analytics monitoring on events of clinical deterioration in a heterogeneous group of adults in acute care cardiac medical-surgical wards. We test the use of an AI-based risk analytic, Continuous Monitoring of Event Trajectories (CoMET, Nihon Kohden Digital Health Solutions, Irvine, CA), that visually displays risk estimates of multiple events of clinical deterioration every 15 min. We hypothesized that the passive display of a visual risk analytic would facilitate early identification of patients at risk for impending adverse outcomes and increase the hours free from clinical deterioration.

Methods

Study design

We conducted a cluster-randomized controlled trial (RCT) among patients admitted to 85-bed acute care cardiology medical-surgical wards. Eleven clusters were defined by room number, and patients were randomized to receive either the intervention (passive display of CoMET ) or usual care (standard of care). Randomization began on January 4, 2021, and ended on October 4, 2022. Due to the pragmatic nature and replication of real-world patient contexts, all patients admitted to these floors during the study period were enrolled in the study. Clusters were re-randomized every 2 months following a Latin square design within each stratum to ensure a balanced treatment assignment. Full pre-registered study protocol details and additional randomization details can be found in 17, and the clinical trial was registered (NCT04359641). The study was approved by the University of Virginia Health System Institutional Review Board, and the study was granted a waiver of informed consent (IRB#22196) in accordance with the Declaration of Helsinki. A Data Safety and Monitoring Board reviewed interim results. The results are reported in accordance with the CONSORT-AI guidelines12. The trial was stopped after meeting the pre-specified sample size enrolled.

Patient care units and populations

The Medical-Cardiology unit predominantly comprised acute cardiology patients admitted with decompensated congestive heart failure, acute coronary syndrome, arrhythmias such as atrial fibrillation, and other cardiac diagnoses. The Medical-Surgical unit was historically the same as Medical-Cardiology. Immediately preceding the study, though, 11 of the 20 beds were repurposed for patients with solid organ transplants under the care of a dedicated transplant team. They differed from the patients and the practices of the cardiology and cardiac surgery beds. The baseline distributions of vital signs and laboratory test results, for example, differed, with a bias toward findings of chronic kidney and liver failure in the transplant patients. Because of the pragmatic nature of this study design in real-world contexts, they remained in the study population. The Cardio-Thoracic Surgery unit predominantly comprised patients awaiting discharge. During the study period, no other early warning systems were used in the study patient care units.

Risk displays

We used the CoMET system to display two risk trajectories, calculated from previously published machine-learning logistic regression analyses relating continuous cardiorespiratory monitoring parameters, nurse-charted vital signs, and laboratory test results to outcome events on these wards from 2013 to 201518. These models were developed using 2,194,077 control observations (8,105 admissions) and 23,881 case observations (586 admissions). The predictors included 41 physiological variables that were at least 70% available/non-missing, including data derived from continuous cardiorespiratory monitoring, laboratory assessments, and vital sign assessments18. The cohort in which the models were originally developed and validated mirrored that in the study population (acute care medical-surgical units) with an event rate of 1.1% (during a pre-COVID-19 time frame) and the imbalanced data was used for training to ensure the predicted risk was well-calibrated to the observed risk18.

The unit of measurement of the risk estimate is the fold-increase in the probability of the event in the next 24 h compared to the average probability. Thus, a patient with a score of 1 has the average risk, a patient with a score of 2 has twice the average risk, and so on. The plot shows a comet-like icon for each patient with the bed number in the head; the tail is 3 h long. Additional details on the model inputs and model performance, including excellent calibration, are in the following publication19. An example of what the clinician sees and can view on a single patient is found in Fig. 1.

Fig. 1
Fig. 1
Full size image

CoMET display of a hypothetical patient in Bed 49B. The head-to-tail of the CoMET represents a 3-hour trajectory. This patient has greater than a 5 fold risk of both a cardiorespiratory and cardiovascular clinical deterioration event in the next 24 h. On the right upper quadrant, the user will find the leaderboard of where this patient’s risk compares to the other patients. In the right lower quadrant, the user can look at the trajectory of the 2 individual risk scores.

Intervention implementation details

We followed principles derived from the prior implementation of predictive analytics monitoring in ICUs and non-ICUs at the UVA Hospital and a regional hospital20. Each of the three units identified an interprofessional team of clinical champions to serve as super users, oversee implementation, and ensure staff education. Our team of educators (PIs, CoMET educators, clinical research coordinators (CRCs), and other study investigators) provided sessions with the clinical champions and other clinician groups in each unit before randomization and during the first quarter. We encouraged the integration of the CoMET scores into routine nurse vital sign assessments, rounding discussions, hand-off reports, and escalation protocols. Our recommended protocol for reporting and responding to changes in the CoMET scores included nurse notification of physicians or advanced practice providers of an increase in score by 2 units over a three-hour period and physician evaluation of the patient as appropriate. A CRC attended morning rounds on each unit on a rotating basis to show the display and answer questions about interpreting score trajectories. The CoMET score was available on the unit at all times to those randomized to the intervention group on large scale monitors and could be accessed via a web-based platform if not on site. The CoMET scores flowed through to the electronic medical record every hour and were recorded within the vital sign flowsheet17.

Study outcomes

The primary outcome was the number of hours free of clinical deterioration (adverse clinical events including: an emergent ICU transfer, emergent intubation, cardiac arrest, emergent surgery, and death) within 21 days of admission. With this outcome, patients were awarded hours that they remained stable without a clinical deterioration event. A maximum score was 21 event-free days (504 h), and patients who were successfully discharged from the hospital prior to the 21 days without an event were awarded all 504 h. Patients who died during admission were counted as having 0 event-free hours. Patients with non-emergent ICU transfers, non-emergent/planned surgeries, or changes in bed assignments that caused a change in arm assignment were censored. Dedicated clinician CRCs adjudicated each case of clinical deterioration/study endpoint event. For any events that were ambiguous, the PIs of the study (JKM and JMB) blindly adjudicated outcomes through a weekly meeting. This report only presents the primary outcome and we will report additional analyses separately focused on the time to clinical action, post-ICU trajectories, and costs to the healthcare system. A Data Safety Monitoring Board met at pre-specified intervals to assess for any unintended harms.

Statistical analysis

We used a generalized estimating equation with a zero-inflated negative binomial to analyze event-free hours and a Cox proportional hazards model to analyze time-until-first-critical events. The model handled the correlation of multiple hospital admissions from the same patient, within-unit clusters within the study design, and differences in the amount of time at risk for censored patients. An intent-to-treat analytic strategy was utilized where all randomized patients were included for the primary endpoint comparison between the intervention and usual care. As a post-hoc analysis, we also analyzed the primary outcome in an ‘at-risk’ cohort, or those patients who experienced a rise in 2 on either axis of the CoMET display during their hospitalization. The analyses in the at-risk cohort were exploratory, as they were not predefined. The post-hoc analysis was conducted to uncover more details surrounding censoring events and patient movement.

Results

Over the 22 months of randomization, which coincided with the COVID-19 pandemic, there were 10,422 patient visits within our study cohort (Fig. 2). They were majority male, White, non-Hispanic, and on average 65 years of age with relatively low hospital length of stays (Table 1). Model performance is reported separately19, but overall the predictive models performed well and outperformed a common early warning score, qSOFA.

Fig. 2
Fig. 2
Full size image

CONSORT randomization information.

Table 1 Admission characteristics of the cohort and at-risk subgroup by arm.

Only 5.3% of patients had a clinical deterioration event, and the vast majority of participants had the maximum amount of event-free hours (Table 2). There was no statistically significant difference in the primary outcome by study arm in both the full cohort and those at risk for clinical deterioration. Among those who had a clinical event, there were more event-free hours in the intervention/display-on group compared to the standard-of-care/display-off, but this did not reach statistical significance. Censoring events and components meeting the endpoint of clinical deterioration are found in Table 2. The estimated marginal means from the negative binomial generalized estimating equation are found in Fig. 3.

Table 2 Study outcomes for the cohort and at-risk subjects.

n(%) or median (inter-quartile range). IRR/ORs for display-on versus display-off using negative binomial or logistic regression adjusted for study design.

Fig. 3
Fig. 3
Full size image

Estimated marginal mean (and 95% CI) event-free hours by arm and population.

Bed movement was common within this RCT (n = 382 went from usual care to CoMET display, and n = 419 went from CoMET display to usual care). In a post-hoc analysis, we found higher relative risk of clinical deterioration among those transferred from a non-display to a display bed. This finding indicates that clinicians made decisions that undermined the cluster RCT design by preferentially placing sicker patients into the CoMET display beds, presumably to obtain additional information about their underlying physiological instability. (Fig. 4).

Fig. 4
Fig. 4
Full size image

Time zero is transfer from one arm to another and the unit of analysis is days. The line is the median and the shaded region is the IQR. Blue are those patients transferred from a display-off to display-on bed and their overall relative risk of clinical deterioration is higher, representing a sicker population.

Discussion

We conducted a large randomized controlled trial to test the impact of predictive analytics monitoring on cardiology and cardiac surgery patients in in-patient acute care wards. Our major findings are that a display of the trajectories of the risk of imminent deterioration without alerts was not associated with an increase in hours free of clinical deterioration. We found that there was an increase in relocating sicker patients from usual care to CoMET display-on, thereby distorting the RCT design. We note this study was undertaken during the early stages of the COVID-19 pandemic in our area, which disrupted staffing and resource utilization10 and was associated with a lower event rate than our pre-COVID baseline. Clinician-driven transfers of sicker patients into display beds may suggest that the intervention was perceived as useful at the bedside, even if it did not improve the prespecified primary outcome and these behavior patterns should be studied in greater detail alongside other implementation outcomes. Further, treating bed movement as a censoring event may have biased the overall study results to the null.

We justify using continuous cardiorespiratory monitoring in the predictive models because of the causality leakage of clinician-ordered tests. For example, the presence of a serum lactate measurement in the chart bespeaks clinical suspicion of sepsis. It excels as a predictor, but not because it detects changes before clinicians. Likewise, the use of pressors, fluid administration, and blood cultures as predictors rather than endpoints allows causality leakage, artificially inflating model performance. Kamran and coworkers recently showed that the Epic Sepsis Model had no predictive capability at all when assessed prior to any clinician suspicion or action, that is, before fluids, antibiotics, and lactate tests21.

We justify the randomized clinical trial approach to the study of AI-based risk analytics. We find irreducible uncertainty in other trial designs and results, even our own demonstration of a 50% reduction in septic shock in a surgical ICU after the introduction of CoMET trained for urgent unplanned intubation for acute respiratory failure and hemorrhage requiring large transfusion22. While a control medical ICU without CoMET monitoring had no consistent benefit over the same time period, the absence of randomization has allowed skepticism of the before-and-after study results. We especially endorse the simple and pragmatic nature of the current trial - we enrolled everyone to mimic complex real-world acute care environments and used important endpoints to clinicians and patients.

Other studies have used non-randomized designs. For example, Winslow, Edelson, Churpek, and coworkers found reduced mortality throughout the hospital after initiating their early warning system23. The magnitude was substantial but was the same even for patients with low risks who were never alerted. Adams, Henry, Saria, and coworkers used a machine learning-based sepsis early warning system that included mechanical ventilation, lactate, and vasopressors as sepsis predictors and placed passive alerts in the EHR, but only after “verifiable symptoms” were present24. They found reduced mortality in the 1.2% of patients who met retrospective sepsis criteria and whose alerts were acknowledged within three hours but did not present outcomes of the other patients24,25. Groups led by Shimabukuro26 and Watkinson27 performed small randomized trials with divergent results. Escobar, Liu and coworkers used a non-randomized stepped wedge approach to show a convincing mortality reduction with an EHR-based early warning score in 19 hospitals28. They employed an informed intermediary to receive alerts and to review the clinical record before alerting bedside clinicians.

Cluster randomization was chosen because clinicians needed to easily recognize who was in the intervention group. We did not want to add additional burden to the complexity of the acute care environment and did not want to falsely reassure a clinician who did not see a rising score to assume the patient was in the intervention group. Although we have previously tested the positive predictive value of creating a threshold-based alert from CoMET1, we did not want to add to clinician burden or create additional markers of alarm fatigue and decided to offer suggestions for response without an active alert strategy. The lack of mandated protocolized responses, however, makes it challenging to attribute the neutral results only to the accuracy of the analytic.

We approached education and implementation of novel analytics in real-world environments within the context of the COVID-19 pandemic and wanted to minimize any contribution to clinician burnout29. Implementation considerations focused on promoting trust, transparency, integration with workflow, prolonged stakeholder engagement, tailored education embedded within hospital norms, supporting actionability through integration within clinical care, and sustainability through ongoing supportive interactions29. However, variability in clinician patterns may limit generalizability if there are underlying practices that differ substantially between hospitals.

AI clinical trials call for an understanding of new and separate design elements not previously contemplated. There are several elements of this type of intervention, such as the use of passive displays, that warrant further study and have RCT design implications when studying AI-based predictive analytics: the proactive nature of the clinical action response, the rarity of the event of interest, and the complexity of implementation within evolving acute care environments. While the pragmatic design allowed for implementation in real-world environments, this also meant there was heterogeneity in patient populations, unit work environments, and clinician responses. The real-world context of clinical trials such as this implies that the implementation of AI-based risk analytics must also be adaptive to ensure long-term integration and practice flexibility across different work environments.

One aspect of the study design we would like to improve in the future is the integration of in-depth qualitative methodology and implementation outcomes alongside patient outcomes to better understand how clinicians process, communicate, and integrate risk scores into their care practices. Understanding the cognitive interpretation of scores in real-time, the response through diverse care processes, and communication patterns involving AI scores preceding clinical events is an area in particular need of further study. In hindsight, hours free of events of clinical deterioration may not have been the most sensitive primary outcome. Outcomes specific to the at-risk population and subsequent intensive care unit trajectories may have been more appropriate, given the overall rarity of our event.