Fig. 3: Model performance in different patient cohorts.
From: Early prediction of circulatory failure in the intensive care unit using machine learning

a–e, Analyses use the circEWS model with a threshold corresponding to 90% recall (obtained on patients in the test set) corresponding to an overall precision of 30% and silencing of new alarms for 30 min. a, Recall and precision for patients in different APACHE diagnostic groups. Boxes in the box plot show IQR and the diamonds are outliers with values that lies outside the [minimum, maximum] range of the whiskers, where minimum = Q1 – 1.5 × IQR and maximum = Q3 + 1.5 × IQR (Q1, Q3 and IQR represent the first quartile, the third quartile and the interquartile range, respectively). b, Recall and precision for patients, as stratified by APACHE-III score. The notation (a/d) under each group name signifies that there were a numbers of patients with events among d numbers of patients in the group. c, Recall and precision as a function of patient age. d, Recall as a function of time since admission. Events (episodes of circulatory failure) are stratified on the basis of time lag after ICU admission. Top, the cumulative performance of the model; that is, at 8 h after admission the overall recall of the model is approximately 96%. Bottom, the recall for each indicated time interval. e, AUPRC (top) and precision at a fixed threshold (baseline prevalence shown in red) (bottom) as a function of the year for which the model was trained. Eight models were trained, each using one year of data between 2008–2015, and were tested on a dataset from 2016, for which we observe stationarity (P = 5 × 10−5, n = 8 years, Dickley–Fuller test). Box plots in a were derived from n = 6 independent experiments in the temporal splits; in panels b–e, solid curves were derived from the held-out split, and variation estimates were derived from n = 5 independent experiments in the development splits. P values for panels a and b (dependent 2-sample t test, Benjamini–Hochberg corrected): P = 0.038 for decreased event recall in patients with neurological conditions, P = 0.0006 for decreased precision in neurosurgical patients, P = 0.0004 for lower precision in patients with APACHE scores (0–15), P = 0.039 for lower recall in emergency admissions, P = 0.039 for higher recall in surgical admissions. n = 6 independent experiments in the temporal splits were used.