Table 2 Performance of the outcome classification task on the held-out test set, and on the subset of the test set used in the reader study (n represents the number of images).

From: An artificial intelligence system for predicting the deterioration of COVID-19 patients in the emergency department

Test set (n = 832)

 

AUC

PR AUC

 

24 h

48 h

72 h

96 h

24 h

48 h

72 h

96 h

COVID-GBM

0.747

0.739

0.750

0.770

0.230

0.325

0.408

0.523

 

(0.698, 0.802)

(0.69, 0.795)

(0.703, 0.799)

(0.727, 0.813)

(0.139, 0.296)

(0.229, 0.396)

(0.317, 0.479)

(0.433, 0.6)

COVID-GMIC

0.695

0.716

0.717

0.738

0.200

0.302

0.374

0.439

 

(0.636, 0.763)

(0.666, 0.771)

(0.668, 0.773)

(0.695, 0.785)

(0.119, 0.260)

(0.209, 0.379)

(0.283, 0.452)

(0.346, 0.515)

COVID-GBM +

0.765

0.749

0.769

0.786

0.243

0.332

0.439

0.517

COVID-GMIC

(0.712, 0.817)

(0.700, 0.798)

(0.724, 0.818)

(0.745, 0.830)

(0.150, 0.299)

(0.237, 0.41)

(0.345, 0.527)

(0.429, 0.600)

Reader study dataset (n = 200)

 

AUC

PR AUC

 

24 h

48 h

72 h

96 h

24 h

48 h

72 h

96 h

Radiologist A

0.613

0.645

0.691

0.740

0.346

0.490

0.640

0.742

 

(0.519, 0.705)

(0.571, 0.731)

(0.618, 0.77)

(0.674, 0.814)

(0.217, 0.441)

(0.367, 0.599)

(0.536, 0.745)

(0.657, 0.834)

Radiologist B

0.637

0.636

0.658

0.713

0.365

0.460

0.590

0.704

 

(0.547, 0.73)

(0.552, 0.716)

(0.588, 0.738)

(0.649, 0.786)

(0.229, 0.462)

(0.335, 0.56)

(0.492, 0.701)

(0.616, 0.805)

Radiologist A +

0.642

0.663

0.692

0.741

0.403

0.499

0.609

0.740

Radiologist B

(0.555, 0.729)

(0.589, 0.746)

(0.621, 0.766)

(0.678, 0.809)

(0.272, 0.52)

(0.380, 0.613)

(0.492, 0.711)

(0.650, 0.831)

COVID-GMIC

0.642

0.701

0.751

0.808

0.381

0.546

0.676

0.789

 

(0.554, 0.734)

(0.627, 0.781)

(0.685, 0.821)

(0.75, 0.87)

(0.235, 0.480)

(0.421, 0.657)

(0.564, 0.780)

(0.699, 0.880)

COVID-GBM

0.704

0.719

0.750

0.787

0.411

0.537

0.668

0.804

 

(0.632, 0.784)

(0.648, 0.794)

(0.684, 0.821)

(0.727, 0.850)

(0.259, 0.518)

(0.394, 0.64)

(0.558, 0.77)

(0.738, 0.884)

COVID-GBM +

0.708

0.702

0.778

0.819

0.411

0.500

0.705

0.808

COVID-GMIC

(0.637, 0.799)

(0.633, 0.775)

(0.719, 0.851)

(0.763, 0.885)

(0.279, 0.517)

(0.364, 0.601)

(0.599, 0.806)

(0.735, 0.898)

  1. We include 95% confidence intervals estimated by 1000 iterations of the bootstrap method30. The optimal weights assigned to the COVID-GMIC prediction in the COVID-GMIC and COVID-GBM ensemble were derived through optimizing the AUC on the validation set as described in Supplementary Fig. 3b. The ensemble of COVID-GMIC and COVID-GBM, denoted as ‘COVID-GMIC + COVID-GBM’, achieves the best performance across all time windows in terms of the AUC and PRAUC, except for the PR AUC in the 96 h task. In the reader study, our main finding is that COVID-GMIC outperforms radiologists A & B across time windows longer than 24 h, with 3 and 17 years of experience, respectively. Note that the radiologists did not have access to clinical variables and as such their performance is not directly comparable to the COVID-GBM model; we include it only for reference. The area under the precision-recall curve is sensitive to class distribution, which explains the large differences between the scores on the test set and the reader study subset. Best performance per metric is shown in bold.