npj Digital Medicine

Table 2 Performance of the outcome classification task on the held-out test set, and on the subset of the test set used in the reader study (n represents the number of images).

From: An artificial intelligence system for predicting the deterioration of COVID-19 patients in the emergency department

Test set (n = 832)
	AUC				PR AUC
	24 h	48 h	72 h	96 h	24 h	48 h	72 h	96 h
COVID-GBM	0.747	0.739	0.750	0.770	0.230	0.325	0.408	0.523
	(0.698, 0.802)	(0.69, 0.795)	(0.703, 0.799)	(0.727, 0.813)	(0.139, 0.296)	(0.229, 0.396)	(0.317, 0.479)	(0.433, 0.6)
COVID-GMIC	0.695	0.716	0.717	0.738	0.200	0.302	0.374	0.439
	(0.636, 0.763)	(0.666, 0.771)	(0.668, 0.773)	(0.695, 0.785)	(0.119, 0.260)	(0.209, 0.379)	(0.283, 0.452)	(0.346, 0.515)
COVID-GBM +	0.765	0.749	0.769	0.786	0.243	0.332	0.439	0.517
COVID-GMIC	(0.712, 0.817)	(0.700, 0.798)	(0.724, 0.818)	(0.745, 0.830)	(0.150, 0.299)	(0.237, 0.41)	(0.345, 0.527)	(0.429, 0.600)

Reader study dataset (n = 200)
	AUC				PR AUC
	24 h	48 h	72 h	96 h	24 h	48 h	72 h	96 h
Radiologist A	0.613	0.645	0.691	0.740	0.346	0.490	0.640	0.742
	(0.519, 0.705)	(0.571, 0.731)	(0.618, 0.77)	(0.674, 0.814)	(0.217, 0.441)	(0.367, 0.599)	(0.536, 0.745)	(0.657, 0.834)
Radiologist B	0.637	0.636	0.658	0.713	0.365	0.460	0.590	0.704
	(0.547, 0.73)	(0.552, 0.716)	(0.588, 0.738)	(0.649, 0.786)	(0.229, 0.462)	(0.335, 0.56)	(0.492, 0.701)	(0.616, 0.805)
Radiologist A +	0.642	0.663	0.692	0.741	0.403	0.499	0.609	0.740
Radiologist B	(0.555, 0.729)	(0.589, 0.746)	(0.621, 0.766)	(0.678, 0.809)	(0.272, 0.52)	(0.380, 0.613)	(0.492, 0.711)	(0.650, 0.831)
COVID-GMIC	0.642	0.701	0.751	0.808	0.381	0.546	0.676	0.789
	(0.554, 0.734)	(0.627, 0.781)	(0.685, 0.821)	(0.75, 0.87)	(0.235, 0.480)	(0.421, 0.657)	(0.564, 0.780)	(0.699, 0.880)
COVID-GBM	0.704	0.719	0.750	0.787	0.411	0.537	0.668	0.804
	(0.632, 0.784)	(0.648, 0.794)	(0.684, 0.821)	(0.727, 0.850)	(0.259, 0.518)	(0.394, 0.64)	(0.558, 0.77)	(0.738, 0.884)
COVID-GBM +	0.708	0.702	0.778	0.819	0.411	0.500	0.705	0.808
COVID-GMIC	(0.637, 0.799)	(0.633, 0.775)	(0.719, 0.851)	(0.763, 0.885)	(0.279, 0.517)	(0.364, 0.601)	(0.599, 0.806)	(0.735, 0.898)

We include 95% confidence intervals estimated by 1000 iterations of the bootstrap method³⁰. The optimal weights assigned to the COVID-GMIC prediction in the COVID-GMIC and COVID-GBM ensemble were derived through optimizing the AUC on the validation set as described in Supplementary Fig. 3b. The ensemble of COVID-GMIC and COVID-GBM, denoted as ‘COVID-GMIC + COVID-GBM’, achieves the best performance across all time windows in terms of the AUC and PRAUC, except for the PR AUC in the 96 h task. In the reader study, our main finding is that COVID-GMIC outperforms radiologists A & B across time windows longer than 24 h, with 3 and 17 years of experience, respectively. Note that the radiologists did not have access to clinical variables and as such their performance is not directly comparable to the COVID-GBM model; we include it only for reference. The area under the precision-recall curve is sensitive to class distribution, which explains the large differences between the scores on the test set and the reader study subset. Best performance per metric is shown in bold.

Back to article page

Search

Advanced search

Quick links