Table 2 Comparison of Source Data Model Performance, Estimated External Validation Performance, and Observed External Validation Performance on 13 Datasets and 5 Modalities

Source Dataset	P_Source	P_DABIS	P_Est	Ext. Dataset	P_Ext	Δ(P_Source, P_Ext)	Δ(P_Est, P_Ext)	Shuffled Ext.
CXR	0.85 [0.85-0.85]	0.63	0.73 [0.72-0.73]	CXP	0.73 [0.73-0.74]	0.12	0.00	0.50
CXR	0.85 [0.85-0.85]	0.63	0.73 [0.72-0.73]	NIH	0.76 [0.76-0.77]	0.09	-0.03	0.51
CXP	0.79 [0.79-0.79]	0.57	0.72 [0.72-0.73]	CXR	0.77 [0.77-0.77]	0.02	-0.05	0.45
CXP	0.79 [0.79-0.79]	0.57	0.72 [0.72-0.73]	NIH	0.76 [0.76-0.77]	0.03	-0.04	0.50
CXR+CXP	0.82 [0.82-0.82]	0.61	0.72 [0.71-0.72]	NIH	0.77 [0.77-0.78]	0.05	-0.05	0.51
CXR+NIH	0.85 [0.84-0.85]	0.68	0.67 [0.66-0.67]	CXP	0.69 [0.69-0.69]	0.16	-0.02	0.50
COVID-Ext	0.99 [0.98-0.99]	0.80	0.68 [0.67-0.69]	COVID-Int	0.64 [0.63-0.65]	0.36	0.04	0.53
ILD-Diag	0.95 [0.93-0.96]	0.85	0.60 [0.58-0.62]	ILD-Plan	0.66 [0.59-0.73]	0.29	-0.06	0.52
PTB-XL ECG	0.90 [0.89-0.91]	0.88	0.52 [0.52-0.53]	LUDB	0.70 [0.62-0.79]	0.21	-0.18	0.60
ICHBHI	0.97 [0.90-1.00]	0.91	0.57 [0.50-0.67]	JUST	0.60 [0.38-0.81]	0.37	-0.03	0.51
MIMIC-III	0.72 [0.68-0.76]	0.63	0.59 [0.54-0.63]	EHR-Int	0.58 [0.56-0.60]	0.14	0.01	0.51
Average:	0.87	0.73	0.64		0.68	0.20	-0.04	0.52

Results of AUROC model performance and bias estimates on validation and external datasets including 95% confidence intervals [A-B]. P_Source, P_DABIS, P_Est, P_Ext, are the source, DABIS, calibrated external estimate, and external AUROCs, respectively. Where a dataset appears in multiple rows, averages are calculated first across instances of the dataset, then across all datasets. Δ refers to the difference in AUROC between two estimates. Values in the Δ(P_Est, P_Ext) column that are bolded highlight instances where our DABIS estimate outperforms the Δ(P_Source, P_Ext) column. MIMIC-CXR was shortened to CXR in this Table. Est. and Ext. refer to estimated and external, respectively. Shuffled Ext. refers to results obtained by validating models trained on shuffled source datasets on shuffled external datasets. Reporting AI model accuracy without external validation overestimates model performance by 20%, whereas our method underestimates it by 4% on average.

Search