Table 2 Comparison of Source Data Model Performance, Estimated External Validation Performance, and Observed External Validation Performance on 13 Datasets and 5 Modalities

From: Shortcut learning in medical AI hinders generalization: method for estimating AI model generalization without external data

Source Dataset

PSource

PDABIS

PEst

Ext. Dataset

PExt

Δ(PSource, PExt)

Δ(PEst, PExt)

Shuffled Ext.

CXR

0.85 [0.85-0.85]

0.63

0.73 [0.72-0.73]

CXP

0.73 [0.73-0.74]

0.12

0.00

0.50

CXR

0.85 [0.85-0.85]

0.63

0.73 [0.72-0.73]

NIH

0.76 [0.76-0.77]

0.09

-0.03

0.51

CXP

0.79 [0.79-0.79]

0.57

0.72 [0.72-0.73]

CXR

0.77 [0.77-0.77]

0.02

-0.05

0.45

CXP

0.79 [0.79-0.79]

0.57

0.72 [0.72-0.73]

NIH

0.76 [0.76-0.77]

0.03

-0.04

0.50

CXR+CXP

0.82 [0.82-0.82]

0.61

0.72 [0.71-0.72]

NIH

0.77 [0.77-0.78]

0.05

-0.05

0.51

CXR+NIH

0.85 [0.84-0.85]

0.68

0.67 [0.66-0.67]

CXP

0.69 [0.69-0.69]

0.16

-0.02

0.50

COVID-Ext

0.99 [0.98-0.99]

0.80

0.68 [0.67-0.69]

COVID-Int

0.64 [0.63-0.65]

0.36

0.04

0.53

ILD-Diag

0.95 [0.93-0.96]

0.85

0.60 [0.58-0.62]

ILD-Plan

0.66 [0.59-0.73]

0.29

-0.06

0.52

PTB-XL ECG

0.90 [0.89-0.91]

0.88

0.52 [0.52-0.53]

LUDB

0.70 [0.62-0.79]

0.21

-0.18

0.60

ICHBHI

0.97 [0.90-1.00]

0.91

0.57 [0.50-0.67]

JUST

0.60 [0.38-0.81]

0.37

-0.03

0.51

MIMIC-III

0.72 [0.68-0.76]

0.63

0.59 [0.54-0.63]

EHR-Int

0.58 [0.56-0.60]

0.14

0.01

0.51

Average:

0.87

0.73

0.64

 

0.68

0.20

-0.04

0.52

  1. Results of AUROC model performance and bias estimates on validation and external datasets including 95% confidence intervals [A-B]. PSource, PDABIS, PEst, PExt, are the source, DABIS, calibrated external estimate, and external AUROCs, respectively. Where a dataset appears in multiple rows, averages are calculated first across instances of the dataset, then across all datasets. Δ refers to the difference in AUROC between two estimates. Values in the Δ(PEst, PExt) column that are bolded highlight instances where our DABIS estimate outperforms the Δ(PSource, PExt) column. MIMIC-CXR was shortened to CXR in this Table. Est. and Ext. refer to estimated and external, respectively. Shuffled Ext. refers to results obtained by validating models trained on shuffled source datasets on shuffled external datasets. Reporting AI model accuracy without external validation overestimates model performance by 20%, whereas our method underestimates it by 4% on average.