Table 1 Dataset Characteristics

From: Shortcut learning in medical AI hinders generalization: method for estimating AI model generalization without external data

Database

#Patients

#Samples

Modality

#Classes

Classes (Evaluated classes are in bold)

MIMIC-CXR

65379

377095

X-Ray

14

No Finding, Enlarged Cardiomediastinum, Cardiomegaly,

CXP

64540

223414

X-Ray

 

Lung Opacity, Lung Lesion, Edema, Consolidation,

MIMIC-CXR+CXP

98964

300000

X-Ray

 

Pneumonia, Atelectasis, Pneumothorax, Pleural Effusion,

MIMIC-CXR+NIH

62545

172562

X-Ray

 

Pleural Other, Fracture, Support Devices

NIH

30763

111788

X-Ray

15

Atelectasis, Cardiomegaly, Consolidation, Edema, Pleural Effusion, Emphysema, Fibrosis, Hernia, Infiltration, Mass, No Finding, Nodule, Pleural Thickening, Pneumonia, Pneumothorax

COVID-Kaggle

10192

21165

X-Ray

4

COVID Positive, Normal, Lung Opacity, Viral Pneumonia

COVID-Internal

2962

10731

X-Ray

2

COVID Positive, Normal

ILD-Diag

3182

4394

CT

2

ILD, No ILD

ILD-Plan

503

503

CT

2

ILD, No ILD

PTB-XL ECG

18885

21837

ECG

5

Conduction Disturbance, Hypertrophy, Myocardial Infarction, ST/T Change, Normal ECG

LUDB

157

157

ECG

2

Conduction Disturbance, Hypertrophy

ICBHI

97

866

Auscultation

2

Normal, Abnormal

JUST

243

243

Auscultation

2

Normal, Abnormal

MIMIC-III

4292

26244

EHR

2

Readmission, no readmission

EHR-Int

6292

7263

EHR

2

Readmission, no readmission

Total unique:

207487

805700

   
  1. Details of datasets sizes and label classes used in experiments. All datasets were split by 80/20 with the exception of external validation only datasets: COVID-Internal, LUDB, and JUST. All dataset splits occurred at the patient level. Public Datasets: MIMIC-CXR (Medical Information Mart for Intensive Care-Chest X-ray), MIMIC-III, Stanford CheXpert Chest X-rays (CXP), COVID-Kaggle, National Institute of Health (NIH) X-ray dataset, Physikalisch-Technische Bundesanstalt-XL ECG (PTB-XL), Lobachevsky University Electrocardiography Database (LUDB), International Conference on Biomedical Health Informatics (ICBHI 2017), Jordan University of Science and Technology Faculty of Computer and Information Technology & King Abdullah University Hospital (JUST). Internal Datasets: Interstitial Lung Disease CT (ILD-Diag, and ILD-Plan), COVID Internal(COVID-Int), Electronic Health Record (EHR-Int) clinical discharge summaries.