Table 2 Summary of datasets and data splits. We summarize the size of the training, validation, and test sets in terms of the number of images used in our experiments. We used MIMIC-CXR for self-supervised pretraining and downstream classification and CheXPert to obtain an external test set only. For NIH-14, we used the training set during downstream classification since it has different labels from MIMIC-CXR.

From: Multimodal masked siamese network improves chest X-ray representation learning

Dataset

Purpose

Training

Validation

Test

MIMIC-CXR

Internal validation

325,188

15,282

36,625

ChexPert

External validation

-

-

688

NIH-14

External validation

32,457

3,567

15,735