Table 3 Performance results for linear evaluation of our proposed method using ViT-S backbone on the MIMIC-CXR dataset. We summarize the AUROC and AUPRC performance on the test set for the self-supervised methods as well as the 95% confidence intervals. The best results are shown in bold. We use \(\uparrow\) and \(\downarrow\) to indicate whether the performance of a given model is 0.5-1.5% better or worse than the reference model (i.e., vanilla MSN). \(\uparrow \uparrow\) and \(\downarrow \downarrow\) indicate whether the difference is 1.5% better or worse than the reference model, respectively.

From: Multimodal masked siamese network improves chest X-ray representation learning

Pretraining

AUROC (CI)

AUPRC (CI)

ImageNet

0.703 (0.699, 0.706) \(\downarrow \downarrow\)

0.269 (0.266, 0.272) \(\downarrow \downarrow\)

DINO

0.714 (0.711, 0.718) \(\downarrow \downarrow\)

0.278 (0.276, 0.281) \(\downarrow\)

MAE

0.649 (0.645, 0.653) \(\downarrow \downarrow\)

0.223 (0.221, 0.225) \(\downarrow \downarrow\)

MSN

0.731 (0.727, 0.734)

0.291 (0.289, 0.294)

MSN\(+ x_{sex}\)

0.751\(^*\)(0.748, 0.754)\(\uparrow \uparrow\)

0.311\(^*\)(0.309, 0.314)\(\uparrow \uparrow\)

MSN\(+ x_{age}\)

0.746\(^*\) (0.743, 0.749) \(\uparrow \uparrow\)

0.307\(^*\) (0.305, 0.310) \(\uparrow \uparrow\)

MSN\(+ x_{view}\)

0.747\(^*\) (0.744, 0.750) \(\uparrow \uparrow\)

0.307\(^*\) (0.305, 0.310) \(\uparrow \uparrow\)

MSN\(+ x_{pos}\)

0.748\(^*\) (0.745, 0.752) \(\uparrow \uparrow\)

0.306\(^*\) (0.303, 0.309) \(\uparrow \uparrow\)

MSN\(+ x_{mort}\)

0.748\(^*\) (0.744, 0.751)\(\uparrow \uparrow\)

0.308\(^*\) (0.306, 0.312) \(\uparrow \uparrow\)

MSN\(+ x_{icu}\)

0.746\(^*\) (0.742, 0.749) \(\uparrow \uparrow\)

0.305\(^*\) (0.303, 0.308) \(\uparrow\)

MSN\(+ x_{SD}\)

0.751\(^*\)(0.748, 0.754)\(\uparrow \uparrow\)

0.310\(^*\) (0.308, 0.313) \(\uparrow \uparrow\)

MSN\(+ x_{SM}\)

0.744\(^*\) (0.741, 0.747) \(\uparrow\)

0.306\(^*\) (0.303, 0.309) \(\uparrow \uparrow\)

MSN\(+ x_{IS}\)

0.749\(^*\) (0.746, 0.752) \(\uparrow \uparrow\)

0.308\(^*\) (0.305, 0.311) \(\uparrow \uparrow\)

MSN\(+ x_{SD+SM}\)

0.742\(^*\) (0.739, 0.746) \(\uparrow\)

0.302\(^*\) (0.300, 0.305) \(\uparrow\)

MSN\(+ x_{SD+SI}\)

0.744\(^*\) (0.740, 0.747) \(\uparrow\)

0.306\(^*\) (0.304, 0.309) \(\uparrow \uparrow\)

MSN\(+ x_{SM+SI}\)

0.748\(^*\) (0.744, 0.751) \(\uparrow \uparrow\)

0.307\(^*\) (0.304, 0.310) \(\uparrow \uparrow\)

MSN\(+ x_{SD+SM+SI}\)

0.739\(^{\dagger }\) (0.736, 0.743) \(\uparrow\)

0.301\(^*\) (0.299, 0.305) \(\uparrow\)

  1. \(^*\) Statistical significance results with respect to vanilla MSN (\(p < 0.001\)).
  2. \(^{\dagger }\) Statistical significance results with respect to vanilla MSN (\(p < 0.01\)).