Table 3 Results of the comparison between local and FFL-based training for 5 different datasets.

From: Collaborative training of medical artificial intelligence models with non-uniform labels

Dataset name

Training set size

Included labels

Training setup

AUROC

P-value

VinDr-CXR

n = 15,000

No finding, aortic enlargement, pleural thickening, cardiomegaly, pleural effusion

Local

0.867 ± 0.045

0.001

FFL

0.885 ± 0.049

ChestX-ray14

n = 83,525

Cardiomegaly, lung opacity, lung lesion, pneumonia, edema

Local

0.744 ± 0.076

0.363

FFL

0.744 ± 0.080

CheXpert

n = 126,141

Cardiomegaly, lung opacity, lung lesion, pneumonia, edema

Local

0.796 ± 0.064

0.243

FFL

0.797 ± 0.061

MIMIC-CXR-JPG-v2.0

n = 237,972

Enlarged cardiomediastinum, consolidation, pleural effusion, pneumothorax, atelectasis

Local

0.772 ± 0.072

0.004

FFL

0.786 ± 0.066

UKA-CXR

n = 122,297

Pleural effusion left, pleural effusion right, cardiomegaly, pneumonic infiltrates left, pneumonic infiltrates right

Local

0.916 ± 0.031

0.001

FFL

0.918 ± 0.031

  1. Average AUROC values over all included labels for each dataset, tested on the test benchmark of the corresponding dataset. The FFL process for each dataset was performed in combination with the other 4 datasets including 5 different labels for each dataset.