Table 1 Model performance for each experiment

From: Self-supervised identification and elimination of harmful datasets in distributed machine learning for medical image analysis

Experiment

Revisit cycle

Error

Accuracy

Sensitivity

Specificity

F1- score

AUROC

Best cyclea

Pre-trained model

n/a

n/a

67%

72%

63%

67%

73%

10

Clean baseline

75% (71–75)

75% (71–75)

75% (70–75)

73% (69–73)

82% (79–82)

26 (23–29)

Dirty baseline 1st scenario

54%

0%

100%

0%

49%

11

Dirty baseline 2nd scenario

54%

0%

100%

0%

45%

11

1st scenario

2

2%

73%

70%

75%

70%

80%

25

2

3%

70%

75%

65%

70%

78%

20

2

4%

71%

64%

77%

67%

77%

22

2

5%

71%

60%

80%

66%

78%

20

5

2%

69%

65%

72%

66%

73%

20

5

3%

71%

63%

78%

67%

78%

20

5

4%

71%

63%

77%

67%

77%

22

5

5%

71%

64%

77%

67%

77%

22

2nd scenario

2

2%

72%

73%

70%

70%

78%

21

2

3%

72%

72%

72%

70%

78%

26

2

4%

55%

9%

91%

16%

48%

12

2

5%

54%

0%

100%

0%

47%

11

5

2%

72%

78%

67%

72%

78%

26

5

3%

70%

71%

70%

69%

78%

26

5

4%

55%

9%

91%

16%

48%

12

5

5%

54%

0%

100%

0%

47%

11

Real 83 centers

2

2%

61%

30%

87%

41%

66%

15

2

3%

72%

71%

73%

71%

78%

22

  1. aBest cycle for ablation studies and Parkinson’s disease classifier is counted on top of the pre-trained model trained for 10 cycles. Thus, the best cycle 11 indicates that the best model needed only one cycle of additional training, while the best cycle 26 indicates that 16 additional cycles were needed using the pre-trained model.