Table 2 Benchmark results (3 repeated runs; mean  ± std. dev.) of Fitzpatrick17k-C for all the experiments originally reported by Groh et al.20.

From: Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets

Holdout Set

Verified

Random (Stratified)

Source A (Atl. Derm.)

Source B (DermaAmin)

FST 3–6

FST 1–2 & 5–6

FST 1–4

# Train Images

10,060

7,975

6,129

2,175

5,340

3,312

935

# Validation Images

1,119

1,139

703

220

599

337

122

# Test Images

215

2,280

2,395

6,832

5,029

6,685

6,903

Best Hyperparameters (n_epochs, optim, lr)

(200, Adam, 1e-3)

(200, Adam, 1e-4)

(100, Adam, 1e-4)

(200, SGD, 1e-2)

(100, Adam, 1e-4)

(200, Adam, 1e-4)

(200, Adam, 1e-4)

Overall

4.65%  ± 0.00%

24.05%  ± 0.34%

16.62%  ± 0.86%

5.16%  ± 0.20%

14.42%  ± 0.13%

17.27%  ± 0.38%

12.07%  ± 0.28%

Type 1

3.51%  ± 1.24%

21.71%  ± 0.47%

22.55%  ± 0.80%

4.80%  ± 0.38%

15.05%  ± 0.35%

7.85%  ± 0.36%

Type 2

6.92%  ± 0.89%

21.06%  ± 0.50%

18.57%  ± 0.51%

4.11%  ± 0.38%

16.43%  ± 0.40%

9.81%  ± 0.27%

Type 3

0.90%  ± 1.27%

23.72%  ± 0.68%

14.39%  ± 1.34%

5.21%  ± 0.42%

17.18%  ± 0.15%

13.23%  ± 0.54%

Type 4

8.33%  ± 1.18%

28.56%  ± 0.93%

15.32%  ± 0.61%

6.62%  ± 0.52%

13.16%  ± 0.45%

20.22%  ± 0.30%

Type 5

1.45%  ± 2.05%

32.09%  ± 0.77%

19.43%  ± 1.53%

8.12%  ± 0.90%

12.23%  ± 0.47%

25.70%  ± 0.50%

Type 6

4.55%  ± 0.00%

25.97%  ± 1.06%

15.32%  ± 2.37%

8.00%  ± 1.08%

9.29%  ± 0.22%

18.11%  ± 0.45%

Groh et al.20: Overall

26.7%

20.2%

27.4%

11.4%

13.8%

13.4%

7.7%

  1. The metrics being reported are the overall accuracy ("Overall”) and FST-specific accuracy ("Type x” corresponds to Fitzpatrick skin tone x). Verified: the models are tested on a set of 215 images that were verified to be diagnostic of the disease label by a board-certified dermatologist. Random (Stratified): the test partition contains 20% of the dataset, randomly sampled stratified on the disease labels. Source {A, B}: the models were tested on all the images from Atlas Dermatologico and DermaAmin respectively. FSTxx − yy: the models were tested on images with FST labels xx, …, yy. For all the experiments, the training and the validation partitions were drawn from the remaining images from Fitzpatrick17k-C. It is important to note that the results in the last row have been reported verbatim from Groh et al.20, whose training and evaluation partitions differ significantly from our work, and therefore these metrics are not directly comparable.