Table 2 Benchmark results (3 repeated runs; mean ± std. dev.) of Fitzpatrick17k-C for all the experiments originally reported by Groh et al.²⁰.

Holdout Set	Verified	Random (Stratified)	Source A (Atl. Derm.)	Source B (DermaAmin)	FST 3–6	FST 1–2 & 5–6	FST 1–4
# Train Images	10,060	7,975	6,129	2,175	5,340	3,312	935
# Validation Images	1,119	1,139	703	220	599	337	122
# Test Images	215	2,280	2,395	6,832	5,029	6,685	6,903
Best Hyperparameters (n_epochs, optim, lr)	(200, Adam, 1e-3)	(200, Adam, 1e-4)	(100, Adam, 1e-4)	(200, SGD, 1e-2)	(100, Adam, 1e-4)	(200, Adam, 1e-4)	(200, Adam, 1e-4)
Overall	4.65% ± 0.00%	24.05% ± 0.34%	16.62% ± 0.86%	5.16% ± 0.20%	14.42% ± 0.13%	17.27% ± 0.38%	12.07% ± 0.28%
Type 1	3.51% ± 1.24%	21.71% ± 0.47%	22.55% ± 0.80%	4.80% ± 0.38%	—	15.05% ± 0.35%	7.85% ± 0.36%
Type 2	6.92% ± 0.89%	21.06% ± 0.50%	18.57% ± 0.51%	4.11% ± 0.38%	—	16.43% ± 0.40%	9.81% ± 0.27%
Type 3	0.90% ± 1.27%	23.72% ± 0.68%	14.39% ± 1.34%	5.21% ± 0.42%	17.18% ± 0.15%	—	13.23% ± 0.54%
Type 4	8.33% ± 1.18%	28.56% ± 0.93%	15.32% ± 0.61%	6.62% ± 0.52%	13.16% ± 0.45%	—	20.22% ± 0.30%
Type 5	1.45% ± 2.05%	32.09% ± 0.77%	19.43% ± 1.53%	8.12% ± 0.90%	12.23% ± 0.47%	25.70% ± 0.50%	—
Type 6	4.55% ± 0.00%	25.97% ± 1.06%	15.32% ± 2.37%	8.00% ± 1.08%	9.29% ± 0.22%	18.11% ± 0.45%	—
Groh et al.²⁰: Overall	26.7%	20.2%	27.4%	11.4%	13.8%	13.4%	7.7%

The metrics being reported are the overall accuracy ("Overall”) and FST-specific accuracy ("Type x” corresponds to Fitzpatrick skin tone x). Verified: the models are tested on a set of 215 images that were verified to be diagnostic of the disease label by a board-certified dermatologist. Random (Stratified): the test partition contains 20% of the dataset, randomly sampled stratified on the disease labels. Source {A, B}: the models were tested on all the images from Atlas Dermatologico and DermaAmin respectively. FST xx − yy: the models were tested on images with FST labels xx, …, yy. For all the experiments, the training and the validation partitions were drawn from the remaining images from Fitzpatrick17k-C. It is important to note that the results in the last row have been reported verbatim from Groh et al.²⁰, whose training and evaluation partitions differ significantly from our work, and therefore these metrics are not directly comparable.

Search