Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets

Abhishek, Kumar; Jain, Aditi; Hamarneh, Ghassan

doi:10.1038/s41597-025-04382-5

Download PDF

Analysis
Open access
Published: 01 February 2025

Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets

Scientific Data volume 12, Article number: 196 (2025) Cite this article

4289 Accesses
5 Citations
2 Altmetric
Metrics details

Subjects

Abstract

The remarkable progress of deep learning in dermatological tasks has brought us closer to achieving diagnostic accuracies comparable to those of human experts. However, while large datasets play a crucial role in the development of reliable deep neural network models, the quality of data therein and their correct usage are of paramount importance. Several factors can impact data quality, such as the presence of duplicates, data leakage across train-test partitions, mislabeled images, and the absence of a well-defined test partition. In this paper, we conduct meticulous analyses of three popular dermatological image datasets: DermaMNIST, its source HAM10000, and Fitzpatrick17k, uncovering these data quality issues, measure the effects of these problems on the benchmark results, and propose corrections to the datasets. Besides ensuring the reproducibility of our analysis, by making our analysis pipeline and the accompanying code publicly available, we aim to encourage similar explorations and to facilitate the identification and addressing of potential data quality issues in other large datasets.

Optimizing skin disease diagnosis: harnessing online community data with contrastive learning and clustering techniques

Article Open access 08 February 2024

A fuzzy rank-based deep ensemble methodology for multi-class skin cancer classification

Article Open access 20 February 2025

Semantic modeling of cell damage prediction: a machine learning approach at human-level performance in dermatology

Article Open access 23 May 2023

Introduction

Skin diseases are the most common reason for clinical consultations in studied populations¹, affecting almost a third of the global population^2,3. The 2013 Global Burden of Disease found skin diseases to be the fourth leading cause of nonfatal disabilities globally, accounting for 41.6 million Disability Adjusted Life Years and 39.0 million Years Lost due to Disability⁴. In the USA alone, the healthcare cost of skin diseases was estimated to be $75 billion in 2016⁵. Of all the skin diseases, skin cancer is a particularly concerning skin disease condition that merits special attention due to its potential seriousness. With the increased incidence rates of skin cancer over the past decades⁶, coupled with the projected decline in the ratio of dermatologists to populations⁵, automated systems for dermatological diagnosis can be immensely valuable.

Advances in deep learning (DL)-based methods for dermatological tasks have produced models that are approaching the diagnostic accuracies of experts, some even mimicking clinical approaches of hierarchical^7,8,9 and differential¹⁰ diagnoses. The data-driven nature of these DL methods implies that large and diverse datasets are needed to train accurate, robust, and generalizable models. However, unlike natural computer vision datasets, medical image datasets are relatively smaller, primarily because of the large costs associated with image acquisition and annotation, legal, ethical, and privacy concerns¹¹, and are more cost prohibitive to expand¹². This is also true for skin cancer image datasets^13,14, where the surge in skin image analysis research over the past decade can be attributed in part to recent publicly available datasets, most notably the datasets and challenges of the International Skin Imaging Collaboration (ISIC) and the associated HAM10000¹⁵ and BCN2000¹⁶ datasets, which are primarily dermoscopic image datasets, as well as other clinical image datasets such as SD-198¹⁷, SD-260¹⁸, derm7pt¹⁹, and Fitzpatrick17k²⁰.

Although large data sets are important for the development of reliable models, the quality of the data therein and their correct use are equally important^21,22,23,24: low-quality data may result in inefficient training, inaccurate models that exhibit biases, poor generalizability and low robustness, and may negatively affect the interpretability of such models. The data quality can be affected by several factors: mislabeled images, data leakage across training and evaluation partitions, the absence of a held-out test partition, etc. The issue of data leakage, in particular, is in fact quite widespread, and a recent survey²⁵ of 17 fields spanning 294 articles, on topics ranging from medicine and bioinformatics to information technology operations and computer security, showed that ML adoption in all these fields suffers from data leakage. In an analysis of 10 popular natural computer vision, natural language, and audio datasets, Northcutt et al.²⁶ estimated an average label error rate of at least 3.3%. In medical image analysis domains, too, investigation into the use of machine learning best practices has found several instances of incorrect data partitioning and feature leakage between training and evaluation partitions. Oner et al.²⁷ showed that a peer-reviewed article published in Nature Medicine using DL for histopathology image analysis suffered from data leakage by using slide-level stratification for data partitioning instead of patient-level stratification. In a large-scale study, Bussola et al.²⁸ showed how DL models can exhibit considerably inflated performance measures when evaluated on datasets where histopathology image patches from the same subject are present in training and validation partitions. In mammography analysis, Samala et al.^29,30 showed the risks of feature leakage between training and validation partitions and how this could lead to an overly optimistic performance on the validation partition, compared to a completely held out test partition. Similar investigations have been conducted on the adverse effects of incorrect data partitioning on test performance in optical coherence tomography (OCT) image classification³¹, brain magnetic resonance imaging (MRI) classification³², and longitudinal brain MRI analysis³³

Specific to skin image analysis, our previous work³⁴ showed that the popular ISIC Skin Lesion Segmentation Challenge datasets from 2016 through 2018 have considerable overlap among their training partitions, and that 706 images are present in all three datasets’ training splits, a surprising discovery since ISIC 2016 only has 900 training images. Cassidy et al.³⁵ analyzed the ISIC Skin Lesion Diagnosis Challenge datasets from 2016 to 2020, and found overlap and duplicates across these datasets. They used a duplicate removal strategy to curate new clean training, validation, and testing sets. Vega et al.³⁶ found that a popular monkeypox skin image dataset, used in several peer-reviewed publications, contained “medically irrelevant images” and that models trained on these images did not necessarily rely on features underlying the diseases. Very recently, Groger et al.³⁷ carried out an analysis of six dermatology skin data sets (MED-NODE, PH2, DDI, derm7pt, PAD-UFES-20, and SD-128), detecting and removing near duplicates and “irrelevant samples” from them. However, all the datasets in their study were small, with 3 of 6 datasets (MED-NODE, PH2, DDI) consisting of less than 700 images and the largest (SD-128) containing 5,619 images.

In this paper, we look at three popular and large skin image analysis datasets: the DermaMNIST dataset^38,39 and its source HAM10000 dataset¹⁵ (10,015 images) and the Fitzpatrick17k dataset²⁰ (16,577 images). We perform systematic analyses of both datasets and report instances of data duplication, data leakage across training and evaluation partitions, and data mislabeling in both datasets, fixing them where possible. Our corrected datasets and detailed analysis results are available online on Zenodo⁴⁰.

Results

DermaMNIST

Released as a biomedical imaging dataset equivalent to the MNIST dataset of handwritten digits, MedMNIST consists of images from standardized biomedical imaging datasets resized to MNIST-like 28 × 28 resolution. Despite being a fairly new dataset, it has been quite popular (1009 citations as of November, 2024: 633 citations of [a more recent] 2023 paper³⁹, and 376 of [an older] 2021 version of the paper³⁸). The dermatological subset of MedMNIST, DermaMNIST, contains resized images from the popular “Human Against Machine with 10000 training images” (HAM10000) dataset¹⁵. HAM10000 contains 10,015 dermoscopic images of pigmented skin lesions collected from patients at two study sites in Australia and Austria, with their diagnoses confirmed by either histopathology, confocal microscopy, clinical follow-up visits, or expert consensus. The 7 disease labels in the dataset cover 95% of the lesions encountered in clinical practice¹⁵. Because of these meritorious properties, HAM10000 is a good candidate dataset for dermatological analysis, as aimed for with DermaMNIST. While the “lightweight” nature of DermaMNIST due to its “small size” is appealing for its adoption in machine learning for biomedical imaging³⁹, the low spatial resolution (28 × 28) does not capture sufficient morphological structures of skin lesions compared to the source dataset HAM10000. Despite this, DermaMNIST has been used for a wide variety of applications in peer-reviewed publications: semi- and self-supervised learning^41,42,43, federated learning^44,45,46, privacy-preserving learning^47,48, neural architecture search^49,50, adversarially robust learning^51,52, data augmentation^53,54, generative modeling⁵⁵, model interpretability⁵⁶, AutoML⁵⁷, active learning⁵⁸, quantum vision transformers⁵⁹, and biomedical vision-language foundation models^60,61,62, as well as derivative benchmark datasets^63,64. However, as we investigate below, the resulting DermaMNIST and its benchmarks suffer from serious flaws.

Data Leakage

A caveat of HAM10000, despite its rather large size, is that it contains multiple images of the same lesion captured either from different viewing angles or at different magnification levels (Fig. 1(a)), i.e., the number of lesions with unique lesion IDs (HAM_xxx) is smaller than the number of images with unique image IDs (ISIC_xxx). We visualize the frequency counts of lesions and how many images of the same lesion are present in HAM10000 in Fig. 1(b) and observe that the 10,015 images are in fact derived from only 7,470 unique lesions, and 1,956 of these lesion IDs (~26.18%) contains 2 or more images: 1,423 lesions have 2 images, 490 lesions have 3 images, 34 lesions have 4 images, 5 lesions have 5 images, and 6 lesions have 4 images each. Unfortunately, this was not accounted for when preparing train-valid-test splits for DermaMNIST. The DermaMNIST dataset is released as a pre-processed NumPy array, so while the dataset does not contain filenames of the images, we confirmed this by contacting the authors to obtain the exact training-validation-testing split filenames. Therefore, there is considerable data leakage of images of the same lesion across partitions. Fig. 1(a) shows 2 examples where images of the same lesion are present in the training, validation, and testing partitions. This issue is quite pervasive across DermaMNIST, and our analysis found the following overlaps across partitions: train-test: 886 images (641 lesions), train-valid: 440 images (332 lesions); valid-test: 128 images (113 lesions); train-valid-test: 51 images (40 lesions) (Fig. 1(c)). Such data leakage naturally raises concerns on the reliability of the DermaMNIST benchmarks and related studies.

We correct this data leakage by simply moving all images of a lesion ID present in the train partition from valid and test partitions back to the train partition. We choose to do this (i.e., moving the images of a lesion to the training set) instead of removing the images completely, to ensure that images of the same lesion are in one partition while also not discarding any images. Although this has the undesirable side effect of increasing the training partition size at the cost of reduced validation and testing partition sizes, it fixes the data leakage issue by ensuring there is no overlap across partitions.

Next, we examine the accuracy of HAM10000’s metadata. While the images in HAM10000 and their labels have been collected from clinical sites and the dataset itself has been widely adopted for online challenges^65,66 and human-in-the-loop evaluations^67,68, we found that some images with different lesion IDs are in fact duplicates and should have been assigned the same lesion ID. Therefore, for a systematic analysis, we use fastdup⁶⁹, an open-source Python library for analyzing visual datasets at scale, and calculate inter-image embedding similarity scores ${\mathscr{S}}({x}_{i},{x}_{j})$ for all $\left(\begin{array}{c}10,015\\ 2\end{array}\right)$ pairs of images (x_i, x_j). Figure 3 visualizes, as a confusion matrix, the following four scenarios that are possible when comparing duplication checks based on the metadata with duplicates detected using fastdup followed by visual human confirmation:

“Confirmed duplicates”: image pairs where both images share the same lesion IDs in the metadata and are indeed images of the same lesion.
“True non-duplicates”: image pairs where both images differ in their lesion IDs and are images of different lesions.
“Missed duplicates”: image pairs where both images differ in their lesion IDs but actually belong to the same lesion.
“False duplicates”: image pairs where both images share the same lesion IDs in but actually belong to the same lesion.

The image pairs that lie along this confusion matrix’s diagonal, i.e., “confirmed duplicates” and “true non-duplicates”, are those where the metadata agrees with our analysis, and the possible errors in metadata arise out of the other two scenarios.

For detecting errors of the first kind, i.e., “missed duplicates”, we analyze the top 1,000 most similar image pairs, measured by similarity of the image embeddings. We look at these 1,000 pairs in intervals of 100 pairs, ordered by decreasing inter-image similarity. For the 100 image pairs in each interval, we look up the HAM10000 metadata for both images in a pair to exclude “confirmed duplicates”, since these duplicates are already accounted for in the metadata. For the remaining pairs, we manually review them to detect which, if any, of these are “missed duplicates” and “true non-duplicates”. We visualize these counts in Fig. 4. Of the 1,000 most similar image pairs in HAM10000, we discover 18 “missed duplicates” image pairs that were not accounted for in the metadata, which are visualized in Fig. 5. Moreover, the fraction of “true non-duplicates” keeps monotonically increasing as we look at intervals of 100 image pairs, going from 0% (0 “true non-duplicates” out of 2 duplicate pairs) in the top 100 most similar pairs to 100% (64 “true non-duplicates” out of 64 duplicate pairs) in the 501^st to the 600^th most similar pairs. This may be explained by the narrow field of view in dermoscopic images (HAM10000 contains dermoscopic images), allowing for fewer visual cues for accurate detection of duplicates. Additionally, the lack of any “missed duplicates” after the top 500 most similar pairs (Fig. 4) means it is highly unlikely that, apart from the 18 duplicate pairs discovered (Fig. 5), there are other undetected duplicate image pairs in HAM10000.

Figure 6 shows an interesting discovery in the manual review of the highly similar image pairs. We found two instances of image pairs with high similarity scores, but manual inspection confirmed that the lesions had minor morphological differences. HAM10000 contains images acquired during follow-up clinical visits¹⁵, and it is possible that these highly similar image pairs are images of the same lesion acquired at different time durations.

Next, we check of errors of the second kind, i.e., “false duplicates”. For all the lesion IDs that have more than 1 image per lesion, we measure image similarity between images that belong to the same lesion and manually review the 5 least similar image pairs, as visualized in Fig. 7. We find that there are no errors, meaning that all image pairs, despite their low similarities, indeed belong to the same lesion, and that visual dissimilarities can be attributed to one or more of: zoom and crop levels, rotation, flipping, and artifacts such as gel bubbles and rulers.

Of these 18 newly discovered duplicate image pairs in HAM10000, 7 pairs leak across partitions: train-test: 5 images (5 lesions) and train-valid: 2 images (2 lesions). We correct this data leakage by moving both images in each of these 7 pairs to the train partition.

We name this “corrected” dataset version as DermaMNIST-C. Furthermore, the relative diagnosis-wise distribution of DermaMNIST-C across partitions is quite similar to that of the original DermaMNIST (Fig. 1(c)). We benchmark DermaMNIST-C using DermaMNIST’s publicly available disease classification codebase, repeating all experiments 3 times for robustness, and the results are presented in Table 1.

Table 1 Benchmark results (3 repeated runs; mean ± std. dev.) of DermaMNIST and the 2 proposed versions: DermaMNIST-C and DermaMNIST-E.

Full size table

Results on 224 × 224 resolution

DermaMNIST is created by resizing images from HAM10000’s original 600 × 450 spatial resolution to MNIST-like 28 × 28 resolution using (bi)cubic spline interpolation. However, for their classification benchmark experiments on the 224 × 224 resolution, instead of resizing the original images to 224 × 224, the authors^38,39 upsample the low-resolution 28 × 28 to obtain the 224 × 224 using nearest neighbor interpolation (“224 (resized from 28)”; verifiable through their source code⁷⁰). This, unsurprisingly, leads to quite blurry images, given the unrecoverable information lost when downsampling from the original 600 × 450 to 28 × 28 and leads to significant loss of detail (e.g., dermoscopic structures and artifacts) in the images used to train models on 224 × 224 images (Fig. 2). Our approach of directly downsampling from the original high-resolution to 224 × 224 to create DermaMNIST-C and DermaMNIST-E, used when reporting results in Table 1, yields conspicuously more detailed outputs (Fig. 2).

Extending DermaMNIST

Finally, although DermaMNIST-C is a good light-weight dataset of choice for evaluating machine learning models on dermatological tasks and for educational purposes, as MedMNIST intended to be, the quantitative results of DermaMNIST-C (and DermaMNIST for that matter) perhaps paint a deceptively optimistic picture of the state of automated dermatological diagnosis models. We propose a more challenging extension of DermaMNIST named DermaMNIST-E. The original DermaMNIST and its corrected version DermaMNIST-C are based on HAM10000, which was used as the training partition for the ISIC Challenge 2018. However, the ISIC Challenge 2018 had, apart from the 10,015 training images from HAM10000, separate validation and testing partitions containing 193 and 1,512 images, respectively. Therefore, we create DermaMNIST-E with DermaMNIST as training set and ISIC 2018 validation and test partitions as validation and testing sets, respectively. Although the official testing partition of ISIC 2018 contained 1,512 images, we remove one image known as the “easter egg” (ISIC_0035068)⁶⁸, resulting in a total of 1,511 images. While the resulting diagnosis distribution across partitions for DermaMNIST-E is similar to that of DermaMNIST-C and DermaMNIST (Fig. 1(c)), our benchmark results on DermaMNIST-E (Table 1) show that it is indeed a more challenging dataset and is guaranteed to be void of any data leakage. It should be noted that the DermaMNIST-E dataset is almost the same as the official partitions of the ISIC 2018 Challenge data with 2 distinctions: the images in DermaMNIST-E are resized (28 × 28 or 224 × 224) and the “easter egg” image has been removed from the testing partition.

A summary of the three datasets: DermaMNIST, DermaMNIST-C, and DermaMNIST-E is presented in Table 4, listing a brief description and statistics of the datasets.

Fitzpatrick17k

Released in 2021, Fitzpatrick17k is one of the largest publicly available datasets of clinical skin disease images. The large number of skin diseases covered (114), the in-the-wild nature of the images therein, and the availability of associated and diverse Fitzpatrick skin tone (FST) labels⁷¹ make it an immensely valuable dataset for skin image analysis research. However, unlike DermaMNIST, which is collected from clinical visits and whose labels were confirmed, Fitzpatrick17k was curated from 2 publicly available online dermatological atlases: DermaAmin⁷² (12,672 images) and Atlas Dermatologico⁷³ (3,905 images). As such, the diagnosis labels of these images are not confirmed, through histopathology or otherwise. The authors conducted a small-scale study on only 3.04% of the entire dataset (504 out of 16,577 images) where 2 board-certified dermatologists assessed the diagnoses of the images, and the consensus was that only 69.0% of the images were clearly diagnostic of the disease label and, more importantly, 3.4% of the images were mislabeled. This is problematic since Fitzpatrick17k has been used to train models for high-stakes applications such as model explainability^74,75, trustworthiness⁷⁶, skin tone detection⁷⁷, model calibration⁷⁸ and fairness^{79,80,81,82,83,84}. Fitzpatrick17k has also been used for for training and evaluating large vision-language models^{85,86,87,88,89}, visual question answering^90,91, clinical decision support for differential diagnosis⁹², generative modeling^93,94,95, federated learning⁹⁶, and for creating a derivative dataset: SkinCon⁹⁷. Pakzad et al.⁸⁰ previously highlighted the existence of erroneous and wrongly labeled images in Fitzpatrick17k, and for these reasons, we investigate the extent of labeling inaccuracy in this dataset.

Data duplication and leakage

To investigate the presence of duplicates in Fitzpatrick17k, we use fastdup to calculate inter-image embedding similarity scores ${\mathscr{S}}({x}_{i},{x}_{j})$ for all $\left(\begin{array}{c}16,577\\ 2\end{array}\right)$ pairs of images (x_i, x_j). These are shown as a 16,577 × 16,577 similarity matrix in Fig. 8, and we can clearly see that there are several pairs with high similarities, denoted by darker shades, spread throughout the dataset. For subsequent analyses, we restrict ourselves to pairs with a high similarity by setting thresholds to ${\mathscr{S}}\ge \tau ;\,\tau \in \{0.90,0.95\}$. The distributions of image embedding pairs with these similarity thresholds are shown in Fig. 9(a,b), respectively. Note that there are 6,622 and 1,425 image embedding pairs with similarity scores greater than 0.90 and 0.95, respectively. Manual verification by a human reviewer of these 1,425 image pairs, whose embeddings had similarities greater than 0.95, revealed that 98.39% of these images (1,402 images) were indeed duplicates, with 16 pairs (1.12%) being false positive and 7 pairs (0.49%) being ambiguous. A second reviewer agreed with 1,419 labels of the first reviewer (99.58% match), exhibiting a near-perfect agreement^98,99 (Cohen’s kappa κ = 0.87). Due to the publisher’s human data policy, we are unable to visualize these samples in the published manuscript, and instead direct the readers to our arXiv pre-print¹⁰⁰ for additional visualizations and accompanying descriptions, where we visualize some of these pairs, categorizing them according to traits exhibited by the pairs. Since the filenames of the images in Fitzpatrick17k are of the format {MD5hash}.jpg, for each image, we also display their diagnosis abbreviation, their FST label (set to ‘N/A’ when the FST label is missing), and a truncated MD5 hash to uniquely identify the images. Notice that duplicate image pairs exist because of:

different crop/zoom levels,
different illumination setups,
different image resolutions, and
simple geometrical transformations (e.g., mirroring).

Worryingly, duplicate image pairs containing multiple disjoint objects of interest or multiple people also exist, making it difficult to determine to which of these the diagnosis and the FST labels apply. Finally, several duplicate pairs with more than one of these issues were also detected.

For a more detailed duplicate detection, we employ another Python library, cleanvision¹⁰¹, to further assess the dataset. Aside from the duplicate image pairs found by fastdup based on our similarity threshold of ${\mathscr{S}}\ge 0.90$, the cleanvision analysis found 19 more duplicate pairs that have slightly lower inter-image similarity ($0.85\le {\mathscr{S}} < 0.90$), primarily because of the large difference in spatial resolutions between duplicate pairs.

Unfortunately, data duplication in Fitzpatrick17k is not limited to image pairs only. We use fastdup to cluster images whose intra-cluster image similarity is greater than 0.90, i.e., clusters of images where the mean similarity for all the image pairs is greater than 0.90. We consider clusters of at least 3 images, since those with 2 images (i.e., duplicate images pairs) are already covered in our analysis of duplicate pairs. Mathematically, we find all image clusters {x₁, x₂, …, x_N}; N ≥ 3 such that $\frac{1}{\left(\begin{array}{c}N\\ 2\end{array}\right)}{\sum }_{i=1}^{N}{\sum }_{j=i+1}^{N}{\mathscr{S}}({x}_{i},{x}_{j}) > 0.90$. Manual verification of the clustering outputs yielded 139 image clusters, with 3.71 ± 1.11 images per cluster on average. Visual inspection of these image clusters (we direct the readers to our arXiv pre-print¹⁰⁰ for these visualizations since they could not be put in the published manuscript), shows that these image clusters exhibit similar traits as the duplicate pairs, i.e., the images in each cluster are one or more of: exact matches, zoomed-in or cropped-out duplicates, duplicates with different illumination setups (captured with and without camera flash), or acquired at slightly different viewing angles. Finally, we merge the results of duplicate pairs with duplicate clusters, forming larger clusters as they are discovered. This results in some large duplicate image clusters with as many as 10 images in a cluster, and a total of 2,297 clusters with 2.18 ± 0.66 images per cluster on average.

Mislabeled diagnosis and FST labels

In addition to the presence of duplicates, Fitzpatrick17k contains images with mislabeled diagnoses and FSTs. For a more concrete estimation of the extent of mislabeling, we use similarity thresholds of 0.90 (denoted by ${{\mathscr{S}}}_{0.90}$) and 0.95 (denoted by ${{\mathscr{S}}}_{0.95}$). We report the number of image pairs that have a similarity higher than {0.90, 0.95} but differ in their labels. Further, given the subjectivity of FST labels, Groh et al.²⁰ evaluated the accuracy of human annotations (HA) to the gold standard (GT) subset using two metrics: accuracy and “off-by-one” accuracy, where the latter considers an annotation to be correct if $| {{\mathscr{F}}}_{{\rm{HA}}}-{{\mathscr{F}}}_{{\rm{GT}}}| \le 1$. Similar to previous works that accounted for this “off-by-one” margin^77,80, we count similar images that differ by at least 1 (${\widehat{{\mathscr{F}}}}^{\ge 1}$) and those that differ by strictly more than 1 FST (${\widehat{{\mathscr{F}}}}^{ > 1}$). Visualizations in our arXiv pre-print¹⁰⁰ show sample pairs with very high inter-image similarity (${\mathscr{S}} > 0.95$) that differ in diagnoses and FST by 1 and more than 1, respectively. Fig. 9 shows the distributions of duplicate image pairs filtered by one or more of: their similarity scores, whether their diagnoses are different, and whether and how much their FST scores differ by. For image pair similarity thresholds of [0.90; 0.95], there are [2498; 93] image pairs that differ in their diagnoses ($\widehat{{\mathscr{D}}}$). [4030; 803] image pairs differ in their FST labels (${\widehat{{\mathscr{F}}}}^{\ge 1}$), while [1236; 199] pairs strictly differ in their FST labels (${\widehat{{\mathscr{F}}}}^{ > 1}$). [4947; 841] images pairs differ in either their diagnosis or in their FST label ($\{\widehat{{\mathscr{D}}}\cup {\widehat{{\mathscr{F}}}}^{\ge 1}\}$), and [3172; 277] differ in their diagnosis or strictly differ in the FST label ($\{\widehat{{\mathscr{D}}}\cup {\widehat{{\mathscr{F}}}}^{ > 1}\}$). Finally, there are [1581; 55] and [562; 15] image pairs for the $\{\widehat{{\mathscr{D}}}\cap {\widehat{{\mathscr{F}}}}^{\ge 1}\}$ and $\{\widehat{{\mathscr{D}}}\cap {\widehat{{\mathscr{F}}}}^{ > 1}\}$ categories, respectively.

Erroneous images

In a recent study, Pakzad et al.⁸⁰ reported the presence of erroneous or outlier non-skin images in Fitzpatrick17k. Using an outlier detection approach based on distance to the nearest neighbors in the embedding space, we rank images in the dataset based on their probability of being an outlier. Sample outliers include non-dermatological imaging modalities (e.g., histopathology, radiology, microscopy, fundus), images of plants (leaves, trees) and animals (e.g., rodents, bugs, poultry), etc. Worryingly, Fitzpatrick17k does not contain information regarding which images are non-dermatological, which consequently impacts the training and evaluation of models.

Non-standardized data partitioning

The Fitzpatrick17k benchmarks by Groh et al.²⁰, as well as several works that followed^75,79,82,84, also suffer from another major problem: the lack of a strictly held-out test partition. For all their skin condition prediction experiments, the authors only used a training and a validation set, and used the terms “validation” and “testing” interchangeably in the paper. This can also be verified in their accompanying code implementation, where the data partitions used to select the best epoch during training¹⁰² (“the epoch with the lowest loss on the validation set”) and to report the final results¹⁰³ are the same. This violates the fundamental rules of machine learning model training and evaluation, where the validation and the testing partitions must be separate disjoint sets, and the former is used for choosing the best performing model during training and hyperparameter selection, while the latter is reserved only for the final model evaluation and is never used during training.

Correcting Fitzpatrick17k

In light of the numerous aforementioned issues with Fitzpatrick17k: data duplication, conflicting labels, the presence of erroneous images, and the absence of a well-defined test partition, we attempt to clean up Fitzpatrick17k and present a smaller, yet more reliable dataset. Specifically, we remove clusters of duplicates (this includes duplicate pairs), keeping one image from each cluster if there are no conflicting diagnosis or FST labels within the cluster, i.e., a “homogenous cluster”. Next, we remove the erroneous images from the dataset and refer to this “cleaned” version of Fitzpatrick17k as Fitzpatrick17k-C.

In the absence of standardized dataset partitions, researchers who used Fitzpatrick17k for their models had to resort to generating their own splits^74,79,80, making it very hard for models across papers to be compared. To resolve this, we present standardized training, validation, and testing partitions for Fitzpatrick17k-C for the skin image analysis community to use, obtained by splitting Fitzpatrick17k-C in the ratio of 70:10:20, stratified on the diagnosis labels. Table 4 summarizes the two datasets: Fitzpatrick17k and Fitzpatrick17k-C, listing the number of images in their respective partitions.

Finally, we also provide benchmarks for Fitzpatrick17k-C using all the different experimental settings proposed by Groh et al.²⁰ in Table 2. We perform a hyperparameter search for each experimental setting over the space of optimizers ({Adam, SGD}), learning rate ({1e − 2, 1e − 3, 1e − 4}), and number of training epochs ({20, 50, 100, 200}), and list the number of images in the training, validation, and testing partitions. For added robustness, we repeat each experiment using 3 random seeds. We also observed that using one setting’s optimal hyperparameter choices to evaluate another setting’s test partition does not considerably degrade the classification performance (Table 3).

Table 2 Benchmark results (3 repeated runs; mean ± std. dev.) of Fitzpatrick17k-C for all the experiments originally reported by Groh et al.²⁰.

Full size table

Table 3 Understanding how Fitzpatrick17k-C classification performance varies with change in hyperparameters.

Full size table

Discussion

In this paper, we examine the data quality of three popular and large skin image analysis datasets: DermaMNIST from the MedMNIST dataset and its source HAM10000 dataset (10,015 dermoscopic images of skin lesions) and Fitzpatrick17k (16,577 clinical images of skin diseases). For DermaMNIST, we investigate the extent of data leakage across its training, validation, and testing partitions, and propose corrected (DermaMNIST-C) and extended (DermaMNIST-E) versions. We conducted benchmark evaluations using multiple methods and compare the results to those of DermaMNIST across all datasets. Table 4 For Fitzpatrick17k, we perform a systematic analysis encompassing data duplication, mislabeling of diagnosis and Fitzpatrick skin tone labels, identification of erroneous images, as well as highlighting the use of non-standard data partitions. Finally, we propose a cleaned version of the dataset with standardized partitions called Fitzpatrick17k-C, and release the corresponding updated benchmarks.

Table 4 Summary statistics for the two datasets analyzed in this paper and their corresponding corrected versions proposed.

Full size table