Fig. 3
From: Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets

Visualizing the four scenarios that a pair of images from HAM10000 can be assigned to in duplicate detection, based on the metadata and the fastdup-based duplicate detection followed by manual review. “Confirmed duplicates”, as the name suggests, are pairs that are images of the same lesion, indicated by the same lesion IDs in the metadata. Similarly, “True non-duplicates” are pairs of images that belong to different lesions. “Missed duplicates” refer to image pairs that have differing lesion IDs according to the metadata, but their high visual similarity (measured by cosine similarity of their image embeddings) followed by manual review confirms that these are indeed images of the same lesion, and were therefore ‘missed’ by the metadata. Finally, “False duplicates” refer to pairs where images share the same lesion IDs but do not belong to the same lesion. In our analysis, we did not find any instances of “False duplicates” in HAM10000. For all these sample images, the image IDs and the lesion IDs are along the horizontal and the vertical axis, respectively. Images from HAM10000 are licensed under CC BY-NC 4.015.