Table 3 Included studies general characteristics, dataset used, and performance evaluating both dermoscopic and clinical images
Author | Database | Dataset | Test | I/E | Design | HP | CD | Participants | IA model | Classification | Clinicians’ vs IA | IA performance (%, 95% CI) | Clinicians’ performance (%, 95% CI) | Augmented performance |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Esteva et al.55 | Public: ISIC, Edinburgh Dermofit Library, Stanford Hospital, | Dataset: 129,450 Training/validation: 127463 | Test: 1942, 376 for comparison | I | R | Y | N | 21 dermatologists | GoogleNet Inception v3 | Malignant vs benign vs non-neoplastic Multiclass: 9-class | Comparable | Overall Acc 72.1% ± 0.9 9-class classification Acc 55.4% ± 1.7 | Acc 65.78% 9-class classification Acc 54.15% | |
Tschandl et al.11 | Institutional databese from C.R., Australia (training and test). Medical University of Vienna, image database from C.R., and a convenience sample of rare diagnoses (test) | Training: 7895 dermoscopy 5,829 close-up Validation: 340 dermoscopy, 635 close-up | Test: 2,072 multiple sources. | I | R | Y | N | 95 participants: Beginner ( < 3 y), intermediate (3-10 y), expert ( > 10 y). | CNN (combined model with outputs of 2 CNNs) InceptionV3 architecture30 ResNet50 network31 | Dichotomous: benign vs malignant | Comparable | Sn 80.5% (79.0–82.1) Sp 53.5% (51.7-55.3) | Sn 77.6% (74.7-80.5) Sp 51.3% (48.4-54.3) mean AUC 0.695 (0.676–0.713): Beginners AUC 0.655; (0.626–0.684) Intermediate AUC 0.690; (0.657–0.722) Experts AUC: 0.741 (0.719– 0.763) | |
Haenssle et al.57 (I) | Public: ISIC 2016. Institutional: Department of Dermatology, University of Heidelberg, Germany | Training/validation: not specified (ISIC) | Test: 100 | E | R | N | Y | 58 dermatologists: -17 Beginner <2, -11 Skilled 2–5 y -30 Expert >5 y | Google’s Inception v4 CNN architecture | Dichotomous: Melanoma vs nevus. Management decision (excision, short-term follow-up, no action). | CNN’s specificity was higher (82.5% vs 71.3%, p < 0.01). CNN ROC AUC (0.86 vs 0.79, p < 0.01). | Level I (dermoscopic images): Sn 86.6%. Sp 82.5% Level II (dermoscopy and clinical information) Sn 88.9% Sp 82.5% | Level I All: Sn 86.6% ( ± 9.3%); Sp 71.3% ( ± 11.2) ROC 0.79 Expert: Sn 89.0%, Sp 74.5% Skilled: Sn 85.9%, Sp 68.5% Beginner: Sn 82.9%, Sp 67.6% level-II All: Sn 88.9% ( ± 9.6%) Sp 75.7% ( ± 11.7, p < 0.05) ROC 0.82 Expert: Sn 89.5%, Sp 77.7% Skilled: Sn 90.0%, Sp 77.2% Beginner: Sn 86.6%, Sp 71.2% | |
Brinker et al.58 | Public ISIC 2017, HAM1000, MED-NODE database (training) Institutional (clinical images, test) | Dataset: 20,735 Training/validation: 12,378/1,359 dermoscopic images | Test: 100 clinical images | E | R | B | N | 145 dermatologist -88 Junior physicians -16 Attendings -35 Senior physicians -3 Chief physicians | ResNet50 | Dichotomous | Comparable | Sn 89.4% (55-100) Sp 68.2% (47.5-86.25) | All participants Sn 89.4% (55-100) Sp 64.4% (22.5–92.5) Junior Sn 88.9% Sp 64.7% ROC 0.768 Attendings Sn 92.8%, Sp 57.7%, ROC 0.753 Senior Sn 89.1%, Sp 66.3%, ROC 0.777 Chief Sn 91,70%, Sp 58.8%, ROC 0.753 | |
Li et al.59 | Training: Chinese Skin Image Database (CSID), Youzhi AI software. Test: Institutional China-Japan Friendship Hospital. | Dataset: 1,438 patients Training: > 200,000 dermoscopic images | Test: 212 clinical, 106 dermoscopic | E | R | Y | N | 11 participants: - 4 primary level - 4 intermediate - 3 dermoscopy experts. | Youzhi AI software (system version 2.2.5). GoogLeNet Inception v4 convolutional neural network architecture | Dichotomous: benign vs malignant | Comparable | Sn 74.84% ± 0.0149 Sp 92.96% ± 0.0052 Acc 85.85% Clinical images: Sn 71.1% ± 0.0169 Sp 90.6% ± 0.0107 Acc 83.02% Dermoscopic images: Sn 78.64% ± 0.0273 Sp 95.32% ± 0.0107 Acc 88.68% | AUC: 0.63 (0.55–0.71) Accuracy D Matched clinical and dermoscopy 86.02% Accuracy D Random 83.32% Clinical images: Acc 79.5% ± 0.0753 Dermoscopic images: Acc 89.62% | |
Haenssle et al.21 | Moleanalyzer Pro® (Training) Public MSK-1 dataset, ISIC-2018 (test set only for algorithm) Institutional (test) | Training: MSK-1 (1,100 images); ISIC-2018 (1511 images). | Test: 100 convenience sample collected between 2014 and 2019 MSK-1 dataset (1100) and ISIC-2018 dataset (1511) only for algorithm test. | E | R | B | Y | 96 dermatologists: -17 beginners, <2 y -29 skilled 2–5 y -40 experts >5 y | Moleanalyzer Pro (Foto-Finder Systems GmbH, Bad Birnbach, Germany) CNN architecture based on Google’s Inception_v4,15 | Dichotomous: malignant/premalignant vs benign. Management decision (treatment/ excision, no action, follow-up) | CNN and most dermatologists comparable performance. | Sn 95% (83.5%–98.6). Sp 76.7% (64.6%–85.6) | Level I dermoscopy: Sn 83.8%; Sp 77.6% Acc: Beginners 79.9% (77.7%–82.1%) Skilled 83.3% (80.1%–85.6%) Experts 86.9% (85.5%–88.3%). Level II dermoscopy + close-up + inf: Sn 90.6%; Sn 82.4% Acc: Beginners: 82.0% (79.3%–84.7%) Skilled: 85.4% (83.0%–87.8%) Experts: 88.5% (87.0%–90.0%) | |
Willingham et al.60 | Institutional Hawaii Pathologists’ Laboratory (training and test) Public ISIC dataset, MED-NODE, PH, DermNet, Asan and Hallym datasets (training) | Training: 14522 ISIC 539 Hawaii-based dermatologist image dataset. | Test: 50 (25 public, 25 institutional) | I | R | B | N | 3 dermatologists | Google’s InceptionV3 network | Benign vs malignant Melanoma vs nonmelanoma. | Comparable. | AUC 0.948 Acc 68% | Acc: 64.7% | |
Huang et al.61 | Institutional Xiangya-Derm, (Chinese database, from 15 hospitals, that consists of over 150,000 images) | Data set: approximately 3000 images (six subtypes of skin diseases) Training: 2,400 | Test: 600 | IΔ | R | B | N | 31 dermatologists: professors, senior attending doctors, young attending doctors, and medical students. | Xy-SkinNet, ResNet-101, ResNet-152 model | 6-category common types of diseases. | AI-based classification accuracy exceeded the average accuracy of dermatologists | Top 3 Acc: 84.77% | Acc: 78.15% |