Table 3 Included studies general characteristics, dataset used, and performance evaluating both dermoscopic and clinical images

From: A systematic review and meta-analysis of artificial intelligence versus clinicians for skin cancer diagnosis

Author

Database

Dataset

Test

I/E

Design

HP

CD

Participants

IA model

Classification

Clinicians’ vs IA

IA performance (%, 95% CI)

Clinicians’ performance (%, 95% CI)

Augmented performance

Esteva et al.55

Public: ISIC, Edinburgh Dermofit Library, Stanford Hospital,

Dataset: 129,450

Training/validation: 127463

Test: 1942, 376 for comparison

I

R

Y

N

21 dermatologists

GoogleNet Inception v3

Malignant vs benign vs non-neoplastic

Multiclass: 9-class

Comparable

Overall Acc 72.1% ± 0.9

9-class classification Acc 55.4% ± 1.7

Acc 65.78%

9-class classification Acc 54.15%

 

Tschandl et al.11

Institutional databese from C.R., Australia (training and test).

Medical University of Vienna, image database from C.R., and a convenience sample of rare diagnoses (test)

Training: 7895 dermoscopy 5,829 close-up

Validation: 340 dermoscopy, 635 close-up

Test: 2,072 multiple sources.

I

R

Y

N

95 participants:

Beginner ( < 3 y), intermediate (3-10 y), expert ( > 10 y).

CNN (combined model with outputs of 2 CNNs)

InceptionV3 architecture30 ResNet50 network31

Dichotomous: benign vs malignant

Comparable

Sn 80.5% (79.0–82.1)

Sp 53.5% (51.7-55.3)

Sn 77.6% (74.7-80.5)

Sp 51.3% (48.4-54.3)

mean AUC 0.695 (0.676–0.713):

Beginners AUC 0.655; (0.626–0.684)

Intermediate AUC 0.690; (0.657–0.722)

Experts AUC: 0.741 (0.719–

0.763)

 

Haenssle et al.57 (I)

Public: ISIC 2016.

Institutional: Department of Dermatology,

University of Heidelberg, Germany

Training/validation: not specified (ISIC)

Test: 100

E

R

N

Y

58 dermatologists:

-17 Beginner <2,

-11 Skilled 2–5 y

-30 Expert >5 y

Google’s Inception v4 CNN architecture

Dichotomous: Melanoma vs nevus.

Management decision (excision, short-term follow-up, no action).

CNN’s specificity was higher (82.5% vs 71.3%, p < 0.01).

CNN ROC AUC (0.86 vs 0.79, p < 0.01).

Level I (dermoscopic images):

Sn 86.6%.

Sp 82.5%

Level II (dermoscopy and clinical information)

Sn 88.9%

Sp 82.5%

Level I

All: Sn 86.6% ( ± 9.3%);

Sp 71.3% ( ± 11.2)

ROC 0.79

Expert: Sn 89.0%, Sp 74.5%

Skilled: Sn 85.9%, Sp 68.5%

Beginner: Sn 82.9%, Sp 67.6%

level-II

All: Sn 88.9% ( ± 9.6%)

Sp 75.7% ( ± 11.7, p < 0.05)

ROC 0.82

Expert: Sn 89.5%, Sp 77.7%

Skilled: Sn 90.0%, Sp 77.2%

Beginner: Sn 86.6%, Sp 71.2%

 

Brinker et al.58

Public

ISIC 2017, HAM1000, MED-NODE database (training)

Institutional (clinical images, test)

Dataset: 20,735

Training/validation: 12,378/1,359 dermoscopic images

Test: 100 clinical images

E

R

B

N

145 dermatologist

-88 Junior physicians

-16 Attendings

-35 Senior physicians

-3 Chief physicians

ResNet50

Dichotomous

Comparable

Sn 89.4% (55-100)

Sp 68.2% (47.5-86.25)

All participants

Sn 89.4% (55-100)

Sp 64.4% (22.5–92.5)

Junior

Sn 88.9% Sp 64.7% ROC 0.768

Attendings

Sn 92.8%, Sp 57.7%, ROC 0.753

Senior

Sn 89.1%, Sp 66.3%, ROC 0.777

Chief

Sn 91,70%, Sp 58.8%, ROC 0.753

 

Li et al.59

Training: Chinese Skin Image Database (CSID), Youzhi AI software.

Test: Institutional China-Japan Friendship Hospital.

Dataset: 1,438 patients

Training: > 200,000 dermoscopic images

Test: 212 clinical, 106 dermoscopic

E

R

Y

N

11 participants:

- 4 primary level

- 4 intermediate

- 3 dermoscopy experts.

Youzhi AI software (system version 2.2.5). GoogLeNet Inception v4 convolutional neural network architecture

Dichotomous: benign vs malignant

Comparable

Sn 74.84% ± 0.0149

Sp 92.96% ± 0.0052

Acc 85.85%

Clinical images:

Sn 71.1% ± 0.0169

Sp 90.6% ± 0.0107

Acc 83.02%

Dermoscopic images:

Sn 78.64% ± 0.0273

Sp 95.32% ± 0.0107

Acc 88.68%

AUC: 0.63 (0.55–0.71)

Accuracy D Matched clinical and dermoscopy 86.02%

Accuracy D Random 83.32%

Clinical images:

Acc 79.5% ± 0.0753

Dermoscopic images:

Acc 89.62%

 

Haenssle et al.21

Moleanalyzer Pro®

(Training)

Public

MSK-1 dataset, ISIC-2018 (test set only for algorithm)

Institutional (test)

Training: MSK-1 (1,100 images); ISIC-2018 (1511 images).

Test: 100

convenience sample collected between 2014 and

2019

MSK-1 dataset

(1100) and ISIC-2018

dataset (1511) only for algorithm test.

E

R

B

Y

96 dermatologists:

-17 beginners, <2 y

-29 skilled 2–5 y

-40 experts >5 y

Moleanalyzer Pro (Foto-Finder Systems GmbH, Bad Birnbach, Germany)

CNN architecture based on Google’s Inception_v4,15

Dichotomous: malignant/premalignant vs benign.

Management decision (treatment/ excision, no action, follow-up)

CNN and most dermatologists comparable performance.

Sn 95% (83.5%–98.6).

Sp 76.7% (64.6%–85.6)

Level I dermoscopy:

Sn 83.8%; Sp 77.6%

Acc:

Beginners 79.9% (77.7%–82.1%) Skilled 83.3% (80.1%–85.6%)

Experts 86.9% (85.5%–88.3%).

Level II dermoscopy + close-up + inf:

Sn 90.6%; Sn 82.4%

Acc:

Beginners: 82.0% (79.3%–84.7%)

Skilled: 85.4% (83.0%–87.8%)

Experts: 88.5% (87.0%–90.0%)

 

Willingham et al.60

Institutional Hawaii Pathologists’ Laboratory (training and test)

Public ISIC dataset, MED-NODE, PH, DermNet, Asan and Hallym datasets (training)

Training:

14522 ISIC

539 Hawaii-based dermatologist image dataset.

Test: 50 (25 public, 25 institutional)

I

R

B

N

3 dermatologists

Google’s InceptionV3 network

Benign vs malignant

Melanoma vs nonmelanoma.

Comparable.

AUC 0.948

Acc 68%

Acc: 64.7%

 

Huang et al.61

Institutional

Xiangya-Derm, (Chinese database, from 15 hospitals, that consists of over 150,000 images)

Data set: approximately 3000 images (six subtypes of skin diseases)

Training: 2,400

Test: 600

R

B

N

31 dermatologists: professors, senior attending doctors, young attending doctors, and medical students.

Xy-SkinNet, ResNet-101, ResNet-152 model

6-category common types of diseases.

AI-based classification accuracy exceeded the average accuracy of dermatologists

Top 3 Acc: 84.77%

Acc: 78.15%

 
  1. HP histopathology confirmation, I/E internal/external test set, P prospective, R retrospective, B both (a subset of lesions were biopsy proven and a subset based on clinical/consensus diagnosis), CD clinical data (metadata) available, CNN convolutional neural network, DCNN deep convolutional neural network, AK actinic keratosis, BCC basal cell carcinoma, BKL benign keratosis, SK seborrheic keratosis, DF dermatofibroma, MEL melanoma, NT not trained, SCC squamous cell carcinoma, VASC vascular lesion, Sn sensitivity, Sp specificity, Acc accuracy, NPV negative predictive value, PPV positive predictive value, ROC receiver operating characteristic curve, AI artificial intelligence. Δ hold-out dataset.