Table 2 Included studies general characteristics, dataset used, and performance evaluating clinical images

From: A systematic review and meta-analysis of artificial intelligence versus clinicians for skin cancer diagnosis

Author

Database

Training set

Test set

I/E

Design

HP

CD

Participants

IA

Classification

Clinicians’ vs IA

IA performance

Clinicians’ performance

Augmented performance

Chang et al.47

Institutional:

Kaohsiung Medical University

Dataset: 24,178

Training/validation: not specified

Test: 769

I

R

Y

N

25 dermatologists

CADx system

3-class: Malignant or benign or indeterminate

Comparable

Sn 85.63%

Sp 87.65%

Acc 90.64%

ROC 0.949

Sn 83.33%

Sp 85.88%

Acc 85.31%

 

Han et al.42

Public:

Training: Asan dataset, MED-NODE dataset, and atlas site images

Dataset: 598,854

Training: 19,398

Validation: portion of the Asan, Hallym and

Edinburgh datasets.

Test: 480 images (260 images Asan test, 220 images Edinburgh)

I

R

B

N

16 dermatologists:

-10 professors

-6 clinicians

Microsoft ResNet-152 model

Dichotomous: Benign vs malignant

Comparable

Asan dataset:

Sn 86.4% ± 3.5%

Sp 85.5% ±3.2%

AUC 0.91 ± 0.01

Edinburgh:

Sn 85.1% ± 2.2%

Sp 81.3% ± 2.9%

AUC 0.89 ± 0.01

  

Fujisawa et al.43

Institutional:

University of Tsukuba Hospital from 2003 to 2016 (training and test)

dataset: 6,009

training/validation: 4,867

Test: 1,142

R

B

N

22 dermatologists:

-13 board-certified

-9 trainees

GoogLeNet DCNN model

Dichotomous: Benign vs malignant

DCNN achieved greater accuracy (P< .0001).

Sn 96.3%

Sp 89.5%

Acc 76.5%

Acc board-certified 85.3% ± 3.7%

Acc trainees 74.4% ± 6.8%

 

Han et al.44

Public: MED-NODE data set, Seven-Point Checklist Dermatology data set (training)

Institutional: Asan Medical Center Department of Dermatology, Hallym National University Department of Plastic Surgery, Chonnam University Department of Plastic Surger (training and test set)

Dataset:

Training/validation: 1,106,886/2,844

Test: 325

I

R

Y

N

119 clinicians:

-13 board-certified dermatologists

-34 dermatology residents

-20 non-dermatologic physicians

-52 general public with no medical background

Blob detector training using faster-RCNN20, a fine image selector and the disease classifier training using CNNs (SENet, SE-ResNeXt-50, and SE-ResNet-50).

Dichotomous: Benign vs malignant

Comparable

AUC: 91.9

Sn 98.2%

Sp 77.9%

Dermatologists ROC: 0.90

Non-dermatologist physicians ROC: 0.725

(Sn and Sp for each one not specified)

Overall: Sn 95.0%

Sp 72.1%

 

Zhao et al.48

Institutional: XiangyaDerm, which was collected from Xiangya Hospital

Dataset: 150,223

Training/validation: 4,500

Test: 60

I

R

Y

N

20 dermatologists

Xception architecture

3 risk classification: low risk, high risk, and dangerous

Classifier outperforms dermatologists

Acc 82.7%

Benign: Sn 93%, Sp 88%

Low degree: Sn 85%, Sp 85%

High degree: Sn 86%, Sp 91%

AUC:

- Low-risk: 0.959

-High-risk: 0.919

- Dangerous: 0.947

Sn:

- Low-risk: 61%

- High-risk: 49.5%

- Dangerous: 64%

Sp

- Low-risk: 4.9%

-High-risk: 29%

- Dangerous: 29%

 

Han et al.52

Public: Asan Medical Center and images from websites (training)

Institutional: Department of Dermatology, Severance Hospital,

Seoul, Korea (test set)

Dataset:

- Dataset A (Dichotomous): 40,331

- Dataset B (Multiclass): 39,721

Training: 1,106,886 images

Test: 1,320

E

R

Y

N

65 attending physicians (dichotomous)

44 dermatologists 5.7 ± 5.2 years of experience (multiclass)

Disease classifier (SENet and SE-ResNeXt-50) was trained with the help of a region-based CNN (faster RCNN)

Dichotomous: benign or malignant

Multiclass: diagnosis

First clinical impression of physicians was superior to those of the algorithm

Multiclass classification was comparable.

Dichotomous:

AUC 0.863 (0.852–0.875)

Sn 62.7% (59.9–65.1)

Sp 90.0% (89.4–90.6)

PPV 45.4% (43.7–47.3)

NPV 94.8% (94.4–95.2)

Multiclass:

Sn 66.9% (57.7–76.0)

Sp 87.4% (82.5–92.2)

Dichotomous:

Sn 70.2%

Sp 95.6%

PPV 68.1%

NPV 96.0%

Multiclass:

Sn 65.8% (55.7–75.9)

Sp 85.7% (82.4–88.9)

 

Huang et al.45

Institutional: Xiangya Hospital, Central South University,

Dataset: 3,299

Training: 2,474

Test: 825

Additional test set: 116

R

Y

N

21 participants:

-8 expert dermatologists

-13 general dermatologists

4 CNN networks: InceptionV3, Inception-ResNetV2, DenseNet121, and ResNet50

Dichotomous: BCC vs SK

InceptionResNetV2 model outperformed general dermatologists and was comparable to expert dermatologists.

PPV 89.7%

NPV 10.3%

AUC 0.937

PPV 73.2%

NPV 21.5%

 

Han et al.53 (I)

Public: ASAN, Web, MED-NODE, images from websites (training).

Edinburgh dataset (validation)

Institutional: SNU datasets (validation and test) SNU dataset consisted of data from three university hospitals (Seoul National University Bundang Hospital, Inje University Sanggye Paik Hospital, and Hallym University Dongtan Hospital)

Dataset: 224,181

Training: 220,680, 174 disease classes

Validation:

SNU dataset: 2,201 images of 134 disorders

Edinburgh dataset: 1,300 images of 10 tumorous skin diseases.

Test: 240 images from SNU dataset

E

R

B

N

70 participants:

- 21 dermatologists

- 26 dermatology residents

- 23 non-medical professionals

Not specified

Dichotomous: melanoma vs nevus and suggesting treatment option

Multi-class classification of 134 skin disorders

Dichotomous:

algorithm showed similar performance as dermatology residents but slightly lower than dermatologists

SNU AUC 0.937 ± 0.004

Edinburgh AUC 0.928 ± 0.002

Multiclass:

mean top 1, 3, and 5 accuracies: 44.8 ± 1.2%, 69.0 ± 0.9%, and 78.1 ± 0.3%

Dermatologists

Sn 77.4% ± 10.7

Sp 92.9% ± 2.4

AUC 0.66 ± 0.08

Non-medical professionals

Sn 47.6 ± 33.1%

Sn and Sp of clinicians were improved by 12.1% (p < 0.0001) and 1.1% (p < 0.0001), respectively.

Non-medical professionals improved Sn from 47.6 ± 33.1% to 87.5 ± 17.2% (p < 0.0001) without loss in Sp.

Jinnai et al.54

Institutional: Department Dermatologic Oncology in the National Cancer Center Hospital (training and test)

Dataset: 5846

Training/validation: 4732 images.

Test: 200 images

R

B

N

20 dermatologists:

-10 board-certified dermatologists (BCDs)

- 10 dermatologic trainees (TRNs)

Faster, region-based CNN (FRCNN)

-Dichotomous: benign vs malignant

-Multiclass: Six-class classification

Accuracy of FRCNN was significantly better than that of the dermatologists (p < 0.00001)

Dichotomous:

-Acc: 91.5%

-Sn: 83.3%

-Sp: 94.5%

Multiclass:

-Acc: 86.2%

-VPN 5.5%

-VPP 84.7%

Dichotomous:

BCDs: Acc 86.6%,

Sn 86.3%, Sp 86.6%,

TRNs: Acc: 85.3%

Sn 83.5%; Sp 85.9%

Multiclass:

Acc: BCDs 79.5%; TRNs 75.1%

 

Polesie et al.46

Institutional: department of Dermatology at Sahlgrenska University Hospital

Dataset: 1,551.

819 Melanoma in situ and 732 invasive melanomas.

Training/validation: 1,051/200

Test: 300 images

R

Y

N

7 dermatologists:

-1 resident physician

-6 board-certified dermatologists

De novo CNN

Dichotomous: in situ vs invasive melanoma

CNN was outperformed by dermatologists.

AUC 0.72 (95% CI 0.66–0.78)

AUC: 0.81 (95% CI 0.76–0.86

 

Pangti et al.49

Public: public archives (http://www.hellenicdermatlas.com/en and http://www.danderm.dk/atlas, dermatologists across India.) (training)

Institutional

Training/validation: 17,784 images, 40 skin diseases.

Test: 100 images, 58 biopsy-proven

BCC, 42 facial non-BCC

lesions.

E

R

B

N

50 participants:

- 36 dermatologists

- 14 non-dermatologists: 5 surgeons and 9 general physicians

DenseNet-161 Tensorflow

Multiclass

Sn and Acc of the app were significantly higher than both dermatologists (P < 0.0001) and non-dermatologists (P < 0.0001). The Sp was comparable (P = 0.07).

AUC 0.933

Sn 80.24 ± 3.11%

Sp 91.57 ± 2.66%

Acc 84.97 ± 2.45%

BCC diagnosis

-Dermatologists:

Sn 45.98% ± 21.21

Sp 96.03% ± 6.52

Acc 65% ± 11.7

-Non-dermatologists:

Sn 10.71% ±10.53

Sp 98.47% ±3.19

Acc 47.57% ± 6.32

 

Agarwala et al.50

Public: Triage tool www.triage.com

free online system composed of four CNN models (training)

Institutional (test)

Training: > 200,000 images, > 500 skin conditions

Test: 353 images

E

R

B

Y

21 US board-certified dermatologists

Triage algorithm

Multiclass

Accuracy of the dermatologist’s was better than the AI accuracy

Acc 63.3%; 95% CI 58.0–68.4%)

Acc: 69.1% (95% CI 63.7–74.1)

 

Kim et al.51

Public Pre-trained algorithm

Institutional:

Department of Dermatology, Asan Medical Center, Seoul National University, Bundang Hospital (Test)

Training: 721,749 images, 178 disease classes

Test: 285 images

E

P

B

N

-10 attending physicians (11.4 ± 8.8 years’ experience after board certification)

-11 dermatology trainees

-7 intern doctors

Model Dermatology; https://modelderm.com

Multiclass

There was no direct comparison between AI and clinicians

Top-1 of the algorithm

Sn 52.2%

Sp 93.4%

Acc 53.5%

Top-2 of the algorithm

Sn 69.6%

Sp 78.5%

Acc 66.0%

Top-3 of the algorithm

Sn 78.3%

Sp 66.1%

Acc 70.8%

Top-1 Dermatologist

Sn 79.3%

Sp 90.2%

Acc 61.8%

Trainees

Sn 65.5%

Sp 81.3%

Acc 46.5%

Top-2 Dermatologist

Sn 86.2%

Sp 82.1%

Acc 69.4%

Trainees

Sn 93.1%

Sp 51.8%

Acc 54.2%

Top-3 Dermatologist

Sn 86.2%

Sp 79.5%

Acc 71.5%

Trainees

Sn 93,1%

Sp 49.1%

Acc 54.9%

Top-1/Top-2/Top-3 accuracies after assistance were significantly higher than those before assistance

AI augmented the diagnostic accuracy of trainee doctors

Ba. et al.41

Institutional:

Chinese PLA General Hospital & Medical School

Dataset: 29,280

Training/validation: 25,773.

10 categories of cutaneous tumors

Test: 400 from 2107 images dataset.

I Δ

R

Y

N

18 board-certified dermatologists, with different levels of experience

EfficientNet-B3

Dichotomous: malignant vs benign

CNN had higher Acc than un-assisted dermatologists.

CNN-assisted dermatologists achieved a higher Acc and kappa (p < 0.001) than unassisted dermatologists Dermatologists with less experience benefited more from CNN assistance.

Multiclass

Acc 78.45%

Dichotomous

Sn 83.21%

Sp 91.3% (85.5-97.1)

Multiclass

Acc 62.78%

Dichotomous

Sn 83.21%

Sp 80.92%

Multiclass Acc:

76.60% vs. 62.78%, p < 0.001; kappa 0.74 vs. 0.59, p < 0.001

Dichotomous

Sn 89.56% vs. 83.21%, p < 0.001

Sp 87.90% vs. 80.92%, p < 0.001

  1. HP histopathology confirmation, I/E internal/external test set, P prospective, R retrospective, B both (a subset of lesions were biopsy proven and a subset based on clinical/consensus diagnosis), CD clinical data (metadata) available, CNN convolutional neural network, DCNN deep convolutional neural network, AK actinic keratosis, BCC basal cell carcinoma, BKL benign keratosis, SK seborrheic keratosis, DF dermatofibroma, MEL melanoma, NT not trained, SCC squamous cell carcinoma, VASC vascular lesion, Sn sensitivity, Sp specificity, Acc accuracy, NPV negative predictive value, PPV positive predictive value, ROC receiver operating characteristic curve, AI artificial intelligence. Δ hold-out dataset.