Table 2 Included studies general characteristics, dataset used, and performance evaluating clinical images
Author | Database | Training set | Test set | I/E | Design | HP | CD | Participants | IA | Classification | Clinicians’ vs IA | IA performance | Clinicians’ performance | Augmented performance |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Chang et al.47 | Institutional: Kaohsiung Medical University | Dataset: 24,178 Training/validation: not specified | Test: 769 | I | R | Y | N | 25 dermatologists | CADx system | 3-class: Malignant or benign or indeterminate | Comparable | Sn 85.63% Sp 87.65% Acc 90.64% ROC 0.949 | Sn 83.33% Sp 85.88% Acc 85.31% | |
Han et al.42 | Public: Training: Asan dataset, MED-NODE dataset, and atlas site images | Dataset: 598,854 Training: 19,398 Validation: portion of the Asan, Hallym and Edinburgh datasets. | Test: 480 images (260 images Asan test, 220 images Edinburgh) | I | R | B | N | 16 dermatologists: -10 professors -6 clinicians | Microsoft ResNet-152 model | Dichotomous: Benign vs malignant | Comparable | Asan dataset: Sn 86.4% ± 3.5% Sp 85.5% ±3.2% AUC 0.91 ± 0.01 Edinburgh: Sn 85.1% ± 2.2% Sp 81.3% ± 2.9% AUC 0.89 ± 0.01 | ||
Fujisawa et al.43 | Institutional: University of Tsukuba Hospital from 2003 to 2016 (training and test) | dataset: 6,009 training/validation: 4,867 | Test: 1,142 | IΔ | R | B | N | 22 dermatologists: -13 board-certified -9 trainees | GoogLeNet DCNN model | Dichotomous: Benign vs malignant | DCNN achieved greater accuracy (P< .0001). | Sn 96.3% Sp 89.5% Acc 76.5% | Acc board-certified 85.3% ± 3.7% Acc trainees 74.4% ± 6.8% | |
Han et al.44 | Public: MED-NODE data set, Seven-Point Checklist Dermatology data set (training) Institutional: Asan Medical Center Department of Dermatology, Hallym National University Department of Plastic Surgery, Chonnam University Department of Plastic Surger (training and test set) | Dataset: Training/validation: 1,106,886/2,844 | Test: 325 | I | R | Y | N | 119 clinicians: -13 board-certified dermatologists -34 dermatology residents -20 non-dermatologic physicians -52 general public with no medical background | Blob detector training using faster-RCNN20, a fine image selector and the disease classifier training using CNNs (SENet, SE-ResNeXt-50, and SE-ResNet-50). | Dichotomous: Benign vs malignant | Comparable | AUC: 91.9 Sn 98.2% Sp 77.9% | Dermatologists ROC: 0.90 Non-dermatologist physicians ROC: 0.725 (Sn and Sp for each one not specified) Overall: Sn 95.0% Sp 72.1% | |
Zhao et al.48 | Institutional: XiangyaDerm, which was collected from Xiangya Hospital | Dataset: 150,223 Training/validation: 4,500 | Test: 60 | I | R | Y | N | 20 dermatologists | Xception architecture | 3 risk classification: low risk, high risk, and dangerous | Classifier outperforms dermatologists | Acc 82.7% Benign: Sn 93%, Sp 88% Low degree: Sn 85%, Sp 85% High degree: Sn 86%, Sp 91% AUC: - Low-risk: 0.959 -High-risk: 0.919 - Dangerous: 0.947 | Sn: - Low-risk: 61% - High-risk: 49.5% - Dangerous: 64% Sp - Low-risk: 4.9% -High-risk: 29% - Dangerous: 29% | |
Han et al.52 | Public: Asan Medical Center and images from websites (training) Institutional: Department of Dermatology, Severance Hospital, Seoul, Korea (test set) | Dataset: - Dataset A (Dichotomous): 40,331 - Dataset B (Multiclass): 39,721 Training: 1,106,886 images | Test: 1,320 | E | R | Y | N | 65 attending physicians (dichotomous) 44 dermatologists 5.7 ± 5.2 years of experience (multiclass) | Disease classifier (SENet and SE-ResNeXt-50) was trained with the help of a region-based CNN (faster RCNN) | Dichotomous: benign or malignant Multiclass: diagnosis | First clinical impression of physicians was superior to those of the algorithm Multiclass classification was comparable. | Dichotomous: AUC 0.863 (0.852–0.875) Sn 62.7% (59.9–65.1) Sp 90.0% (89.4–90.6) PPV 45.4% (43.7–47.3) NPV 94.8% (94.4–95.2) Multiclass: Sn 66.9% (57.7–76.0) Sp 87.4% (82.5–92.2) | Dichotomous: Sn 70.2% Sp 95.6% PPV 68.1% NPV 96.0% Multiclass: Sn 65.8% (55.7–75.9) Sp 85.7% (82.4–88.9) | |
Huang et al.45 | Institutional: Xiangya Hospital, Central South University, | Dataset: 3,299 Training: 2,474 | Test: 825 Additional test set: 116 | IΔ | R | Y | N | 21 participants: -8 expert dermatologists -13 general dermatologists | 4 CNN networks: InceptionV3, Inception-ResNetV2, DenseNet121, and ResNet50 | Dichotomous: BCC vs SK | InceptionResNetV2 model outperformed general dermatologists and was comparable to expert dermatologists. | PPV 89.7% NPV 10.3% AUC 0.937 | PPV 73.2% NPV 21.5% | |
Han et al.53 (I) | Public: ASAN, Web, MED-NODE, images from websites (training). Edinburgh dataset (validation) Institutional: SNU datasets (validation and test) SNU dataset consisted of data from three university hospitals (Seoul National University Bundang Hospital, Inje University Sanggye Paik Hospital, and Hallym University Dongtan Hospital) | Dataset: 224,181 Training: 220,680, 174 disease classes Validation: SNU dataset: 2,201 images of 134 disorders Edinburgh dataset: 1,300 images of 10 tumorous skin diseases. | Test: 240 images from SNU dataset | E | R | B | N | 70 participants: - 21 dermatologists - 26 dermatology residents - 23 non-medical professionals | Not specified | Dichotomous: melanoma vs nevus and suggesting treatment option Multi-class classification of 134 skin disorders | Dichotomous: algorithm showed similar performance as dermatology residents but slightly lower than dermatologists | SNU AUC 0.937 ± 0.004 Edinburgh AUC 0.928 ± 0.002 Multiclass: mean top 1, 3, and 5 accuracies: 44.8 ± 1.2%, 69.0 ± 0.9%, and 78.1 ± 0.3% | Dermatologists Sn 77.4% ± 10.7 Sp 92.9% ± 2.4 AUC 0.66 ± 0.08 Non-medical professionals Sn 47.6 ± 33.1% | Sn and Sp of clinicians were improved by 12.1% (p < 0.0001) and 1.1% (p < 0.0001), respectively. Non-medical professionals improved Sn from 47.6 ± 33.1% to 87.5 ± 17.2% (p < 0.0001) without loss in Sp. |
Jinnai et al.54 | Institutional: Department Dermatologic Oncology in the National Cancer Center Hospital (training and test) | Dataset: 5846 Training/validation: 4732 images. | Test: 200 images | IΔ | R | B | N | 20 dermatologists: -10 board-certified dermatologists (BCDs) - 10 dermatologic trainees (TRNs) | Faster, region-based CNN (FRCNN) | -Dichotomous: benign vs malignant -Multiclass: Six-class classification | Accuracy of FRCNN was significantly better than that of the dermatologists (p < 0.00001) | Dichotomous: -Acc: 91.5% -Sn: 83.3% -Sp: 94.5% Multiclass: -Acc: 86.2% -VPN 5.5% -VPP 84.7% | Dichotomous: BCDs: Acc 86.6%, Sn 86.3%, Sp 86.6%, TRNs: Acc: 85.3% Sn 83.5%; Sp 85.9% Multiclass: Acc: BCDs 79.5%; TRNs 75.1% | |
Polesie et al.46 | Institutional: department of Dermatology at Sahlgrenska University Hospital | Dataset: 1,551. 819 Melanoma in situ and 732 invasive melanomas. Training/validation: 1,051/200 | Test: 300 images | IΔ | R | Y | N | 7 dermatologists: -1 resident physician -6 board-certified dermatologists | De novo CNN | Dichotomous: in situ vs invasive melanoma | CNN was outperformed by dermatologists. | AUC 0.72 (95% CI 0.66–0.78) | AUC: 0.81 (95% CI 0.76–0.86 | |
Pangti et al.49 | Public: public archives (http://www.hellenicdermatlas.com/en and http://www.danderm.dk/atlas, dermatologists across India.) (training) Institutional | Training/validation: 17,784 images, 40 skin diseases. | Test: 100 images, 58 biopsy-proven BCC, 42 facial non-BCC lesions. | E | R | B | N | 50 participants: - 36 dermatologists - 14 non-dermatologists: 5 surgeons and 9 general physicians | DenseNet-161 Tensorflow | Multiclass | Sn and Acc of the app were significantly higher than both dermatologists (P < 0.0001) and non-dermatologists (P < 0.0001). The Sp was comparable (P = 0.07). | AUC 0.933 Sn 80.24 ± 3.11% Sp 91.57 ± 2.66% Acc 84.97 ± 2.45% | BCC diagnosis -Dermatologists: Sn 45.98% ± 21.21 Sp 96.03% ± 6.52 Acc 65% ± 11.7 -Non-dermatologists: Sn 10.71% ±10.53 Sp 98.47% ±3.19 Acc 47.57% ± 6.32 | |
Agarwala et al.50 | Public: Triage tool www.triage.com free online system composed of four CNN models (training) Institutional (test) | Training: > 200,000 images, > 500 skin conditions | Test: 353 images | E | R | B | Y | 21 US board-certified dermatologists | Triage algorithm | Multiclass | Accuracy of the dermatologist’s was better than the AI accuracy | Acc 63.3%; 95% CI 58.0–68.4%) | Acc: 69.1% (95% CI 63.7–74.1) | |
Kim et al.51 | Public Pre-trained algorithm Institutional: Department of Dermatology, Asan Medical Center, Seoul National University, Bundang Hospital (Test) | Training: 721,749 images, 178 disease classes | Test: 285 images | E | P | B | N | -10 attending physicians (11.4 ± 8.8 years’ experience after board certification) -11 dermatology trainees -7 intern doctors | Model Dermatology; https://modelderm.com | Multiclass | There was no direct comparison between AI and clinicians | Top-1 of the algorithm Sn 52.2% Sp 93.4% Acc 53.5% Top-2 of the algorithm Sn 69.6% Sp 78.5% Acc 66.0% Top-3 of the algorithm Sn 78.3% Sp 66.1% Acc 70.8% | Top-1 Dermatologist Sn 79.3% Sp 90.2% Acc 61.8% Trainees Sn 65.5% Sp 81.3% Acc 46.5% Top-2 Dermatologist Sn 86.2% Sp 82.1% Acc 69.4% Trainees Sn 93.1% Sp 51.8% Acc 54.2% Top-3 Dermatologist Sn 86.2% Sp 79.5% Acc 71.5% Trainees Sn 93,1% Sp 49.1% Acc 54.9% | Top-1/Top-2/Top-3 accuracies after assistance were significantly higher than those before assistance AI augmented the diagnostic accuracy of trainee doctors |
Ba. et al.41 | Institutional: Chinese PLA General Hospital & Medical School | Dataset: 29,280 Training/validation: 25,773. 10 categories of cutaneous tumors | Test: 400 from 2107 images dataset. | I Δ | R | Y | N | 18 board-certified dermatologists, with different levels of experience | EfficientNet-B3 | Dichotomous: malignant vs benign | CNN had higher Acc than un-assisted dermatologists. CNN-assisted dermatologists achieved a higher Acc and kappa (p < 0.001) than unassisted dermatologists Dermatologists with less experience benefited more from CNN assistance. | Multiclass Acc 78.45% Dichotomous Sn 83.21% Sp 91.3% (85.5-97.1) | Multiclass Acc 62.78% Dichotomous Sn 83.21% Sp 80.92% | Multiclass Acc: 76.60% vs. 62.78%, p < 0.001; kappa 0.74 vs. 0.59, p < 0.001 Dichotomous Sn 89.56% vs. 83.21%, p < 0.001 Sp 87.90% vs. 80.92%, p < 0.001 |