Introduction

A significant rise was demonstrated in the incidence of the majority of skin disease over the past decades1. Compared to disorders from other systems, diagnosis of skin disease is much more depended on lesion presentation, with more than 1500 different dermatological diagnoses, general practitioner diagnostic accuracy in dermatological disease has been estimated to be from 48 to 77%2, therefor the clinicians face a challenge to increase diagnostic accuracy and further improve theropy efficiency.

A lot of researches focused on the technique improvement of the diagnosis, especially on artificial intelligence3. BinderĀ et al.4 used computerized image analysis and an artificial neural network to automatically diagnose pigmented skin lesions. The sensitivity and specificity of the computerized system were 90% and 74%, respectively.

Verma et al.5 classified erythemato-squamous diseases by ensemble 5 different data mining techniques, and the results showed that the proposed ensemble method generates more efficient use of the dataset and give more accurate rate than individual data mining techniques.

Sharma et al.6 compared Support Vector Machine and Artificial Neural Network, along with an ensemble of these two techniques for classification of erythemato-squamous diseases, and found that the ensemble model has achieved a remarkable performance with the highest accuracy.

Moradi and Mahdavi-Amiri7 propose a kernel sparse representation based method for segmentation and classification of melanoma images, and the evaluation results demonstrate their approach to be competitive as compared to the available state-of-the-art methods.

Yap et al.8 developed a multimodal classifier, which outperforms a baseline classifier that only uses a single macroscopic image in both binary melanoma detection and in multiclass classification.

Chang and Chen9 used decision tree of data mining combining with neural network classification methods to construct the best predictive model on six major skin diseases, and found that the neural network model had the highest accuracy in prediction.

The main work of these investigations is listed in Table 1. However, all of the investigations focused on improvement of diagnosis effects with the assistance of the artificial intelligence techniques, few researches concentrating on the imperfection of the current classification system of dermatology and venereology have been developed. The International Classification of Diseases, Tenth Revision (ICD)-10 is now globally universal in order to keep consistency in disease diagnosis, however, the literature on the shortcomings of the ICD-10 is scant. Recent studies have found deficiencies in the classification of allergic conditions by ICD-10 codes10,11, and a new revision ––ICD-11 ––is currently being developed with the aim of solving problems12.

Table 1 Investigations focusing on skin disease classification using artificial intelligence techniques.

With in-depth researches on pathogenesisĀ of skin disease, the knowledge on dermatology is improved and multiple diseases have been approved that their initial classifications are not accurate, for example, pyogenic granuloma sounds like an infectiousĀ diseases but actually is a kind of hemangioma, classification and nomenclature of vascular malformations have also changed13, and sebopsoriasis lacks a specific code14. So, the modern dermatology faces an imperiousĀ demandĀ of classification with being more scientific. Esteva et al.15 developed a dermatologist-level system for skin cancer classification, although the aim of this study was to test an artificial intelligence capable of classifying skin cancer, it provides a direction to re-classify skin disease from different aspects.

Based on the above considerations, we conduct this study to develop a new taxonomy based on the cytology and pathology, and to further test the new taxonomy on diagnosis effects by Deep Residual Learning method, and compared with the ICD-10 on Diseases of the skin and subcutaneous tissue, in order to find a new classification benefiting prediction, having potential application in clinical practices in dermatology and venereology.

Materials and methods

FigureĀ 1 demonstrates the whole structure of methodology used in this research, and the approach used in this paper is completely data driven.

Figure 1
figure 1

Methodological approach for skin diseases.

Taxonomy

Taxonomy 1

ICD-10 Version: 2016—World Health Organization (http://apps.who.int/classifications/icd10/browse/2016/en).

Taxonomy 2

The taxonomy 2 represents 1,000 individual diseases arranged in a tree structure with three root nodes representing: (1) Keratinogenic diseases (KCs), (2) Melanogenic diseases (MCs), and (3) Diseases related to non-keratinocytes and non-melanocytes (Non-KC and non-MC). The taxonomy 2 was derived by dermatologists using a bottom-up procedure. Among the tree structure, individual diseases, initialized as leaf nodes, were merged based on organic or cellular similarity, until the entire structure was connected. The taxonomy 2 contains 6 levels, and the level 1–3 are present in Fig.Ā 2. For each type of disease, a number indicates a different disease, and so on up to level 6.

Figure 2
figure 2

The first three levels contained by the taxonomy 2.

The taxonomy is used in generating training classes that are both well-suited for machine learning classifiers and medically relevant. The root nodes are used in the first validation strategy and represent the source cell/organization of disease. The children of the root nodes (for example, malignant melanocytic lesions) are used in the second validation strategy, and represent disease classes that have similar clinical treatment plans.

Projects setting

All images come from the following public databases, Atlas (http://www.atlasdermatologico.com.br/), Dermatoweb (http://www.dermatoweb.net/), dermnet (http://www.dermnet.com/), Dermnetnz (https://www.dermnetnz.org/), Emedicinehealth (https://www.emedicinehealth.com/), Globalskinatlas (http://www.globalskinatlas.com/), Meddean (http://www.meddean.luc.edu/), Uiowa (https://medicine.uiowa.edu/). A total of 56,571 images were collected. The acquisition program generates a list of images with classification tags for each website, downloads the corresponding images, and obtains a picture library with a description of the classification tags.

Taxonomy 1 was defined as Project 1. Finally, based on the resources of the image library, which should be balanced in two taxonomies, 11 classes were selected as project 1, including pemphigus, lichen planus, congenital ichthyosis, other dermatitis, pediculosis, scabies, herpes viral infections, unspecified viral infection, gonococcal infection, other sexually transmitted diseases, other congenital malformations of skin, and not elsewhere classified.

Level 3 from Taxonomy 2 is defined as Project 2, and contains a total of 2 classes: Inflammatory diseases; Infectious diseases. Level 4 from Taxonomy 2 is defined as Project 3, and contains a total of 4 classes: Virus, Parasite, Bacteria, Dermatitis. Level 5 from Taxonomy 2 is defined as Project 4, and contains a total of 11 categories: porokeratosis; herpes, simple genital; lichen planus; condilomas acuminados; ichthyosis; viral exanthems; pediculosis pubis; pemphigus; gonorrhea; eczema; sarna noruega.

Data processing instructions

According to the Taxonomy 2, finally 1,847 images were extracted. And then, the images are screened to ensure that the two taxonomies contain the same ones, and finally a total of 1,160 images were obtained.

Predictive model evaluation by recurrent neural network

After annotation of the images, our predictions on the two taxonomies are based on Deep Residual Learning for Image Recognition (deep learning), which belongs to CNN. For fair comparison, we adopt ResNet-50 pre-trained on ImageNet as the feature extraction network. Specifically, SGD optimizer with momentum 0.9 and weight decay 5e-4 is adopted, the initial learning rate is set as 1e-4. The batch size is set to 64 and the drop-out rate is 0.5.

Identify the images according to the Taxonomy 1: Project_1 represents the specific information of each picture marked using taxonomy1 classification system. Entity_id is the unique ID of the picture. Code_1 represents the number of images in each category under images marked with the taxonomy1 classification system. code_id is the category unique ID.

Identify the images according to the Taxonomy 2 (3–5 levels): Project 2, Project 3, Project 4 represents the specific information of each picture marked at the 3, 4, 5 level using the Taxonomy 2 system, respectively. entity_id is the unique ID of the picture. And code_2 represents the Taxonomy 2 system. At the 2, 3, 4, level under the marked images, respectively, the number of images in each category. code_id is the category unique ID.

For each project, 2/3 of the images were included as the training group, and the rest 1/3 of the images acted as the test group according to the category (class) as the stratification variable.

The accuracy, Kappa coefficient, Precision, Recall, and F1-score were calculated and compared between the two taxonomies.

Formulas:

$$ \begin{aligned} & P{\text{recision}} = TP/\left( {TP + FP} \right) \\ & R{\text{ecall}} = TP/\left( {TP + FN} \right) \\ & {\text{F1-score}} = 2 \times P \times R/\left( {P + R} \right) \\ \end{aligned} $$

TP indicates the number of correct predictions for this category in the real classification, FP indicates the number of false predictions in this category for unreal classification, FN indicates that the number of this category is not correctly predicted in the real classification.

Results

The overall comparison on predicted results between projects

Table 2 showed the comparison of the predicted results of projects by different categories. Only the Project 4 has a higher accuracy on prediction of skin disease.

Table 2 Comparison of the identified results of projects by different categories.

Except for the test group in Project 3, all of the train and test groups in the Projects (2,3, and 4) from Taxonomy 2 have a higher precision on prediction of skin disease than the corresponding group in the Project 1 from Taxonomy 1, while no differences are significant. For the recall rate of Projects, both train and test groups in the Projects (2,3, and 4) from Taxonomy 2 are better than the corresponding group Project 1 from Taxonomy 1, while only the test group in Project 4 has a statistically significantly higher recall rate than the test group in Project 1 (P = 0.016).

For the F1-score, both train and test groups in the Projects (2, 3, and 4) from Taxonomy 2 are better than the corresponding groups in Project 1 from Taxonomy 1, and both the train and test groups in Project 4 have a statistically significantly higher F1-score than the corresponding groups in Project 1 (P = 0.025 and 0.005, respectively).

All of the train and test groups in the Projects (2, 3, and 4) from Taxonomy 2 have a higher Kappa value on prediction of skin disease than the corresponding groups in the Project 1 from Taxonomy 1.

Comparisons among classes in Projects

The results showed that all of the parameters including sensitivity and recall, specificity, positive predictive value (PPV) and precision, negativeĀ predictive value (NPV), and F1 in the 11 diseases of the train groups are all better than those in the test group in Project 1 (Table 3). And the F1 in part of diseases, especially of gonococcal infection and Herpes viral infections, in the test group are much lower compared with that in the train group.

Table 3 Effects of AI prediction between train and test groups among various diseases in Project 1 (%).

While the results showed that all of the parameters including sensitivity and recall, specificity, PPV and precision, NPV, and F1 in the 11 diseases of the train groups are similar with those in the test group at different classification levels in Projects 2–4 of Taxonomy 2 (Project 2/Level 3, Table 4; Project 3/Level 4, Table 5; Project 4/Level 5, Table 6).

Table 4 Effects of AI prediction between train and test groups among various diseases in Project 2 (%).
Table 5 Effects of AI prediction between train and test groups among various diseases in Project 3(%).
Table 6 Effects of AI prediction between train and test groups among various diseases in Project 4 (%).

Discussion

Descriptive dermatology of the morphological phenomena of skin has been developed for more than two thousand years16. Briefly, our ancestors have separated skin disorders, depending either on their location, their appearance or more interestingly their suspected cause. In consequence, the textbooks, that have fashioned our education, have also adopted sometimes very different ways to present and classify skin diseases17. Classification by similarities became more and more difficult as the complexity of disease was realized18. New classification which may help diagnosis, disease management, and discipline development is in urgent need.

This study developed a new taxonomy (Taxonomy 2) containing 6 levels (project 2–4) of most skin disease based on cytology and pathology, which is a completely new work on the dermatology and venereology compared to the previous work focusing on classification of one type or several skin disease by AI techniques4,5,6,7,8,9.

In order to investigate the predictive effect of the new taxonomy on skin disease, we further compared the accuracy, precision, recall, F1, and Kappa of the new taxonomy with the ICD 10 using Deep Residual Learning method. Precision, recall, and F1-score are commonly used to evaluate the predictive effect of models/projects in multi-class prediction. Precision is the number of correctly predicted samples divided by the number of all samples, that is, the prediction accuracy rate of the model, and is used to measure the proportion of correct discrimination among all predicted categories, similar to sensitivity. Recall is used to measure the proportion of correctly identified in all true categories, similar to specificity. The two constitute a pair of contradictory measures. F1 score is used to weigh these two indicators. Deep CNNs has a potential widely application for diagnosis of skin diseases, with a higher accuracy compared with human dermatologists19,20, that is why we applied it to prediction diseases based on different taxonomies, at same time to avoid instability of human beings.

Our results confirmed that the new taxonomy had a better performance in all parameters, and the final level of classification had a significant higher F1-score than the ICD-10 taxonomy, which means it may be better on extension to unknown data and may provide a better taxonomy system for skin disease prediction under assistance of AI techniques in the future.

The literature on the shortcomings of the ICD-10 is very few. A compatible version of the ICD-10 specifically adapted to dermatology was produced in Spain in 1999 to overcome these shortcomings. GonzĆ”lez-López et al.21 confirmed that the ICD-10 system does have some minor shortcomings when it comes to coding certain diseases, particularly newly discovered and emerging diseases. A classification of hypersensitivity/allergic diseases was constructed to validate it for ICD-11 by crowdsourcing the allergist community11, because the well-known misclassification and/or under-notification of these diseases in the ICD, which has a direct and huge detrimental impact on hypersensitivity/allergic diseases data22. However, a reclassification of whole disciplinary systems of dermatology hasn’t been tried yet, so we attempted to construct a new taxonomy in this study. The results of current study confirmed that the taxonomy 2 developed has advance on the disease prediction compared to ICD-10 on skin diseases, which may have a potential application value in future clinical practice in dermatology and venereology.

The current study has the following limitations: 1. AI is the only detection technology for comparison, but is not the gold standard for prediction, so it has systemĀ error, which may affect the comparison result. 2. The dermatological data didn’t include histopathological images, and it may influence accurate classification effect. 3. The train and test groups of Project 1 have differences on all of the three parameters. And the Project 3 and Project 4 have a difference on precision and F1-score, respectively. Our purpose of dividing the images into 2 groups is to prevent model overfitting, which means that it performs well in the training group, but may be very poor when it is changed to other data and cannot be well predicted. We used 2/3 of the data to build the model and adjust the parameters in order to build a good model, however the difference between train and test groups indicate a low credibility of the results, the images of different types of diseases are not balanced, which may result from the not good enough quality of images of skin diseases, especially for some types.

Conclusion and future work

In conclusion, this study is a try for dermatology precise or effective classification for discipline development, and this new taxonomy based on cytology and pathology we developed is an innovation and challenge for current dermatology classification from ICD-10, and has been provided to have an overall better performance on predictive effect including sensitivity and recall, specificity, PPV and precision, NPV, and F1, compared with ICD-10. The new taxonomy has the potential application value for clinical practice using AI techniques for skin prediction. However, a coming comprehensive system covering more skin disease and having different data including dermoscopic and histopathogical images are necessary for further confirmation of the stability of the taxonomy.