Background & Summary

More than 6% of the world’s population is affected by rare genetic diseases1, and the latest Orphanet2 and OMIM3 databases show that there are currently at least 7000 rare genetic diseases. Among these, many rare genetic diseases have recognizable facial phenotypic features, and facial phenotypes are often used as a basis for diagnosis4,5,6,7. In recent studies, the diagnosis of rare genetic diseases through computer vision techniques has reached the level of clinical experts8,9,10,11,12,13,14.

DeepGestalt9, trained on a private dataset comprising over 17,000 images, demonstrates the ability to identify the correct disease among 502 different images, achieving an impressive top-10 accuracy of 91%. However, like many AI models, DeepGestalt lacks the capability to explicitly explain its predictions or provide insights into which specific facial features contributed to the diagnosis. AI models rely heavily on data, and the GestaltMatcher Database (GMDB)15 is the only publicly available dataset in the field, containing 10,189 frontal images of 7,695 patients with 683 diseases. Despite its utility, models trained on GMDB share the same interpretability shortcomings as DeepGestalt. In the medical field, where explainability is paramount16,17,18, the lack of transparent and explainable datasets poses a significant barrier.

Compared to the GMDB image format, the tabular format is naturally more explainable. Tabular dataset with clear meanings and units, the relationship between data is more direct and easier to be understood by humans19. For instance, when we see “age” and “disease” in the table, we can intuitively understand the correlation between them. Moreover, the tabular dataset consists of numerical values and categories that can be directly understood by humans, while GMDB image data consists of pixel values that do not have direct semantic information. Furthermore, some logical rules can be found in tabular data, such as “c.6726_6730del; p.Leu2243Serfs*8 in Exon 20 cause Coffin-Siris syndrome 1”, whereas for image data, what the model learns tends to be high-level feature combinations.

Therefore, We propose a new tabular dataset, FGDD, which contains 1147 data records, 197 associated genes, 437 associated phenotypes, and 211 associated diseases, of which 689 data records have disease labels. FGDD was constructed by retrieving publications from Human Phenotype Ontology (HPO20)-generated terms, and then identifying facial phenotypes-gene-disease associations from these publications.

The FGDD is primarily used for facial phenotype analysis of rare genetic diseases. It serves multiple purposes, including training explainable diagnostic models, conducting in-depth analysis of the complex relationships between genes, diseases, and facial phenotypes, and uncovering additional potential associations and patterns.

Our contributions can be summarized as follows:

  • We propose a new dataset, FGDD, for facial phenotype analysis of rare genetic diseases, which can be used not only for training explainable diagnostic models but also for in-depth analysis of the complex gene-disease-facial phenotype relationships and for mining more potential associations and patterns.

  • We have conducted extensive benchmarking on our dataset, and commonly used algorithms can achieve up to 80.19% accuracy and provide clinical support.

  • We performed feature importance evaluation from both coarse- and fine-grained perspectives to explain relative contributions across feature categories and individual features within each category.

An overview of this study is shown in Fig. 1. FGDD dataset is available at figshare21. All codes are publicly available at https://github.com/zhelishisongjie/FGDD and can be audited, copied, and reused.

Fig. 1
Fig. 1
Full size image

The overall process of this study.

Methods

Data retrieval

We used a systematic search strategy to build the publications list as follows: firstly, we built the base search terms based on the concept of facial phenotype and its synonyms in the Human Phenotype Ontology (HPO20), and then logically combined them with the terms related to genetic variation (including “genetic”, “variation”, “deletion”), and finally, we generated 595 composite search terms, with a deadline of 1 April 2023.

An initial search of the PubMed database yielded 26,814 publications, the details of these publications were obtained through the Entrez programming utilities22, and 11,304 publications were retained after the de-duplication process. Subsequently, through a stratified screening process: excluding reviews, commentaries, book chapters and animal experiments (n=6,045), the remaining 5,259 publications entered the full-text assessment stage, and those with insufficient relevance or unclear conclusions were excluded from the full-text screening (n=4,750), and 509 high-quality publications were finally included for data extraction. The complete search strategy (including the search term construction code, the search terms list, and the final included publications list) has been made publicly available at our figshare21 and Github repository to ensure the reproducibility of the research.

Data extraction

In the data extraction phase, there are four types of information to be extracted for each publication, including demographic information, phenotype information, variation information, and disease information. We constructed a hybrid automated-manual workflow: firstly, we implement basic information extraction by PhenoTagger23 (a human phenotype entity recognition tool) and PubTator24 (a biomedical entity recognition tool), where PhenoTagger is responsible for extracting the standardized phenotype information, and PubTator is responsible for capturing the variation information and disease information. However, existing automated tools have significant technical limitations: 1) although PhenoTagger can recognize phenotypic entities in text, it is unable to distinguish which patient the phenotype belongs to; 2) PubTator lacks standardized support for variant forms described by HGVS nomenclature (e.g., c.898C>T; p.Arg300Cys); 3) demographic information (e.g., age, gender, ethnicity) requires manual parsing due to the highly heterogeneous representation. The complete extraction of the above three types of information relies on the researcher’s in-depth reading of the text.

Finally, we manually integrated the raw data extracted from each publication according to the current format of FGDD, and to ensure the data quality, the publications were read again for data checking to make sure that there were no correspondence mismatches or data misentry. (After M.H. completed the raw data collection, data checking was performed by S.J.). The overall process of data collection is shown in Fig. 2.

Fig. 2
Fig. 2
Full size image

The process of data collection.

Data Records

FGDD, available for download on figshare21, provides tabular data in .csv format and a knowledge graph in .dump format. The dataset has a Creative Commons Attribution 4.0 International (CC BY 4.0) license. The figshare repository is organized as shown in Fig. 3. “Search terms” directory stores the data generated during the search and screening process to ensure the reproducibility of the study. “Raw data” directory stores data extracted from research publications, covering facial phenotypes, genes, disease information, and associations with patients, of which data with clear associations with patients are used for subsequent integration (some genes and facial phenotypes are only mentioned in the publications, and are not directly associated with the patients), and “FGDD” directory stores the standardized dataset and its knowledge graph format based on the final systematic integration of the raw data. Table 1 provides an overview of the files stored in figshare21.

Fig. 3
Fig. 3
Full size image

Organization of dataset.

Table 1 Overview of data files.

Technical Validation

Overview of facial phenotype-gene-disease data in FGDD

The data distribution is illustrated in Fig. 4, providing a clear visual representation of the dataset’s key characteristics. Table 2 systematically presents the data completeness analysis for each feature dimension. Based on the feature attributes, the variables in this study were classified into four categories: demographic features, facial phenotype features, variation features, and disease features. notably, the group of disease-associated features presented the highest missing rate (39.93%). This is mainly attributed to the fact that most genetic articles have focused on the functional association of facial dysmorphisms with specific genetic variants, with an emphasis on the molecular and mechanistic mechanisms, and less on the association with clinical diseases.

Fig. 4
Fig. 4
Full size image

(a) Patient’s Regional Distribution on a World Map, where Asia has the highest number of patients. (b) Patient’s Regional Distribution Bar Chart, a bar chart quantifying differences in the number of patients by region. (c) Gene Distribution, Chr is an abbreviation for Chromosome. (d)Age and Gender Distribution, age distribution (upper figure), with the majority of patients aged 10 years or older in years. Gender distribution (lower figure), where Male&Female indicates that this group of patients contains more than or equal to 2 individuals of different genders. (e) Disease Distribution, a bar chart showing the distribution of diseases. (f) Facial Phenotypes Distribution, with the most common facial phenotype being hypertelorism.

Table 2 FGDD feature missing rates.

To better visualize the facial phenotype-gene-disease relationships, we present FGDD in a format of knowledge graph, the schema of the knowledge graph is shown in Fig. 5(a), and a specific visualization example is shown in Fig. 5(b). Python script and tutorial for transforming FGDD tabular into knowledge graph are given in https://github.com/zhelishisongjie/FGDD.

Fig. 5
Fig. 5
Full size image

(a) Knowledge graph schema, which includes four types of nodes with four types of relationships. (b) A Knowledge graph visualization example, displayed through the yworks neo4j explorer.

Common algorithms performance validation

We tested the performance of common classification algorithms by splitting the training and test sets using a ratio of 7:3. All algorithms are in regular configuration and can be simply reproduced using our codes. The results are shown in Table 3. TabNet, Node, TabTransformer, FTTransformer are the popular deep learning algorithms for tabular data, Node achieves the highest Top-1 accuracy (80.19%), FTTransformer achieves the highest Macro-F1 score (0.59).

Table 3 Performance of different classification algorithms on FGDD.

Macro-F1 is a commonly used evaluation metric in multiclassification tasks that reflects the model’s balanced performance on all categories. Macro-F1 is defined as shown in equation (1). TPi, FPi, FNi represent the true positive, false positive, and false negative cases of disease i, respectively.

$$\,{\rm{Macro \mbox{-} F1}}\,=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\frac{2\cdot T{P}_{i}}{2\cdot T{P}_{i}+F{P}_{i}+F{N}_{i}}$$
(1)

Explainability validation

Explainability validation is concerned with understanding the logic of the whole model and tries to explain how the model is obtained through learning. Here we focus on feature importance analysis from both coarse-grained and fine-grained perspectives.

Coarse-grained features: We classify features into three main categories: patients, variations, and phenotypes. Coarse-grained features analyze the overall importance of these three categories of features, aiming to understand the contribution of each category to the model’s decision-making at a macro level. Notably, patient metadata demonstrates limited influence as shown in Fig. 6.

Fig. 6
Fig. 6
Full size image

Coarse-grained features importance. This suggests the disease pathogenesis is primarily driven by genetic rather than environmental/lifestyle factors. Genetic variants and facial phenotypes are central to diagnosis; integrating genomic analysis with phenotypic evaluation (e.g., distinct facial features) enhances diagnostic precision.

Fine-grained features: Unlike coarse-grained features, fine-grained features analyze the three categories of patients, variations, and phenotypes individually, aiming to reveal the importance of specific features within each category at the micro level. Genetic variation features emerge as the predominant diagnostic determinants as shown in Fig. 7.

Fig. 7
Fig. 7
Full size image

Fine-grained features importance. Key genomic features (e.g., exon count, chromosomal location) and distinctive facial phenotypes are critical diagnostic parameters. This underscores the necessity of integrating genomic profiling with phenotypic evaluation for precise diagnosis. Furthermore, patient-specific factors including ethnicity and geographic origin must be considered, as certain diseases exhibit population-stratified prevalence patterns requiring individualized diagnostic frameworks.

Limitations and future work

Mechanistic interpretability

The current dataset also lacks data on the biological mechanisms that underlie diagnostic decisions at a deeper level. Future datasets should include proteins, complexes, pathways, and biological processes to deepen biological insights. For example, in COFFIN-SIRIS SYNDROME, the most common genetic cause is a mutation in the ARID1B gene25,26.

ARID1B encodes a subunit of the Brg1/Brm associated factor (BAF) complex (a core component of the SWI/SNF chromatin remodeling complex), regulating gene expression in cell differentiation, neural development, and DNA repair27. Pathogenic mutations block the ARID1A-to-ARID1B subunit switch in BAF, causing sustained pluripotency gene activation (NANOG/SOX2)28. This disrupts cranial neural crest cell (CNCC) differentiation/migration and neuroectodermal maturation, leading to craniofacial anomalies (short philtrum, thick eyebrows, abnormal lips) and intellectual disability. Pathogenic mutations in SWI/SNF (e.g., ARID1B) is central to COFFIN-SIRIS SYNDROME29,30,31. Systematic incorporation of such molecular mechanisms will bridge the gap between genetic variants, disease, and phenotype.

Generalizability to different populations

There is a notable population region bias in the collected data, with certain racial or ethnic groups being overrepresented while others are underrepresented. This imbalance in distribution poses significant challenges in ensuring the model’s generalizability across diverse populations. To address this issue, several strategies can be implemented. First, to increase the size of the dataset and ensure a balanced distribution of racial patients; second, to use data augmentation/generation techniques32,33 to generate diverse data and to develop weighted learning algorithms34 to reweight the imbalanced population. In addition, multimodal learning can be used in conjunction with other image datasets to combine demographic, phenotypic, and genetic variation information with visual information to enhance generalization. For instance, a multimodal knowledge graph35 can link image features to diseases, phenotypes, and variations, enabling effective fusion through graph neural networks for tasks like diagnosis and phenotype prediction.