An explainable dataset linking facial phenotypes and genes to rare genetic diseases

Song, Jie; He, Mengqiao; Ren, Shumin; Shen, Bairong

doi:10.1038/s41597-025-04922-z

Download PDF

Data Descriptor
Open access
Published: 15 April 2025

An explainable dataset linking facial phenotypes and genes to rare genetic diseases

Jie Song¹^na1,
Mengqiao He¹^na1,
Shumin Ren¹ &
…
Bairong Shen ORCID: orcid.org/0000-0003-2899-1531¹

Scientific Data volume 12, Article number: 634 (2025) Cite this article

6203 Accesses
3 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Distinctive facial phenotypes serve as crucial diagnostic markers for many rare genetic diseases. Although AI-driven image recognition achieves high diagnostic accuracy, it often fails to explain its predictions. In this study, we present the Facial phenotype-Gene-Disease Dataset (FGDD), an explainable dataset collected from 509 research publications. It contains 1,147 data records encompassing 197 disease-causing genes, 437 facial phenotypes, and 211 disease entities, with 689 records having disease labels. Each data record represents a patient group and includes demographic information, variation information, and phenotype information. Baseline and explainability validations conducted on FGDD confirmed the dataset’s effectiveness. FGDD supports the training of diagnostic models for rare genetic diseases while delivering explainable results, and provides a foundation for exploring intricate connections between genes, diseases, and facial phenotypes.

GestaltMatcher facilitates rare disease matching using facial phenotype descriptors

Article 10 February 2022

Decoding disease: from genomes to networks to phenotypes

Article 02 August 2021

PhenoScore quantifies phenotypic variation for rare genetic diseases by combining facial analysis with other clinical features using a machine-learning framework

Article 07 August 2023

Background & Summary

More than 6% of the world’s population is affected by rare genetic diseases¹, and the latest Orphanet² and OMIM³ databases show that there are currently at least 7000 rare genetic diseases. Among these, many rare genetic diseases have recognizable facial phenotypic features, and facial phenotypes are often used as a basis for diagnosis^4,5,6,7. In recent studies, the diagnosis of rare genetic diseases through computer vision techniques has reached the level of clinical experts^{8,9,10,11,12,13,14}.

DeepGestalt⁹, trained on a private dataset comprising over 17,000 images, demonstrates the ability to identify the correct disease among 502 different images, achieving an impressive top-10 accuracy of 91%. However, like many AI models, DeepGestalt lacks the capability to explicitly explain its predictions or provide insights into which specific facial features contributed to the diagnosis. AI models rely heavily on data, and the GestaltMatcher Database (GMDB)¹⁵ is the only publicly available dataset in the field, containing 10,189 frontal images of 7,695 patients with 683 diseases. Despite its utility, models trained on GMDB share the same interpretability shortcomings as DeepGestalt. In the medical field, where explainability is paramount^16,17,18, the lack of transparent and explainable datasets poses a significant barrier.

Compared to the GMDB image format, the tabular format is naturally more explainable. Tabular dataset with clear meanings and units, the relationship between data is more direct and easier to be understood by humans¹⁹. For instance, when we see “age” and “disease” in the table, we can intuitively understand the correlation between them. Moreover, the tabular dataset consists of numerical values and categories that can be directly understood by humans, while GMDB image data consists of pixel values that do not have direct semantic information. Furthermore, some logical rules can be found in tabular data, such as “c.6726_6730del; p.Leu2243Serfs*8 in Exon 20 cause Coffin-Siris syndrome 1”, whereas for image data, what the model learns tends to be high-level feature combinations.

Therefore, We propose a new tabular dataset, FGDD, which contains 1147 data records, 197 associated genes, 437 associated phenotypes, and 211 associated diseases, of which 689 data records have disease labels. FGDD was constructed by retrieving publications from Human Phenotype Ontology (HPO²⁰)-generated terms, and then identifying facial phenotypes-gene-disease associations from these publications.

The FGDD is primarily used for facial phenotype analysis of rare genetic diseases. It serves multiple purposes, including training explainable diagnostic models, conducting in-depth analysis of the complex relationships between genes, diseases, and facial phenotypes, and uncovering additional potential associations and patterns.

Our contributions can be summarized as follows:

We propose a new dataset, FGDD, for facial phenotype analysis of rare genetic diseases, which can be used not only for training explainable diagnostic models but also for in-depth analysis of the complex gene-disease-facial phenotype relationships and for mining more potential associations and patterns.
We have conducted extensive benchmarking on our dataset, and commonly used algorithms can achieve up to 80.19% accuracy and provide clinical support.
We performed feature importance evaluation from both coarse- and fine-grained perspectives to explain relative contributions across feature categories and individual features within each category.

An overview of this study is shown in Fig. 1. FGDD dataset is available at figshare²¹. All codes are publicly available at https://github.com/zhelishisongjie/FGDD and can be audited, copied, and reused.

Methods

Data retrieval

We used a systematic search strategy to build the publications list as follows: firstly, we built the base search terms based on the concept of facial phenotype and its synonyms in the Human Phenotype Ontology (HPO²⁰), and then logically combined them with the terms related to genetic variation (including “genetic”, “variation”, “deletion”), and finally, we generated 595 composite search terms, with a deadline of 1 April 2023.

An initial search of the PubMed database yielded 26,814 publications, the details of these publications were obtained through the Entrez programming utilities²², and 11,304 publications were retained after the de-duplication process. Subsequently, through a stratified screening process: excluding reviews, commentaries, book chapters and animal experiments (n=6,045), the remaining 5,259 publications entered the full-text assessment stage, and those with insufficient relevance or unclear conclusions were excluded from the full-text screening (n=4,750), and 509 high-quality publications were finally included for data extraction. The complete search strategy (including the search term construction code, the search terms list, and the final included publications list) has been made publicly available at our figshare²¹ and Github repository to ensure the reproducibility of the research.

Data extraction

In the data extraction phase, there are four types of information to be extracted for each publication, including demographic information, phenotype information, variation information, and disease information. We constructed a hybrid automated-manual workflow: firstly, we implement basic information extraction by PhenoTagger²³ (a human phenotype entity recognition tool) and PubTator²⁴ (a biomedical entity recognition tool), where PhenoTagger is responsible for extracting the standardized phenotype information, and PubTator is responsible for capturing the variation information and disease information. However, existing automated tools have significant technical limitations: 1) although PhenoTagger can recognize phenotypic entities in text, it is unable to distinguish which patient the phenotype belongs to; 2) PubTator lacks standardized support for variant forms described by HGVS nomenclature (e.g., c.898C>T; p.Arg300Cys); 3) demographic information (e.g., age, gender, ethnicity) requires manual parsing due to the highly heterogeneous representation. The complete extraction of the above three types of information relies on the researcher’s in-depth reading of the text.

Finally, we manually integrated the raw data extracted from each publication according to the current format of FGDD, and to ensure the data quality, the publications were read again for data checking to make sure that there were no correspondence mismatches or data misentry. (After M.H. completed the raw data collection, data checking was performed by S.J.). The overall process of data collection is shown in Fig. 2.

Data Records

FGDD, available for download on figshare²¹, provides tabular data in .csv format and a knowledge graph in .dump format. The dataset has a Creative Commons Attribution 4.0 International (CC BY 4.0) license. The figshare repository is organized as shown in Fig. 3. “Search terms” directory stores the data generated during the search and screening process to ensure the reproducibility of the study. “Raw data” directory stores data extracted from research publications, covering facial phenotypes, genes, disease information, and associations with patients, of which data with clear associations with patients are used for subsequent integration (some genes and facial phenotypes are only mentioned in the publications, and are not directly associated with the patients), and “FGDD” directory stores the standardized dataset and its knowledge graph format based on the final systematic integration of the raw data. Table 1 provides an overview of the files stored in figshare²¹.

Table 1 Overview of data files.

Full size table

Technical Validation

Overview of facial phenotype-gene-disease data in FGDD

The data distribution is illustrated in Fig. 4, providing a clear visual representation of the dataset’s key characteristics. Table 2 systematically presents the data completeness analysis for each feature dimension. Based on the feature attributes, the variables in this study were classified into four categories: demographic features, facial phenotype features, variation features, and disease features. notably, the group of disease-associated features presented the highest missing rate (39.93%). This is mainly attributed to the fact that most genetic articles have focused on the functional association of facial dysmorphisms with specific genetic variants, with an emphasis on the molecular and mechanistic mechanisms, and less on the association with clinical diseases.

Table 2 FGDD feature missing rates.

Full size table

To better visualize the facial phenotype-gene-disease relationships, we present FGDD in a format of knowledge graph, the schema of the knowledge graph is shown in Fig. 5(a), and a specific visualization example is shown in Fig. 5(b). Python script and tutorial for transforming FGDD tabular into knowledge graph are given in https://github.com/zhelishisongjie/FGDD.

Common algorithms performance validation

We tested the performance of common classification algorithms by splitting the training and test sets using a ratio of 7:3. All algorithms are in regular configuration and can be simply reproduced using our codes. The results are shown in Table 3. TabNet, Node, TabTransformer, FTTransformer are the popular deep learning algorithms for tabular data, Node achieves the highest Top-1 accuracy (80.19%), FTTransformer achieves the highest Macro-F1 score (0.59).

Table 3 Performance of different classification algorithms on FGDD.

Full size table

Macro-F1 is a commonly used evaluation metric in multiclassification tasks that reflects the model’s balanced performance on all categories. Macro-F1 is defined as shown in equation (1). TP_i, FP_i, FN_i represent the true positive, false positive, and false negative cases of disease i, respectively.

$$\,{\rm{Macro \mbox{-} F1}}\,=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\frac{2\cdot T{P}_{i}}{2\cdot T{P}_{i}+F{P}_{i}+F{N}_{i}}$$

(1)

Explainability validation

Explainability validation is concerned with understanding the logic of the whole model and tries to explain how the model is obtained through learning. Here we focus on feature importance analysis from both coarse-grained and fine-grained perspectives.

Coarse-grained features: We classify features into three main categories: patients, variations, and phenotypes. Coarse-grained features analyze the overall importance of these three categories of features, aiming to understand the contribution of each category to the model’s decision-making at a macro level. Notably, patient metadata demonstrates limited influence as shown in Fig. 6.

Fine-grained features: Unlike coarse-grained features, fine-grained features analyze the three categories of patients, variations, and phenotypes individually, aiming to reveal the importance of specific features within each category at the micro level. Genetic variation features emerge as the predominant diagnostic determinants as shown in Fig. 7.

Limitations and future work

Mechanistic interpretability

The current dataset also lacks data on the biological mechanisms that underlie diagnostic decisions at a deeper level. Future datasets should include proteins, complexes, pathways, and biological processes to deepen biological insights. For example, in COFFIN-SIRIS SYNDROME, the most common genetic cause is a mutation in the ARID1B gene^25,26.

ARID1B encodes a subunit of the Brg1/Brm associated factor (BAF) complex (a core component of the SWI/SNF chromatin remodeling complex), regulating gene expression in cell differentiation, neural development, and DNA repair²⁷. Pathogenic mutations block the ARID1A-to-ARID1B subunit switch in BAF, causing sustained pluripotency gene activation (NANOG/SOX2)²⁸. This disrupts cranial neural crest cell (CNCC) differentiation/migration and neuroectodermal maturation, leading to craniofacial anomalies (short philtrum, thick eyebrows, abnormal lips) and intellectual disability. Pathogenic mutations in SWI/SNF (e.g., ARID1B) is central to COFFIN-SIRIS SYNDROME^29,30,31. Systematic incorporation of such molecular mechanisms will bridge the gap between genetic variants, disease, and phenotype.

Generalizability to different populations

There is a notable population region bias in the collected data, with certain racial or ethnic groups being overrepresented while others are underrepresented. This imbalance in distribution poses significant challenges in ensuring the model’s generalizability across diverse populations. To address this issue, several strategies can be implemented. First, to increase the size of the dataset and ensure a balanced distribution of racial patients; second, to use data augmentation/generation techniques^32,33 to generate diverse data and to develop weighted learning algorithms³⁴ to reweight the imbalanced population. In addition, multimodal learning can be used in conjunction with other image datasets to combine demographic, phenotypic, and genetic variation information with visual information to enhance generalization. For instance, a multimodal knowledge graph³⁵ can link image features to diseases, phenotypes, and variations, enabling effective fusion through graph neural networks for tasks like diagnosis and phenotype prediction.

Code availability

The code and its description, required packages, installation tutorials, data preprocessing, and specific parameters are publicly available at https://github.com/zhelishisongjie/FGDD.

References

Ferreira, C. R. The burden of rare diseases. American journal of medical genetics Part A 179, 885–892, https://doi.org/10.1002/ajmg.a.61124 (2019).
Article PubMed Google Scholar
Weinreich, S. S., Mangon, R., Sikkens, J., Teeuw, M. E. & Cornel, M. Orphanet: a european database for rare diseases. Nederlands tijdschrift voor geneeskunde 152, 518–519 (2008).
CAS PubMed Google Scholar
Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A. & McKusick, V. A. Online mendelian inheritance in man (omim), a knowledgebase of human genes and genetic disorders. Nucleic acids research33, D514–D517, https://doi.org/10.1093/nar/gki033 (2005).
Article CAS PubMed Google Scholar
Helman, S. N., Badhey, A., Kadakia, S. & Myers, E. Revisiting crouzon syndrome: reviewing the background and management of a multifaceted disease. Oral and maxillofacial surgery18, 373–379, https://doi.org/10.1007/s10006-014-0467-0 (2014).
Article PubMed Google Scholar
Donnai, D. & Karmiloff-Smith, A. Williams syndrome: From genotype through to the cognitive phenotype. American journal of medical genetics 97, 164–171 (2000).
Article CAS PubMed Google Scholar
Vergano, S. S. & Deardorff, M. A. Clinical features, diagnostic criteria, and management of coffin–siris syndrome. In American Journal of Medical Genetics Part C: Seminars in Medical Genetics, vol. 166, 252–256, https://doi.org/10.1002/ajmg.c.31411 (Wiley Online Library, 2014).
Hunt, J. A. & Hobar, C. P. Common craniofacial anomalies: the facial dysostoses. Plastic and reconstructive surgery 110, 1714–1725, https://doi.org/10.1097/01.prs.0000033869.10382.91 (2002).
Article PubMed Google Scholar
Thevenot, J., López, M. B. & Hadid, A. A survey on computer vision for assistive medical diagnosis from faces. IEEE journal of biomedical and health informatics22, 1497–1511, https://doi.org/10.1109/JBHI.2017.2754861 (2017).
Article PubMed Google Scholar
Gurovich, Y. et al. Identifying facial phenotypes of genetic disorders using deep learning. Nature medicine 25, 60–64, https://doi.org/10.1038/s41591-018-0279-0 (2019).
Article CAS PubMed Google Scholar
Bannister, J. J. et al. A deep invertible 3-d facial shape model for interpretable genetic syndrome diagnosis. IEEE journal of biomedical and health informatics26, 3229–3239, https://doi.org/10.1109/JBHI.2022.3164848 (2022).
Article PubMed PubMed Central Google Scholar
Hallgrímsson, B. et al. Automated syndrome diagnosis by three-dimensional facial imaging. Genetics in medicine22, 1682–1693, https://doi.org/10.1038/s41436-020-0845-y (2020).
Article PubMed PubMed Central Google Scholar
Jin, B., Cruz, L. & Gonçalves, N. Deep facial diagnosis: deep transfer learning from face recognition to facial diagnosis. IEEE Access8, 123649–123661, https://doi.org/10.1007/978-3-030-33128-3_1 (2020).
Article Google Scholar
Pooch, E. H. P., Alva, T. A. P. & Becker, C. D. L. A computational tool for automated detection of genetic syndrome using facial images. In Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20–23, 2020, Proceedings, Part I 9, 361–370, https://doi.org/10.1007/978-3-030-61377-8_25 (Springer, 2020).
Singh, A. & Kisku, D. R. Detection of rare genetic diseases using facial 2d images with transfer learning. In 2018 8th International Symposium on Embedded Computing and System Design (ISED), 26–30, https://doi.org/10.1109/ISED.2018.8703997 (IEEE, 2018).
Lesmann, H. et al. Gestaltmatcher database-a global reference for facial phenotypic variability in rare human diseases. medRxiv 2023–06, https://doi.org/10.1101/2023.06.06.23290887 (2023).
Yoon, C. H., Torrance, R. & Scheinerman, N. Machine learning in medicine: should the pursuit of enhanced interpretability be abandoned? Journal of Medical Ethics48, 581–585, https://doi.org/10.1136/medethics-2020-107102 (2022).
Article PubMed Google Scholar
Vellido, A. The importance of interpretability and visualization in machine learning for applications in medicine and health care. Neural computing and applications32, 18069–18083, https://doi.org/10.1007/s00521-019-04051-w (2020).
Article Google Scholar
Teng, Q., Liu, Z., Song, Y., Han, K. & Lu, Y. A survey on the interpretability of deep learning in medical diagnosis. Multimedia Systems28, 2335–2355, https://doi.org/10.1007/s00530-022-00960-4 (2022).
Article PubMed PubMed Central Google Scholar
Rudin, C. et al. Interpretable machine learning: Fundamental principles and 10 grand challenges. statistics surveys 16 (none): 1–85, https://doi.org/10.1214/21-SS133 (2022).
Köhler, S. et al. The human phenotype ontology in 2021. Nucleic acids research49, D1207–D1217, https://doi.org/10.1093/nar/gkaa1043 (2021).
Article CAS PubMed Google Scholar
Song, J., He, M., Ren, S. & Shen, B. An explainable dataset linking facial phenotypes and genes to rare genetic diseases. figsharehttps://doi.org/10.6084/m9.figshare.28516604.v3 (2025).
Buchmann, J. P. & Holmes, E. C. Entrezpy: a python library to dynamically interact with the ncbi entrez databases. Bioinformatics35, 4511–4514, https://doi.org/10.1093/bioinformatics/btz385 (2019).
Article CAS PubMed PubMed Central Google Scholar
Luo, L. et al. Phenotagger: a hybrid method for phenotype concept recognition using human phenotype ontology. Bioinformatics37, 1884–1890, https://doi.org/10.1093/bioinformatics/btab019 (2021).
Article CAS PubMed PubMed Central Google Scholar
Wei, C.-H., Kao, H.-Y. & Lu, Z. Pubtator: a web-based text mining tool for assisting biocuration. Nucleic acids research41, W518–W522, https://doi.org/10.1093/nar/gkt441 (2013).
Article PubMed PubMed Central Google Scholar
van der Sluijs, P. J. et al. The arid1b spectrum in 143 patients: from nonsyndromic intellectual disability to coffin–siris syndrome. Genetics in Medicine21, 1295–1307, https://doi.org/10.1038/s41436-018-0330-z (2019).
Article CAS PubMed Google Scholar
Vasko, A., Drivas, T. G. & Schrier Vergano, S. A. Genotype-phenotype correlations in 208 individuals with coffin-siris syndrome. Genes12, 937, https://doi.org/10.3390/genes12060937 (2021).
Article CAS PubMed PubMed Central Google Scholar
Aref-Eshghi, E. et al. Bafopathies’ dna methylation epi-signatures demonstrate diagnostic utility and functional continuum of coffin–siris and nicolaides–baraitser syndromes. Nature communications9, 4885, https://doi.org/10.1038/s41467-018-07193-y (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Pagliaroli, L. et al. Inability to switch from arid1a-baf to arid1b-baf impairs exit from pluripotency and commitment towards neural crest formation in arid1b-related neurodevelopmental disorders. Nature communications12, 6469, https://doi.org/10.1038/s41467-021-26810-x (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Santen, G. W. et al. Mutations in swi/snf chromatin remodeling complex gene arid1b cause coffin-siris syndrome. Nature genetics44, 379–380, https://doi.org/10.1038/ng.2217 (2012).
Article CAS PubMed Google Scholar
Tsurusaki, Y. et al. Mutations affecting components of the swi/snf complex cause coffin-siris syndrome. Nature genetics44, 376–378, https://doi.org/10.1038/ng.2219 (2012).
Article CAS PubMed Google Scholar
Wieczorek, D. et al. A comprehensive molecular study on coffin–siris and nicolaides–baraitser syndromes identifies a broad molecular and clinical spectrum converging on altered chromatin remodeling. Human molecular genetics22, 5121–5135, https://doi.org/10.1093/hmg/ddt366 (2013).
Article CAS PubMed Google Scholar
Hernandez, M., Epelde, G., Alberdi, A., Cilla, R. & Rankin, D. Synthetic data generation for tabular health records: A systematic review. Neurocomputing493, 28–45, https://doi.org/10.1016/j.neucom.2022.04.053 (2022).
Article Google Scholar
Nik, A. H. Z., Riegler, M. A., Halvorsen, P. & Storås, A. M. Generation of synthetic tabular healthcare data using generative adversarial networks. In International Conference on Multimedia Modeling, 434–446, https://doi.org/10.1007/978-3-031-27077-2_34 (Springer, 2023).
Huang, L. et al. Gradient attention balance network: Mitigating face recognition racial bias via gradient attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 38–47, https://doi.org/10.48550/arXiv.2304.02284 (2023).
Zhu, X. et al. Multi-modal knowledge graph construction and application: A survey. IEEE Transactions on Knowledge and Data Engineering36, 715–735, https://doi.org/10.1109/TKDE.2022.3224228 (2022).
Article Google Scholar
Arik, S. Ö. & Pfister, T. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI conference on artificial intelligence, vol. 35, 6679–6687, https://doi.org/10.1609/aaai.v35i8.16826 (2021).
Popov, S., Morozov, S. & Babenko, A. Neural oblivious decision ensembles for deep learning on tabular data. arXiv preprint arXiv:1909.06312,https://doi.org/10.48550/arXiv.1909.06312 (2019).
Huang, X., Khetan, A., Cvitkovic, M. & Karnin, Z. Tabtransformer: Tabular data modeling using contextual embeddings. arXiv preprint arXiv:2012.06678,https://doi.org/10.48550/arXiv.2012.06678 (2020).
Gorishniy, Y., Rubachev, I., Khrulkov, V. & Babenko, A. Revisiting deep learning models for tabular data. Advances in Neural Information Processing Systems34, 18932–18943 (2021).
Google Scholar

Download references

Acknowledgements

The research was supported by National Supercomputing Center in Chengdu.

Author information

These authors contributed equally: Jie Song, Mengqiao He.

Authors and Affiliations

Department of Ophthalmology and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
Jie Song, Mengqiao He, Shumin Ren & Bairong Shen

Authors

Jie Song
View author publications
Search author on:PubMed Google Scholar
Mengqiao He
View author publications
Search author on:PubMed Google Scholar
Shumin Ren
View author publications
Search author on:PubMed Google Scholar
Bairong Shen
View author publications
Search author on:PubMed Google Scholar

Contributions

Jie Song: Data Curation, Formal Analysis, Validation, Writing - Original Draft, Writing - Review & Editing; Mengqiao He: Data Curation, Writing - Review & Editing; Shumin Ren: Writing - Review & Editing; Bairong Shen: Conceptualization, Supervision, Writing - Review & Editing.

Corresponding authors

Correspondence to Shumin Ren or Bairong Shen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Song, J., He, M., Ren, S. et al. An explainable dataset linking facial phenotypes and genes to rare genetic diseases. Sci Data 12, 634 (2025). https://doi.org/10.1038/s41597-025-04922-z

Download citation

Received: 24 July 2024
Accepted: 28 March 2025
Published: 15 April 2025
Version of record: 15 April 2025
DOI: https://doi.org/10.1038/s41597-025-04922-z

This article is cited by

WDR26-related Skraban–Deardorff syndrome: clinical, genetic and pathomechanistic insights
- Xueqin Lin
- Hui Chen
- Jing Peng
European Journal of Medical Research (2025)
Face-based machine learning diagnostics: applications, challenges and opportunities
- Jie Song
- Mengqiao He
- Bairong Shen
Artificial Intelligence Review (2025)