Abstract
Missense variants (MVs) influence clinical phenotypes, but our understanding of their phenotypic consequences remains constrained. Existing computational approaches to interpret MVs predominantly assess their pathogenicity, without considering phenotypic heterogeneity. We present a machine-learning-based method, PheMART, to predict the clinical phenotypic consequences of MVs. PheMART integrates comprehensive variant and phenotype characterizations by leveraging a robust combination of multiple resources involving protein language models, protein–protein interactions, protein domains, medical knowledge graphs and electronic health records. Exploiting contrastive learning, PheMART establishes connections between MVs and 4,179 phenotypes by jointly projecting them into a cohesive low-dimensional metric space where proximity signifies relevance. Besides substantially outperforming existing models, PheMART aids in diagnosing individuals with rare diseases by effectively pinpointing clinical diagnoses and causative MVs. As a resource to the community, we provide a database of phenotypic predictions for 5.1 million putative pathogenic amino acid alterations.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout





Data availability
Processed datasets are provided in Zenodo at https://zenodo.org/records/17402388 (ref. 91). The high-confidence phenotypic predictions by PheMART are available in Zenodo at https://zenodo.org/records/17402574 (ref. 92) and are also visualized by genes, phenotypes and protein domains on https://shiny.parse-health.org/PheMART/. In accordance with VA policy, the VA-derived EHR embedding vectors used in this study are available upon request. For access to UDN data, please refer to the UDN official data availability guidelines at https://undiagnosed.hms.harvard.edu/research/data-availability/. The Human Gene Mutation Database used for external validation cannot be publicly shared due to licensing restrictions, as it was obtained under a paid institutional license from QIAGEN, which prohibits redistribution. Source data are provided with this paper.
Code availability
The source codes are available in GitHub at https://github.com/celehs/PheMART (ref. 93).
References
Lappalainen, T. & MacArthur, D. G. From variant to function in human disease genetics. Science 373, 1464–1468 (2021).
Chong, J. X. et al. The genetic basis of Mendelian phenotypes: discoveries, challenges, and opportunities. Am. J. Hum. Genet. 97, 199–215 (2015).
Venugopal, A. et al. Monogenic diseases in India. Mutat. Res. Rev. Mutat. Res. 776, 23–31 (2018).
Konishi, C. T. & Long, C. Progress and challenges in CRISPR-mediated therapeutic genome editing for monogenic diseases. J. Biomed. Res. 35, 148–162 (2021).
Johannesen, K. M. et al. Genotype-phenotype correlations in SCN8A-related disorders reveal prognostic and therapeutic implications. Brain 145, 2991–3009 (2022).
Halldorsson, B. V. et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607, 732–740 (2022).
Landrum, M. J. et al. ClinVar: improvements to accessing data. Nucleic Acids Res. 48, D835–D844 (2020).
Uffelmann, E. et al. Genome-wide association studies. Nat. Rev. Methods Primers 1, 59 (2021).
Starita, L. M. et al. Variant interpretation: functional assays to the rescue. Am. J. Hum. Genet. 101, 315–325 (2017).
Cheng, J. et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492 (2023).
Wu, Y. et al. Improved pathogenicity prediction for rare human missense variants. Am. J. Hum. Genet. 108, 1891–1906 (2021); erratum 108, 2389 (2021).
Slatkin, M. Linkage disequilibrium—understanding the evolutionary past and mapping the medical future. Nat. Rev. Genet. 9, 477–485 (2008).
Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Pejaver, V. et al. Inferring the molecular and phenotypic impact of amino acid variants with MutPred2. Nat. Commun. 11, 5918 (2020).
Frazer, J. et al. Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91–95 (2021).
Zhang, H., Xu, M. S., Fan, X., Chung, W. K. & Shen, Y. Predicting functional effect of missense variants using graph attention neural networks. Nat. Mach. Intell. 4, 1017–1028 (2022).
Brandes, N., Goldman, G., Wang, C. H., Ye, C. J. & Ntranos, V. Genome-wide prediction of disease variant effects with a deep protein language model. Nat. Genet. 55, 1512–1522 (2023).
Chesmore, K., Bartlett, J. & Williams, S. M. The ubiquity of pleiotropy in human disease. Hum. Genet. 137, 39–44 (2018).
Bodenreider, O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32, D267–D270 (2004).
Meyer, M. J. et al. Interactome INSIDER: a structural interactome browser for genomic studies. Nat. Methods 15, 107–114 (2018).
Das, J. & Yu, H. HINT: high-quality protein interactomes and their applications in understanding human disease. BMC Syst. Biol. 6, 92 (2012).
Luck, K. et al. A reference map of the human binary protein interactome. Nature 580, 402–408 (2020).
The UniProt Consortium. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res 53, D609–D617 (2025).
Yuan, Z. et al. CODER: knowledge-infused cross-lingual medical term embedding for term normalization. J. Biomed. Inform. 126, 103983 (2022).
Zhou, D., Cai, T. & Lu, J. Multi-source learning via completion of block-wise overlapping noisy matrices. J. Mach. Learn. Res. 24, 1–43 (2023).
Hong, C. et al. Clinical knowledge extraction via sparse embedding regression (KESER) with multi-center large scale electronic health record data. npj Digit. Med. 4, 151 (2021).
Wen, J. et al. Multimodal representation learning for predicting molecule–disease relations. Bioinformatics 39, btad085 (2023).
Ramoni, R. B. et al. The Undiagnosed Diseases Network: accelerating discovery about health and disease. Am. J. Hum. Genet. 100, 185–192 (2017).
Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).
Stenson, P. D. et al. The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies. Hum. Genet. 136, 665–677 (2017).
Ryu, J. Y., Kim, H. U. & Lee, S. Y. Deep learning improves prediction of drug–drug and drug–food interactions. Proc. Natl Acad. Sci. USA 115, E4304–E4311 (2018).
Kingma, D. P., Rezende, D. J., Mohamed, S. & Welling, M. Semi-supervised learning with deep generative models. Adv. Neural Inf. Process. Syst. 27, 3581–3589 (2014).
Dehghan, A., Abbasi, K., Razzaghi, P., Banadkuki, H. & Gharaghani, S. CCL-DTI: contributing the contrastive loss in drug–target interaction prediction. BMC Bioinformatics 25, 48 (2024).
Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning (PMLR 2021) (eds Meila, M. & Zhang, T.) 8748–8763 (PMLR, 2021)
Miller, D. T. et al. ACMG SF v3.1 list for reporting of secondary findings in clinical exome and genome sequencing: a policy statement of the American College of Medical Genetics and Genomics (ACMG). Genet. Med. 24, 1407–1414 (2022).
Marchant, R. G. et al. Genome and RNA sequencing boost neuromuscular diagnoses to 62% from 34% with exome sequencing alone. Ann. Clin. Transl. Neurol. 11, 1250–1266 (2024).
Koutsofti, C. et al. Massive parallel DNA sequencing of patients with inherited cardiomyopathies in Cyprus and suggestion of digenic or oligogenic inheritance. Genes 15, 319 (2024).
Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. Adv. Neural Inf. Process. Syst. 34, 29287–29303 (2021).
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Hadsell, R., Chopra, S. & LeCun, Y. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’ 06) (eds Fitzgibbon, A. et al.) 1735–1742 (IEEE, 2006).
Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. On calibration of modern neural networks. In Proc. 34th International Conference on Machine Learning (PMLR 2017) (eds Precup, D. & Teh, Y. W.) 1321–1330 (PMLR, 2017)
George, A. L. Jr. Inherited disorders of voltage-gated sodium channels. J. Clin. Invest. 115, 1990–1999 (2005).
Kullmann, D. M. & Waxman, S. G. Neurological channelopathies: new insights into disease mechanisms and ion channel function. J. Physiol. 588, 1823–1827 (2010).
Bers, D. M. Cardiac excitation–contraction coupling. Nature 415, 198–205 (2002).
Opalińska, M. & Jańska, H. AAA proteases: guardians of mitochondrial function and homeostasis. Cells 7, 163 (2018).
Cloonan, S. M. & Choi, A. M. K. Mitochondria in lung disease. J. Clin. Invest. 126, 809–820 (2016).
Klaassen, S. et al. Mutations in sarcomere protein genes in left ventricular noncompaction. Circulation 117, 2893–2901 (2008).
Holm, H. et al. A rare variant in MYH6 is associated with high risk of sick sinus syndrome. Nat. Genet. 43, 316–320 (2011).
D’Cruz, A. A., Babon, J. J., Norton, R. S., Nicola, N. A. & Nicholson, S. E. Structure and function of the SPRY/B30.2 domain proteins involved in innate immunity. Protein Sci. 22, 1–10 (2013).
Van Bergen, N. J. et al. CDKL5 deficiency disorder: molecular insights and mechanisms of pathogenicity to fast-track therapeutic development. Biochem. Soc. Trans. 50, 1207–1224 (2022).
Yan, G.-X. & Antzelevitch, C. Cellular basis for the Brugada syndrome and other mechanisms of arrhythmogenesis associated with ST-segment elevation. Circulation 100, 1660–1666 (1999).
Jones, S. An overview of the basic helix-loop-helix proteins. Genome Biol. 5, 226 (2004).
Duverger, O. & Morasso, M. I. Role of homeobox genes in the patterning, specification, and differentiation of ectodermal appendages in mammals. J. Cell. Physiol. 216, 337–346 (2008).
Lewis, D. L. et al. Ectopic gene expression and homeotic transformations in arthropods using recombinant sindbis viruses. Curr. Biol. 9, 1279–1287 (1999).
Hwang, J. L. et al. FOXP3 mutations causing early-onset insulin-requiring diabetes but without other features of immune dysregulation, polyendocrinopathy, enteropathy, x-linked syndrome. Pediatr. Diabetes 19, 388–392 (2018).
Brock, S. et al. Tubulinopathies continued: refining the phenotypic spectrum associated with variants in TUBG1. Eur. J. Hum. Genet. 26, 1132–1142 (2018).
Cai, S., Li, J., Wu, Y. & Jiang, Y. De novo mutations of TUBB2A cause infantile-onset epilepsy and developmental delay. J. Hum. Genet. 65, 601–608 (2020).
Watanabe, K. et al. Identification of two novel de novo TUBB variants in cases with brain malformations: case reports and literature review. J. Hum. Genet. 66, 1193–1197 (2021).
Splinter, K. et al. Effect of genetic diagnosis on patients with previously undiagnosed disease. N. Engl. J. Med. 379, 2131–2139 (2018).
Kobren, S. N. et al. Commonalities across computational workflows for uncovering explanatory variants in undiagnosed cases. Genet. Med. 23, 1075–1085 (2021).
Schlüter, A. et al. ClinPrior: an algorithm for diagnosis and novel gene discovery by network-based prioritization. Genome Med. 15, 68 (2023).
Marwaha, S., Knowles, J. W. & Ashley, E. A. A guide for the diagnosis of rare and undiagnosed disease: beyond the exome. Genome Med. 14, 23 (2022).
Gahl, W. A. et al. The National Institutes of Health Undiagnosed Diseases Program: insights into rare diseases. Genet. Med. 14, 51–59 (2012).
Zhu, X. et al. Whole-exome sequencing in undiagnosed genetic diseases: interpreting 119 trios. Genet. Med. 17, 774–781 (2015).
Weinreich, S. S., Mangon, R., Sikkens, J. J., En Teeuw, M. E. & Cornel, M. C. Orphanet: a European database for rare diseases [in Dutch with English abstract]. Ned. Tijdschr. Geneeskd. 152, 518–519 (2008).
Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A. & McKusick, V. A. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33, D514–D517 (2005).
Robinson, P. N. et al. The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am. J. Hum. Genet. 83, 610–615 (2008).
Wang, J. Z., Du, Z., Payattakool, R., Yu, P. S. & Chen, C.-F. A new method to measure the semantic similarity of GO terms. Bioinformatics 23, 1274–1281 (2007).
Jagadeesh, K. A. et al. Phrank measures phenotype sets similarity to greatly improve Mendelian diagnostic disease prioritization. Genet. Med. 21, 464–470 (2019).
Köhler, S. et al. Clinical diagnostics in human genetics with semantic similarity searches in ontologies. Am. J. Hum. Genet. 85, 457–464 (2009).
Li, Y. et al. Functional correlation of ATP1A2 mutations with phenotypic spectrum: from pure hemiplegic migraine to its variant forms. J. Headache Pain 22, 92 (2021).
Wojciechowska, K., Pikulicka, A., Drgas, O., Żarnowska, I. & Brudkowska, Z. Heterozygous de novo mutation in the ATP1A2 gene in a patient with alternating hemiplegia of childhood. Pediatr. Pol. 98, 258–263 (2023).
Ng, B. G. et al. Biallelic mutations in CAD, impair de novo pyrimidine biosynthesis and decrease glycosylation precursors. Hum. Mol. Genet. 24, 3050–3057 (2015).
Yang, H. et al. Noncoding genetic variation in GATA3 increases acute lymphoblastic leukemia risk through local and global changes in chromatin conformation. Nat. Genet. 54, 170–179 (2022).
Abunimye, D. A., Okafor, I. M., Okorowo, H. & Obeagu, E. I. The role of GATA family transcriptional factors in haematological malignancies: a review. Medicine 103, e37487 (2024); retraction 103, e38232 (2024).
Backman, J. D. et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature 599, 628–634 (2021).
Stefl, S., Nishi, H., Petukh, M., Panchenko, A. R. & Alexov, E. Molecular mechanisms of disease-causing missense mutations. J. Mol. Biol. 425, 3919–3936 (2013).
Backwell, L. & Marsh, J. A. Diverse molecular mechanisms underlying pathogenic protein mutations: beyond the loss-of-function paradigm. Annu. Rev. Genom. Hum. Genet. 23, 475–498 (2022).
Frésard, L. et al. Identification of rare-disease genes using blood transcriptome sequencing and large control cohorts. Nat. Med. 25, 911–919 (2019).
Lee, H. et al. Diagnostic utility of transcriptome sequencing for rare Mendelian diseases. Genet. Med. 22, 490–499 (2020).
Postel, M. D., Culver, J. O., Ricker, C. & Craig, D. W. Transcriptome analysis provides critical answers to the “variants of uncertain significance” conundrum. Hum. Mutat. 43, 1590–1608 (2022).
Chowdhury, R. et al. Single-sequence protein structure prediction using a language model and deep learning. Nat. Biotechnol. 40, 1617–1623 (2022).
Zeng, S., Yuan, Z. & Yu, S. Automatic biomedical term clustering by learning fine-grained term representations. In Proc. 21st Workshop on Biomedical Language Processing (eds Demner-Fushman, D. et al.) 91–96 (Association for Computational Linguistics, 2022)
Putkowski, S. The National Organization for Rare Disorders (NORD): providing advocacy for people with rare disorders. NASN Sch. Nurse 25, 38–41 (2010).
Sahni, N. et al. Widespread macromolecular interaction perturbations in human genetic disorders. Cell 161, 647–660 (2015).
Ng, P. K.-S. et al. Systematic functional annotation of somatic mutations in cancer. Cancer Cell 33, 450–462 (2018).
Cheng, F. et al. Comprehensive characterization of protein–protein interactions perturbed by disease mutations. Nat. Genet. 53, 342–353 (2021).
Chandak, P., Huang, K. & Zitnik, M. Building a knowledge graph to enable precision medicine. Sci. Data 10, 67 (2023).
Wang, X. et al. Three-dimensional reconstruction of protein networks provides insight into human genetic disease. Nat. Biotechnol. 30, 159–164 (2012).
Jun, W. et al. Phenotypic prediction of missense variants (high-confidence predictions). Zenodo https://doi.org/10.5281/zenodo.17402573 (2025).
Jun, W. et al. Phenotypic prediction of missense variants (dataset). Zenodo https://doi.org/10.5281/zenodo.17402387 (2025).
Jun, W. et al. PheMART. GitHub https://github.com/celehs/PheMART (2025).
The UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 49, D480–D489 (2021).
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. J. Open Source Softw. 3, 861 (2018).
Acknowledgements
Part of this research is based on data from the Million Veteran Program, Office of Research and Development, Veterans Health Administration, and was supported by award no. MVP000 (K.C.). Research reported in this manuscript was supported by the US National Institutes of Health (NIH) under award nos. U01HG007530 (S.N.K. and I.S.K.), P30 AR072577 (K.P.L.), R01 CA297832 (H.W.) and R01 GM152814-01 (J.S.L.). This work was also supported by the US National Science Foundation (NSF) under award no. IIS-2127918 (H.W.) and NSF CAREER Award IIS-2340125 (H.W.), and by an Amazon Faculty Research Award, and Microsoft AI and Society Fellowship (H.W.). The content is solely the authors’ responsibility and does not necessarily represent the official views of the NIH.
Author information
Authors and Affiliations
Contributions
J.W., T.C., S.Z., J.D., J.S.L., A.C.P., M. Zitnik, I.S.K., H.W., M. Zhu, S.C. and F.L. contributed to the conceptualization of the study. Writing was carried out by J.W., T.C., S.Z., A.C.P., Y.C., S.N.K., M. Zitnik, C.-L.B. and H.G.Z. Data curation was performed by J.W., S.Z., J.D., S.N.K., Y.C., S.C. and I.S.K. Funding was provided by T.C., I.S.K., K.P.L. and K.C.
Corresponding author
Ethics declarations
Competing interests
K.P.L. was a past one-time consultant for the University of California, Berkeley. The other authors declare no competing interests.
Peer review
Peer review information
Nature Biomedical Engineering thanks Xiaoming Liu, Jianyi Yang and Matthew Jensen for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information (download PDF )
Supplementary Figs. 1–11; results, analysis and resource details (S1.1–S1.12); and References.
Supplementary Data 1 (download XLS )
Calibrated PheMART predictions across phenotype groups.
Supplementary Data 2 (download XLS )
Ablation study to investigate the contributions of the proposed components of PheMART.
Supplementary Data 3 (download XLS )
Evaluating PheMART’s generalization to variants in unseen protein domains.
Source data
Source Data Fig. 1 (download CSV )
Statistical source data.
Source Data Fig. 2 (download XLS )
Statistical source data.
Source Data Fig. 3 (download CSV )
Statistical source data.
Source Data Fig. 4 (download XLS )
Statistical source data.
Source Data Fig. 5 (download XLS )
Statistical source data.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wen, J., Zeng, S., Bonzel, CL. et al. Phenotypic prediction of missense variants via deep contrastive learning. Nat. Biomed. Eng (2026). https://doi.org/10.1038/s41551-026-01636-4
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41551-026-01636-4