Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Phenotypic prediction of missense variants via deep contrastive learning

Abstract

Missense variants (MVs) influence clinical phenotypes, but our understanding of their phenotypic consequences remains constrained. Existing computational approaches to interpret MVs predominantly assess their pathogenicity, without considering phenotypic heterogeneity. We present a machine-learning-based method, PheMART, to predict the clinical phenotypic consequences of MVs. PheMART integrates comprehensive variant and phenotype characterizations by leveraging a robust combination of multiple resources involving protein language models, protein–protein interactions, protein domains, medical knowledge graphs and electronic health records. Exploiting contrastive learning, PheMART establishes connections between MVs and 4,179 phenotypes by jointly projecting them into a cohesive low-dimensional metric space where proximity signifies relevance. Besides substantially outperforming existing models, PheMART aids in diagnosing individuals with rare diseases by effectively pinpointing clinical diagnoses and causative MVs. As a resource to the community, we provide a database of phenotypic predictions for 5.1 million putative pathogenic amino acid alterations.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of PheMART.
Fig. 2: Comprehensive evaluation of PheMART’s phenotypic predictions.
Fig. 3: PheMART’s phenotypic predictions resonate with established biological knowledge.
Fig. 4: PheMART aids in diagnosing patients with rare genetic diseases.
Fig. 5: PheMART’s phenotypic predictions as a community resource.

Data availability

Processed datasets are provided in Zenodo at https://zenodo.org/records/17402388 (ref. 91). The high-confidence phenotypic predictions by PheMART are available in Zenodo at https://zenodo.org/records/17402574 (ref. 92) and are also visualized by genes, phenotypes and protein domains on https://shiny.parse-health.org/PheMART/. In accordance with VA policy, the VA-derived EHR embedding vectors used in this study are available upon request. For access to UDN data, please refer to the UDN official data availability guidelines at https://undiagnosed.hms.harvard.edu/research/data-availability/. The Human Gene Mutation Database used for external validation cannot be publicly shared due to licensing restrictions, as it was obtained under a paid institutional license from QIAGEN, which prohibits redistribution. Source data are provided with this paper.

Code availability

The source codes are available in GitHub at https://github.com/celehs/PheMART (ref. 93).

References

  1. Lappalainen, T. & MacArthur, D. G. From variant to function in human disease genetics. Science 373, 1464–1468 (2021).

    Article  CAS  PubMed  Google Scholar 

  2. Chong, J. X. et al. The genetic basis of Mendelian phenotypes: discoveries, challenges, and opportunities. Am. J. Hum. Genet. 97, 199–215 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Venugopal, A. et al. Monogenic diseases in India. Mutat. Res. Rev. Mutat. Res. 776, 23–31 (2018).

    Article  CAS  PubMed  Google Scholar 

  4. Konishi, C. T. & Long, C. Progress and challenges in CRISPR-mediated therapeutic genome editing for monogenic diseases. J. Biomed. Res. 35, 148–162 (2021).

    Article  CAS  Google Scholar 

  5. Johannesen, K. M. et al. Genotype-phenotype correlations in SCN8A-related disorders reveal prognostic and therapeutic implications. Brain 145, 2991–3009 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  6. Halldorsson, B. V. et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607, 732–740 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Landrum, M. J. et al. ClinVar: improvements to accessing data. Nucleic Acids Res. 48, D835–D844 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Uffelmann, E. et al. Genome-wide association studies. Nat. Rev. Methods Primers 1, 59 (2021).

    Article  CAS  Google Scholar 

  9. Starita, L. M. et al. Variant interpretation: functional assays to the rescue. Am. J. Hum. Genet. 101, 315–325 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Cheng, J. et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492 (2023).

    Article  CAS  PubMed  Google Scholar 

  11. Wu, Y. et al. Improved pathogenicity prediction for rare human missense variants. Am. J. Hum. Genet. 108, 1891–1906 (2021); erratum 108, 2389 (2021).

  12. Slatkin, M. Linkage disequilibrium—understanding the evolutionary past and mapping the medical future. Nat. Rev. Genet. 9, 477–485 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Pejaver, V. et al. Inferring the molecular and phenotypic impact of amino acid variants with MutPred2. Nat. Commun. 11, 5918 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Frazer, J. et al. Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91–95 (2021).

    Article  CAS  PubMed  Google Scholar 

  17. Zhang, H., Xu, M. S., Fan, X., Chung, W. K. & Shen, Y. Predicting functional effect of missense variants using graph attention neural networks. Nat. Mach. Intell. 4, 1017–1028 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Brandes, N., Goldman, G., Wang, C. H., Ye, C. J. & Ntranos, V. Genome-wide prediction of disease variant effects with a deep protein language model. Nat. Genet. 55, 1512–1522 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Chesmore, K., Bartlett, J. & Williams, S. M. The ubiquity of pleiotropy in human disease. Hum. Genet. 137, 39–44 (2018).

    Article  CAS  PubMed  Google Scholar 

  20. Bodenreider, O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32, D267–D270 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Meyer, M. J. et al. Interactome INSIDER: a structural interactome browser for genomic studies. Nat. Methods 15, 107–114 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Das, J. & Yu, H. HINT: high-quality protein interactomes and their applications in understanding human disease. BMC Syst. Biol. 6, 92 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Luck, K. et al. A reference map of the human binary protein interactome. Nature 580, 402–408 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. The UniProt Consortium. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res 53, D609–D617 (2025).

    Google Scholar 

  25. Yuan, Z. et al. CODER: knowledge-infused cross-lingual medical term embedding for term normalization. J. Biomed. Inform. 126, 103983 (2022).

    Article  PubMed  Google Scholar 

  26. Zhou, D., Cai, T. & Lu, J. Multi-source learning via completion of block-wise overlapping noisy matrices. J. Mach. Learn. Res. 24, 1–43 (2023).

    Google Scholar 

  27. Hong, C. et al. Clinical knowledge extraction via sparse embedding regression (KESER) with multi-center large scale electronic health record data. npj Digit. Med. 4, 151 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Wen, J. et al. Multimodal representation learning for predicting molecule–disease relations. Bioinformatics 39, btad085 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Ramoni, R. B. et al. The Undiagnosed Diseases Network: accelerating discovery about health and disease. Am. J. Hum. Genet. 100, 185–192 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Stenson, P. D. et al. The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies. Hum. Genet. 136, 665–677 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Ryu, J. Y., Kim, H. U. & Lee, S. Y. Deep learning improves prediction of drug–drug and drug–food interactions. Proc. Natl Acad. Sci. USA 115, E4304–E4311 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Kingma, D. P., Rezende, D. J., Mohamed, S. & Welling, M. Semi-supervised learning with deep generative models. Adv. Neural Inf. Process. Syst. 27, 3581–3589 (2014).

    Google Scholar 

  34. Dehghan, A., Abbasi, K., Razzaghi, P., Banadkuki, H. & Gharaghani, S. CCL-DTI: contributing the contrastive loss in drug–target interaction prediction. BMC Bioinformatics 25, 48 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  35. Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning (PMLR 2021) (eds Meila, M. & Zhang, T.) 8748–8763 (PMLR, 2021)

  36. Miller, D. T. et al. ACMG SF v3.1 list for reporting of secondary findings in clinical exome and genome sequencing: a policy statement of the American College of Medical Genetics and Genomics (ACMG). Genet. Med. 24, 1407–1414 (2022).

    Article  CAS  PubMed  Google Scholar 

  37. Marchant, R. G. et al. Genome and RNA sequencing boost neuromuscular diagnoses to 62% from 34% with exome sequencing alone. Ann. Clin. Transl. Neurol. 11, 1250–1266 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Koutsofti, C. et al. Massive parallel DNA sequencing of patients with inherited cardiomyopathies in Cyprus and suggestion of digenic or oligogenic inheritance. Genes 15, 319 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. Adv. Neural Inf. Process. Syst. 34, 29287–29303 (2021).

    Google Scholar 

  40. Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).

    Article  CAS  PubMed  Google Scholar 

  41. Hadsell, R., Chopra, S. & LeCun, Y. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’ 06) (eds Fitzgibbon, A. et al.) 1735–1742 (IEEE, 2006).

  42. Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. On calibration of modern neural networks. In Proc. 34th International Conference on Machine Learning (PMLR 2017) (eds Precup, D. & Teh, Y. W.) 1321–1330 (PMLR, 2017)

  43. George, A. L. Jr. Inherited disorders of voltage-gated sodium channels. J. Clin. Invest. 115, 1990–1999 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Kullmann, D. M. & Waxman, S. G. Neurological channelopathies: new insights into disease mechanisms and ion channel function. J. Physiol. 588, 1823–1827 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Bers, D. M. Cardiac excitation–contraction coupling. Nature 415, 198–205 (2002).

    Article  CAS  PubMed  Google Scholar 

  46. Opalińska, M. & Jańska, H. AAA proteases: guardians of mitochondrial function and homeostasis. Cells 7, 163 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  47. Cloonan, S. M. & Choi, A. M. K. Mitochondria in lung disease. J. Clin. Invest. 126, 809–820 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  48. Klaassen, S. et al. Mutations in sarcomere protein genes in left ventricular noncompaction. Circulation 117, 2893–2901 (2008).

    Article  CAS  PubMed  Google Scholar 

  49. Holm, H. et al. A rare variant in MYH6 is associated with high risk of sick sinus syndrome. Nat. Genet. 43, 316–320 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. D’Cruz, A. A., Babon, J. J., Norton, R. S., Nicola, N. A. & Nicholson, S. E. Structure and function of the SPRY/B30.2 domain proteins involved in innate immunity. Protein Sci. 22, 1–10 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  51. Van Bergen, N. J. et al. CDKL5 deficiency disorder: molecular insights and mechanisms of pathogenicity to fast-track therapeutic development. Biochem. Soc. Trans. 50, 1207–1224 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  52. Yan, G.-X. & Antzelevitch, C. Cellular basis for the Brugada syndrome and other mechanisms of arrhythmogenesis associated with ST-segment elevation. Circulation 100, 1660–1666 (1999).

    Article  CAS  PubMed  Google Scholar 

  53. Jones, S. An overview of the basic helix-loop-helix proteins. Genome Biol. 5, 226 (2004).

    Article  PubMed  PubMed Central  Google Scholar 

  54. Duverger, O. & Morasso, M. I. Role of homeobox genes in the patterning, specification, and differentiation of ectodermal appendages in mammals. J. Cell. Physiol. 216, 337–346 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Lewis, D. L. et al. Ectopic gene expression and homeotic transformations in arthropods using recombinant sindbis viruses. Curr. Biol. 9, 1279–1287 (1999).

    Article  CAS  PubMed  Google Scholar 

  56. Hwang, J. L. et al. FOXP3 mutations causing early-onset insulin-requiring diabetes but without other features of immune dysregulation, polyendocrinopathy, enteropathy, x-linked syndrome. Pediatr. Diabetes 19, 388–392 (2018).

    Article  CAS  PubMed  Google Scholar 

  57. Brock, S. et al. Tubulinopathies continued: refining the phenotypic spectrum associated with variants in TUBG1. Eur. J. Hum. Genet. 26, 1132–1142 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Cai, S., Li, J., Wu, Y. & Jiang, Y. De novo mutations of TUBB2A cause infantile-onset epilepsy and developmental delay. J. Hum. Genet. 65, 601–608 (2020).

    Article  CAS  PubMed  Google Scholar 

  59. Watanabe, K. et al. Identification of two novel de novo TUBB variants in cases with brain malformations: case reports and literature review. J. Hum. Genet. 66, 1193–1197 (2021).

    Article  CAS  PubMed  Google Scholar 

  60. Splinter, K. et al. Effect of genetic diagnosis on patients with previously undiagnosed disease. N. Engl. J. Med. 379, 2131–2139 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Kobren, S. N. et al. Commonalities across computational workflows for uncovering explanatory variants in undiagnosed cases. Genet. Med. 23, 1075–1085 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  62. Schlüter, A. et al. ClinPrior: an algorithm for diagnosis and novel gene discovery by network-based prioritization. Genome Med. 15, 68 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  63. Marwaha, S., Knowles, J. W. & Ashley, E. A. A guide for the diagnosis of rare and undiagnosed disease: beyond the exome. Genome Med. 14, 23 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  64. Gahl, W. A. et al. The National Institutes of Health Undiagnosed Diseases Program: insights into rare diseases. Genet. Med. 14, 51–59 (2012).

    Article  CAS  PubMed  Google Scholar 

  65. Zhu, X. et al. Whole-exome sequencing in undiagnosed genetic diseases: interpreting 119 trios. Genet. Med. 17, 774–781 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Weinreich, S. S., Mangon, R., Sikkens, J. J., En Teeuw, M. E. & Cornel, M. C. Orphanet: a European database for rare diseases [in Dutch with English abstract]. Ned. Tijdschr. Geneeskd. 152, 518–519 (2008).

  67. Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A. & McKusick, V. A. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33, D514–D517 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  68. Robinson, P. N. et al. The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am. J. Hum. Genet. 83, 610–615 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Wang, J. Z., Du, Z., Payattakool, R., Yu, P. S. & Chen, C.-F. A new method to measure the semantic similarity of GO terms. Bioinformatics 23, 1274–1281 (2007).

    Article  CAS  PubMed  Google Scholar 

  70. Jagadeesh, K. A. et al. Phrank measures phenotype sets similarity to greatly improve Mendelian diagnostic disease prioritization. Genet. Med. 21, 464–470 (2019).

    Article  CAS  PubMed  Google Scholar 

  71. Köhler, S. et al. Clinical diagnostics in human genetics with semantic similarity searches in ontologies. Am. J. Hum. Genet. 85, 457–464 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  72. Li, Y. et al. Functional correlation of ATP1A2 mutations with phenotypic spectrum: from pure hemiplegic migraine to its variant forms. J. Headache Pain 22, 92 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. Wojciechowska, K., Pikulicka, A., Drgas, O., Żarnowska, I. & Brudkowska, Z. Heterozygous de novo mutation in the ATP1A2 gene in a patient with alternating hemiplegia of childhood. Pediatr. Pol. 98, 258–263 (2023).

    Article  Google Scholar 

  74. Ng, B. G. et al. Biallelic mutations in CAD, impair de novo pyrimidine biosynthesis and decrease glycosylation precursors. Hum. Mol. Genet. 24, 3050–3057 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  75. Yang, H. et al. Noncoding genetic variation in GATA3 increases acute lymphoblastic leukemia risk through local and global changes in chromatin conformation. Nat. Genet. 54, 170–179 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  76. Abunimye, D. A., Okafor, I. M., Okorowo, H. & Obeagu, E. I. The role of GATA family transcriptional factors in haematological malignancies: a review. Medicine 103, e37487 (2024); retraction 103, e38232 (2024).

  77. Backman, J. D. et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature 599, 628–634 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  78. Stefl, S., Nishi, H., Petukh, M., Panchenko, A. R. & Alexov, E. Molecular mechanisms of disease-causing missense mutations. J. Mol. Biol. 425, 3919–3936 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  79. Backwell, L. & Marsh, J. A. Diverse molecular mechanisms underlying pathogenic protein mutations: beyond the loss-of-function paradigm. Annu. Rev. Genom. Hum. Genet. 23, 475–498 (2022).

    Article  CAS  Google Scholar 

  80. Frésard, L. et al. Identification of rare-disease genes using blood transcriptome sequencing and large control cohorts. Nat. Med. 25, 911–919 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  81. Lee, H. et al. Diagnostic utility of transcriptome sequencing for rare Mendelian diseases. Genet. Med. 22, 490–499 (2020).

    Article  CAS  PubMed  Google Scholar 

  82. Postel, M. D., Culver, J. O., Ricker, C. & Craig, D. W. Transcriptome analysis provides critical answers to the “variants of uncertain significance” conundrum. Hum. Mutat. 43, 1590–1608 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  83. Chowdhury, R. et al. Single-sequence protein structure prediction using a language model and deep learning. Nat. Biotechnol. 40, 1617–1623 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  84. Zeng, S., Yuan, Z. & Yu, S. Automatic biomedical term clustering by learning fine-grained term representations. In Proc. 21st Workshop on Biomedical Language Processing (eds Demner-Fushman, D. et al.) 91–96 (Association for Computational Linguistics, 2022)

  85. Putkowski, S. The National Organization for Rare Disorders (NORD): providing advocacy for people with rare disorders. NASN Sch. Nurse 25, 38–41 (2010).

    Article  PubMed  Google Scholar 

  86. Sahni, N. et al. Widespread macromolecular interaction perturbations in human genetic disorders. Cell 161, 647–660 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  87. Ng, P. K.-S. et al. Systematic functional annotation of somatic mutations in cancer. Cancer Cell 33, 450–462 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  88. Cheng, F. et al. Comprehensive characterization of protein–protein interactions perturbed by disease mutations. Nat. Genet. 53, 342–353 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  89. Chandak, P., Huang, K. & Zitnik, M. Building a knowledge graph to enable precision medicine. Sci. Data 10, 67 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  90. Wang, X. et al. Three-dimensional reconstruction of protein networks provides insight into human genetic disease. Nat. Biotechnol. 30, 159–164 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  91. Jun, W. et al. Phenotypic prediction of missense variants (high-confidence predictions). Zenodo https://doi.org/10.5281/zenodo.17402573 (2025).

  92. Jun, W. et al. Phenotypic prediction of missense variants (dataset). Zenodo https://doi.org/10.5281/zenodo.17402387 (2025).

  93. Jun, W. et al. PheMART. GitHub https://github.com/celehs/PheMART (2025).

  94. The UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 49, D480–D489 (2021).

    Article  Google Scholar 

  95. McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. J. Open Source Softw. 3, 861 (2018).

    Article  Google Scholar 

Download references

Acknowledgements

Part of this research is based on data from the Million Veteran Program, Office of Research and Development, Veterans Health Administration, and was supported by award no. MVP000 (K.C.). Research reported in this manuscript was supported by the US National Institutes of Health (NIH) under award nos. U01HG007530 (S.N.K. and I.S.K.), P30 AR072577 (K.P.L.), R01 CA297832 (H.W.) and R01 GM152814-01 (J.S.L.). This work was also supported by the US National Science Foundation (NSF) under award no. IIS-2127918 (H.W.) and NSF CAREER Award IIS-2340125 (H.W.), and by an Amazon Faculty Research Award, and Microsoft AI and Society Fellowship (H.W.). The content is solely the authors’ responsibility and does not necessarily represent the official views of the NIH.

Author information

Authors and Affiliations

Authors

Contributions

J.W., T.C., S.Z., J.D., J.S.L., A.C.P., M. Zitnik, I.S.K., H.W., M. Zhu, S.C. and F.L. contributed to the conceptualization of the study. Writing was carried out by J.W., T.C., S.Z., A.C.P., Y.C., S.N.K., M. Zitnik, C.-L.B. and H.G.Z. Data curation was performed by J.W., S.Z., J.D., S.N.K., Y.C., S.C. and I.S.K. Funding was provided by T.C., I.S.K., K.P.L. and K.C.

Corresponding author

Correspondence to Tianxi Cai.

Ethics declarations

Competing interests

K.P.L. was a past one-time consultant for the University of California, Berkeley. The other authors declare no competing interests.

Peer review

Peer review information

Nature Biomedical Engineering thanks Xiaoming Liu, Jianyi Yang and Matthew Jensen for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Supplementary Figs. 1–11; results, analysis and resource details (S1.1–S1.12); and References.

Reporting Summary (download PDF )

Supplementary Data 1 (download XLS )

Calibrated PheMART predictions across phenotype groups.

Supplementary Data 2 (download XLS )

Ablation study to investigate the contributions of the proposed components of PheMART.

Supplementary Data 3 (download XLS )

Evaluating PheMART’s generalization to variants in unseen protein domains.

Source data

Source Data Fig. 1 (download CSV )

Statistical source data.

Source Data Fig. 2 (download XLS )

Statistical source data.

Source Data Fig. 3 (download CSV )

Statistical source data.

Source Data Fig. 4 (download XLS )

Statistical source data.

Source Data Fig. 5 (download XLS )

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wen, J., Zeng, S., Bonzel, CL. et al. Phenotypic prediction of missense variants via deep contrastive learning. Nat. Biomed. Eng (2026). https://doi.org/10.1038/s41551-026-01636-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1038/s41551-026-01636-4

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing