Abstract
Coronary artery disease (CAD) exists on a spectrum of disease represented by a combination of risk factors and pathogenic processes. An in silico score for CAD built using machine learning and clinical data in electronic health records captures disease progression, severity and underdiagnosis on this spectrum and could enhance genetic discovery efforts for CAD. Here we tested associations of rare and ultrarare coding variants with the in silico score for CAD in the UK Biobank, All of Us Research Program and BioMe Biobank. We identified associations in 17 genes; of these, 14 show at least moderate levels of prior genetic, biological and/or clinical support for CAD. We also observed an excess of ultrarare coding variants in 321 aggregated CAD genes, suggesting more ultrarare variant associations await discovery. These results expand our understanding of the genetic etiology of CAD and illustrate how digital markers can enhance genetic association investigations for complex diseases.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout


Similar content being viewed by others
Data availability
Genetic association summary statistics are available on the GWAS Catalog (study accession GCST90370243 and GCST90370244, both available at https://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90370001-GCST90371000/) and Zenodo (https://zenodo.org/records/11086022)55.
Code availability
All analysis code is available on Zenodo (https://zenodo.org/records/11086022)55.
References
Roth Gregory, A. et al. Global burden of cardiovascular diseases and risk factors, 1990–2019. J. Am. Coll. Cardiol. 76, 2982–3021 (2020).
Khera, A. V. & Kathiresan, S. Genetics of coronary artery disease: discovery, biology and clinical translation. Nat. Rev. Genet. 18, 331–344 (2017).
Chen, Z. & Schunkert, H. Genetics of coronary artery disease in the post-GWAS era. J. Intern. Med. 290, 980–992 (2021).
Aragam, K. G. et al. Discovery and systematic characterization of risk variants and genes for coronary artery disease in over a million participants. Nat. Genet. 54, 1803–1815 (2022).
Tcheandjieu, C. et al. Large-scale genome-wide association study of coronary artery disease in genetically diverse populations. Nat. Med. 28, 1679–1692 (2022).
Plenge, R. M., Scolnick, E. M. & Altshuler, D. Validating therapeutic targets through human genetics. Nat. Rev. Drug Discov. 12, 581–594 (2013).
Plenge, R. M. Disciplined approach to drug discovery and early development. Sci. Transl. Med. 8, 349ps15 (2016).
Szustakowski, J. D. et al. Advancing human genetics research and drug discovery through exome sequencing of the UK Biobank. Nat. Genet. 53, 942–948 (2021).
Do, R. et al. Exome sequencing identifies rare LDLR and APOA5 alleles conferring risk for myocardial infarction. Nature 518, 102–106 (2015).
Yao, K. et al. Exome sequencing identifies rare mutations of LDLR and QTRT1 conferring risk for early-onset coronary artery disease in Chinese. Natl Sci. Rev. 9, nwac102 (2022).
Khera, A. V. et al. Gene sequencing identifies perturbation in nitric oxide signaling as a nonlipid molecular subtype of coronary artery disease. Circ. Genom. Precis. Med. 15, e003598 (2022).
Martin, S. S. et al. 2024 heart disease and stroke statistics: a report of US and global data from the American Heart Association. Circulation 149, e347–e913 (2024).
Maddox, T. M. et al. Nonobstructive coronary artery disease and risk of myocardial infarction. JAMA 312, 1754–1763 (2014).
Park, D. W. et al. Extent, location, and clinical significance of non-infarct-related coronary artery disease among patients with ST-elevation myocardial infarction. JAMA 312, 2019–2027 (2014).
Forrest, I. S. et al. Machine learning-based marker for coronary artery disease: derivation and validation in two longitudinal cohorts. Lancet 401, 215–225 (2023).
Petrazzini, B. O. et al. Coronary risk estimation based on clinical data in electronic health records. J. Am. Coll. Cardiol. 79, 1155–1166 (2022).
Mbatchou, J. et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. 53, 1097–1103 (2021).
Willer, C. J., Li, Y. & Abecasis, G. R. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics 26, 2190–2191 (2010).
Sveinbjornsson, G. et al. Weighting sequence variants based on their annotation increases power of whole-genome association studies. Nat. Genet. 48, 314–317 (2016).
Zhou, W. et al. Efficiently controlling for case–control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet. 50, 1335–1341 (2018).
Loh, P. R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-scale datasets. Nat. Genet. 50, 906–908 (2018).
Nikpay, M. et al. A comprehensive 1,000 genomes–based genome-wide association meta-analysis of coronary artery disease. Nat. Genet. 47, 1121–1130 (2015).
Tarugi, P. et al. Molecular diagnosis of hypobetalipoproteinemia: an ENID review. Atherosclerosis 195, e19–e27 (2007).
Ference, B. A. et al. Variation in PCSK9 and HMGCR and risk of cardiovascular disease and diabetes. N. Engl. J. Med. 375, 2144–2153 (2016).
Schmidt, A. F. et al. PCSK9 genetic variants and risk of type 2 diabetes: a mendelian randomisation study. Lancet Diabetes Endocrinol. 5, 97–105 (2017).
Lotta, L. A. et al. Association between low-density lipoprotein cholesterol–lowering genetic variants and risk of type 2 diabetes: a meta-analysis. JAMA 316, 1383–1391 (2016).
Benn, M., Nordestgaard, B. G., Grande, P., Schnohr, P. & Tybjærg-Hansen, A. PCSK9R46L, low-density lipoprotein cholesterol levels, and risk of ischemic heart disease: 3 independent studies and meta-analyses. J. Am. Coll. Cardiol. 55, 2833–2842 (2010).
Ghoussaini, M. et al. Open Targets Genetics: systematic identification of trait-associated genes using large-scale genetics and functional genomics. Nucleic Acids Res. 49, D1311–D1320 (2021).
Thomas, D. G., Wei, Y. & Tall, A. R. Lipid and metabolic syndrome traits in coronary artery disease: a Mendelian randomization study. J. Lipid Res. 62, 100044 (2021).
Liberzon, A. et al. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 1, 417–425 (2015).
Schrodi, S. J. The impact of diagnostic code misclassification on optimizing the experimental design of genetic association studies. J. Healthc. Eng. 2017, 7653071 (2017).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Klarin, D. et al. Genetic analysis in UK Biobank links insulin resistance and transendothelial migration pathways to coronary artery disease. Nat. Genet. 49, 1392–1397 (2017).
Honigberg, M. C. et al. Premature menopause, clonal hematopoiesis, and coronary artery disease in postmenopausal women. Circulation 143, 410–423 (2021).
Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018).
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, 7 (2015).
Kursa, M. B. & Rudnicki, W. R. Feature selection with the Boruta package. J. Stat. Softw. 36, 1–13 (2010).
Rajkomar, A., Dean, J. & Kohane, I. Machine learning in medicine. N. Engl. J. Med. 380, 1347–1358 (2019).
Liaw, A. & Wiener, M. Classification and regression by randomForest. R. N. 2, 18–22 (2002).
Kuhn, M. Building predictive models in R using the caret package. J. Stat. Softw. 28, 1–26 (2008).
Grün, B., Kosmidis, I. & Zeileis, A. Extended beta regression in R: shaken, stirred, mixed, and partitioned. J. Stat. Softw. 48, 1–25 (2012).
McCaw, Z. R., Lane, J. M., Saxena, R., Redline, S. & Lin, X. Operating characteristics of the rank-based inverse normal transformation for quantitative trait analysis in genome-wide association studies. Biometrics 76, 1262–1272 (2020).
Wojcik, G. L. et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature 570, 514–518 (2019).
Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
Ng, P. C. & Henikoff, S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 31, 3812–3814 (2003).
Chun, S. & Fay, J. C. Identification of deleterious mutations within three human genomes. Genome Res. 19, 1553–1561 (2009).
Schwarz, J. M., Cooper, D. N., Schuelke, M. & Seelow, D. MutationTaster2: mutation prediction for the deep-sequencing age. Nat. Methods 11, 361–362 (2014).
Wu, M. C. et al. Rare-variant association testing for sequencing data with the sequence Kernel association test. Am. J. Hum. Genet. 89, 82–93 (2011).
Liu, Y. et al. ACAT: a fast and powerful P value combination method for rare-variant analysis in sequencing studies. Am. J. Hum. Genet. 104, 410–421 (2019).
McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. 17, 122 (2016).
Mendez, D. et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 47, D930–D940 (2019).
Online Mendelian Inheritance in Man, OMIM®. McKusick-Nathans Institute of Genetic Medicine. (Johns Hopkins University, 2022); https://omim.org/
R Core Team. R: a language and environment for statistical computing. (R Foundation for Statistical Computing, 2019); https://www.r-project.org/
Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinform. 12, 77 (2011).
Petrazzini, B. O. et al. Exome sequence analysis identifies rare coding variants associated with a machine learning-based marker for coronary artery disease. Zenodo https://doi.org/10.5281/zenodo.11086022 (2024).
Acknowledgements
S.N.G. is supported by VA MERIT grant 1I01CX002560. R.S.R. is supported by National Institute of Aging of the National Institutes of Health R01 AG061186-0 and the National Heart, Lung, and Blood Institute of the National Institutes of Health R01HL157439-01. R.D. is supported by the National Institute of General Medical Sciences of the NIH (R35-GM124836) and the National Heart, Lung and Blood Institute of the NIH (R01-HL139865 and R01-HL155915). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
B.O.P., I.S.F. and R.D. conceived and designed the study. B.O.P. performed statistical analyses. B.O.P., I.S.F., G.R., H.M.T.V., C.M.-L., A.D., R.C., J.K.P., K.G., S.N.G., W.A.M., R.S.R., D.M.J. and R.D. provided administrative, technical and material support. B.O.P. and R.D. drafted the manuscript. R.D. supervised the study. B.O.P. and R.D. had access to all of the data in the study and take responsibility for the integrity of the data and accuracy of the analysis.
Corresponding author
Ethics declarations
Competing interests
R.D. reports being a scientific cofounder, consultant and equity holder for Pensieve Health (pending) and being a consultant for Variant Bio, all not related to this study. R.S.R. reports research funding to his institution from Amgen, Arrowhead, Eli Lilly, Merck, NIH, Novartis, Novo Nordisk, Regeneron and 89Bio, consulting fees from Amgen, Avilar, CRISPER Therapeutics, Editas, Eli Lilly, Lipigon, New Amsterdam, Novartis, Precision Biosciences, Regeneron, UltraGenyx, Verve Therapeutics, nonpromotional honoraria from Meda Pharma, royalties from Wolters Kluwer (UpToDate) and stock holding in MediMergent, LLC. He reports patent applications on: methods and systems for biocellular marker detection and diagnosis using a microfluidic profiling device. EFS ID: 32278349. application no. (PCT/US2019/026364) (provisional); all unrelated to this study. The other authors declare no competing interests.
Peer review
Peer review information
Nature Genetics thanks Matthias Heinig, Samuli Ripatti and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Area under the receiver operating characteristic curves on the testing sets used to evaluate in silico score for coronary artery disease (ISCAD).
We trained and tested 100 models with independent random sampling. Receiver operator characteristic curves are shown for the current ISCAD model trained on the UK Biobank (a), the All of Us biobank (b), the BioMe biobank (c). AUC: Area under the receiver operating characteristic curve.
Extended Data Fig. 2 Distribution of the in silico score for coronary artery disease (ISCAD) in cases and controls.
We trained and tested 100 models with independent random sampling. Distributions of CAD cases and controls separately are shown for the current ISCAD model trained on the UK Biobank (a), the All of Us biobank (b) and the BioMe biobank (c). Vertical dotted lines represent the median value of the distribution. ISCAD: in silico score for coronary artery disease.
Extended Data Fig. 3 Manhattan plot of rare coding variant association meta-analysis.
We tested 2,738,849 rare missense and protein truncating variants from 604,915 individuals in the UK Biobank, the All of Us Research Program and the BioMe Biobank. Dotted horizontal line represents an exome-wide significance threshold of P = 4.3 × 10−7. We obtained two-sided base 10 logarithm P-values from a fixed-effect inverse-variance weighted meta-analysis. Italicized text indicates gene names.
Supplementary information
Supplementary Information
Supplementary Results, Methods, Note, Figs. 1–14 and Tables 2–4, 9–10, 13–17, 21–22 and 26–27.
Supplementary Tables
Supplementary Tables 1, 5–8, 11–12, 18–20, 23–25 and 28.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Petrazzini, B.O., Forrest, I.S., Rocheleau, G. et al. Exome sequence analysis identifies rare coding variants associated with a machine learning-based marker for coronary artery disease. Nat Genet 56, 1412–1419 (2024). https://doi.org/10.1038/s41588-024-01791-x
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s41588-024-01791-x
This article is cited by
-
Trans-ancestral rare variant association study with machine learning-based phenotyping for metabolic dysfunction-associated steatotic liver disease
Genome Biology (2025)
-
A digital marker for coronary artery disease
Nature Reviews Genetics (2024)
-
Expanding drug targets for 112 chronic diseases using a machine learning-assisted genetic priority score
Nature Communications (2024)
-
Rare variant contribution to the heritability of coronary artery disease
Nature Communications (2024)