Abstract
Background
Polygenic risk scores (PRSs) are increasingly being used to predict disease risk from genetic data. While promising in research, their clinical utility—especially when combined with non-genetic (NG) data such as lab results, physical measurements, and diagnostic history—remains uncertain. Myocardial infarction (MI), a leading cause of morbidity and mortality, is a key use case for assessing the incremental value of PRSs in risk models.
Methods
Using UK Biobank data, we evaluated the added value of PRSs for 10-year MI risk prediction. We trained models with NG data alone and in combination with PRSs, varying model complexity and the NG feature space. Two modeling frameworks were used: logistic regression and a neural network. NG data was defined using two feature sets: NG1, which included established MI risk factors from structured fields; and NG2, a high-dimensional dataset derived from millions of diagnostic codes across five linked UK Biobank electronic health records (EHR) datasets combined with NG1 features. NG2 was generated using a deep representation learning approach that produced low-dimensional embeddings capturing latent medical concepts and disease co-occurrence patterns. Each model was trained with and without PRSs and evaluated using metrics such as the area under the ROC curve (AUC).
Results
PRSs add minimal predictive value when used alone. In contrast, diagnostic data from EHRs significantly improve performance. The best results are achieved using a multimodal neural network combining NG1, NG2, and PRSs.
Conclusions
PRSs provide limited standalone utility for MI prediction compared to detailed diagnostic data. Their clinical value likely lies in integration with EHR-based models. Future work should focus on multi-modal approaches that contextualize PRS information within broader clinical data.
Plain Language Summary
Polygenic risk scores are a measure of how likely it is a person will get a particular disease based on the genes they inherited from their parents. We studied whether polygenic risk scores (PRSs) help predict heart attack occurrence when combined with clinical data such as information from blood tests and medical history. We tested two different computational models using two types of clinical data: one with common risk factors for heart attacks and another with broader data. We found that PRSs added little benefit by themselves, instead, detailed clinical information was much more useful. The best predictions came from combining all data types. This information and our models could be helpful to better identify people who will have heart attacks.
Similar content being viewed by others
Introduction
Polygenic risk scores (PRSs) have recently emerged as a promising scoring tool for heart disease risk prediction using genotype data. A PRS for a given individual is calculated as a linear combination of an individual’s allelic dosage for a set of single-nucleotide polymorphisms (SNPs) associated with risk. While PRSs are not yet routinely used in clinical practice, lately there has been a rapidly growing interest in their clinical translation1,2,3. Introducing PRSs into clinical practice will require logistical hurdles to be overcome2,4, and thus it is critical to establish a strong rationale for their clinical utility, especially when considered in the context of non-genetic data (NG) that are more readily available for use in disease risk prediction models, such as physical measurements, lab results, lifestyle factors, and diagnostic data2,5,6,7.
When considered independently as a risk factor combined with just basic demographic data, PRSs achieve excellent risk stratification8,9,10. For instance, it was recently established that a PRS for coronary artery disease (CAD), in combination with age, sex, and ancestry-related data, can identify an ~8% swathe of the population with an odds ratio of 3-fold or greater for developing CAD compared to the rest of the population11,12. However, other studies that have explored the disease prediction capacity of PRSs in the context of routinely available clinical data, such as physical measurements, lifestyle questionnaires, or lab results that measure known risk factors13,14,15,16,17,18,19,20,21,22, produced results that paint a somewhat different picture. For example, it was found that when a PRS for CAD is combined in a linear time-to-event model with clinical risk scores already used in practice—the Framingham Risk Score23 (FRS) and the American College of Cardiology/American Heart Association (ACC/AHA) pooled cohort equations24,25 score, respectively—the improvement in C-index (i.e., a measure of model performance) attributed to the PRSs (∆C-indexPRS) was less than 0.0226. This was further validated by Elliott et al.14, Mars et al.13, and Isgut et al.16. Thus, while PRSs might be useful risk factors when considered independently or with just demographic data, more research is needed to further elucidate their clinical utility (i.e., value-add) in the context of clinical risk data already available.
An important limitation of existing research on the disease prediction value-add of PRSs is that most studies to date have used linear models with small-scale clinical data. A rapidly growing body of cutting-edge engineering research is leveraging large-scale electronic health records (EHRs) in deep representation learning models for predictive tasks27,28, and these approaches might soon be translated into the clinical setting for disease risk prediction. Several of these models adapt natural language processing-related methods, such as Word2Vec29,30, recurrent neural networks (RNN)31,32,33,34,35,36, long short-term memory (LSTM) networks36, or Bi-directional encoder representations from transformers (BERT)37,38,39, to sequential EHR data, to learn machine-interpretable latent representations of the data that can be used in downstream learning tasks. Others use graph neural networks28,40,41, and related methods42,43 to represent EHR data. These complex big data models may have the potential to someday become the state-of-the-art in clinical risk prediction and could, in some ways, be considered as a proxy for a best-case scenario of clinical data utilization for risk prediction. When considering the disease prediction value-add of PRSs, it will be important to expand our understanding beyond using small-scale risk factors (i.e., blood pressure, select lab test results, lifestyle questionnaires) in linear models. Exploring this question in the context of using large-scale EHR data in nonlinear models that can account for nonlinear interactions between PRSs and individual clinical features (i.e., gene by environment interactions) could potentially improve the estimated value-add of PRSs and provide stronger justification for their translation into the clinical setting44.
To fill these gaps, in this analysis, we used the UK Biobank dataset of ~502,000 individuals45 to systematically evaluate the role of model complexity and clinical feature space on the disease prediction value-add of PRSs for 10-year first-time incident myocardial infarction prediction (also denoted as CAD; coronary artery disease). To avoid potential ancestry-related confounders when analyzing the PRSs, our analysis centered on self-reported White British individuals only using UK Biobank datafield 21,000. The PRSs used contained 57, 202, ~1.7 million, and ~6 million variants, respectively, and for the purposes of this study were named CAD_5746, CAD_20226, CAD_1.7M26, and CAD_6M11. We used two categories of clinical (non-genetic; NG) feature space: (a) a small set of 15 known established risk factor data such as blood pressure, smoking status, and cholesterol that are often measured as part of existing clinical risk scores such as the FRS or ACC/AHA scores (non-genetic feature set 1; NG1), and (b) a large-scale set of 574 diagnostic features derived from EHRs using a deep representation learning algorithm combined with the small-scale NG1 features (non-genetic feature set 2; NG2). The two categories of model complexity were logistic regression (model complexity category 1; LR) and a deep neural network (model complexity category 2; NN). The baseline analysis comprised using LR to train both NG1 and multi-modal NG1 + PRS models and then calculating the improvement in area under the curve (AUC) attributed to including PRSs, ∆AUCPRS = AUCNG+PRS - AUCNG. This was then repeated for each combination of the two variables. We found that PRSs provided minimal value-add for risk prediction independently, while large-scale diagnostic data from EHRs provided much greater value-add for prediction. The multi-modality neural network with large-scale diagnostic data, established risk factors, and PRSs achieved the best overall performance for myocardial infarction risk prediction.
Methods
Experimental design and model training
We used a combinatorial framework to systematically evaluate the myocardial infarction prediction value-add of including PRSs in predictive models compared to using just routine clinical data. Specifically, we define the improvement in disease prediction performance attributed to polygenic risk scores (prediction value-add of a PRS) as ∆AUCPRS, where AUC is the area under the precision-recall curve (AUC), and where ∆AUCPRS = AUCNG+PRS – AUCNG. Here, AUCNG represents the disease prediction performance of models that use routinely available non-genetic (NG) input data (i.e., lab test results, imaging, patient health records, physical measurements, and lifestyle factors), and AUCNG+PRS represents the performance of models that use both polygenic risk scores and non-genetic input data for prediction.
We designed four experiments depicted in the grid shown in Fig. 1b. Each experiment comprised a different combination of one of two categories of model (M) complexity (logistic regression; LR vs. feedforward neural network; NN) and one of two categories of non-genetic (NG) feature space (small-scale disease-specific risk factors; NG1 vs. large-scale medical history embedding combined with disease-specific risk factors; NG2). For each experiment, two models were trained: (1) a model using just non-genetic features, and (2) a model using both non-genetic features and PRSs (NG + PRS). Both models used the same combination of categories of model complexity and feature space. This resulted in a total of four models being trained, two for each subcategory of model complexity and feature space. The experimental design framework is visualized in Fig. 1 (see Supplementary Table 1 for additional details on the hyperparameter ranges for each model), and the representation learning algorithm used to derive the large-scale diagnostic features is described in the Supplementary Methods and in Supplementary Figs. 1, 2.
LR logistic regression, NN feedforward neural network, NG1 non-genetic dataset 1 (known disease-specific risk factors), NG2 non-genetic dataset 2 (large-scale medical history embedding + known disease-specific risk factors), AUC area under the curve, PRS polygenic risk score. a The established risk factors dataset (NG1) comprised in-clinic measurements, lab test results, and self-reported questionnaire data on lifestyle and demographic factors associated with myocardial infarction risk. For the larger-scale clinical dataset, diagnostic data were extracted from five UK Biobank datasets, including electronic health records (EHRs), and these data were then preprocessed and used collectively in an autoencoder-based representation learning algorithm (see Methods for details) to generate feature embeddings containing data on each individual’s medical history. These diagnostic features were then combined with the NG1 dataset to form the NG2 non-genetic dataset. The two model complexity categories explored comprised logistic regression and neural network, whereby the optimal neural network architecture was selected via hyperparameter tuning. b The four combinations of model complexity and clinical feature space categories are shown. For each combination, a version of the model with just non-genetic data were compared to the respective multi-modal model with non-genetic data combined with PRSs and used to calculate the PRS value-add (∆AUCPRS = AUCNG+PRS – AUCNG). After training, the mean and standard deviation of the performance metrics were computed for each model across ten trials, for each test set prevalence at each percentile, to get a 95% confidence interval using the test set. The bottom left quadrant comparison (NG1 + PRS; LR vs. NG1; LR) uses established risk factors and PRSs in a logistic regression model and is considered the baseline. For the multi-modal NG + PRS models, the four PRSs were added together as features in each model. c The heart disease PRSs used are shown. The heatmap (left) displays the Pearson correlations between the PRSs, which ranged from 0.12 to 0.68, and the bar chart (right) shows the AUC for each PRS and the combination thereof for first-time 10-year incident myocardial infarction risk prediction using logistic regression. See the Supplementary Results for details.
Prior to starting any experiments, the UK Biobank samples were randomly split into train, validation, and test sets ten times (ratio of 70/20/10), and each random split was used to train all models for the disease to prevent per-sample variability from having a confounding impact on the results. Random search-based hyperparameter tuning was done during training for each trial. For neural network models, hyperparameters included the Adam optimizer learning rate, input L1 regularization, hidden layer L1 regularization, input dropout, hidden layer dropout, number of hidden layers (from 1 to 3), and number of nodes per hidden layer. For logistic regression models, the hyperparameters were the learning rate and L1 regularization. Because of the challenges in training neural networks due to the larger hyperparameter space, we manually identified optimal hyperparameter ranges for these models prior to running the random search-based tuning. To account for class imbalance between cases and controls, we used a weighted loss function with the weight proportional to the case/control ratio for each of the trials.
For each experiment, each model was trained for 10 trials, each for 50 maximum epochs. The model checkpoint at the best-performing epoch in the validation set was used as the selected model for each trial (more details in the Supplementary Methods). After training, the mean and standard deviation of the performance metrics were computed for each model to get a confidence interval using the test set.
Dataset and study population
We utilized the UK Biobank dataset45 comprising questionnaire results, physical measurements, diagnoses, and other clinical information for ~502,000 individuals. Data in the UK Biobank were collected during assessment visits in which a patient would attend a research site in person, and through online questionnaires or continuously updated national databases (i.e., death or cancer registry, or health records). All participants attended their first assessment visit, which occurred between 2006 and 2010, and all non-genetic features were collected during each participant’s first assessment visit. Participants with no medical history prior to their first assessment visit, with no available genotype data, or who opted to be removed from the UK Biobank were excluded from the analysis. To avoid potential ancestry-related confounders when analyzing the PRSs, our analysis centered on self-reported White British individuals only. The overall process for filtering the study population can be seen in Supplementary Fig. 3. Data from the UK Biobank was used under the approved application number 17984. UK Biobank has obtained ethics approval from the North West Multi-center Research Ethics Committee (REC reference: 11/NW/0382), and all participants provided written informed consent at the time of enrollment. All analyses were conducted in accordance with the relevant ethical guidelines and regulations. No additional institutional review board (IRB) approval was required for this study, as it involved only secondary analysis of de-identified data provided by the UK Biobank under an approved application and governed by their established ethical oversight framework.
Cohort definitions
Individuals were classified based on whether they had a given ICD-10 billing code in their medical record. Five diagnostic datasets available in the UK Biobank were used to identify cases: primary care records, hospital inpatient records, cancer registry data, death registry data, and self-reported questionnaire data (i.e., patients being asked about their medical history). These were encoded using various systems, including ICD-10, ICD-9, a UK Biobank-designed coding system, and read version 2 and 3 codes (Supplementary Fig. 4). We adapted mappings from each coding system to ICD-10 (Supplementary Table 2) and then selected ICD-10 codes that would reasonably identify individuals as being cases for myocardial infarction, I21 (I21.0, I21.1, I21.2, I21.3, I21.4, I21.9, I21.X), I22 (I22.0, I22.1, I22.8, I22.9), I23 (I23.0, I23.1, I23.2, I23.3, I23.4, I23.5, I23.6, I23.8), I24.1, and I25.2 (Supplementary Table 3). Any individual diagnosed with at least one code in the list at any point in time in their medical history would be considered a prevalent case. To establish cohorts for incident disease risk prediction, we used the year of each participant’s first UK Biobank assessment visit as the baseline time point for that participant and excluded all participants with a preexisting diagnosis prior to their assessment visit. The 10-year incident case and control counts were 7515 and 315,939, respectively. A total of 179,187 individuals were excluded from the analyses based on the exclusion criteria outlined in Supplementary Table 4.
Preparation of the PRS features set
We selected PRSs that have previously been published for coronary artery disease (CAD) risk prediction, and each PRS was calculated as a linear weighted sum of allele dosage by genome-wide association study (GWAS) weight for each variant, in concordance with conventional approaches. We used the Plink 2.0 software47 to generate a polygenic score for each individual by calculating a weighted sum of allele dosage by GWAS weight for each variant. This was done using the UK Biobank imputed genotypes dataset of 96 million variants in chromosomes 1–22. As mentioned in the Introduction, the PRSs used contained 57, 202, ~1.7 million, and ~6 million variants, respectively, and for the purposes of this study were named CAD_5746, CAD_20226, CAD_1.7M26, and CAD_6M11 (see details in Supplementary Table 5). We utilized data from chromosomes 1–22 to calculate each PRS for each individual. We downloaded the variant IDs, effect alleles, and weights (betas) from the Polygenic Score (PGS) Catalog48 for CAD_6M (publication PGP00006, PRS ID PGS000013) and CAD_57 (publication PGP000042, PRS ID PGS000057) and utilized data from online supplemental data for the other scores. Specifically, for CAD_1.7 M, we used PGS00018 from study PGP000007 and for CAD_202, these SNPs and their weights (log odds) were derived from the CARDIoGRAMplusC4D 1000 Genomes meta-analysis49 by identifying independent SNPs at false discovery rate (FDR) <0.05. Each score was checked to ensure the SNPs were in GRCh37 (hg19), to map to the UK Biobank dataset. The rationale for using more than one PRS was to maximize the scope of information available in PRSs and avoid limiting the analysis to one PRS, thus enabling a more robust analysis50. For example, while the selected PRSs for CAD are correlated (Fig. 1c), collectively they might capture slightly different independent components of genetic risk. Further details on the selection and calculation of the PRSs is provided in the Supplementary Methods.
Derivation of the NG1 feature set (small-scale disease-specific risk factors)
To set up the NG1 feature set, we identified available UK Biobank data fields with large-enough sample sizes that corresponded to MI-specific risk factor data in the scientific literature13,14,16,22,23 and preprocessed the data. Baseline demographic features used were age at recruitment and male sex. Physical measurements used were systolic blood pressure (SBP) and body mass index (BMI). Cholesterol, triglycerides, and lipoprotein A measurements were used as lab test results. Smoking status was the only lifestyle factor included, and a family history of having a mother, father, or sibling diagnosed with heart disease completed NG1 (Supplementary Table 6). Imaging data, time series and wearables data, and non-genetic omics data (i.e., transcriptomics, proteomics, etc.) were excluded and were not considered as PRS or NG data for the purposes of this analysis. Note that amongst the features, two values of systolic blood pressure are included, both taken moments apart. Binary features are one-hot-encoded as one or zero for each to facilitate training.
Derivation of NG2 feature set (large-scale medical history embedding + disease-specific risk factors)
The large-scale diagnostic features dataset was derived using a representation learning algorithm (i.e., a variational autoencoder) that leveraged large-scale data from millions of diagnostic records in five UK Biobank EHR datasets to learn low-dimensional feature embeddings that represent medical concepts relating diagnoses and their comorbidities, combined with the small-scale NG1 features. We used the pre-trained model to derive 574-dimensional medical history representations for participants in the UK Biobank using input data on each individual’s entire available medical history before and through the year of their initial recruitment into the UK Biobank. The input medical history data was aggregated across available diagnostic datasets and then codified using the ICD-10 terminology to indicate, for each of the 19,154 codes, which ones a patient had ever been diagnosed with at any time prior to their recruitment into the UK Biobank study. Out of the 574 features, for select features of interest, we used the integrated gradients framework introduced in Sundararajan et al.51 to interpret each feature to find the most important contributing diagnostic codes (Supplementary Methods and Supplementary Fig. 5). Additional analyses and examples for diagnostic feature interpretation are in the Supplementary Results and Supplementary Figs. 6–9. The performance of the 574 diagnostic features alone in logistic regression as a baseline to the other data modalities is described in Supplementary Table 7.
Calculation and analysis of ∆AUCPRS
We then ran basic statistical analyses on the NG and NG + PRS model performance. We calculated various performance metrics, including AUC, precision, and recall, across each of the ten trials for each model. Then we calculated the differences in AUC between the NG and NG + PRS models to obtain a metric of value-add from including PRSs (∆AUCPRS = AUCNG+PRS - AUCNG) for each trial. For each disease, the ∆AUCPRS results were then compared across models with varying model complexities (LR vs. NN) and clinical feature spaces (NG1 vs. NG2) using the unpaired Student’s t-test and Cohen’s d to explore whether either variable had a significant impact on the disease prediction value-add of PRSs.
Feature importance and interactions analysis
To explore whether any interactions between features in the neural network (NN) models (NG1 + PRS; NN or NG2 + PRS; NN), we used a tool developed and validated by Tsang et al. (Neural Interaction Detection; NID)52. To further interpret and compare models with large-scale clinical feature sets to those with small-scale risk factor features, we calculated input feature importance weights for each model using the integrated gradients framework51. See the Supplementary Methods for more details.
Results
Neither greater model complexity nor larger clinical feature space significantly improves the value-add of PRSs over baseline risk factor data for myocardial infarction risk prediction
We compared the value-add of including PRSs in risk prediction models while independently varying model complexity subcategories (logistic regression, LR vs. neural network, NN) and clinical feature space subcategories (small-scale established risk factors, NG1 vs. large-scale diagnostic data from EHRs, NG2). For each combination of variable subcategories, we compared 10-year first-time incident myocardial infarction risk prediction performance using just non-genetic features (NG) to performance using non-genetic features in combination with PRSs (NG + PRS) (see Supplementary Fig. 10 for the ROC curves of the individual models). For each comparison between NG and multi-modal NG + PRS models, the model complexity and clinical feature space subcategory variables were held constant to assess the independent effect of the PRSs, as described in Fig. 1b. As shown in Fig. 2 and Table 1, we found that for all comparative analyses, adding PRSs resulted in a small but significant improvement in risk prediction performance, with the ∆AUCPRS ranging from 0.007 to 0.011 (more details in Supplementary Data). The baseline comparison using small-scale clinical risk factors in a logistic regression model (NG1; LR vs. NG1 + PRS; LR) had a significant ∆AUCPRS of just 0.010 ± 0.004, which was concordant with results from previous studies on the value-add of PRSs for coronary artery disease-related prediction.
LR logistic regression, NN feedforward neural network, NG1 non-genetic dataset 1 (known disease-specific risk factors), NG2 non-genetic dataset 2 (large-scale medical history embedding + known disease-specific risk factors), AUC area under the curve, PRS polygenic risk score. a The performance, as measured by AUC, is shown for each of the eight models trained. Specifically, for each model complexity and feature space category, the respective NG and NG + PRS model performance is shown. The inclusion of PRSs provides a small but significant improvement in model performance regardless of the variables modified. The greatest magnitude improvement in performance over the baseline (NG1; LR) is achieved when large-scale diagnostic data are added to the model. b The value-add of the PRSs, as calculated with ΔAUCPRS ranging from 0.007 to 0.011 (ΔAUCPRS = AUCNG+PRS – AUCNG), is shown. The quadrants show the comparisons between NG and NG + PRS models for each category combination illustrated in Fig. 1b. While there is no significant improvement in PRS value-add over baseline regardless of the variables modified, the value-add is slightly lower when large-scale clinical features are used in logistic regression (top right quadrant). c The incidence at each percentile of risk is shown for all the logistic regression models (see Supplementary Table 1 for more details). As shown, the NG2 + PRS; LR models and the NG2; LR models have the greatest risk discrimination across percentiles of the risk score, whereby the incidence is greatest (up to ~15%) at the top percentiles and lowest at the lower percentiles of risk. The PRSs alone achieve the worst risk discrimination performance. d, e The recall and precision are shown for all four multi-modal NG + PRS models based on the various combinations of clinical feature space and model complexity for class decision thresholds ranging from 0.5 to 0.97. More details on these results are in Supplementary Results. f A violin plot is shown with the distribution of risk scores for cases and controls for each multi-modal NG + PRS model. The NG2 + PRS; NN model achieves the best case/control raw score stratification, whereby case scores cluster near 1 and control scores cluster near zero.
While there was no significant improvement in ∆AUCPRS over baseline attributed to independently increasing the model complexity (i.e., using a nonlinear neural network and small-scale clinical features; NG1; NN vs. NG1 + PRS; NN), independently increasing the clinical feature space by adding the large-scale diagnostic features (NG2; LR vs. NG2 + PRS; LR) resulted in a small reduction in the ∆AUCPRS compared to baseline (∆AUCPRS = 0.007 ± 0.003, p value = 0.051). This reduction in value-add was then compensated for when both large-scale clinical features and a neural network were used in the analysis (NG2; NN vs. NG2 + PRS; NN), but the degree of compensation was minimal, nearly returning the value-add estimate to baseline (∆AUCPRS = 0.009 ± 0.003) rather than dramatically increasing the value-add.
Inclusion of large-scale diagnostic data provides greater value-add for myocardial infarction risk prediction than PRSs
We next explored the independent effect of increasing the clinical feature space on risk prediction performance. When the large-scale non-genetic feature space (NG2) was used in a logistic regression model for 10-year incident myocardial infarction prediction (LR; NG2), the result was a dramatic increase in prediction performance compared to the model utilizing just hand-selected risk factors (LR; NG1). The AUC increased significantly from 0.717 ± 0.007 to 0.765 ± 0.006 (∆AUC = 0.048), which is a nearly 7% improvement in performance (Fig. 2 and Table 1). The original baseline clinical risk factors model alone already achieved relatively strong risk stratification performance, with those in the top percentile of risk having odds ratios of ~3.8 and ~42.0 of being diagnosed with myocardial infarction within 10 years compared to the general population incidence rate and compared to the individuals at the bottom percentile of risk, respectively. The odds ratios for the top percentile of the score increased to ~6.4 and ~536.0, respectively, when the large-scale diagnostic features were added to the model. Furthermore, the top 9% swathe of the population with the highest risk in the NG2 model has an odds ratio greater than or equal to ~3 compared to the population incidence rate, in contrast to just the top 5% for the NG1 model, suggesting that substantially more clinically at-risk individuals can be identified using large-scale non-genetic data (Fig. 2 and Supplementary Table 8). The NG2 logistic regression model also had a significantly greater precision and recall at various class decision thresholds (Fig. 2 and Table 2). The same trends are observed when comparing logistic regression models using multi-modal NG2 + PRS versus NG1 + PRS for risk prediction, where adding diagnostic features resulted in an ∆AUC of 0.044 over the model with baseline non-genetic risk factors alone (Fig. 2 and Table 1).
We next did a feature interpretation analysis to gain further insights into the most important factors contributing to the improved performance of the NG2 feature set. We used the learned logistic regression feature weights as proxies to describe feature importance and found that in the NG2 + PRS; LR and the NG2; LR models, the 574 diagnostic features tend to have weights generally closer to zero, with outliers mostly skewed to the right (i.e., positive feature weights), whereas the 15 NG1 risk factor feature weights were significantly higher and more positive on average (Fig. 3a). Using the feature importance weight mean and standard error (across ten trials of logistic regression) for each of the 589 non-genetic features in the NG2; LR and NG2 + PRS; LR models, we ran a two-tailed one-sample t-test to find the features with importance weights significantly less than or greater than zero and found a total of 37–38 non-genetic features (6% of all non-genetic features) to be significant in both models, of which 12–13 were baseline risk factor (NG1) features and 24 were diagnostic features (Fig. 3b, c). The feature importance weights of the NG2 features in the NG2 + PRS; LR and NG2; LR models were highly concordant (Pearson r2 > 0.99), as were the feature importance weights of the NG1 features in the NG1; LR and NG2; LR models (Pearson r2 = 0.98) (Fig. 3d), confirming that the value-add from adding large-scale diagnostic features mostly derives from the independent effects of the added features themselves rather than through changes to the feature importance weights of the small-scale NG1 features.
PRS polygenic risk score features, Abs absolute value, LR logistic regression, NN feedforward neural network, NG1 non-genetic dataset 1 (known disease-specific risk factors), NG2 non-genetic dataset 2 (large-scale medical history embedding + known disease-specific risk factors), SBP systolic blood pressure, BMI body mass index, Hx history, Sig. significant. Diagnoses refers to diagnostic features in the NG2 dataset, excluding NG1 features. a Boxplots with the distribution of feature importance weights for the different data modalities in NG2 + PRS; LR model. Medical history (i.e., diagnostic) features have significantly lower weights on average compared to established risk factors and PRSs. b A scatterplot comparing the absolute value of the mean logistic regression feature importance weight for each of the 589 features in the NG2 + PRS; LR model to the p value of each feature. The p values were calculated using a one-sample two-tailed t-test for each feature weight across ten trials against a null hypothesis of zero weight. Significant features were selected after correcting for multiple comparisons based on the total number of features in each model using the Bonferroni correction (alpha = 0.01). Diagnostic feature 235 has the lowest p value and the highest absolute value mean feature weight. c The number of significant features for each data modality for the NG2 + PRS; LR model. The greatest number of features with importance weights significantly different from zero is from the medical history (i.e., diagnostic) feature set, but these 24 significant features made up <5% of all diagnostic features. In contrast, ~80% of the general datafield (i.e., established risk factor) features were significant. d (Left) The Pearson correlation in mean NG2 feature importance weights between NG2 + PRS; LR and NG2; LR models (i.e., with vs. without adding PRSs), (Right) The Pearson correlation in mean NG1 and PRS feature importance weights between NG1 + PRS; LR and NG2 + PRS; LR models. e Boxplots of positive and negative significant features in the NG2 + PRS; LR model, sorted by median importance weight across ten trials for each feature. Note: SBP listed as (1) and (2) represent the first and second of two consecutive measurements.
The feature with the most significant weight positively associated with 10-year incident first-time myocardial infarction risk for both models was diagnostic feature 235 (i.e., p value = 6.4 × 10−15 for NG2 + PRS; LR), followed by current smoker status, body mass index (BMI), systolic blood pressure, diagnostic feature 365, age, male sex, and having a father with heart disease (Fig. 3e). The most significant feature negatively associated with incident risk was diagnostic feature 545 (p value = 3.10 × 10−12), followed by diagnostic feature 471, which also had the highest magnitude negative mean weight. The only NG1 feature with a significant negative weight was “never smoker” status. We also analyzed correlations among diagnostic features and PRSs and found they were negligible (Supplementary Fig. 11).
Given that each diagnostic feature was originally extracted using the learned latent embedding of a variational autoencoder (see the Methods), each diagnostic feature comprised a composite of information derived from any combination of the 19,154 ICD-10 codes used to train the autoencoder and thus required further interpretation. As described in the Methods section, we used a neural network interpretation algorithm called integrated gradients to gain insights into the meaning of the most important features contributing to the strong 10-year first-time myocardial infarction risk prediction performance of NG2 logistic regression models. We first explored the meaning of diagnostic feature 235, which appears to have the greatest independent impact on improved model performance attributed to using large-scale diagnostic data. As shown in Fig. 4, the top five most important ICD-10 codes positively associated with this feature comprised (in order of importance): (a) E78 (disorders of lipoprotein metabolism and other lipidaemia), (b) J45 (asthma), (c) M19 (arthrosis), (d) K21 (gastroesophageal reflux disease), and (e) I10 (essential hypertension). Further details on the most important ICD-10 diagnostic codes associated with feature 235 are described in Fig. 4, Supplementary Table 9, and the Supplementary Results. The most important ICD-10 codes associated with some of the other diagnostic features are discussed in the Supplementary Results.
ICD-10 International Classification of Diseases, Tenth Revision, Pos. positive, Neg. negative, Char character. a Distribution of feature importance weights for 19,154 ICD-10 codes, split by hierarchy level in the coding scheme, from least granular (chapter-level) to most granular (5-character). The most relevant chapter-level and block-level codes have the largest absolute value importance weights, whereas the more granular codes individually contribute less towards prediction. b The 18,869 3- to 5-char codes (granular codes), which are more interpretable, are shown. The top 50 positive and negative codes have significantly larger magnitude feature importance than all others. c The top five positive and negative granular codes by mean feature importance for node 235. d Feature importance weights for 18,869 granular codes were aggregated. The top positive and negative code weights were then compared to other codes as a percentage of the total. The top 1000 positive codes and top 1000 negative codes combined (just ~11% of granular codes) comprise 91% of the feature importance weightage. The top positive code (E78) alone makes up 2% of the feature weightage. e Sunburst diagram displays the top 50 positive granular codes (~22% of aggregate feature weightage), including block and chapter-level parent codes. Notes: Extensive details on the derivation and interpretation of the diagnostic features are in the Supplementary Methods and Results. The feature importance weights shown are the mean across ten trials. Diagnostic feature 235 is referred to as a node here. ICD-10 Codes: Chapter IV: E78 = disorders of lipoprotein metabolism and other lipidemia (E780 = pure hypercholesterolemia), E66 = obesity (E669 = unspecified), E14 = diabetes mellitus, E11 = non-insulin-dependent diabetes mellitus. Chapter XVIII: R07 = pain in throat and chest (R074 = chest pain), R06 = abnormalities of breathing. Chapter IX: I10 = essential hypertension, I20 = angina pectoris. Chapter X: J45 = asthma (J459 = unspecified), J30 = vasomotor and allergic rhinitis, J34 = other disorders of nose and nasal sinuses, J32 = chronic sinusitis, J06 = acute upper respiratory infections of multiple unspecified sites (J069 = acute upper respiratory infection, unspecified), J02 = acute pharyngitis, J18 = pneumonia, J22 = unspecified lower respiratory infection. Chapter XIII: M19 = other arthrosis (M199 = unspecified), M17 = arthrosis of knee (M179 = unspecified), M25 = other joint disorders (M255 = pain in joint), M10 = gout, M13 = other arthritis, M79 = other soft tissue disorders, M75 = shoulder lesions, M54 = back pain, M47 = spondylosis. Chapter XI: K21 = gastroesophageal reflux disease (K210 = with esophagitis, K219 = without esophagitis), K44 = diaphragmatic hernia, K40 = inguinal hernia, K80 = cholelithiasis, K57 = diverticular disease of intestine. Other Chapters: F32 = depressive episode, Z86 = personal history of other diseases, N40 = prostate hyperplasia, G47 = sleep disorders, H26 = cataracts, A16 = respiratory tuberculosis.
Thus, in conclusion, we found that adding large-scale diagnostic data to a 10-year myocardial infarction risk prediction model substantially improves risk prediction performance over baseline risk factors compared to adding PRSs, and that this value-add is mainly driven by the independent effects of composite diagnostic features comprising aggregate information derived from large numbers of ICD-10 codes. Although the value-add attributed to PRSs is minimal in comparison, it is notable that these features are relatively important compared to most of the non-genetic features used in both the multi-modal NG1 + PRS and NG2 + PRS logistic regression models. For example, in the NG2 + PRS; LR model comprising a total of 593 features, all four PRSs were among the top 35 features with the largest positive mean weights. In both NG1 + PRS; LR and NG2 + PRS; LR models, the CAD_6M PRS, which is calculated using data from ~6 million genomic variants, had the fourth-highest mean weight amongst all features and was the PRS most significantly associated with risk. Concordantly, in spite of the incremental performance gain from the PRSs, the multi-modal NG2 + PRS; LR was nonetheless the best-performing logistic regression model for 10-year incident first-time myocardial infarction risk prediction (AUCNG+PRS = 0.773 ± 0.006), with an overall improvement in AUC of 0.055 over the baseline risk factors model (NG1; LR), suggesting that inclusion of both data modalities provides the most value.
There are interactions between PRSs and clinical features, but most of the value-add of PRSs is through their independent additive effect
While we found that independently increasing the model complexity had no significant impact on the PRS value-add as measured by AUC, the multi-modal NG1 + PRS; NN model achieved significantly higher recall than the NG1; NN model at class decision thresholds 0.6, 0.7, and 0.8, a result that was not found in the respective logistic regression model comparison, suggesting that inclusion of PRSs in a neural network may improve model sensitivity for identifying at-risk individuals (Fig. 2 and Table 2).
We next sought to gain deeper insights into the potential feature interactions occurring between clinical features (NG–NG interactions), between clinical features and PRSs (NG–PRS interactions), and between PRSs (PRS–PRS interactions) in the multi-modal neural network model (NG2 + PRS; NN). Using the neural interaction detection (NID) approach52 pioneered by Tsang et al., we calculated 171 pairwise interaction weights across the 19 input features (15 NG risk factors and four PRSs), averaged across ten trials (Fig. 5a, b). On average, NG–NG pairwise feature interactions had significantly higher weights than NG–PRS interactions, which were significantly higher on average than PRS–PRS interaction weights (Fig. 5c). The baseline risk factor features with the greatest level of pairwise interaction with other features were age, body mass index (BMI), male sex, and systolic blood pressure. Some of the top NG–NG pairwise interactions included (no specific order): (a) age and systolic blood pressure, (b) age and male sex, (c) age and BMI, (d) male sex and systolic blood pressure, and (e) male sex and current smoking status. The CAD_6M had the greatest number and magnitude of pairwise feature interactions compared to the other PRSs. The top NG–PRS pairwise interactions included: (a) CAD_6M and age, (b) CAD_6M and male sex, and (c) CAD_6M and systolic blood pressure (Fig. 5d). The top interacting features were generally those with the highest feature importance weights.
BMI body mass index, SBP systolic blood pressure, BP blood pressure, Hx history, NG non-genetic features, PRS polygenic risk scores, NG1 non-genetic feature set 1, NN model complexity subcategory 2 (neural network). a A heatmap of all 171 pairwise interaction weights between the 19 features in the NG1 + PRS; NN neural network model. The features with the greatest number and magnitude of interactions included age, BMI, and male sex. The PRSs were amongst the least-interacting features, but CAD_6M was the score most enriched for feature interactions. The mean interaction weight across all ten trials is shown. b A histogram describing the distribution of mean pairwise interaction weights (across ten trials) for each of the NG1 + PRS; NN models and its counterpart neural network model with just non-genetic data (NG1; NN). The inclusion of PRSs skews the distribution towards having smaller numbers of mean interaction weights in general. Interactions between PRSs and other features contribute to ~39% of all 171 pairwise interactions on the NG1 + PRS model. c A bar chart with the mean pairwise interaction weight and 95% confidence interval for each subcategory of feature interactions—NG–NG (two non-genetic features), NG–PRS (a non-genetic feature and a PRS), and PRS–PRS (two PRSs) for the NG1 + PRS; NN model. The NG–PRS and PRS–PRS pairs have significantly lower interaction weights than NG–NG pairs. d The pairwise interactions for the NG1 + PRS; NN model with the highest mean weights are shown. The top NG–PRS interactions include CAD_6M. e Scatterplot of NG–NG mean pairwise interaction weights between the NG1; NN and NG1 + PRS; NN models. Adding PRSs has a minimal effect on the distribution of NG–NG interaction weights.
Finally, we confirmed that there was no significant difference in mean pairwise NG–NG interaction weights between the NG1; NN model and the multi-modal NG1 + PRS; NN model, and that the feature importance weights of the NG features were concordant between models (Fig. 5e). Thus, any improvement in performance attributed to the PRSs relates to either NG–PRS interactions, PRS–PRS interactions, or the independent additive effect of the PRSs. Given that there was no significant improvement in the ∆AUCPRS between neural network (NN) and logistic regression (LR) models, the latter of which do not inherently capture feature interactions, the primary value of PRSs seems to be driven by their independent additive effect. While NG–PRS and PRS–PRS interactions exist, their effect on PRS value-add is miniscule and is mostly skewed towards improving model recall (i.e., correctly identifying more incident cases) rather than precision.
Multi-modality neural network models with large-scale diagnostic data, established risk factors, and PRSs achieve the best overall performance for myocardial infarction risk prediction
We then modified both clinical feature space and model complexity variables together by training neural networks with established risk factors combined with large-scale diagnostic data, with and without also including PRSs (NG2; NN vs. NG2 + PRS; NN). While the two neural network models utilizing large-scale diagnostic data had significantly higher recall (at class decision thresholds ranging from 0.5 to 0.99) compared to all other models (Fig. 2d, e), there was no significant difference in recall between the NG2 + PRS; NN and NG2; NN models, with recall at class decision threshold 0.5 as high as 85% (±1%) for both models. Even as the class decision threshold is increased to 0.8, the recall remains relatively high at 63–65% (±3%). This contrasts with the stark drop in recall at higher decision thresholds for all other models, with recall ranging from ~68% (±2%) at class decision threshold 0.5 to as low as 7–8% at class decision threshold 0.8 for the baseline NG1; LR and NG1 + PRS; LR models (Table 2). This can be further visualized in the violin plots in Fig. 2f showing the distribution of output scores ranging from zero to one for cases and controls, whereby cases tend to heavily cluster around having higher risk scores for the NG2 + PRS; NN model.
This improvement in recall seems to be at the expense of precision, wherein the NG2 + PRS; NN and NG2; NN models achieve slightly but significantly lower precision at class decision thresholds from 0.5 to 0.9 compared to all other models. As shown in Fig. 2f, there is a cluster of controls with very low scores close to zero, but there is a smaller cluster of controls with scores that cluster closer to one, with few in between. The NG2; NN model case/control distribution is similar. Thus, these models appear to better differentiate between 10-year incident myocardial infarction cases and controls but result in a cluster of apparent false positives with high estimated risk.
To gain further insights into the performance of these neural network models, we first investigated their feature importance weights and compared this to the counterpart logistic regression models that used large-scale diagnostic data. There was strong concordance in the feature importance between the NG2 + PRS; NN and NG2 + PRS; LR models, as well as between the NG2; NN and NG2; LR models (Pearson r2 = 0.96 for both comparisons; Fig. 6a). However, there was a ~50% increase in the number of diagnostic features with importance weights significantly greater than or less than zero, while the number of significant established risk factor features and PRSs remained relatively consistent (Fig. 6b, c). This suggests that the use of a neural network does not change the trend of which features are most important, but rather leverages certain diagnostic features differently for prediction. This is further supported by our analysis of feature interactions in the NG2 + PRS; NN model, where we found that the 175 highest-weight interacting feature pairs (top 0.1%) out of all 175,528 possible unique pairs were heavily enriched for interactions between two diagnostic features (Diag–Diag; ~39%), a diagnostic feature and an established risk factor feature (Diag–NG1; ~33%), and a diagnostic feature and a PRS feature (Diag–PRS; ~6%), with the remaining ~22% of top feature pairs comprising NG1–NG1 and NG1–PRS interactions combined (Fig. 6d–f).
LR logistic regression, NN neural network, NG1 non-genetic dataset 1 (known disease-specific risk factors), NG2 non-genetic dataset 2 (large-scale medical history embedding + known disease-specific risk factors), PRS polygenic risk score, SBP systolic blood pressure, BMI body mass index, Hx history, Diag diagnostic features (NG2 feature set excluding the NG1 features). a (Left) Scatterplot of the mean feature importance weights across ten trials of the 589 NG2 features for the NG2; NN (neural network) model and its counterpart NG2; LR (logistic regression) model. The features are highly correlated (Pearson r2 = 0.96). (Right) The same is shown for the multi-modal NG2 + PRS; NN (neural network) model and its counterpart logistic regression model. b Bar chart of the number of significant features for the NG2; LR model and its counterpart neural network model (NG2; NN). Feature significance was calculated using a one-sample two-tailed t-test comparing the mean and standard error of each feature importance weight across ten trials to zero, followed by the Bonferroni correction (alpha = 0.01) based on the 589 total features. Features with weights significantly greater than or less than zero were “significant”. c A boxplot is shown with the distribution of feature importance weight for each of the positive significant features across the trials for the NG2 + PRS; NN model, color-coded by the feature data modality. d Boxplot comparing each type of pairwise interaction between features in the NG2 + PRS; NN model. e Bar chart showing mean interaction weight for the top pairwise interactions in the NG2 + PRS; NN model, color-coded by type of pairwise interaction, with 95% confidence intervals and per-trial weights for each feature. f The top 0.1% of the 175,528 possible pairwise groupings of features in the NG2 + PRS; NN model with the highest feature interaction weights are shown in a stacked bar chart, split based on the proportion of each type of pairwise grouping. The top 0.1% interacting pairs of features are enriched for Diag–NG1, Diag–Diag, and NG1–NG1 interactions. The bottom 0.1% are also shown and are most heavily enriched for Diag–NG1 and Diag–Diag interactions.
Discussion
In this study, we sought to systematically evaluate the value-add of integrating polygenic risk scores (PRSs) in 10-year incident first-time myocardial infarction risk prediction models in combination with established data on non-genetic risk factors that are highly validated and already used routinely for prediction, often as part of clinical risk scores such as the Framingham Risk Score53 (FRS). Our baseline results were consistent with previous studies, with AUC improving by just 0.01 when PRSs were added as features to a model with established clinical risk factor features (i.e., body mass index, smoking status, and cholesterol).
Given that previous studies used linear models and typically just a handful of established risk factor features, we hypothesized that including PRSs along with larger quantities of clinical features in a nonlinear neural network model that can leverage feature interactions to improve performance would result in a substantially greater estimated value-add for prediction, resulting from potential interactions between the PRSs and the many clinical features. To explore this, we systematically analyzed the value-add of PRSs in the context of modifying two variables—clinical feature space (i.e., the total breadth of clinical data used for prediction) and model complexity (i.e., use of a linear model for prediction versus a nonlinear neural network). For the large-scale clinical feature set, we utilized diagnostic features extracted from EHRs containing data on all recorded diagnoses for each individual, using a representation learning algorithm. We modified each variable independently and then both at once, but we found that regardless of the variables modified, the value-add of PRSs did not increase.
In contrast, we found that including large-scale diagnostic data from EHRs improved prediction performance to a greater extent, with an improvement in AUC of ~0.05 over baseline established clinical risk factors alone and with about 4% more of the population identified as having three-fold or higher odds of being diagnosed with myocardial infarction within 10 years from baseline compared to the population incidence rate. The improvement in the ability of these models to capture true cases (i.e., the recall) further increased significantly when these features were used in a neural network, possibly resulting partially from nonlinear interactions between features.
When we further analyzed the relative importance of the features contributing to the performance of the large-scale clinical data models, we found that the most significant feature was one of the extracted diagnostic features (diagnostic feature 235). This feature alone achieved similar importance as top-ranking established myocardial infarction risk factors such as age and sex. When assessed alone in a linear classifier for 10-year incident myocardial infarction risk prediction, this diagnostic feature achieved higher performance than the four PRSs combined (AUC = 0.61, Supplementary Results). All diagnostic features combined achieved an AUC of 0.70 ± 0.01, which approaches that of the established risk factor features alone (AUC = 0.72 ± 0.01). When we explored the ICD-10 diagnostic codes most important towards the meaning of diagnostic feature 235, we found that these codes were heavily enriched for conditions shown in the literature to be associated with increased myocardial infarction risk. For instance, hyperlipidemia (elevated blood lipid levels) is one of the most well-established risk factors for heart disease and plays a significant role in its development54,55. Other prominent codes included those for obesity and type 2 diabetes, both of which frequently occur together and are strongly linked to increased heart disease risk56,57. While not as well-established as risk factors, gastroesophageal reflux and gastritis—also relevant to the classification of node 235—have recently been identified as independent risk factors for heart disease58,59. These are discussed further in the Supplementary Results and Supplementary Discussion.
Our results are the first to our knowledge to systematically compare the value-add of PRSs in the context of using large-scale clinical data and complex nonlinear multi-modal models for prediction and may provide useful insights for researchers and decision-makers considering the implementation of PRSs into clinical settings. EHRs are readily available to clinicians, and their usage is widespread across medical establishments. Thus, logistical barriers towards the implementation of clinical myocardial infarction prediction risk models that utilize an individual’s medical history for prediction are relatively low. In contrast, most patients do not currently have genomic data. The integration of PRSs into clinical settings, therefore, involves more logistical barriers, and they provide a lower value-add for risk prediction.
Although we found that the value-add of PRSs is lower for myocardial infarction risk prediction than that of EHRs, there is nonetheless significant potential value in translating PRSs into clinical settings. For example, for other diseases such as certain types of cancer or autoimmune disease, PRSs have been shown in the literature to provide greater value-add (i.e., ΔAUCPRS ~ = 0.04 for breast cancer13). Furthermore, the question of the value-add of PRSs for myocardial infarction risk prediction over clinical data alone is only relevant in contexts in which clinical data are available for a given individual. For patients with no available medical history, whether it is because they have not had significant healthcare utilization or because they are new to a given healthcare system, for example, it may not be feasible or effective to use clinical data for prediction. In these cases, PRSs combined with basic demographic information might be the best available predictors, providing myocardial infarction risk prediction AUCs of ~0.6 alone and even higher when combined with demographic data. These advantages of PRSs further extend to applications related to predicting lifetime myocardial infarction risk from an early age, such as adolescence or young adulthood, a time when medical history might be relatively sparse for many. PRSs can be used to identify genetically at-risk youth who might benefit from primary prevention programs. In these cases, which are outside the context of this study, there is likely significant value attributed to the use of PRSs for myocardial infarction risk prediction.
Additionally, despite our finding that the performance improvements attributed to PRSs were minimal, these improvements were significant, and the multi-modal models utilizing conventional risk factors, PRSs, and large-scale diagnostic data were the best-performing models overall. Thus, if the goal is in translating PRSs into the clinic is to maximize risk stratification performance regardless of value-add of an individual data modality, PRSs should be integrated into risk prediction scores. In our study, the multi-modal models that included PRSs were able to identify an additional 1% of the population at over three-fold greater risk of developing myocardial infarction in comparison to models with clinical data alone, regardless of whether the clinical models comprised small-scale clinical features or large-scale diagnostic data. In aggregate, these improvements can impact a small but non-negligible minority of patients.
An interesting finding in our analysis was that the nonlinear neural network models with large-scale diagnostic data significantly improved model sensitivity for identifying cases and strongly differentiated individuals into high or low risk score categories, regardless of whether these models integrated PRSs. However, this was at the expense of slightly reduced precision and did not significantly change the AUC compared to the logistic regression model with large-scale diagnostic data. These findings suggest that neural networks may be capturing nonlinear trends between input diagnostic features. Some of the false negatives may potentially be due to individuals who experience their first myocardial infarction diagnosis beyond 10 years, who might have similar clinical risk profiles at baseline. Thus, our results support the potential value of neural networks with large-scale diagnostic data for myocardial infarction risk prediction and warrant further investigation.
There are several limitations of our approach. First, we focused our analysis on individuals of European ancestry only, given that this was the most highly represented ancestry group in the UK Biobank dataset. Furthermore, the UK Biobank is mostly a self-selected population and is limited to individuals in the UK, which has cultural, ancestry-related, and other factors that differentiate it from other parts of the world. As EHR-based representation learning algorithms extrapolate patterns in the diagnostic data, these depend on the general frequencies of diagnoses and comorbidities between the diagnoses that occur within the UK Biobank, which may not translate to broader populations. The same can be said about our 10-year incident myocardial infarction risk prediction models. Future work should adapt these frameworks to other ancestry groups and populations using datasets such as the Million Veteran Program (MVP)60, All of Us61, FinnGen62, and others, to explore these findings in broader contexts.
Our study also focused on predicting first-time 10-year incident myocardial infarction and excluded individuals that had already experienced a documented myocardial infarction prior to the start of the incidence window. Including these individuals, i.e., by evaluating the risk of 10-year incident myocardial infarction risk independent of prior diagnosis with the condition, might have modified the results of our study in several ways. First, having a high genetic risk for myocardial infarction increases the chances that an individual will have their first event early, and as such, including these individuals in the study might skew the study population to include participants with relatively higher genetic risk, thus increasing the overall performance of the PRSs. Second, including participants who already had their first event prior to the start of the study would likely result in much stronger performance of the EHR-based feature set in our study, given that having had a previous myocardial infarction event is one of the strongest predictors of a future event, independent of genetic risk63. Thus, if we did explore that question, this likely would not have affected our key results, or might even further dilute the predicted value-add of PRSs. Nonetheless, future work is warranted to evaluate this empirically.
Other limitations involved the input data used for the analysis. For our baseline clinical feature set (i.e., the established risk factors), we used common risk factor features present in several of the major coronary artery disease clinical risk scores14,53,64 rather than the clinical risk scores themselves (i.e., FRS) as individual features. Integration of these scores themselves in this analysis framework might be an avenue for future work. In addition to risk factors such as cholesterol and body mass index that are used in most scores, we could also have used a broader scope of known risk factors as part of the baseline feature set, such as the diagnostic features used in the QRISK3 risk score64. This would likely not further improve the independent performance of the clinical features alone but may provide more opportunity for interaction effects between clinical and PRS features to potentially synergistically improve the value-add of PRSs. Another limitation relates to the PRSs that were used. Rather than just one, we selected several PRSs to leverage genetic information that might be independently contributed by each score. However, we could have selected a broader swathe of PRSs available in the literature, including scores designed specifically for myocardial infarction65,66,67, and this is warranted for future work. The granularity of case phenotype definitions used in designing a PRS can have an impact on score results, so a future study leveraging MI-specific PRSs may be warranted68. Furthermore, for the large-scale diagnostic features, there could have been alternative approaches used for extracting features other than the variational autoencoder method. There are many EHR representation learning algorithms in the literature that could be considered27. While our approach was sufficient for the purposes of this study to demonstrate the substantially greater value-add of large-scale diagnostic data for myocardial infarction prediction over established risk factors, use of more sophisticated algorithms might potentially better capture potential interactions between PRSs and clinical features and result in a greater value attributed to integrating both data modalities. Finally, as this study broadly uses diagnostic data from various sources available from the UK Biobank, future work might further explore this question in the context of specific clinical settings, such as in comparison between ambulatory and in-patient care settings, to explore to what degree the results of the study vary in these different contexts.
Overall, through our systematic analysis of the value-add of PRSs in the context of varied model complexity and clinical feature sets, our results suggest that using large-scale EHR data in neural networks is a promising avenue towards improving myocardial infarction risk prediction and provides greater value-add for prediction than using polygenic risk scores. Although genomic data minimally improved on 10-year risk prediction performance over EHR data in our study, it is worth noting that there may still be untapped value in the multi-modality integration of large-scale EHR-based and genomic data in advanced deep learning models. Gene-by-environment interactions are well documented in the literature, so it follows that there may be the potential for greater improvements in performance from the multi-modality approach. For example, as deep learning models become more sophisticated in the future, and as larger population-level datasets become available that include both genomic and EHR data, it is plausible that interaction effects, if they exist, might be better captured, lending additional value-add to the notion of including PRSs in 10-year myocardial infarction prediction models. Data inputs that might better enable this include leveraging different genomic features (i.e., genotypes rather than PRSs as univariate scores), and disease prediction models might be more advanced than those used in this study, leveraging attention or residual components, for example, that might better capture patterns representing these interactions. Future work is warranted to build on this analysis to explore the question in broader contexts.
Altogether, our study represents, to our knowledge, one of the most comprehensive and thorough explorations of the value-add of PRSs over clinical data alone for myocardial infarction risk prediction, to date. On a practical level, the results of this study can provide value for practitioners and healthcare centers deciding how and in what contexts to integrate PRSs into clinical practice for risk assessments. Furthermore, it can provide a baseline for future studies and discussions that explore this question in different populations and clinical settings.
Data availability
The original datasets used for this study are not publicly available. They were obtained by the authors under approved license 17984 from the UK Biobank Resource, https://www.ukbiobank.ac.uk/. The associated files contain all data necessary to replicate the findings of this research.
Supplementary Data Files 1–9 provide additional details supporting the findings of this study. Supplementary Data File 1 contains detailed results for ΔAUCPRS comparisons across subcategories of model complexity and feature space. Supplementary Data File 2 includes feature importance weights and p values across ten trials for models using the NG2 feature set (logistic regression and neural networks), both with and without PRSs, as well as the underlying data for Fig. 3. Supplementary Data Files 3 and 4 contain interactive sunburst plots illustrating the semantic structure of diagnostic feature 235, showing the top 500 positively and top 500 negatively associated ICD-10 codes, respectively. Supplementary Data File 5 lists the top 50 positive and top 50 negative codes most strongly associated with diagnostic feature 235, the data used for Fig. 4. Supplementary Data File 6 contains the data used for Fig. 1. Supplementary Data File 7 contains the data used for Fig. 2. Supplementary Date File 8 contains the data used for Fig. 5. Supplementary Data File 9 contains the data used for Fig. 6.
Code availability
The underlying code for this study, and the training and validation datasets, are not publicly available but may be made available to qualified researchers upon reasonable request from the corresponding author.
References
Wand, H. et al. Improving reporting standards for polygenic scores in risk prediction studies. Nature 591, 211–219 (2021).
Torkamani, A., Wineinger, N. E. & Topol, E. J. The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. https://www.nature.com/articles/s41576-018-0018-x (2018).
Lewis, A. C. F. & Green, R. C. Polygenic risk scores in the clinic: new perspectives needed on familiar ethical issues. Genome Med. 13, 14 (2021).
Hao, L. et al. Development of a clinical polygenic risk score assay and reporting workflow. Nat. Med. 28, 1006–1013 (2022).
Lewis, C. M. & Vassos, E. Polygenic risk scores: from research tools to clinical instruments. Genome Med. 12, 44 (2020).
Kumuthini, J. et al. The clinical utility of polygenic risk scores in genomic medicine practices: a systematic review. Hum. Genet. https://doi.org/10.1007/s00439-022-02452-x (2022).
Adeyemo, A. et al. Responsible use of polygenic risk scores in the clinic: potential benefits, risks and gaps. Nat. Med. 27, 1876–1884 (2021).
Khan, A. et al. Genome-wide polygenic score to predict chronic kidney disease across ancestries. Nat. Med. 28, 1412–1420 (2022).
Paul, K. C., Schulz, J., Bronstein, J. M., Lill, C. M. & Ritz, B. R. Association of polygenic risk score with cognitive decline and motor progression in Parkinson disease. JAMA Neurol. 75, 360–366 (2018).
Shams, H. et al. Polygenic risk score association with multiple sclerosis susceptibility and phenotype in Europeans. Brain https://doi.org/10.1093/brain/awac092 (2022).
Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018).
Schork, A. J., Schork, M. A. & Schork, N. J. Genetic risks and clinical rewards. Nat. Genet. 50, 1210–1211 (2018).
Mars, N. et al. Polygenic and clinical risk scores and their impact on age at onset and prediction of cardiometabolic diseases and common cancers. Nat. Med. 26, 549–557 (2020).
Elliott, J. et al. Predictive accuracy of a polygenic risk score–enhanced prediction model vs a clinical risk score for coronary artery disease. JAMA 323, 636–645 (2020).
Kachuri, L. et al. Pan-cancer analysis demonstrates that integrating polygenic risk scores with modifiable risk factors improves risk prediction. Nat. Commun. 11, 6084 (2020).
Isgut, M., Sun, J., Quyyumi, A. A. & Gibson, G. Highly elevated polygenic risk scores are better predictors of myocardial infarction risk early in life than later. Genome Med.13, 13 (2021).
Khan, S. S., Cooper, R. & Greenland, P. Do polygenic risk scores improve patient selection for prevention of coronary artery disease?. JAMA 323, 614–615 (2020).
Sun, L. et al. Polygenic risk scores in cardiovascular risk prediction: A cohort study and modelling analyses. PLoS Med. 18, e1003498 (2021).
Abraham, G. et al. Genomic prediction of coronary heart disease. Eur. Heart J. 37, 3267–3278 (2016).
Riveros-Mckay, F. et al. Integrated polygenic tool substantially enhances coronary artery disease prediction. Circ Genom Precis. Med. 14, e003304 (2021).
Hindy, G. et al. Genome-wide polygenic score, clinical risk factors, and long-term trajectories of coronary artery disease. Arterioscler. Thromb. Vasc. Biol. 40, 2738–2746 (2020).
Mosley, J. D. et al. Predictive accuracy of a polygenic risk score compared with a clinical risk score for incident coronary heart disease. JAMA 323, 627–635 (2020).
Lloyd-Jones, D. M. et al. Framingham risk score and prediction of lifetime risk for coronary heart disease. Am. J. Cardiol. 94, 20–24 (2004).
Damen, J. A. et al. Performance of the Framingham risk models and pooled cohort equations for predicting 10-year risk of cardiovascular disease: a systematic review and meta-analysis. BMC Med. 17, 109 (2019).
Goff, Jr. D. C. etal. 2013 ACC/AHA guideline on the assessment of cardiovascular risk.Circulation. https://www.ahajournals.org/doi/10.1161/01.cir.0000437741.48606.98 (2013).
Inouye, M. et al. Genomic risk prediction of coronary artery disease in 480,000 adults: implications for primary prevention. J. Am. Coll. Cardiol. 72, 1883–1893 (2018).
Si, Y. et al. Deep representation learning of patient data from Electronic Health Records (EHR): a systematic review. J. Biomed. Inform. 115, 103671 (2021).
Miotto, R., Wang, F., Wang, S., Jiang, X. & Dudley, J. T. Deep learning for healthcare: review, opportunities and challenges. Brief. Bioinform.19, 1236–1246 (2018).
Choi, E., Bahadori, M. T., Searles, E., Coffey, C., Thompson, M., Bost, J., Tejedor-Sojo, J. & Sun, J. Multi-layer representation learning for medical concepts. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16), 1495–1504 (2016).
Rong, X. word2vec parameter learning explained. Preprint at https://doi.org/10.48550/arXiv.1411.2738 (2016).
Choi, E., Bahadori, M. T., Schuetz, A., Stewart, W. F. & Sun, J. Doctor AI: Predicting clinical events via recurrent neural networks. JMLR Workshop and Conference Proceedings. 56, 301–318 (2016).
Raket, L. L. et al. Dynamic ElecTronic hEalth reCord deTection (DETECT) of individuals at risk of a first episode of psychosis: a case-control development and validation study. Lancet Digit. Health 2, e229–e239 (2020).
Choi, E., Bahadori, M. T., Song, L., Stewart, W. F. & Sun, J. GRAM: Graph-based attention model for healthcare representation learning. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17), 787–795 (2017).
Ma, F. et al. Dipole: diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks. In Proc. 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1903–1911 (Association for Computing Machinery, 2017).
Ma, F. et al. KAME: knowledge-based attention model for diagnosis prediction in healthcare. In Proc. 27th ACM International Conference on Information and Knowledge Management 743–752 (Association for Computing Machinery).
Choi, E., Bahadori, M. T., Kulas, J. A., Schuetz, A., Stewart, W. F. & Sun, J. RETAIN: An interpretable predictive model for healthcare using reverse time attention mechanism. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS '16), 3512–3520 (2016).
Rasmy, L., Xiang, Y., Xie, Z., Tao, C. & Zhi, D. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. npj Digit. Med. 4, 1–13 (2021).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT '19), 4171–4186 (2019).
Li, Y. et al. BEHRT: transformer for electronic health records. Sci. Rep. 10, 7155 (2020).
Wu, T., Wang, Y., Wang, Y., Zhao, E. & Yuan, Y. Leveraging graph-based hierarchical medical entity embedding for healthcare applications. Sci. Rep. 11, 5858 (2021).
Landi, I. et al. Deep representation learning of electronic health records to unlock patient stratification at scale. npj Digit. Med. 3, 1–11 (2020).
Nguyen, P., Tran, T., Wickramasinghe, N. & Venkatesh, S. Deepr: a convolutional net for medical records. IEEE J. Biomed. Health Inform. 21, 22–30 (2017).
Li, Y. et al. Inferring multimodal latent topics from electronic health records. Nat. Commun. 11, 2536 (2020).
McCaw, Z. R. et al. DeepNull models non-linear covariate effects to improve phenotypic prediction and association power. Nat. Commun. 13, 241 (2022).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Natarajan, P. et al. Polygenic risk score identifies subgroup with higher burden of atherosclerosis and greater relative benefit from statin therapy in the primary prevention setting. Circulation 135, 2091–2101 (2017).
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
Lambert, S. A. et al. The polygenic score catalog as an open database for reproducibility and systematic evaluation. Nat. Genet. 53, 420–425 (2021).
the CARDIoGRAMplusC4D Consortium A comprehensive 1000 genomes–based genome-wide association meta-analysis of coronary artery disease. Nat. Genet. 47, 1121–1130 (2015).
Clifton, L., Collister, J. A., Liu, X., Littlejohns, T. J. & Hunter, D. J. Assessing agreement between different polygenic risk scores in the UK Biobank. Sci Rep. 12, 12812 (2022).
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning (ICML '17), 3319–3328 (2017).
Tsang, M., Cheng, D. & Liu, Y. Detecting statistical interactions from neural network weights. Preprint at https://doi.org/10.48550/arXiv.1705.04977 (2018).
Hemann, B. A., Bimson, W. F. & Taylor, A. J. The Framingham risk score: an appraisal of its benefits and limitations. Am. Heart Hosp. J. 5, 91–96 (2007).
Yao, Y. S., Li, T. D. & Zeng, Z. H. Mechanisms underlying direct actions of hyperlipidemia on myocardium: an updated review. Lipids Health Dis. 19, 23 (2020).
Nelson, R. H. Hyperlipidemia as a risk factor for cardiovascular disease. Prim. Care 40, 195–211 (2013).
Powell-Wiley, T. M. et al. Obesity and cardiovascular disease: a scientific statement from the American Heart Association. Circulation 143, e984–e1010 (2021).
Martín-Timón, I., Sevillano-Collantes, C., Segura-Galindo, A. & del Cañizo-Gómez, F. J. Type 2 diabetes and cardiovascular disease: Have all risk factors the same strength? World J. Diabetes 5, 444–470 (2014).
Chen, C.-H., Lin, C.-L. & Kao, C.-H. Association between gastroesophageal reflux disease and coronary heart disease: a nationwide population-based analysis. Medicine 95, e4089 (2016).
Senmaru, T. & al, et. Atrophic gastritis is associated with coronary artery disease. J. Clin. Biochem. Nutr. 51, 39–41 (2012).
U.S. Department of Veterans Affairs. Million Veteran Program.
National Institutes of Health (NIH). All of Us Research Program.
FinnGen. FinnGen: an expedition into genomics and medicine.
Jernberg, T. et al. Cardiovascular risk in post-myocardial infarction patients: nationwide real world data demonstrate the importance of a long-term perspective. Eur. Heart J. 36, 1163–1170 (2015).
Hippisley-Cox, J., Coupland, C. & Brindle, P. Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study. BMJ 357, j2099 (2017).
Sinnott-Armstrong, N. et al. Genetics of 35 blood and urine biomarkers in the UK Biobank. Nat. Genet. 53, 185–194 (2021).
Tanigawa, Y. et al. Significant sparse polygenic risk scores across 813 traits in UK Biobank. PLoS Genet. 18, e1010105 (2022).
Jung, H. et al. Integration of risk factor polygenic risk score with disease polygenic risk score for disease prediction. Commun. Biol. 7, 180 (2024).
Isgut, M., Song K., Ehm M. G., Wang M. D. & Davitte J. Effect of case and control definitions on genome-wide association study (GWAS) findings. Genet. Epidemiol. 47, 394–406.
Acknowledgements
This research has been supported by funding from Enduring Heart Foundation, Georgia Tech Retention, a Wallace H. Coulter Distinguished Faculty Fellowship, a Petit Institute Faculty Fellowship, Amazon, and Microsoft Research to Professor May D. Wang.
Author information
Authors and Affiliations
Contributions
M.I., as the primary author, conceptualized the ideas in this work, designed the experimental framework, processed the data, analyzed the results, and wrote the manuscript. H.B., Y.S., A.H., and K.C. supported data collection by running experiments. B.J.A., S.R.D., and M.D.W. provided conceptual support for the completion of the work and for structuring the writing.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Communications Medicine thanks the anonymous reviewers for their contribution to the peer review of this work. [Peer reviewer reports are available].
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Isgut, M., Hornback, A., Bao, H. et al. Greater value add from electronic health records than polygenic risk scores for predicting myocardial infarction in machine learning. Commun Med 5, 450 (2025). https://doi.org/10.1038/s43856-025-01138-5
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s43856-025-01138-5








