Introduction

Polygenic scores (PRSs) are derived by summing genome-wide genetic variants associated with a given trait or disease. As aggregates of genetic variants’ effects, they are used to predict disease risk. In recent years, PRSs have been studied for their utility in predicting future diseases and identifying individuals at higher risk. For their use in disease prediction, it is premised that the phenotypic variation can be sufficiently explained by genetic variants used to construct the PRS. PRSs have demonstrated the ability to stratify genetic disease susceptibility, but there remains substantial variability in predictions that PRSs do not explain1. One potential explanation for this variability is the contribution of rare variants (RVs), which are generally not accounted for in standard PRS models2,3. Existing methods to develop PRSs include only common variants (CVs) (minor allele frequency (MAF) ≥ 1%), whereas the contribution of RVs (MAF < 1%) to disease prediction remains largely unexplored3.

Including RVs in polygenic risk scores may enhance the accuracy of predicting an individual’s risk of developing a disease. However, aggregating RVs has been challenging due to their abundance in the genome, poor annotations, and sample size constraints. The accuracy of the PRS significantly depends on the power of the discovery GWAS summary statistics4. However, the statistical power of classical single-variant association tests for RVs is low unless sample sizes or effect sizes are very large5. The availability of extensive whole exome sequencing (WES) and imputed genotype (IMP) data from the UK Biobank (UKB), with half a million participants, presents a unique opportunity to address these power issues, allowing for a more rigorous investigation into role of RVs in disease prediction.

Although WES and IMP data from UKB facilitate the inclusion of RVs in disease association studies, standardized protocols for incorporating these variants into PRS construction are still lacking6. Previous efforts have tended toward gene-based aggregation (or gene-burden) tests for RVs rather than single-variant approaches, especially when sample sizes were limited5,7. However, each approach captures RV effects differently. Gene-burden tests aggregate the effect sizes of multiple variants within a gene, while single-variant tests can offer finer resolution by estimating individual SNP effects. As a result, the choice of association method can influence the protocol for PRS construction. For example, gene-burden associations use aggregated effect sizes per gene to construct PRS, often employing the score function in PLINK to summarize gene effects. This aggregated effect sizes per gene can be influenced by the MAF and the types of RV annotations3,8,9. By contrast, single-variant approaches can leverage PRS software tools such as PRSice-2, LDpred, PRS-CS, and lassosum, which apply different methods for selecting or weighting SNPs based on association strength and linkage disequilibrium (LD)10,11,12,13. While PRSice-2 focuses on LD clumping and P-value thresholding to retain the most significant SNPs, other tools (e.g., LDpred, PRS-CS, and lassosum) apply shrinkage on effect size estimates using external LD references3,9,10.

Recent studies have demonstrated that integrating RVs into polygenic risk prediction can improve disease prediction, but a universally accepted protocol has yet to emerge. For instance, Rali et al. (2021) described RV-EXCALIBER, which tests associations between gene-level burdened RVs and disease, then calibrates RV burdens using external references such as gnomAD14. This method effectively addresses systematic gene biases but is limited to binary traits14. Other approaches have utilized RVs from eQTL or high-effect SNPs to construct PRS models for complex traits like body mass index (BMI) and type 2 diabetes, with some success in improving prediction3,15,16. However, these methods have not yet provided a comprehensive solution across traits with diverse characteristics, including differences in trait type (binary or quantitative) and in genetic architecture (polygenic or oligogenic). To address this gap, establishing a protocol that can be flexibly applied to multiple types of data and traits is needed to determine the broader applicability of rvPRS.

In this study, we systematically evaluated gene-burden and single-SNP associations to assess RV effects across 18 traits (13 binary and 5 quantitative) using UKB WES and IMP data. All associations were analyzed with SAIGE-GENE + . Gene-burden association summaries were converted to rvPRS with PLINK, and single-SNP association summaries were used with PRSice-2 to optimize rvPRS models. By comparing rvPRS from both WES and IMP data and validating performance in an independent UKB set, we aim to identify effective strategies for incorporating RVs into PRS.

Results

Study overview

The study scheme is illustrated in Fig. 1. Utilizing UKB data, we derived both cvPRS and rvPRS for 13 binary and 5 quantitative traits. The dataset was divided into three independent subsets: the discovery set for GWAS; the training set for PRS construction and optimization; and the validation set for evaluating final model performance. For cvPRS, we used FinnGen GWAS summary statistics for binary traits and performed GWAS using SAIGE on UKB IMP genotypes for quantitative traits. SNP effect sizes were re-weighted using PRS-CS and applied in PLINK to calculate cvPRS.

Fig. 1
figure 1

Study overview in the UKB.

For rvPRS, we analyzed both WES and IMP data. In WES, we performed both gene-burden and single-SNP associations using SAIGE-GENE+ on RVs. This yielded 10 WES-based models: 9 rvPRSGENE models from gene-burden associations (nine variant masks defined by functional annotation and MAF thresholds) and 1 rvPRSWES model from single-SNP associations. In IMP, we conducted single-SNP associations using SAIGE and generated 12 rvPRSIMP models using variant sets stratified by MAF ranges, imputation quality (INFO score), and exonic location (details in the Methods and Materials). rvPRS was calculated using PLINK for gene-burden-based scores and PRSice-2 for single-SNP-based scores.

The predictive performance of PRS models was evaluated via logistic or linear regression. Additionally, we used net reclassification improvement (NRI) and integrated discrimination improvement (IDI) to assess clinical utility and model improvement.

We use the following trait abbreviations throughout: AF, atrial fibrillation; AST, asthma; CA, cardiac dysrhythmias; CER, cerebrovascular disease; CHO, cholelithiasis and cholecystitis; COL, colorectal cancer; DM2, type 2 diabetes; HER, abdominal hernia; HF, congestive heart failure; HT, hypertension; IH, ischemic heart disease; BRE, breast cancer; PRO, prostate cancer; BMI, body mass index; SBP, systolic blood pressure; DBP, diastolic blood pressure; RHR, resting heart rate; FEV, predicted forced expiratory volume in one second percentage.

In addition, throughout, we use polygenic score (PRS) as a general term for both binary and quantitative traits; we use polygenic risk score only when referring specifically to binary disease risk.

Estimating the predictive performance of PRS

To investigate the association between PRS and traits, we conducted logistic regression analysis for 13 binary traits and linear regression analysis for 5 quantitative traits in the training set, adjusted for age, sex, genotyping array, and principal components (PCs) 1–20.

From WES data, we derived 9 rvPRSGENE models from gene-burden associations (mask1-9) and 1 rvPRSWES model from single-SNP associations. From IMP data, we constructed 12 rvPRSIMP models using variant sets defined by combinations of MAF ranges, imputation quality (INFO scores), and genomic regions (exonic or not).

The first four rvPRSIMP models (INFO > 0.3) used variants with INFO > 0.3 and four MAF ranges (Maf1 – Maf4). To enable comparison with WES-based models, the next four models (Exon only) used exonic variants with INFO > 0.3 and the same MAF ranges. The final four models (INFO > 0.8) used variants with higher imputation quality (INFO > 0.8), again stratified by the same MAF ranges. The Bonferroni-corrected threshold for significance was set at P < 1.26E-04 (0.05/18/22). To identify the best-performing rvPRS among 22 different models, we utilized Nagelkerke’s pseudo-R2 for binary traits and adjusted R2 (the phenotypic variance explained) for quantitative traits as selection criteria. The best-performing rvPRS was defined as the one with the highest Nagelkerke’s pseudo-R2 or adjusted R2, as shown in Fig. 2A, B.

Fig. 2: Predictive performance of rvPRS models in the training dataset.
figure 2

Only the best-performing PRS from each category is presented. A The left solid bar plot represents Nagelkerke’s R2, and the right forest plot shows the perSD OR in logistic regression models for binary traits. B The left solid bar plot displays the adjusted R2, and the right forest plot shows the perSD Beta in linear regression models for quantitative traits. * denotes association P-values that meet the Bonferroni-corrected threshold for significance (P < 1.26E-04 = 0.05/18/22). C This forest plot represents the NRI and IRI values comparing the “Full model (rvPRS + null)” with the “Null model (formula: trait ~ age + sex + genotype array + PC1-20)” and the “rvPRS+cvPRS model” with the “cvPRS model”. * denotes P-values for NRI and IDI that meet the Bonferroni-corrected threshold for significance (P < 1.92E-03 = 0.05/13/2). The error bars represent the 95% confidence intervals (CIs). AF atrial fibrillation, AST asthma, CA cardiac dysrhythmias, CER cerebrovascular disease, CHO cholelithiasis and cholecystitis, COL colorectal cancer, DM2 type 2 diabetes, HER abdominal hernia, HF congestive heart failure, HT hypertension, IH ischemic heart disease, BRE breast cancer, PRO prostate cancer, BMI body mass index, SBP systolic blood pressure, DBP diastolic blood pressure, RHR resting heart rate, FEV predicted FEV1%.

Among the 9 rvPRSGENE models derived from WES data, only 6 traits yielded significant associations. The best model’s R2 ranged from 0.0616% for HT to 0.0923% for BMI (Supplementary Data 3). In contrast, 15 of the 18 traits showed significant associations with rvPRSWES (single-SNP-based), with R2 values ranging from 0.0242% (CA) to 0.4718% (BRE) (Supplementary Data 4).

For the first set of rvPRSIMP models (INFO > 0.3), 16 traits showed significant associations, with R2 ranging from 0.0701% (AF) to 1.3847% (BMI) (Supplementary Data 5). In the exon-only IMP models, 13 traits met the significance threshold, with R2 ranging from 0.0346% (HER) to 0.6438% (PRO) (Supplementary Data 6). For IMP variants with higher imputation score (IMP > 0.8), 15 traits met the threshold, with R2 values from 0.0744% (AF) to 1.3693% (BMI) (Supplementary Data 7).

On average, the best rvPRSGENE model showed 74.4% lower R2 than rvPRSWES, while rvPRSWES models had 66.8% and 65.8% lower R2 than rvPRSIMP models with INFO > 0.3 and INFO > 0.8, respectively - highlighting the stronger performance of genome-wide IMP data (Supplementary Data 8). Notably, rvPRSWES underperformed exon-only rvPRSIMP by 8.0% in R2 and had two more models meeting the significance threshold, suggesting comparable utility between the two variant sets. In comparisons between different imputation qualities, rvPRSIMP (INFO > 0.8) outperformed INFO > 0.3 for AF, CA, and SBP, supporting the benefit of higher quality imputation (Supplementary Data 20).

Overall, rvPRSIMP provided the best-performing model for 17 out of 18 traits, except BRE, where rvPRSWES performed best (Fig. 2, Supplementary Data 8). For five quantitative traits (BMI, DBP, FEV, RHR, and SBP), per-standard deviation (perSD) Beta values for rvPRS showed highly significant associations (max P = 1.21E-149; significance threshold P < 1.26E-04 = 0.05/18/22). Among 13 binary traits, 11 rvPRSs (AF, AST, BRE, CA, CHO, DM2, HER, HF, HT, IH, and PRO) passed Bonferroni correction. The remaining two traits (CER and COL) were marginal (CER, perSD OR = 1.06 [95% CI: 1.03–1.10], P = 6.93E-04; and COL, perSD OR = 1.09 [1.04–1.14], P = 6.04E-04) (Supplementary Data 8).

The best rvPRS models for traits like breast cancer, cholelithiasis, prostate cancer, and resting heart rate were derived from either WES or IMP data, indicating their oligogenic or monogenic inheritance. For breast cancer, the rvPRSWES model used variants on BRCA2, PALB2, and CHEK2, while the rvPRSIMP model focused on a variant near ZNRF3, a gene involved in Wnt signaling17,18. All four genes are known to influence breast cancer risk, but the BRCA2, PALB2, and CHEK2 variants have especially strong effect sizes, likely explaining the superior performance of the WES-based model. These results suggest that for traits with a concentrated genetic architecture involving rare, high-impact coding variants, sequencing-based data (WES) may outperform imputed genotype data, even when imputation includes variants in nearby noncoding regions. In contrast, for most other traits in our study, RVs from IMP—covering both coding and noncoding regions—provided greater predictive power than WES alone. Further details on these oligogenic patterns-cholelithiasis, prostate cancer, and resting heart rate-are provided in the Supplementary Note.

We constructed cvPRS models using PRS-CS re-weighted SNP data from IMP, achieving association R2 values from 0.16% (CER) to 10.09% (BMI) (Supplementary Data 9). On average, cvPRS models showed stronger associations than rvPRS models; however, the R2 value for rvPRS increased proportionally to that of cvPRS, averaging approximately 12% of the R2 observed in cvPRS.

Additionally, we constructed cvPRS using PRSice-2 and compared its performance to cvPRS from PRS-CS (Supplementary Data 10). PRSice-2 retained far fewer SNPs than PRS-CS, and across all 18 traits PRS-CS consistently outperformed PRSice-2 (Supplementary Data 11), likely because PRS-CS better optimizes SNP weights even when effect size estimates have low power.

Our primary focus was to identify the most effective method for deriving rvPRS. GWAS for rvPRS used the UKB discovery set; for cvPRS, we used FinnGen summaries for binary traits, whereas we performed GWASs on UKB CVs for quantitative traits. We selected PRS-CS for cvPRS due to its strong performance and cross-trait compatibility via the HapMap3 LD reference panel, which helps harmonize cohort differences. Accordingly, we used PRS-CS-derived cvPRS as the baseline for comparing rvPRS across 18 traits and included only PRS-CS cvPRS in subsequent analyses.

Estimating the improvement in risk prediction models for binary traits

For binary traits, we assessed the improvement in risk prediction using two analyses in the training dataset: (1) net reclassification improvement (NRI) and (2) integrated discrimination improvement (IDI). We estimated the NRI and IDI values for two sets of models: “null model” versus “full model”, and “cvPRS model” versus “rvPRS+cvPRS model”. The null model includes the formula: trait ~ age + sex + genotype array + PC1-20, while the full model adds rvPRS to the null model. Among the 13 binary traits, the NRI for 8 traits showed significantly positive values (P < 1.92E-03 = 0.05/13/2), suggesting that both the “full model” and the “rvPRS+cvPRS model” improved reclassification compared to the “null model” and the “cvPRS model”, respectively (Fig. 2C, Supplementary Data 12). Of all 8 traits, the IDI values were significantly positive (P < 1.92E-03 = 0.05/13/2) (AF, AST, BRE, CA, DM2, HER, HT, and IH), indicating that both the “full model” and the “rvPRS+cvPRS model” improved discriminatory ability compared to the “null model” and the “cvPRS model”, respectively. Thus, we found that 8 binary traits exhibited significantly improved prediction performance in the training dataset.

Validation of the best rvPRS model

To test whether the prediction performance of rvPRS is replicated in a separate dataset, we performed association analyses for 8 binary traits using logistic regression and 5 quantitative traits using linear regression in the validation dataset. The effect sizes for PRS were extracted from the association results between PRS and traits in the training dataset (Supplementary Data 13). For all 13 traits, three types of PRS (cvPRS, rvPRS, and tPRS) were calculated in the validation dataset using these effect sizes. The tPRS was calculated by combining cvPRS and rvPRS in a model with the following formula: tPRS = βcvPRS × cvPRS + βrvPRS × rvPRS. Associations between each of the 3 PRS and their respective traits were tested using two models: model (1) trait ~ PRS and model (2) trait ~ PRS + age + sex + genotyping array + PC1-20. The results indicated that for all 13 traits, the perSD OR or Beta showed significant associations with the respective traits in the validation dataset, based on Bonferroni multiple correction (P < 6.41E-04 = 0.05/13/6) (Fig. 3A, Supplementary Data 14). We found that the mean proportion of association R2 of rvPRS compared to that of tPRS was 10.17% for binary traits and 11.07% for quantitative traits, suggesting that rvPRS explains ~10–11% of the phenotypic variance explained by tPRS.

Fig. 3: Validation of rvPRS models.
figure 3

Only rvPRS models that showed significant associations and risk reclassification in the validation dataset are displayed. A The left solid bar plot presents Nagelkerke’s R2 or adjusted R2 while the right forest plot displays the perSD OR or Beta depending on the trait class. B This forest plot represents the NRI and IRI values comparing the “Full model” with the “Null model” and the “rvPRS+cvPRS model” with the “cvPRS model”. The error bars represent the 95% confidence intervals (CIs). For 12 of these 13 traits (all except BRE), the final selected rvPRSs were derived from single-SNP association summaries in the IMP data. Final selections: AF, IMP (INFO > 0.8) Maf4; AST, IMP (INFO > 0.3) Maf1; CA, IMP (INFO > 0.8) Maf1; DM2, IMP (INFO > 0.3) Maf4; HER, IMP (INFO > 0.3) Maf3; HT, IMP (INFO > 0.3) Maf2; IH, IMP (INFO > 0.3) Maf4; BRE, WES; BMI, IMP (INFO > 0.3) Maf1; SBP, IMP (INFO > 0.8) Maf1; DBP, IMP (INFO > 0.3) Maf1; RHR, IMP (INFO > 0.3) Maf1; FEV, IMP (INFO > 0.3) Maf1.

To assess whether the improvement of tPRS (cvPRS + rvPRS) over cvPRS alone is statistically significant in terms of predictive performance, we applied the r2redux package in R, which provides test statistics for comparing R2 values between models, including variance estimates, P-values and 95% confidence intervals (CI)19. Using this tool, we evaluated whether the R2 values of the cvPRS and tPRS models were significantly different. As shown in Supplementary Data 14, for all 13 traits, the differences in R2 between the cvPRS and tPRS models were statistically significant after multiple testing correction (Benjamini-Hochberg adjusted P-value < 0.05), indicating that incorporating rvPRS significantly improves prediction accuracy across traits.

For the 8 binary traits, we further assessed the improvement in risk prediction of rvPRS using NRI and IDI values in the validation dataset. We evaluated the NRI and IDI values for two sets of models: “null model” versus “full model” and “cvPRS model” versus “rvPRS+cvPRS model” (Fig. 3B, Supplementary Data 15). The NRI and IDI values for 7 traits (AST, BRE, CA, DM2, HER, HT, and IH) showed significantly positive values (P < 3.13E-03 = 0.05/8/2), suggesting that both the “full model (rvPRS+cov)” and the “rvPRS+cvPRS model” improved classification and discriminatory ability compared to the “null model (cov only)” and the “cvPRS model”, respectively. However, the rvPRS for AF did not meet the significance threshold in the NRI and IDI comparison between “full model” and “null model” (P = 2.53E-02 for NRI; P = 2.79E-03 for IDI).

Overall, these validation analyses indicate that rvPRS performance was significantly replicated for 12 traits in the independent dataset. For 11 of these 12 traits (all except BRE), the final selected rvPRSs were derived from single-SNP association summaries in the IMP data, suggesting that-despite limitations in imputation-combining information from both the exome and noncoding genome provides stronger predictive power than using exome-restricted data alone. In contrast, for BRE, the final selected rvPRS originated from WES data, implying that traits with monogenic or oligogenic features may benefit more from high-quality variants obtained via sequencing-based platforms than from array-based approaches.

RV heritability and predictive performance

Genetic heritability of traits is known to correlate with the association strength between traits and genotype or PRS. Using genome-wide complex trait analysis (GCTA), we estimated SNP-based genetic heritability (h2RV) from RVs in IMP and WES for the 12 traits (Supplementary Data 16)20. Overall, h2IMP-RV was lower for binary traits than for quantitative traits, whereas h2WES-RV showed the opposite pattern. Consistent with these trends, h2IMP-RV met the Bonferroni-corrected significance threshold for more traits-3 binary (AST, DM2, and HT) and 5 quantitative (BMI, DBP, FEV, RHR, and SBP) (P < 6.94E-04 = 0.05/12/6)-whereas h2WES-RV met the threshold for fewer traits-4 binary (AST, DM2, HT, and IH) and 1 quantitative (BMI)-in the validation dataset.

We compared our RV heritability estimates to previously reported values21,22. In our validation dataset, h2WES-RV was 7.92% (95% CI: 5.25–10.59) for BMI, 3.08% (95% CI: 0.47–5.69) for DBP, and 3.05% (95% CI: 0.96–5.14) for SBP. Meanwhile, h2IMP-RV was higher: 14.24% (95% CI: 12.30-16.18) for BMI, 8.25% (95% CI: 6.36–10.14) for DBP, and 8.79% (95% CI: 6.97–10.62) for SBP. These estimates were generally higher than h2WES-RV. Our h2WES-RV estimates were 20 to 55% lower for BMI, SBP, and DBP compared to those reported by Pathan et al. (2024), whereas our h2IMP-RV was 44% and 29% higher for BMI and SBP, respectively, and 3.8% lower for DBP than Pathan’s (2024) h2WES-RV (Supplementary Data 16). Although Wainschtein et al. (2022) did not provide specific RV heritability values, we estimated the RV heritability from IMP to be ~20% for height and 10% for BMI, based on scaling from figures in their manuscript. Our h2IMP-RV for BMI is 42% higher than Wainschtein’s reported value, suggesting that RV heritability from IMP data is high across multiple traits compared to previous reports.

We further assessed the correlation between genetic heritability (h2RV) and rvPRS model R2 (Nagelkerke’s pseudo-R2 for binary traits; adjusted R2 for quantitative traits) across 12 traits using Pearson’s correlation (Fig. 4, Supplementary Data 16). For IMP data, h2IMP-RV correlated strongly with rvPRSIMP R2 in the training dataset (female: r = 0.95, P = 1.25E-04; male: r = 0.93, P = 3.35E-05), with a similar but non-significant trend in the validation dataset (r = 0.58, P = 5.67E-02). For WES data, h2WES-RV did not show significant correlations with rvPRSWES R2 (training female: r = 0.14, P = 6.54E-01; training male: r = 0.59, P = 5.86E-02; and validation: r = -0.19, P = 5.45E-01). Notably, when stratified by trait type, correlations between h2RV and R2 were significant for quantitative traits in both training and validation datasets for both genotype platforms. Overall, these patterns suggest that genome-wide IMP data capture trait-relevant RV signal more effectively than exome-restricted WES in training (with attenuation in validation), and that quantitative traits exhibit stronger, more reproducible relationships between RV heritability and rvPRS performance.

Fig. 4: Correlation between R2 and genetic heritability (h2RV) of IMP RVs.
figure 4

The correlation between genetic heritability (h2RV) and Nagelkerke’s pseudo-R2 or adjusted R2 values of rvPRS is plotted.

Improved risk stratification using tPRS

PRSs are known for their clinical utility in the early identification of individuals at higher risk for disease. To assess the ability of the tPRS to improve risk stratification, we calculated the odds ratio (OR) / Beta for high-risk groups (top 5%, 10%, 15%, 20%, 25%, or 30% of the PRS individuals) compared with a normal group (middle 41-60%) across 7 binary and 5 quantitative traits. These traits had previously shown statistically validated performance in the validation dataset.

We then evaluated whether risk stratification was significantly improved by tPRS compared with cvPRS. Using a one-sample t-test, we compared the differences in ORs or Beta (β) values between high-risk and normal groups across the two PRS models. As shown in Supplementary Data 17, for several binary traits (BRE, CA, DM2, and IH), the mean OR differences between the high-risk and normal groups were significantly greater for tPRS than for cvPRS (Fig. 5). For two quantitative traits (BMI and SBP), the mean β differences between these groups were also significantly larger for tPRS than for cvPRS (Fig. 5 and Supplementary Data 18). These results suggest that incorporating rvPRS into the tPRS significantly enhances both predictive accuracy and risk stratification compared with using cvPRS alone.

Fig. 5: Improved stratification of high-risk versus normal groups by tPRS in the validation dataset.
figure 5

High-risk groups were the top 5%, 10%, 15%, 20%, 25%, and 30% of the PRS distribution; the normal group was the middle 41–60%. For each top portion, we estimated log(OR) (binary traits) or β (quantitative traits) for tPRS and cvPRS, and plotted their difference [tPRS–cvPRS]. Boxplots summarize the distribution of differences across top portions (points = individual top portions). Traits shown are those with a significantly positive mean difference. tPRS = rvPRS + cvPRS.

Discussion

In this study, we developed rvPRSs for 13 binary and 5 quantitative traits using both WES and IMP data, establishing a versatile protocol for constructing high-performing rvPRSs. This protocol outlines the sequential analytical steps we applied, making it applicable across different genetic data sources (WES and IMP) and phenotype types (binary and quantitative traits), although no executable code is provided.

Gene-burden association is commonly used for RVs rather than single-SNP association5,7. However, classifying RVs remains challenging, as current algorithms do not always accurately predict their pathogenic characteristics. Ideally, only damaging alleles would be aggregated, excluding benign alleles; however, distinguishing between the two is not perfect. To enrich for potentially harmful alleles, RV association studies typically focus on (1) nonsynonymous variants, such as loss-of-function or missense mutations in protein-coding regions, and (2) variants below a specified frequency threshold. Even with these criteria, the variants identified may still include a mix of damaging and benign alleles22,23. Additionally, the impact of RVs in noncoding regions can complicate their inclusion in PRS models. Although noncoding RVs may be individually less harmful, their high number per individual could still significantly impact trait variation, similar to the effect of CVs in aggregate genetic scores.

Recent studies using whole exome or genome sequencing data have consistently showed that RVs contribute to the “missing” heritability of traits. For example, total heritability (CVs + RVs) for height and BMI has been estimated at 68% and 30%, respectively, using whole genome sequencing (WGS) data, and at 50-56% (height) and 16-21% (BMI) using IMP data21, while WES data yielded estimates of 87.8% (height) and 39.5% (BMI)22. The higher WES heritability estimates reported by Pathan et al. (2024)22, relative to WGS in Wainschtein et al. (2022)21, likely reflect differences in methodology, sample size, and variant selection rather than an inherent superiority of WES over WGS. These prior works therefore suggest that WGS or WES can capture RV contributions more effectively than IMP genotypes, particularly for very rare variants (MAF < 0.001) in low LD that are often missed or imperfectly imputed21.

In contrast, our results showed that SNP-based heritability from IMP RVs (h2IMP-RV) was higher than that from WES RVs (h2WES-RV) for several traits. This is not necessarily inconsistent with previous reports and likely reflects differences in dataset characteristics, variant selection, and heritability estimation. For instance, our WES heritability used ~6–1,587,000 variants (vs. ~1.5 million in Pathan et al.), while our h2IMP-RV was based on ~279,000–4,406,000 variants (vs. ~20 million in Wainschtein et al.). Despite these smaller variant sets, our h2IMP-RV estimates for traits such as BMI and SBP were higher than those reported from WES or WGS datasets.

Both Pathan (2024) and Wainschtein (2022) analyzed independent RV sets obtained by LD pruning. Similarly, we selected variants for h2IMP-RV using a combination of filtering and model-based selection: (1) MAF stratification, (2) LD clumping for PRS derivation, and (3) selecting the rvPRS model with the best predictive performance. Despite imperfect imputation, these steps may have contributed to the higher heritability estimates observed in our study relative to Wainschtein (2022), underscoring that variant quality, filtering strategy, and model design can influence heritability estimates more than raw variant counts. Additionally, for traits such as CA, DM2, BMI, DBP, FEV, RHR, and SBP, lower WES-based heritability may reflect limited coverage or platform-specific bias in our data.

Interestingly, although h2IMP-RV was relatively high, the predictive performance of rvPRSIMP was lower than expected. For example, whereas Pathan et al. (2024) reported that RV heritability from WES accounted for ~22% of total genetic heritability across traits, our rvPRSIMP models accounted for only ~10% of the variance explained by tPRS. This gap may reflect the imperfect tagging of ultra-rare variants (MAF < 0.001) in the IMP data, leaving a substantial portion of “missing” heritability unmodeled.

Our findings suggest that the relative performance of WES- versus imputation-based rvPRS models within coding (exonic) regions may vary by trait, likely reflecting differences in imputation quality, variant coverage, and trait-specific genetic architecture. For traits with a stronger contribution from rare coding variants and/or better coverage in WES data, WES-based rvPRS may offer higher predictive value. In contrast, for other traits, high-quality imputation of coding variants may be sufficient to yield comparable or even superior performance.

Furthermore, the relative performance of rvPRS based on more- versus less-accurate imputation may also vary by trait, again reflecting differences in imputation quality and variant coverage. These observations underscore the importance of optimizing rvPRS construction strategies based on both trait-specific genetic architectures and the quality of the underlying data sources.

Overall, our findings suggest that carefully selected imputed RVs can yield high heritability estimates and contribute meaningfully to PRS models. However, technical limitations of imputation, particularly for ultra-rare variants, still constrain their predictive power. As WGS data become more widely available, future studies will be able to assess whether broader and more accurate RV coverage improves rvPRS construction and its clinical utility.

To derive the rvPRS, we used the PRSice-2 program. Unlike methods that employ shrinkage of SNP effect sizes, PRSice-2 involves clumping SNPs by internal LD and P-value, then prioritizes SNPs at each locus based on the smallest GWAS P-value through thresholding10,24. Because it is challenging to obtain suitable LD references for RVs, given their sparse LD structure, PRSice-2 is an optimal choice for rvPRS derivation as it does not require an external LD reference. Moreover, clumping by P-value is statistically robust, eliminating the need for possibly inaccurate SNP annotation. All PRS construction approaches involve parameter optimization and are therefore susceptible to overfitting24. Several strategies have been proposed to mitigate overfitting9, and we applied them here. First, we used out-of-sample evaluation by separating the training dataset (used for PRS optimization) from the validation dataset (used solely for replication). Second, we leveraged the empirical P-value generation implemented in PRSice-2. As described by Choi et al. (2019), PRSice-2 calculates the empirical P-values by permuting phenotype labels (default = 10,000 permutations) and repeating the PRS association across P-value thresholds, identifying the “best-fit PRS” as the score with the strongest association under permutation. Thus, PRSice-2 offers a flexible approach that can be applied to both IMP and WES data without assumptions about the trait’s genetic architecture. However, in this study, we selected PRS-CS for cvPRS construction based on its superior performance and its compatibility across multiple traits, owing to its use of a standardized LD reference panel, which helps harmonize differences between genotyping cohorts.

As outlined above, we analyzed several traits with monogenic or oligogenic characteristics. For BRE, CHO, and PRO, only a few variants (< 7) were selected from both WES and IMP data through PRSice-2. In contrast, for RHR, 6 variants were selected from WES, but 312,595 were selected from IMP, reflecting a polygenic rather than oligogenic score. Selecting independent SNPs before producing rvPRS is essential, given the absence of a shrinkage strategy for SNP effect estimates. Independent variants are typically selected through pruning, which, however, may eliminate truly harmful variants due to LD. Conversely, including all pruned variants may reduce PRS performance, especially for oligogenic traits. Our clumping approach based on internal LD and P-value ensures selection of the most effective independent SNPs, irrespective of trait genetic complexity. We propose that our SAIGE-to-PRSice-2 protocol could be extended to WGS with appropriate quality controls, enabling analysis of additional variants not examined here.

This study has several limitations. First, while we used both WES and IMP data, each has inherent constraints. WES data excludes non-coding RVs, potentially underestimating genetic associations. In IMP data, limited imputation of very rare RVs (MAF < 0.001) may hinder accurate association estimates between variants and traits. Heritability studies indicate that the improved heritability in WGS over WES could lead to more precise association estimates21. Identifying more high-quality RVs would support the construction of more trait-correlated PRS, improving disease prediction.

Next, although we initially examined 18 traits, the rvPRSs of 6 binary traits did not meet our criteria, including association R2, perSD OR/Beta, NRI, and IDI. We attribute these limitations primarily to the imperfect imputation of RVs from IMP data and the exclusion of RVs in non-coding regions from WES. Therefore, we expect that WGS, rather than WES or IMP data, could enhance rvPRS performance.

Because our primary aim was to optimize rvPRS, we treated the PRS-CS-derived cvPRS as a fixed baseline. Although not exhaustively tuned, this cvPRS performed robustly across traits and provided a consistent comparator for evaluating rvPRS. Joint optimization of cvPRS and rvPRS is left for future work.

Additionally, we acknowledge the ongoing debate about the clinical utility of polygenic scores. In our results, for most binary traits (except AST), cvPR-only models (cvPRS R2) did not outperform covariate-only models (Null R2), whereas for most quantitative traits (except SBP) cvPRS models outperformed covariates (Supplementary Data 9). These findings align with prior reports that, depending on the trait, classical risk factors can equal or exceed PRS.

Nevertheless, we expect PRS utility to grow as both genetic and clinical risk models improve. Precision medicine will require integrating genetic scores with clinical and environmental factors. A practical advantage of PRS is cost-effectiveness: genetic data are collected once and can inform risk for multiple conditions, while clinical and environmental measures often need repeated updates. Thus, despite current limitations, continued refinement-particularly incorporating RVs-remains warranted.

Beyond these limitations, we address when rvPRS may contribute more than cvPRS and how to advance it. RVs tend to have larger per-allele effects than CVs and are enriched in protein-coding or loss-of-function changes, providing a more direct link to mechanism. For traits with monogenic/oligogenic components or severe/early-onset forms, a small number of damaging RVs can disproportionately influence phenotype. Thus, rvPRS may be valuable for identifying individuals at phenotypic extremes or those under-captured by cvPRS, while improving biological interpretability. As sequencing biobanks grow, RV associations will become more robust and rvPRS utility should increase. Incorporating up-to-date functional annotations—particularly AI-based predictors of variant impact—can prioritize informative RVs and improve signal-to-noise. Stratifying by disease subtypes or age-of-onset may reveal contexts where RV effects dominate. Finally, joint models that integrate cvPRS and rvPRS, coupled with broader noncoding RV coverage via WGS, are likely to yield additional gains in risk stratification and clinical utility.

A broader issue extends beyond this study to future research on incorporating RVs into GWAS and PRS. For cvPRS, open resources (PGS Catalog and GWAS Catalog) provide extensive summary statistics (e.g., >4800 PRSs across >650 traits in PGS), so researchers can skip de novo the GWAS and scoring. In contrast, resources explicitly tagged for “rare variants” are sparse (PGS: 25 PRSs across 18 traits; GWAS Catalog: 972 publications across 48 traits), reflecting the large sample sizes RV analyses require. Consequently, building rvPRS commonly starts from GWAS, which requires more computation and time than cvPRS construction. One can speed analyses by pruning to an LD-reduced SNP set before GWAS, but this risks discarding high-impact variants in LD with retained markers. Careful trade-offs between efficiency and sensitivity are therefore essential.

In summary, our findings indicate that (1) rvPRS can be constructed from RVs for various trait types through GWAS and PRSice-2, regardless of genotype data source (imputed or sequencing-based); (2) rvPRSs from IMP data contribute a minor but significant portion to stratifying individuals at higher disease risk when combined in a tPRS; (3) identifying more high-quality variants-for example, through WGS-may further improve rvPRS performance; and (4) advancements in GWAS PRS methodologies could further improve rvPRS predictive efficiency.

Methods

Ethical approval and consent to participate

The UK Biobank (UKB) is a large-scale, population-based cohort study that recruited 502,369 individuals aged 40–69 years across the United Kingdom from 2006 to 201025. All participants in the UKB provided informed consent before participating in the study. Ethical approval for data collection in the UKB was obtained from the North West Multicentre Research Ethics Committee, with additional oversight from the National Information Governance Board for Health & Social Care and the Community Health Index Advisory Group. The UKB holds a generic Research Tissue Bank approval from the National Research Ethics Service, allowing researchers to utilize UKB data for various studies without requiring additional ethical approvals. Access to UKB data for this study was granted under application reference number 83990.

Study population

We utilized two genetic datasets from the UKB: rare variants (RVs) were derived from whole exome sequencing (WES) data (UKB data field: 23158) and imputed genotype (IMP) data (data field: 22418; v3 release); and common variants (CVs) were extracted from IMP data. The WES dataset contained 20,836,772 variants from 469,835 samples, while the IMP data included 93,095,623 variants across 487,409 participants. Quality control (QC) procedures for both genetic datasets are described in the following section.

For sample QC, we first selected 472,574 European samples self-reporting as “white British,” “Irish,” or “White.” From this group, 441,647 samples with both WES and IMP data were identified. We further excluded 689 samples with self-reported mixed ancestral background, high heterozygosity rate, or high missing rate (data fields: 22018, 22027), 514 samples with sex chromosome aneuploidy marker (data field: 22019), 4 participants excluded from kinship inference process, and 143 participants with ten or more third-degree relatives (data field: 22021), ensuring the removal of individuals with excessive relatedness. This resulted in 440,297 eligible samples, which we divided into three datasets: (1) discovery, (2) training, and (3) validation.

The discovery dataset was used to perform GWAS on RVs and CVs, the training set to calculate and optimize PRS, and the validation set to assess PRS performance. To divide the samples, we first extracted a validation set (N = 73,970) from the Testing Subgroup (data field: 26200, N = 104,107), excluding White British Unrelated samples as described by Thompson et al. (2024) and Jung et al. (2024)26,27. After the validation set extraction, the remaining 366,327 samples were randomly divided into the discovery set (N = 244,218) and the training set (N = 122,109) at a 2:1 ratio. In each dataset, we removed one individual from each pair of related participants (data field: 22011) to ensure independence, resulting in final sets of 242,497 unrelated individuals for discovery, 121,786 for training, and 73,919 for validation.

Genotype data

The WES data in PLINK format was accessed on the UK Biobank Research Access Platform (RAP) under application 83990 (UKB data field: 23158). Exomes were captured using a modified version of the IDT xGen Exome Research Panel v.1.0, targeting 39 Mbp of the human genome (covering 19,396 genes on autosomes and sex chromosomes) with additional probes to improve coverage at underperforming loci. Sample read mapping, variant calling, aggregation, and QC followed the OQFE protocol as previously described6,28. Briefly, NovaSeq WES reads were mapped with BWA MEM to the GRCh38 reference genome, and small variants were identified using WeCall and reported as per-sample gVCFs. These gVCFs were then aggregated with GLnexus into a joint-genotyped, multi-sample project-level VCF (pVCF), which was subsequently converted to PLINK format. A “90pct10dp” QC filter was applied to retain at least 90% of genotypes having a read depth (DP) of at least 106,28.

Following this initial QC, we began with 20,836,772 WES variants across 469,835 samples. For the analysis of RVs, we conducted further QC using PLINK v.1.90, excluding variants with a minor allele frequency (MAF) greater than 0.01, minimum allele count (MAC) less than 10, missing genotype call rates greater than 0.1, or Hardy-Weinberg equilibrium P-value less than 1.00 × 10−1529. After this second QC process, 2,080,272 variants with MAF less than 0.01 remained for further analysis.

The IMP data (UKB data field: 22418) included 93,095,623 variants for 487,409 participants, generated using the UKB Axiom Array and the UK BiLEVE Axiom Array (Affymetrix, Santa Clara, CA, United States)30. Genotyping imputation was conducted with the UK10K Project and Haplotype Reference Consortium (HRC) reference panels (build GRCh37). For QC of IMP data in RV analyses, we excluded variants with MAF > 0.01, MAC < 10, missing genotype call rates > 0.05, Hardy-Weinberg equilibrium P-value < 1.00 × 10−6, or imputation quality scores (INFO) of r2 < 0.331. This process retained 26,194,733 variants with MAF < 0.01 for further analysis.

For CV analysis, we excluded variants with MAF < 0.01, missing genotype call rates > 0.05, and Hardy-Weinberg equilibrium P-value < 1.00 × 10−6. After applying these QC filters, 8,638,862 variants with MAF > 0.01 were retained for further analysis.

Correlation analysis between IMP and WES dosage genotypes

We analyzed IMP data from European individuals to examine imputation accuracy. Although variants with an INFO of 0.3 or higher were included, we were concerned that poor imputation of RVs could reduce association strength and lower PRS prediction efficiency. To assess imputation accuracy, we compared IMP dosage genotypes to WES genotypes at exome variants. Because the IMP genotypes were aligned to GRCh37 (hg19) whereas the WES data were aligned to GRCh38 (hg38), we converted IMP variant coordinates to GRCh38 using UCSC LiftOver; variants that failed liftover or mapped ambiguously were excluded. Markers were grouped into bins based on five MAF regions (0.005–0.01, 0.001–0.005, 0.0005–0.001, 0.0001–0.0005, and ≤0.0001). Across three datasets, we evaluated the correlation between IMP and WES dosages with the Pearson correlation method in the R stats package (version 4.3.2). In each bin, we report the absolute correlation (|r|) between IMP and WES dosages at the same markers (N = 505,002). The average correlation between IMP and WES dosages declined as MAF decreased, with |r| dropping from ~0.93 for MAF 0.005-0.01 to ~0.16 for MAF ≤ 0.0001, reflecting poorer imputation of very rare variants due to weak LD with tagging SNPs (Supplementary Fig. 1, Supplementary Data 19)6,21,28.

Phenotype data

For 13 binary traits, cases and controls were defined according to FinnGen disease end points and controls (available at https://www.finngen.fi/en/researchers/clinical-endpoints)32. Disease endpoints were mapped to International Classification of Diseases (ICD)-9 or ICD-10 codes following phecode mappings33. Participants were assigned a phecode if they had at least one ICD-9 or ICD-10 code associated with the phecode (data fields: 41203, 41271, 41202, and 41270) (see Supplementary Data 1). Cases included participants with (1) the phecode of interest, (2) relevant medical treatments/medications (data fields: 20003), and (3) related operative procedures (OPCS3,4; data fields: 41273, 41272). Controls included participants who lacked (1) the phecode of interest, (2) related phecodes, (3) self-reported diseases (data fields: 20001, 20002), (4) medical treatments/medications (data fields: 20003, 20004, 22167-22170), and (5) related operative procedures (OPCS3,4; data fields: 41273, 41272)32,34. For sex-specific endpoints, such as breast cancer (BRE) and prostate cancer (PRO), only females were included for BRE and only males for PRO.

We also analyzed five quantitative traits: SBP, DBP, RHR, BMI, and FEV.

The mean SBP and DBP were derived from automated blood pressure readings (data fields: 4080 and 4079) at the initial assessment visit (2006-2010, instance 0). When automated readings were unavailable or only a single reading was recorded, manual measurements (data fields 93 and 94) were used35. To account for medication use (data fields: 6153 and 6177; code 2), we adjusted SBP by adding 15 mmHg and DBP by adding 10 mmHg36. The mean RHR was calculated using pulse rates recorded during automated blood pressure measurements (data fields: 102 and 95)37. If multiple measurements were available from the first visit, the mean of these values was used. We excluded individuals with conditions or medications that could affect RHR, including a history of tachycardia, atrial fibrillation, heart arrhythmia, heart valve issues, cardiac arrest, bundle-branch-block, pacemaker implantation, or medications like beta-blockers, calcium channel blockers, and others impacting heart rate38,39.

BMI was calculated based on the height and weight at the initial assessment (data field: 21001). If this was missing, the BMI derived from body composition estimation by impedance (data field: 23104) was used instead37. FEV was estimated from FEV1 values (data field: 3063), using the NHANES III spirometry Caucasian reference panel40,41. Individuals outside the height range of 147–207 cm for males and 145–189 cm for females were excluded from this reference.

Individuals with missing data were excluded from further analyses. Additional cohort characteristics are provided in Supplementary Data 1.

Genetic association studies

Genome-wide association studies (GWASs) were conducted in the discovery dataset (N = 242,497) using RVs from WES and IMP data, as well as CVs (only for quantitative traits) from IMP data. We first analyzed WES data from individuals of European ancestry, targeting 13 binary traits and 5 quantitative traits. The prevalence for binary traits ranged from 2% for heart failure (HF) to 26% for hypertension (HT) in the discovery set (Supplementary Data 2).

GWAS was performed using the Scalable and Accurate Implementation of Generalized mixed model (SAIGE) GENE+ program (v.1.3.0), an extension of the SAIGE R package designed for genome-wide association testing in large-scale datasets8,42. SAIGE-GENE+ accounts for sample relatedness through generalized mixed models, enabling efficient model fitting with either a full or sparse genetic relationship matrix (GRM). This method supports analyses of both binary and quantitative traits, performing single-variant association tests as well as set-based tests like BURDEN, SKAT, and SKAT-O. We employed a sparse GRM to estimate score statistic variances, as it requires less computational power than a full GRM. The sparse GRM was constructed by performing linkage disequilibrium (LD) pruning separately for WES and IMP data using PLINK v1.90, with parameters of a 50 bp window size, 5 bp step-size, and pairwise r2 < 0.05. This resulted in two GRMs calculated for the discovery samples: one with 127,844 LD-pruned variants for WES and another with 249,441 for IMP data43.

For gene-burden testing in WES data, we applied 9 variant mask categories (mask1 to mask9) defined by combinations of functional annotations and MAF thresholds. Specifically, we used three annotation categories (loss of function [lof] only, missense+lof, and missense+lof+synonymous) and three MAF thresholds (0.0001, 0.001, and 0.01). The nine resulting variant masks were defined as follows: mask1, lof only & MAF < 0.0001; mask2, lof only & MAF < 0.001; mask3, lof only & MAF < 0.01; mask4, missense+lof & MAF < 0.0001; mask5, missense+lof & MAF < 0.001; mask6, missense+lof & MAF < 0.01; mask7, missense+lof+synonymous & MAF < 0.0001; mask8, missense+lof+synonymous & MAF < 0.001; mask9, missense+lof+synonymous & MAF < 0.01. For each mask, gene-level association analyses were conducted using SAIGE-GENE+ (and SAIGE), adjusting for age, sex, genotyping array, and the first 20 principal components (PCs). All variants within the same gene and mask were assigned equal weights for estimating aggregate effect sizes (e.g., the estimated effect size for lof variants in gene A was applied uniformly to all lof variants within gene A)3.

Single-SNP association was performed using SAIGE on RVs with MAF between 0.0001 and 0.01. SAIGE has shown robust performance for testing RVs with MAF above 0.001 or 0.0001 for diseases with a prevalence above 3%8. To control type I error, Firth logistic regression was applied in SAIGE for binary traits, effectively managing error rates in RV analyses for diseases with prevalence up to 10%44,45. Quantile-quantile (QQ) and Manhattan plots for single-SNP association results from WES data are provided in Supplementary Fig. 2.

We next analyzed IMP data from European individuals (N = 242,497) in the discovery set. Single-SNP associations were performed, applying Firth logistic regression with SAIGE for binary traits. QQ and Manhattan plots for IMP data are provided in Supplementary Fig. 3.

This study primarily focused on the impact of RVs on PRS. To streamline analysis, we used association summaries from FinnGen to obtain CV effect sizes for 13 binary traits (Supplementary Data 1). Genetic association analyses for CVs from IMP data were conducted via SAIGE for the five quantitative traits.

Construction of PRS

In the training dataset, we developed nine different rvPRSs using gene-burden association statistics and one rvPRS using single-SNP association statistics from WES data. Additionally, twelve distinct rvPRSs were constructed using single-SNP association statistics from the IMP data.

PRS derivation for CVs

For CVs, GWAS summary statistics for 13 binary traits were obtained from the FinnGen repository (https://www.finngen.fi/en/researchers/clinical-endpoints), and those for five quantitative traits were generated using SAIGE. To derive the cvPRS, we used PRS-CS, a Bayesian method that estimates the posterior mean effect size of each variant based on GWAS summary statistics and an external LD reference panel. PRS-CS applies a continuous shrinkage prior on variant effect sizes, enhancing the robustness of effect size estimates12. For the LD reference panel, we used HapMap3, optimized for European ancestry populations. Following the PRS-CS protocol, ambiguous A/T or G/C variants were filtered to avoid strand ambiguity, and only variants present in HapMap3 were analyzed12. We considered two shrinkage prior values (phi = 1 × 10−3 and 1 × 10−4) and used the PRS-CS auto option, which infers the optimal shrinkage prior from the data. For binary traits in case-control studies, we calculated the effective sample size (\({N}_{{eff}}\)) as follows:

$${N}_{{eff}}=\frac{4\times {N}_{{cases}}\times {N}_{{controls}}}{{N}_{{cases}}+{N}_{{controls}}}$$
(1)

This approach accounts for case-control imbalance, leading to more accurate polygenic risk estimates. For quantitative traits, the total sample size was used as \({N}_{{eff}}\). The final cvPRS was computed using PLINK v1.90 with the --score sum option, which calculates a polygenic score by summing the weighted effect sizes of each variant for each individual. The PRS for individual j, as a weighted sum of SNP allele counts, was defined as follows:

$$\widehat{{{PRS}}_{j}}={\sum }_{i=1}^{m}\widehat{{{{\rm{\beta }}}}_{i}}{x}_{{ij}}$$
(2)

where m is the number of SNPs, \(\hat{{{\beta }}_{i}}\) is the estimated effect size for SNP i, and xij is the allele dosage for SNP i in individual j.

PRS derivation for RVs

From gene-burden association statistics in WES data, we assigned identical weights to all variants within each gene in the same mask, applying the aggregate effect size from a given gene (e.g., all loss-of-function (lof) variants in gene A) to derive rvPRSGENE3. Calculation of rvPRSGENE was conducted using PLINK v.1.90 with the --score sum option. The rvPRS of individual j was calculated as follows:

$$\widehat{{{PRS}}_{j}}={\sum }_{h=1}^{N}({\sum }_{i=1}^{{M}_{h}}\widehat{{{{\rm{\beta }}}}_{{hi}}}{x}_{{hij}})$$
(3)

where N is the number of genes, Mh represents the number of SNPs in gene h, \(\hat{{{\beta }}_{{hi}}}\) is the gene-burden coefficient of SNP hi in gene h, and xhij is the allele dosage of SNP hi in individual j.

To generate rvPRS and cvPRS from single-SNP association summaries, we used PRSice-2 (v2.3.3), which is faster and more memory-efficient than PRSice-1 and alternative PRS software, such as LDpred211 and lassosum13, while maintaining similar predictive power9,24. Unlike LDpred2 and lassosum, which require an external LD reference for effect size shrinkage, PRSice-2 uses a clumping approach (retaining SNPs with the smallest GWAS P-value per locus) to minimize LD dependency9. PRSice-2 follows standard PRS calculation methods, involving clumping and P-value thresholding, while retaining features such as automatic strand flipping and clumping10,24. For clumping, we used the training sample of 121,786 unrelated European individuals to estimate LD, with r2 = 0.2 and a region size of 1 Mb. P-value thresholds were varied from 5 × 10−8 to 1, incrementing by 5 × 10−5. PRS calculation was completed in PRSice-2 using the embedded PLINK (--score) function. For WES data, we constructed one rvPRS (rvPRSWES) using variants with a MAF region between 0.0001 and 0.01. For IMP data, we constructed twelve distinct rvPRS models (rvPRSIMP) using variant sets defined by combinations of MAF ranges, imputation quality (INFO scores), and genomic regions (exonic or not). The first four rvPRSIMP models (INFO > 0.3) were derived from variants with INFO > 0.3 and four different MAF ranges: Maf1 (0.005–0.01), Maf2 (0.001-0.01), Maf3 (0.0001–0.01), and Maf4 (< 0.01). The second four rvPRSIMP models (Exon only) used variants restricted to exonic regions, also with INFO > 0.3 and the same four MAF ranges. The last four rvPRSIMP models (INFO > 0.8) were based on variants with INFO > 0.8 and the same four MAF ranges as above.

Specifically, exon regions were defined from RefSeq annotations obtained via the UCSC Table Browser (GRCh37/hg19). We intersected these intervals with the IMP genotype dataset to extract RVs within exons and generated exon-restricted binary files. Applying the same INFO ( > 0.3) and MAF filters described above, these exon-only variant sets were then used to construct the rvPRSIMP (Exon only) models across all traits and MAF ranges in PRSice-2.

Estimation of polygenic risk score performance

To evaluate the association between PRS and traits, we performed logistic regression analysis for binary traits and linear regression analysis for quantitative traits in both the training set (N = 121,786) and validation set (N = 73,919). These models were fitted using the R statistical package (v4.1.0), with adjustments for age, sex, genotyping array, and PCs 1-20. The following models were applied:Logistic regression model (binary trait):

$${{\rm{logic(trait)}}} = \, {\beta}_1 \times {{\rm{PRS}}} + \beta_2 \times {{\rm{age}}} + \beta_3 \times {{\rm{sex}}} + \beta_4 \times {{\rm{genotyping}}} \, {{\rm{array}}} \\ + \beta_5 \times PC1 + \ldots + \beta_{24} \times {{\rm{PC}}} 20$$
(5)

Linear regression model (quantitative trait):

$${{\rm{trait}}} = \, {\beta}_1 \times {{\rm{PRS}}} + \beta_2 \times {{\rm{age}}} + \beta_3 \times {{\rm{sex}}} + \beta_4 \times {{\rm{genotyping}}} \, {{\rm{array}}} \\ + \beta_5 \times PC1 + \ldots + \beta_{24} \times {{\rm{PC}}} 20$$
(6)

Nagelkerke’s pseudo-R2 for binary traits and adjusted R2 for quantitative traits were calculated to assess the explanatory power of the models. Prior to inclusion in the models, the PRS was standardized to have a mean of zero and a standard deviation of one. This standardization allows the PRS effect to be interpreted as the change per one standard deviation increase (e.g., odds ratio per 1 standard deviation (perSD OR) for binary traits and beta per 1 standard deviation (perSD Beta)), making the results more interpretable across traits.

The logistic regression model, using the logit transformation, provides the odds ratio (OR) for the PRS, where:

$${{\rm{OR}}}={e}^{{{\rm{\beta }}}1}$$
(7)

Here, β1 represents the log-odds increase in disease risk per one standard deviation increase in the standardized PRS. The OR, along with 95% confidence intervals (CIs), was reported to assess the significance and precision of the PRS effect on disease risk.

For binary traits, we further assessed the improvement of rvPRS in risk prediction by comparing it with the “null model” and “cvPRS model” using two analyses: (1) net reclassification improvement (NRI) values and (2) integrated discrimination improvement (IDI). NRI quantifies the improvement in reclassification of individuals into the high-risk group when a new prediction model is added, while IDI measures the improvement in the average sensitivity (true positive rate) and specificity (true negative rate). The formulas for calculating NRI and IDI are as follows:

$${{\rm{NRI}}}= \, ({{\rm{Number}}}\; {{\rm{of}}}\; {{\rm{correctly}}}\; {{\rm{reclassified}}}\; {{\rm{events}}}/{{\rm{Total}}}\; {{\rm{events}}}) \\ +({{\rm{Number}}}\; {{\rm{of}}}\; {{\rm{correctly}}}\; {{\rm{reclassified}}}\; {{\rm{non}}}-{{\rm{events}}}/{{\rm{Total}}}\; {{\rm{events}}})$$
(8)
$${{\rm{IDI}}}= \, \left(\right.\Sigma \left(\right.{{\rm{Predicted}}}\; {{\rm{probability}}}\; {{\rm{for}}}\; {{\rm{events}}} \\ -{{\rm{Predicted}}}\; {{\rm{probability}}}\; {{\rm{for}}}\; {{\rm{non}}}-{{\rm{events}}}\left)\right. \left)\right./{{\rm{N}}}$$
(9)

where N is the total number of individuals.

For these analyses, we split both the training set (N = 121,786) and validation set (N = 73,919) into two subsets (modeling and evaluation sets) with a 1:1 ratio. The rvPRS was fitted to a logistic regression model adjusted for age, sex, genotyping array, and 20 PCs. Predictive performance was assessed using continuous NRI and IDI metrics via the “PredictABEL” package in R46. We generated NRI and IDI indices for two sets of comparison models: “null model” versus “full model (null model + rvPRS)”, and “cvPRS model” versus “rvPRS+cvPRS model”, where the “null model” is comprised of the following formula: trait ~ age + sex + genotype array + PC1-20. We compared these indices to assess the improvement of rvPRS in predictive performances.

Additionally, we stratified individuals into high-risk groups (top 5%, 10%, 15%, 20%, 25%, and 30%) and a normal group (41-60%) based on PRS ranking. We then estimated the odds ratio (OR) or beta for each high-risk group relative to the normal group for each trait. Logistic or linear regression models, as appropriate for the trait type, were fitted as:

trait ~ risk group (1 = high-risk group, 0 = normal group) + age + sex + genotype array + PC1-20

Genetic heritability (h 2 SNP)

We estimated the SNP genetic heritability (h2SNP) using the software package Genome-wide Complex Trait Analysis (GCTA) v1.94.120. The GCTA method estimates additive genetic variance in unrelated individuals based on LD between variants. Using GCTA, we first calculated the GRM among individuals using only the SNPs that were used for PRS construction in the training and validation sets. Heritability (h2SNP) was then estimated using the genome-based restricted maximum likelihood (GREML) method implemented in the GCTA package, adjustments for age, sex, genotyping array, and 20 PCs. Due to computational limitations, RV heritability (h2RV) was estimated separately for females and males in the training set, while it was computed for the entire sample in the validation set.

Statistics and reproducibility

We used the Student’s t-test to compare the mean differences of quantitative variables between two groups and the chi-squared test to assess differences in the prevalence of binary traits between female and male individuals, using R statistical package (version 4.3.2; www.r-project.org). To assess the improvement of tPRS (cvPRS + rvPRS) over cvPRS alone, we applied the r2redux package in R, which provides test statistics for comparing R2 values between models. Plots were generated using ggplot2 (version 3.5.1) in R.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.