Main

A principal promise of modern genetics is the ability to predict complex disease risk on the basis of a person’s genetic profile. If successful, health management strategies can be developed to mitigate risk (disease prevention) and to optimize care (early diagnosis and effective treatment). Large-scale studies by the UK Biobank (UKB) and the Electronic Medical Records and Genomics Network show that risk prediction on the basis of genetics holds promise, and several countries are exploring ways to implement risk-based management in clinical practice1,2. Using polygenic risk scores (PRS) to predict disease risk and identify individuals at high risk is an emerging ‘precision medicine’ approach to leverage genetic findings in clinical practice. However, a substantial limitation is that current PRS models are predominantly based on genome-wide association studies (GWAS) with participants of European ancestry (EUR)4,5, often leading to reduced predictive performance in groups of other ancestry6,7. To fully realize the potential of precision medicine for diverse global populations, population-specific phenome-wide genomic discovery must be performed at scale and clinically applicable polygenic risk models must be optimized in and across populations. To fill this research gap in a population of East Asian ancestry (EAS), we characterized the complex genetic architecture of the population of Han Chinese ancestry phenome wide, developed population-specific PRS and assessed the external validity of the models across populations with varying degrees of genetic similarity.

Populations of EAS represent nearly a quarter of the global population, but they account for only 3.95% of the participants in previous GWAS3. Although several biobanks have been built to recruit subjects from East Asia, they have moderate sample size (72,000–212,000), and many focus on specific conditions8,9,10,11,12. By contrast, biobanks with predominantly EUR participants13,14,15,16 have significantly larger sample sizes (224,000–635,000) and access to more comprehensive clinical data. The moderate sample size and limited phenotypes in existing EAS biobanks hamper discovery of unique genetic effects and preclude the development of robust and clinically useful PRS models for EAS.

We assembled a large non-EUR cohort, the Taiwan Precision Medicine Initiative (TPMI), and genotyped more than half a million participants across 16 medical centres in Taiwan from 2019 to 2023. All the participants, who are overwhelmingly of Han Chinese ancestry, contributed DNA samples for genetic profiling with a custom-designed genotyping array and consented to provide their longitudinal electronic medical records (EMRs) from 5 years before enrolment and into the future. The EMR dataset includes rich and accurate health-related phenotypes, including medical diagnoses and biochemical examinations17. Here we present the results of comprehensive genomic analyses with extensive genetic and medical data derived from the TPMI cohort, including phenome-wide GWAS and PRS model development. We identified numerous population-specific risk variants/genes, observed evidence of genetic pleiotropy and pinpointed clusters of traits that shared similar genetic aetiology. Then, we developed and validated PRS prediction models for numerous conditions against external datasets including those from the Taiwan Biobank (TWB), the UKB and the All of Us Project. Our results show the benefits of leveraging a large cohort from an understudied population to identify unique genetic underpinnings of the human phenome, interpret causal effects by means of fine mapping and colocalization and improve the performance of population-specific PRS models, which together better illuminate the clinical implications of genetic risk.

Diseases and quantitative traits in TPMI

We performed comprehensive genomic analyses, including GWAS, heritability estimation and PRS model building and evaluation, across a wide range of diseases and quantitative traits using 463,447 individuals genetically similar to Han Chinese reference populations from TPMI. We examined 695 dichotomized phenotypes (phecodes; case n > 2,000) and 24 quantitative traits (sample size > 100,000), spanning numerous disease categories (defined by phecode groupings18,19), such as neoplasms, metabolic disorders, circulatory conditions, autoimmune diseases and more (Fig. 1). The phecodes, derived from International Classification of Diseases codes18,19, alongside quantitative traits such as blood pressure, body mass index (BMI), liver enzymes and lipid levels, provide a robust dataset for exploring genetic contributions to human health (Supplementary Tables 1 and 2). The log-transformed case proportion identified from EMRs showed a moderate but significant correlation with the log-transformed 5-year disease prevalence from the National Health Insurance Research Database (NHIRD) in Taiwan20 (r = 0.656, P = 2.69 × 10−84) (Fig. 1a and Extended Data Fig. 1), indicating that the TPMI’s hospital-based design may not fully capture mild and common illnesses, which are primarily observed in local and primary care clinics. Figure 1b displays the sample sizes for 24 quantitative traits in the TPMI and highlights sample size variation across traits, a key measure affecting the power and precision of association analyses in the cohort.

Fig. 1: Scatter plot of the case proportion for phecodes and bar chart of sample size for quantitative traits in TPMI dataset.
Fig. 1: Scatter plot of the case proportion for phecodes and bar chart of sample size for quantitative traits in TPMI dataset.The alternative text for this image may have been generated using AI.
Full size image

a, The case proportion in TPMI is compared with the 5-year prevalence in the NHIRD for phecode. Each dot represents a specific phecode, with the x axis showing the prevalence in NHIRD and the y axis showing the case proportion in TPMI. The correlation coefficient and significance were calculated using the two-sided Pearson correlation test. b, Bar chart showing the sample sizes for quantitative traits in the TPMI cohort. Each bar represents a trait, with the x axis indicating the different category of quantitative traits and the y axis representing the corresponding sample sizes. Definitions for abbreviations in panel b can be found in Supplementary Table 2.

GWAS, fine mapping and results

Our GWAS identified at least one significant locus (P < 5 × 10−8) for 265 phecodes of the 695 tested and all 24 quantitative traits. Highlighting the robustness of the TPMI data, we observed a high replication rate of reported disease loci from EAS GWAS on the GWAS catalogue (actual/expected ratio (AER) = 78.17%, considering the statistical power with the published tool, PGRM21), particularly for endocrine and metabolic/hematopoietic diseases (AER = 88.68% and 84.62%, respectively; Extended Data Fig. 2 and Supplementary Table 3). Lower replication rates for respiratory disease (AER = 23.53%) may reflect limited case numbers; untyped genetic variants, such as rare variants, copy number variation and structural variants; or recruitment bias such as age distribution.

We applied the sum-of-single-effects model for fine mapping to identify the independent variant–trait associations and reported the genetic variant with highest posterior inclusion probability of identified credible sets as well as the single lead variant for the failed fine-mapping regions and major histocompatibility complex region (MHC region; chromosome 6: 25,391,792–33,424,245). Our analyses showed a total of 2,656 fine-mapping-identified independent association signals, including 1,309 from phecodes GWAS and 1,347 from quantitative traits. Notably, 95 new associations, defined as having no previously reported results within 1 Mb in the NHGRI-EBI GWAS Catalogue21 of relevant GWAS, were identified across 50 phecodes and seven quantitative traits. In addition, we identified 217 new hits from previously reported regions, defined as having low linkage disequilibrium (r2 < 0.1) with any variant observed in the NHGRI-EBI GWAS Catalogue within 1 Mb for the same phenotype (Supplementary Tables 4 and 5). After applying the multiple testing correction, 1,502 fine-mapped associations passed a Bonferroni-adjusted threshold (5 × 10−8/(695 + 24) = 6.95 × 10−11), as well as 21 previously unreported variants and 115 new hits.

Of the 95 new genetic associations, 30 variants are rare (minor allele frequency (MAF) < 0.05) in populations with other ancestry (African or African American (AFR), Admixed American (AMR), South Asian (SAS) and non-Finnish European) in the Genome Aggregation Database (gnomAD)) and 33 variants less than 0.01 in EUR, the most extensively studied population, which explains why they were not reported in previous GWAS. For example, single nucleotide polymorphism (SNP) rs17089782, a missense variant in PIBF1 (p.R405Q) on chromosome 13 is significantly associated with thyroid cancer (P = 2.8 × 10−9) in the TPMI cohort. This SNP has a MAF of 5.65% in TPMI but 0.01% in EUR, which may explain why this association was only detectable in TPMI. However, PIBF1 is essential for immune regulation, especially during pregnancy, and is relevant to autoimmune diseases and cancer22. Another variant identified in our analysis of BMI (rs761018157; P = 4.8 × 10−9, MAF in TPMI = 4.34%, MAF in EUR < 0.01%) maps to PHOX2B. This gene, highly expressed in the nervous system, had previously been linked to obesity hypoventilation syndrome in a small study (n = 30)23 and associated with bone mineral density24. In addition, when we compare the effect size in TPMI and UKB for the rest of the new findings, 25 exhibit a significant different effect size (P < 0.05). For instance, a TPMI-identified platelet-count-associated variant, rs12955741, located in the intergenic region between TGIF1 and DLGAP1, exhibits a different effect size (β) compared to that in UKB (βTPMI = 0.044, βUKB = −0.005, P = 1.7 × 10−9). Moreover, the high hepatitis B virus carrier rate in Taiwan25 contrasts sharply with its rarity in European cohorts, enabling TPMI to identify new loci associated with viral hepatitis B (case number in TPMI = 23,618 versus UKB = 132). Among the 26 independent loci identified in our analysis of hepatitis B, 19 fine-mapped loci are new (Extended Data Fig. 3). Notably, 18 of these 19 loci were found to be associated with liver function or diseases (Supplementary Table 5). These new associations highlight the uniqueness of certain disease loci in the TPMI cohort, presenting opportunities for developing population-specific therapeutic interventions and advancing precision medicine.

All identified independent associations are summarized in Fig. 2. The identification of the MHC region as a significant hotspot on chromosome 6 emphasizes its extensive involvement in immune-related diseases across several categories. Similarly, the short arm of chromosome 11 (INS-KCNQ1 region) also affect various traits, including metabolic, endocrine and genitourinary diseases. These hotspots of trait-relevant variants implied the shared genetic mechanism among diseases and potential of pleiotropic effects.

Fig. 2: Phenome-wide independent variant–trait associations.
Fig. 2: Phenome-wide independent variant–trait associations.The alternative text for this image may have been generated using AI.
Full size image

Vertical bars show the accumulated number of independent variant–trait associations for dichotomized phecodes (top) and quantitative traits (bottom). Each category of diseases and traits is represented by a corresponding colour. The x axis is chromosome number and the y axis represents the accumulated number of associations, highlighting the uneven distribution of trait-associated variants across phenotypes.

Heritability and colocalization

Linkage disequilibrium score regression analysis (LDSC)26 showed strong liability-scaled SNP heritability (h2) for conditions such as alcoholism (h2 = 0.213), retention of urine (h2 = 0.163) and open-angle glaucoma (h2 = 0.160). Among quantitative traits, body height (h2 = 0.323), BMI (h2 = 0.218) and high-density lipoprotein cholesterol (h2 = 0.191) exhibited the highest heritability estimates (Supplementary Table 6 and Supplementary Information), highlighting the significant role of genetics in these traits. These results have far-reaching implications for precision medicine, as higher heritability signals indicate the potential for more accurate genetic risk-prediction models that could improve personalized disease risk assessments.

We then partitioned heritability at the gene level and identified 329 unique genes contributing significantly to phenotypic variation (h2 > 0.1% and Z-score > 1.64). Of these, 45 affected more than one phecode and/or quantitative category, including key genes such as APOE, APOC1, TOMM40, ABCG2 and KCNQ1 (Fig. 3 and Supplementary Table 7). We also conducted a colocalization analysis to elucidate the potential molecular function of identified GWAS signals with three expression quantitative traits locus (eQTL) datasets, including the Genotype-Tissue Expression Project (GTEx)27, Multi-ancestry Analysis of Gene Expression (MAGE)28 and the Japan COVID-19 Task Force (JCTF)29 (Fig. 3 and Supplementary Table 8). Our results identified 391 unique genes that potentially mediate the outcome through their expression level (posterior probability > 0.9), including GBAP1, which colocalized with five different traits (uric acid, serum creatinine, hematocrit, hypertension and gout). Among the colocalized genes, 75 of them can be identified only in the multi-ancestry lymphoblastoid cell lines eQTL (MAGE; 20 genes) and/or Japanese whole blood eQTL (JCTF; 59 genes). Our findings demonstrate the effect of these genes (such as APOE, ABCG2 and KCNQ1) on several traits and disorders. By elucidating shared genetic effects, these results offer opportunities to develop precision medicine approaches that address comorbidities, such as treating hyperlipidaemia and reducing dementia risk through a single intervention targeting APOE, which influences both lipid metabolism and Alzheimer’s disease risk. This gene-level understanding emphasizes the potential to optimize therapeutic strategies by leveraging genetic pleiotropy in disease management.

Fig. 3: Gene-level heritability and colocalization with gene expression.
Fig. 3: Gene-level heritability and colocalization with gene expression.The alternative text for this image may have been generated using AI.
Full size image

a,b, Circle plots showing gene-level heritability and colocalization (coloc) with gene expression for phecodes (a), summarized in parent (integer) phecodes, and quantitative traits (b). Dots represent gene-level h2 > 10−3; squares indicate colocalization posterior probability (PP) for the hypothesis that both traits are associated and share a single causal variant (H4) > 0.9; and triangles show both. Inner circle indicates the number of associated traits for each identified gene. Outer circle indicates chromosome number. The bar chart shows the number of identified genes by category and grouped by type of pleiotropy. We estimated h2 with h2gene57 and performed the colocalization analysis with coloc58.

Genetic correlation and clusters

Pairwise genetic correlation and clustering analyses showed three main phenotype clusters: cardiometabolic traits, autoimmune and infectious diseases, and kidney-related traits (Fig. 4 and Extended Data Fig. 4). The cardiometabolic cluster, which includes type 2 diabetes, hypertension and BMI, reinforces the interconnected phenotypic and genetic architectures of cardiovascular and metabolic diseases. The cluster of autoimmune and infectious diseases, which includes viral hepatitis B, psoriasis and systemic lupus erythematosus, illuminates shared immune system pathways and potential gene–pathogen interaction. The kidney-related cluster involved gout, chronic kidney disease, calculus of kidney and ureter, ankylosing spondylitis and measures of urea nitrogen, creatinine and uric acid. The shared genetic architecture provides opportunities to leverage the genetic risk of correlated traits while developing the PRS model.

Fig. 4: Genetic correlation among three identified trait clusters.
Fig. 4: Genetic correlation among three identified trait clusters.The alternative text for this image may have been generated using AI.
Full size image

Heatmap displaying genetic correlations between trait clusters: cardiometabolic traits, autoimmune/infectious diseases and kidney-related traits. Genetic correlation was estimated using LDSC, with colours representing the correlation coefficients between traits. NOS, not otherwise specified.

Cross-population comparison

Cross-population comparisons30 with EUR GWAS from UKB showed varying degrees of transethnic genetic-effect correlation (ρge), with strong, statistically significant correlations for traits like cholelithiasis (ρge > 0.999), type 2 diabetes (ρge = 0.829) and ischaemic heart disease (ρge = 0.756), but moderate correlations for gout (ρge = 0.616) and psoriasis (ρge = 0.418) (Supplementary Table 6). The moderate correlations indicate the differentiated genetic mechanism and disease distribution across populations (gout case n = 24,411 in TPMI and 3,179 in UKB; psoriasis case n = 4,166 in TPMI and 2,197 in UKB). Therefore, these findings demonstrate the importance of population-specific genetic studies, as differences in genetic architectures between populations can significantly affect the accuracy of PRS models.

PRS development

Building on these insights, we developed and validated PRS models that demonstrated strong predictive performance for a wide range of diseases. Although we used five PRS tools, including LDpred231, Lassosum232, PRS-CS33, SBayesR34 and MegaPRS35 (Supplementary Tables 913), we found that LDpred2 outperformed the others for most traits (Extended Data Fig. 5). Therefore, we took the results of LDpred2 for further comparisons. Of the 265 PRS models for phecodes, area under the receiver operating characteristic curve (AUC) values exceeded 0.55 for 105 dichotomized phecodes with a significant P value (P < 0.05). Additionally, the explained variance of models for 24 quantitative traits ranged from 0.028 (aspartate aminotransferase) to 0.227 (height). (Supplementary Table 9 and Extended Data Fig. 6). The most predictive PRS models included highly heritable traits such as ankylosing spondylitis (AUC = 0.812 ± 0.016), psoriasis (0.709 ± 0.016), atrial fibrillation (0.702 ± 0.014), prostate cancer (0.696 ± 0.018), systemic lupus erythematosus (0.696 ± 0.015), rheumatoid arthritis (0.646 ± 0.011), type 2 diabetes (0.640 ± 0.005), female breast cancer (0.611 ± 0.010) and hypertension (0.610 ± 0.004). Interestingly, the PRS for hepatitis B also demonstrated high genetic predictability (0.654 ± 0.008). Because h² represents the upper bound of variance that can be explained by PRS, we examined the proportion of heritability captured by our models (r2/h2). A total of 36 traits, including prostate cancer (r2/h2 = 0.054/0.070), type 2 diabetes (0.066/0.126) and high-density lipoprotein cholesterol (0.136/0.191), reached more than 50% of their SNP heritability, indicating that PRS can achieve near-optimal predictive accuracy for highly heritable traits. However, for complex diseases influenced by both genetic and environmental factors, PRS performance is inherently constrained by the fraction of heritability attributable to common variants. These findings reinforce the importance of SNP heritability as a reference point for evaluating PRS utility and highlight the need for larger, ancestrally diverse datasets to further enhance genetic prediction models (Extended Data Fig. 6).

Leveraging the identified clusters, we performed a multitrait PRS training, PRSmix+36, for the traits in each cluster (Fig. 5 and Supplementary Table 14). Notably, multitrait PRS models improved prediction accuracy for the cardiometabolic disease cluster with a 0.040 increase in AUC (from 0.608 to 0.648) and a 1.770-fold improvement in phenotypic variance explained (r2). The performances of autoimmune and kidney-related disease clusters were also enhanced, with average AUC improvements of 0.018 and 0.009, respectively (from 0.641 to 0.659 and 0.601 to 0.610, respectively) and 1.351-fold and 1.349-fold improvements in r². The significant enhancement of multitrait PRS prediction (comparing r2 of LDpred2 and PRSmix+ with a paired t-test, P = 1.07 × 10−13) highlights the potential of leveraging shared genetic architecture to enhance disease risk prediction. Figure 5 demonstrates the performance of single-trait and multitrait PRS across three disease clusters, as well as the differing effectiveness of PRS in predicting genetic risk across various disease categories.

Fig. 5: PRS performance for the three identified trait clusters.
Fig. 5: PRS performance for the three identified trait clusters.The alternative text for this image may have been generated using AI.
Full size image

ac, Bar plots showing h2 and r2 for cardiometabolic trait cluster (a), autoimmune trait cluster (b) and kidney-related trait cluster (c). Grey bars indicate SNP heritability (estimated from TPMI GWAS unrelated set (n = 248,754) with LDSC) and the coloured bars represent the r2 values, indicating the proportion of variance explained by the PRS among TPMI validation set (n = 20,000) from single-trait PRS (LDpred2, red bar) or multitrait PRS (PRSmix+, blue bar). The dot-and-whisker plots showcase predictive accuracy using AUC with 95% confidence intervals for dichotomized traits. Asterisks indicate h2 estimates that consider the MHC region.

PRS external validation and comparison

To evaluate the robustness and generalizability of our PRS models, we performed an external validation of the models (hypertension, type 2 diabetes, viral hepatitis B, gout, calculus of kidney from PRSmix+ and others from LDpred2) in TWB, unrelated individuals genetically similar to Han Chinese reference populations, n = 88,628), UKB (self-reported EAS, n = 9,893) and All of Us (genetically inferred EAS, n = 6,895). We found that the prediction accuracy, AUC, of our models ranged from 0.548 (glaucoma) to 0.712 (prostate cancer) in TWB, 0.557 (female breast cancer) to 0.634 (hypertension) in UKB and 0.520 (migraine) to 0.709 (gout) in All of Us (Extended Data Fig. 7). Although the TWB questionnaire did not contain specific details on hepatitis B status, we used antihepatitis B core total antibodies (Anti-HBc) as an indicator of infection or past infection and hepatitis B surface-antigen (HBsAg) as a marker of acute/chronic infection. Intriguingly, the AUCs for the TPMI-derived model of hepatitis B were 0.674 ± 0.003 for HBsAg and 0.530 ± 0.002 for Anti-HBc in TWB. These results demonstrate the high predictive value of the PRS for hepatitis B for predicting symptoms and severity of the disease.

TPMI-derived PRS models perform better than the UKB EUR-derived models when applied to EAS for viral hepatitis B, type 2 diabetes, hypertension, gout and migraine. (Extended Data Fig. 7) For the other traits, TPMI-derived models consistently outperform the UKB ones, although the confidence intervals overlap. However, the overlapping confidence intervals in UKB and All of Us may be due to their limited sample size of EAS. These results indicate that population-specific PRS models allow for more accurate risk stratification and enable personalized healthcare interventions for EAS. Additionally, we assessed the performance of TPMI-derived PRS and TPMI-included cross-population PRS across various ancestry groups, including populations of EUR, AFR, AMR and SAS ancestry from the UKB and All of Us cohorts (Extended Data Fig. 8). Performance varied by diseases, but consistent results were observed for female breast cancer and glaucoma across populations, and TPMI-included cross-population PRS slightly but not significantly improved in the populations other than EAS and EUR.

Genetic risks on overall health measures

Although overall health is hard to define with a few metrics, herein, we used the count of clinical visits and duration of hospitalization to roughly describe individuals’ overall health. We found that 131 of the top-performing PRS models (LDpred2 models with AUC > 0.55 and all PRSmix+ models for phecodes and all models for quantitative traits) are significantly associated with overall health indices, explaining 8.47% of the variation in clinical visit frequency (P = 2.69 × 10−14) and 10.29% of the variation in hospitalization duration (P = 5.62 × 10−27; Extended Data Table 1 and Supplementary Table 15) in the comparison between top and bottom 5% groups after adjusting for sex, age and recruiting hospital. Among the identified clusters, the cardiometabolic disease cluster contributed the most to the indices, accounting for 1.32% of clinical visits (P = 0.02) and 3.55% of hospitalizations (P = 7.10 × 10−9). This may reflect the high prevalence of cardiometabolic diseases in the hospital-based TPMI cohort. In short, quantification of the effect of PRS for various diseases and traits on human health opens up opportunities for developing precision health management strategies.

Discussion

This study represents a large-scale GWAS in the population of Han Chinese ancestry, using data of around 500,000 individuals recruited from 16 medical centres across Taiwan. We investigated the genetic architecture of 695 dichotomized phecodes and 24 quantitative traits, identifying 2,656 independent variant–trait associations and showed that population-specific genetic risk-prediction PRS models for a wide range of diseases performed well in the population. Indeed, for the traits with sufficient sample size in the cohort, PRS performance rivals those developed for EUR using UKB data. These findings show that population-specific PRS models can be developed successfully for populations of non-EUR, and our project serves as a model for large-scale genetic studies in other populations.

Recent large-scale projects that emphasize ancestral diversity in human genetic studies have discovered new findings with the inclusion of subjects of non-EUR. MVP conducted multi-ancestry GWAS on 635,000 participants, identifying more than 2,000 signals unique to populations with non-EUR16. With the TPMI dataset, we performed larger GWAS in subjects genetically similar to Han Chinese reference populations for several traits than published studies. For instance, the previous largest meta-analysis for type 2 diabetes included 20,573 cases who were of Han Chinese ancestry37. By contrast, our GWAS included 59,289 cases of type 2 diabetes, almost tripling the number of cases ever tested and identified five unreported type 2 diabetes associated loci from known regions, demonstrating the power of TPMI sample size. Identification of new and population-specific risk variants may lead to further understanding of their molecular mechanism and underline the need for population-specific weightings in PRS models. Moreover, population-specific findings also better explain the performance of population-specific PRS models in the population in question. In short, our population-specific genomic profiles for comprehensive phenotypes provide a solid foundation for PRS development.

Our understanding of how the genetic factors influencing hepatitis B, an endemic infectious disease in Taiwan with an estimated hepatitis B virus carrier rate of 9.78% among the unvaccinated cohort (born before 1984)25, also benefited from the large dataset. With 23,618 cases, a significant increase from previous studies of only a few thousand cases38,39,40, we identified 26 fine-mapped signals, including 19 new loci, and showed a significant negative correlation between hepatitis B and other autoimmune diseases, such as Sicca syndrome, psoriasis and systemic lupus erythematosus. Our well-performed and validated PRS model for hepatitis B demonstrated that the host genome may determine the severity and symptoms of this infectious disease. This is similar to that previously reported in COVID-19 and pneumonia, where genetic factors have been shown to influence disease outcomes41,42,43,44. Our unexpected success of GWAS and PRS for hepatitis B not only demonstrates the power of the large sample size of TPMI but also shows the necessity of population-specific genetic study for population-enriched diseases. The benefits extend beyond differences in ancestry to include environmental factors such as pathogen exposure, food intake and lifestyle influences. Exploring how the human genome interacts with these diverse external and environmental factors can greatly enhance our understanding of how genetic variants contribute to disease susceptibility or severity.

In addition, the comprehensive phenotypic data allow us to investigate the genetic correlation among several traits that have substantial implications for clinical applications and leverage them to improve the performance of PRS models. By identifying shared genetic risks across diseases, at-risk individuals can be alerted to pursue early detection of comorbidities and targeted prevention strategies. For example, the clustering of cardiometabolic traits, such as type 2 diabetes, hypertension and BMI, highlights their interconnected genetic basis and indicates that individuals with a high genetic risk for one condition may benefit from early screening and intervention for related conditions45. Additionally, the shared genetic architecture allows the development of multitrait PRS models that integrate genetic risks across correlated traits, improving prediction accuracy and enabling precision medicine approaches that address several health outcomes simultaneously46. Including the correlated traits in PRS model development improved performance, resulting in an average 1.55-fold increase in the explained percentage of phenotypic variation (P = 1.07 × 10−13). Although previous studies have proven the utility of multitraits on target diseases36,47,48, we have extended the use this approach on a phenome-wide level and demonstrated the improvement across different types of traits. As a result, we produced well-performed PRS for various categories of diseases, including cardiometabolic diseases, autoimmune disorders and infectious diseases.

We evaluated our PRS models across several large cohorts, including the TWB, UKB and All of Us. The TPMI-derived PRS models consistently outperformed those developed from EUR when applied to diseases in people with Han Chinese or EAS ancestry from the three large cohorts. When comparing with EUR-derived PRS models, we also observed better performance across several traits in EAS, particularly for cardiometabolic and autoimmune diseases. Similarly, the TPMI-included cross-population model slightly improves performance in populations of other ancestries. These results highlight the need for population-specific models and emphasize the importance of genetic data from diverse populations to advance cross-population models. By integrating these well-developed PRS models, we estimate that genetics account for 10.3% of variation of hospitalization duration in TPMI. Although the estimates of genetic contributions to health measure may be influenced by disease prevalences and ascertainment biases, our result indicate integrating genetic risk-based health management strategies with traditional risk factors, such as age, sex, smoking and BMI, may enhance prediction models and refine personalized risk stratification.

As with other large-scale epidemiological studies, ascertainment bias is also observed in our study. TPMI’s case proportion shows a significant but moderate correlation with the prevalence from NHIRD, implying potential ascertainment bias of TPMI’s hospital-based design. Compared to the general population in Taiwan (NHIRD), TPMI participants are overrepresented in the middle-aged group (year of birth 1940–1970, 54.3% versus 38.3%), include slightly more females (55.1% versus 50.6%) and have a higher proportion of participants from northern Taiwan (59.5% versus 47.3%). These demographic differences, along with the volunteer-based recruitment process, probably contribute to the lower case proportions observed in TPMI (Fig. 1a). Importantly, a significant portion of disease records in the NHIRD originate from local clinics and primary care settings, which are not covered in TPMI. Notably, the EMRs of the participants are incomplete, as some participants receive care from several health providers, but the TPMI only has access to EMRs from their enrolment hospitals. We acknowledge that ascertainment biases may influence disease prevalence and heritability estimates. Thus, we accounted for case-control ascertainment by applying liability-scale transformations using population prevalence data from NHIRD and used independent validation for PRS (TWB, UKB and All of Us) to mitigate the effect of ascertainment biases, while acknowledging that residual biases may persist. Methods like inverse probability weighting could mitigate such biases49, but these require detailed external reference data that are unavailable at present. Additionally, we observed a relatively low estimated heritability for body height and BMI in TPMI, compared to values reported in the literature50,51. These estimates may be affected by factors such as inconsistencies in assessment across EMRs, variations in statistical approaches and reduced bias of assortative mating in TPMI population52,53,54. These limitations also emphasize the need for future adjustments to enhance generalizability.

In addition to ascertainment bias, our study has other commonly found limitations. First, the TPMI cohort size is not sufficiently large to study some of the severe subtypes of many diseases, such as diabetes insipidus and neurofibromatosis. Second, we attempted to use eQTLs to elucidate the molecular mechanism of diseases, but the underrepresentation of EAS in current eQTL datasets, such as GTEx, poses challenges27. Gene expression regulation varies across ancestries55,56, and differences in LD structures further complicate colocalization analyses. Comparing to GTEx whole blood eQTL, the multi-ancestry lymphoblastoid eQTL and Japanese whole blood eQTL showed 309 more gene–trait pairs. Therefore, ancestral diversity is an urgent need not only in genomic data but also in transcriptomic, proteomic, metabolomic and epigenomic datasets. Third, the current project retrieved EMRs from an average of 5 years before enrolment, so some important data such as age of disease onset for the older participants are not available. Incomplete EMRs lead to less precise case definition of some participants. Fourth, some of the younger participants have high-risk genetic profiles but are disease free for those diseases. The duration of the project is too short to determine whether they will eventually develop those diseases.

An effort is underway to gain access to the complete EMRs of the TPMI participants and to recruit more participants with severe subtypes of common diseases. The high-risk participants who are symptom-free are being followed to monitor disease development. Future studies are being planned to study the high-risk individuals who escape disease development to identify genetic and non-genetic factors that mitigate their disease risk. Furthermore, the meta-analysis integrating TPMI with other large-scale EAS biobanks, such as TWB11,12, Korean Genome and Epidemiology Study9, China Kadoorie Biobank10 and Biobank Japan8, may further enhance our understanding of the genetic aetiology in EAS and improve prediction models.

This study demonstrates that population-specific risk-prediction models, such as those developed for EAS in this work, can achieve strong predictive performance for traits with high relevance in that population. The PRS we developed for EAS performed well for several traits, including diseases with significant public health implications, such as type 2 diabetes and systemic lupus erythematosus. However, for certain traits, such as female breast cancer and glaucoma, PRS derived from both UKB and TPMI performed comparably, indicating that the genetic architecture of some traits allows for generalizable models. These findings emphasize the importance of developing and validating PRS models in diverse global populations to maximize their utility and equity in genetic risk prediction. Although our results emphasize the utility of developing population-specific PRS, further research is needed to directly compare their performance with multipopulation models and assess their generalizability and to assess their effect on disease prevention and management. In particular, longitudinal studies and real-world implementations will be critical to determine the extent to which PRS-guided interventions can delay disease onset or improve health outcomes. Furthermore, it is hoped that if all can obtain their genetic profiles and determine their risk for major diseases, many diseases can be prevented or their onset can be delayed significantly, thereby fulfilling the promise of modern genetics.

In conclusion, we used a large-scale dataset of individuals genetically similar to Han Chinese reference populations produced by the TPMI to conduct phenome-wide genetic analyses and leverage these genetic findings to train risk-prediction models for several diseases and traits. The developed models are validated in EAS of different biobanks and demonstrate a consistent performance that bodes well for their use in populations of Han Chinese and EAS ancestry. Our approach can serve as a template for developing PRS models in populations that are currently without such resources, anticipating the time when all populations around the world can benefit from risk-based health management as part of the precision health movement.

Methods

Study population and phenotyping

We used the TPMI dataset, which links extensive EMRs with genotypic data for 486,956 individuals. Dichotomized disease status was defined by phecodes, which were based on information extracted from the EMR using International Classification of Diseases codes18,19. To ensure robustness, cases were defined by having the diagnosis of the relevant condition on two or more clinical visits. We also extracted quantitative traits from the EMR, including anthropometric, vital sign and laboratory measurements; we excluded the extreme outliers and removed or adjusted the treated and/or medicated measures on the basis of previous research; and the median value was kept if the participant had several qualified measures59 (Supplementary Information). In this study, we focused on 695 phecodes that had at least 2,000 cases and 24 quantitative traits that were measured in at least 100,000 individuals. These phecodes spanned 16 disease categories, including but not limited to infectious diseases, neoplasms, endocrine/metabolic disorders and circulatory system diseases. The 24 quantitative traits were categorized into anthropometric, circulatory, hematological, kidney-related, liver-related and metabolic measurements.

Genotyping and quality control

We performed genotyping using two customized high-density Axiom SNP arrays produced by Thermo Fisher, TPMv1 and TPMv2. The genotyping experiments were conducted in six genotyping centres in Taiwan17. The raw genotypic data underwent quality control measures, and the genetic variants were excluded when they had a call rate less than 0.98, MAF < 0.01, or Hardy–Weinberg equilibrium test P < 1 × 10−6. We also excluded individuals with overall call rate less than 0.95, failed heterozygosity check, or inconsistent documented versus genetically determined sex. For this study, we only included the genetic variants found on both genotyping arrays and excluded variants with a significant batch effect in GWAS. The proportion of genetic ancestry was determined by ADMIXTURE60, and the projected principal component scores with 1000 Genomes as a reference panel were applied to determine individuals’ ancestry61. As a result, 401,710 genetic variants and 463,447 Han Chinese participants passed all quality control measures and were used in the subsequent studies. Details are found in Supplementary Information and on GitHub (https://github.com/TPMI-Taiwan/tpmi-qc).

Phasing and imputation

Phasing was conducted on quality-control-passed genotype data with SHAPEIT562. Genome imputation was carried out with IMPUTE5 using a reference panel of 1,498 whole-genome-sequenced TWB subjects12,63. We also conducted postimputation quality control with exclusion criteria INFO score ≤ 0.7 and MAF ≤ 0.01. In addition, we also performed a chip-GWAS for minimizing the bias from different chips, resulting in a dataset of 8,046,864 well-imputed common genetic variants.

Population structure and relatedness estimation

We performed a principal component analysis (PCA) on the basis of genotyped variants to capture the effect of population structure. To diminish the effect of close relatives, the main PCA was conducted in a genetically unrelated subset, and other subjects were projected with the calculated PC weightings. Then these PCA scores were leveraged to accurately quantify the proportion of identity by descent and degree of relatedness. The maximum unrelated set was determined on the basis of these estimated degrees of relatedness. PC-AiR and PC-Relate were used for PCA and relatedness estimation, and PRIMUS was used for identifying the maximum unrelated set with the third degree as threshold64,65,66.

GWAS

The entire dataset was divided into three subgroups: the GWAS set (n = 363,447), the training set (n = 80,000) and the testing set (n = 20,000). To maximize the statistical power, we used a mixed-effect regression model to examine the association between genotype and outcome of interest, logistic regression for dichotomized phecode and linear regression for quantitative traits. The quantile-normalization was applied to quantitative traits to ensure the normal distribution. The mixed-effect model accounted for relatedness among individuals by including a random effect for pairwise kinship. The model was also adjusted for key covariates, including age, sex, age2, interactions between age/age2 and sex, genotyping chip, enrolment hospital and ten genetic principal components to control for population stratification. SAIGE was applied for the mixed-effect model GWAS67. In the GWAS set, we selected an unrelated subset (n = 248,754) to perform GWAS using a generalized linear model with PLINK2, and we conducted 1:10 age, sex-matching for the traits with imbalanced case/control ratio (less than 1/20). These PLINK2 GWAS statistics were then used for heritability and genetic correlation estimation68.

Replication evaluation

To systematically evaluate the performance of our GWAS, we leveraged a presummarized phenotype–genotype reference map69, which collected 5,879 genetic associations for 149 unique phecodes from 523 published GWAS, including 1,215 associations from EAS. We calculate the overall and power-adjusted replication rates and actual over expected ratio for each available phecode and categories, respectively. The R package PGRM was used to measure the quality of biobank data through replication69.

Fine mapping

We performed fine mapping to identify the independent GWAS signals in all genomic regions containing any variant with a P value less than 5 × 10−8 and plus or minus 1.5 Mb of the regional lead variant14, except the MHC region (chromosome 6: 25,391,792–33,424,245) because of its complex linkage disequilibrium structure. We used the reported 95% credible set to determine the independent signals, and up to ten signals were allowed for each region. The genome-wide significant threshold was applied for defining a credible set as an independent hit, and a further requirement of log Bayes factor > 2 was applied for the second hit. For the failed fine-mapping regions and MHC region, we used the lead SNP as the hit of each significant region. SuSiE was conducted for this summary statistics-based fine mapping with linkage disequilibrium derived from our imputation reference panel70, which reflects our study population’s genetic architecture. Although using linkage disequilibrium from the GWAS sample might improve accuracy, we used the imputation panel because of the computation efficiency.

New association identification

We comprehensively compared our GWAS results with reported significant signals on the NHGRI-EBI GWAS Catalogue21, downloaded on 11 March 2024. The mapping of phecodes and quantitative traits to GWAS catalogue phenotypes is summarized in Supplementary Table 16. We classified a variant as new if the fine-mapped independent signal was not located within 1 Mb of any reported genome-wide significant association (P < 5 × 10−8) for the corresponding phenotype. Additionally, a variant was considered a new hit if the highest linkage disequilibrium r² was less than 0.1, with any reported significant association within 1 Mb. Associations derived from uncertain and umbrella phecodes were excluded, and for duplicated genetic variants or regions, we only reported the association with the smaller P value or from the phecode with the more specific definition. Finally, we used ANNOVAR to annotate the new variants with data from the RefSeqGene database (updated 17 August 2020)71,72. For the new variants, we explored their allele frequencies in non-Finnish European, AFR, SAS and AMR from gnomAD73. We also compared the effect size between TPMI and UKB with a t-test for investigating the ancestry-specific effect.

Heritability, genetic correlation and clustering

To quantify the genomic contribution of the specific traits, we applied linkage disequilibrium score regression to estimate the SNP-based heritability with LDSC26. The GWAS summary statistics and the precalculated linkage disequilibrium score from the EAS superpopulation of 1000 Genomes were used61. For the dichotomized traits, we performed a liability-scaled transformation on the observed heritability using the 5-year population prevalence from the NHIRD of the Health and Welfare Data Science Center20,74. For traits with a higher prevalence in our dataset (TPMI) than the population (NHIRD), we applied the equation from ref. 75. For other traits, we used the adapted equation from ref. 74. Additionally, we conducted LDSC to obtain pairwise genetic correlations to assess the similarity of genetic mechanisms between traits76. On the basis of the genetic correlation matrix, we used a hierarchical cluster analysis to identify groups of traits that share genetic mechanisms. We used the weighted pair group method with arithmetic mean for clustering, and the resulting cluster tree was used for group identification. Moreover, we estimate the genetic correlation across populations, TPMI and UKB, to demonstrate varied genetic architecture in different ancestry populations. For the UKB GWAS, we applied a generalized linear model from PLINK2 with the predefined phecode (https://github.com/umich-cphds/createUKBphenome) and corresponding baseline quantitative measures among the identified unrelated set (n = 378,544). We used Popcorn for the cross-population genetic correlation, and two correlation coefficients were calculated: the transethnic genetic-effect correlation (ρge) and transethnic genetic-impact correlation (ρgi)30.

Gene-level heritability and colocalization

We used both gene-level heritability estimation and colocalization analysis to map our GWAS findings to functional units, specifically genes. We conducted h2gene analysis to partition SNP-based heritability to the gene level57. We estimated heritability for genes that overlapped with fine-mapped regions, where gene regions were defined as the gene body plus or minus 10 kb for gene-level heritability. Additionally, to illustrate the molecular functions of genes of interest, we used colocalization analysis to examine whether there are shared common genetic causal variants between tissue-specific gene expression and traits of interest. We used eQTL resources from 49 tissues in GTEx v.827, lymphoblastoid in MAGE28 and whole blood in JCTF29, testing any gene with genome-wide significant signals in the cis-regulation region (transcription start site plus or minus 1 Mb). The posterior probabilities were used to evaluate colocalization between gene expression and the trait of interest. The R package coloc was used with SuSiE, relaxing the single causal variant assumption58,77,78.

Single-trait and multitrait PRS

The preserved dataset of 100,000 unrelated TPMI subjects was split into two subsets, training (n = 80,000) and validation (n = 20,000), for PRS model building. Five popular PRS tools were used—LDpred231, Lassosum232, PRS-CS33, SBayesR34 and MegaPRS35—and the training subset was applied for parameter selection and model optimization if needed. LDpred2, PRS-CS and SBayesR assumed the effect of genetic variants following a mixture distribution with different predefined parameters and applied a Bayesian framework for distribution estimation. Lassosum2 used a penalized regression (LASSO) for weight estimating, and MegaPRS leveraged MAF and linkage disequilibrium for model building. We then used the validation subset to evaluate the performance of PRS models. Individual scores were calculated with PLINK268. The explained variance (r2) was used to evaluate the performance of PRS for quantitative traits75,79, and two indices, AUC and liability-scaled r2, were used for PRS of phecodes. We followed the approach of ref. 79 and report both raw r² and r² adjusted for covariates (sex, age and PCs). Additionally, we include partial r² estimates, calculated using the R package rsq. To account for population stratification in cross-cohort predictions, we also report r² with PCs as covariates in Supplementary Tables 9–14. For AUC comparisons, we include a baseline model incorporating standard covariates (sex, age and PCs) to better assess the added predictive power of PRS. We used the likelihood ratio test to obtain the significance for r2 with the R package lmtest, and we calculated the standard error for AUC with the R package auctestr. To further leverage the gene’s pleiotropy and shared genetic mechanism among traits, we conducted multitrait PRS model building for the traits in the same genetic cluster based on pairwise genetic correlation identified in the previous step. We pooled all PRS models from the five tools for those identified traits and applied an elastic net regression to combine their weighting and find the most optimized model for the target trait. We performed PRSmix+ for multiple-traits PRS model building36. The cross-population PRS models were based on both TPMI and UKB European GWAS (https://pheweb.org/UKB-TOPMed/), and PRS-CSx was applied80.

External validation and comparison

We conducted an external validation of our developed PRS using data from the TWB, EAS from UKB and All of Us. TWB is a community-based biobank, and it has recruited over 200,000 participants in Taiwan. Herein, we used 88,628 unrelated subjects (greater than third degree; we removed 5,242 overlapped individuals), who were genotyped with the Axiom customized chip TWB2 (equivalent to TPMv1); their genotyping quality control, phasing and imputation followed the same protocol as described above. The self-reported disease condition was queried from their baseline questionnaire, except for cancer. Because the study design of TWB excluded cancer patients at recruitment, we used both baseline and follow-up self-reporting data to define cancer cases and controls. UKB has enroled approximately 500,000 participants since 2006 and linked their genetic data with enriched phenotypic data. For UKB validation, we used their inpatient record for case definition. Their ancestral population was determined by self-reported ethnic background, such as self-reported Chinese as EAS (n = 1,572); white, British, Irish and any other white background as EUR (n = 472,869); Black or Black British, Caribbean, African and any other Black background as AFR (n = 8,074); and Asian or Asian British, Indian, Pakistani, Bangladeshi and any other Asian background as SAS (n = 9,893). All of Us intends to enrol more than 1 million participants in the United States and has released whole-genome genotyping data for approximately 312,000 participants as of the first quarter of 2024. We applied ADMIXTURE with 1000 Genomes as a reference panel to assign the genetically inferred ancestral populations, including EAS (n = 6,895), EUR (n = 152,754), AFR (n = 60,964), AMR (n = 32,394) and SAS (n = 2,334). The genetically confirmed EAS as well as other superpopulations and their linked EMR were used for validating our PRS models. Moreover, we compared the TPMI-derived PRS model with UKB-derived models to investigate the performance of population-specific PRS. The UKB-derived models were based on published UKB European GWAS (https://pheweb.org/UKB-TOPMed/), and LDpred2-auto was applied for model building.

Overall health measures evaluation

We evaluated the genetic effect on overall health measures. We used the number of clinical visits and the aggregate duration of hospitalization as overall health indices. Owing to collinearity among PRS for different traits, we used a partial least square-generalized linear model to extract components from the PRS of qualified traits with the R package plsRglm81. The number of extracted components was determined by the Akaike Information Criterion. We then estimated the covariate-adjusted proportion of genetic contribution (r2) by comparing the full model with the null model, which included only covariates such as sex, age and hospital. We used a likelihood ratio test to obtain the significances of regression models. For each index, we used three models to compare the top and bottom 5%, 10% and 20%. We selected covariate-matched controls from subjects without hospitalization records as the bottom group for hospitalization models.

Ethics statement

This study was approved by the Institutional Review Boards of Taipei Veterans General Hospital (2020-08-014A), National Taiwan University Hospital (201912110RINC), Tri-Service General Hospital (2-108-05-038), Chang Gung Memorial Hospital (201901731A3), Taipei Medical University Healthcare System (N202001037), Chung Shan Medical University Hospital (CS19035), Taichung Veterans General Hospital (SF19153A), Changhua Christian Hospital (190713), Kaohsiung Medical University Chung-Ho Memorial Hospital (KMUHIRB-SV(II)-20190059), Hualien Tzu Chi Hospital (IRB108-123-A), Far Eastern Memorial Hospital (110073-F), Ditmanson Medical Foundation Chia-Yi Christian Hospital (IRB2021128), Taipei City Hospital (TCHIRB-10912016), Koo Foundation Sun Yat-Sen Cancer Center (20190823 A), Cathay General Hospital (CGH-P110041), Fu Jen Catholic University Hospital (FJUH109001) and Academia Sinica (AS-IRB01-18079). Written informed consent was obtained from the subjects in accordance with institutional requirements and the Declaration of Helsinki principles. All collected information was de-identified before statistical data analysis. The analysis with TWB was approved by Institutional Review Boards of Academia Sinica (AS-IRB-BM-19014), and the NHIRD analysis with the Health and Welfare Data Science Center (HWDC) was approved by Institutional Review Boards of Academia Sinica (AS-IRB-BM-23056). This research has been conducted using the UKB Resource under UKB Main Application 15326. We worked with All of Us data using the All of Us Researcher Workbench under the workspace ‘Duplicate of Prediction of Polygenic Traits’.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.