Introduction

As of 2021, approximately 537 million people worldwide suffer from diabetes, and approximately 6.7 million people die from the disease annually1. In Taiwan, diabetes is the sixth leading cause of death, with a mortality rate of 52.7 per 100,000 people in 20222. Type 2 diabetes (T2D) comprises approximately 90% of diabetes cases3. Risk factors for T2D include genetics, obesity, environmental factors, and an unhealthy lifestyle4,5,6. These risk factors lead to insulin resistance, which can result in poor blood sugar control and hyperglycemia7. The complications of T2D include cardiovascular disease, various neuropathies (such as peripheral, autonomic, focal, and proximal neuropathy), heart disease, kidney disease (leading to kidney failure), retinopathy, injuries to limb extremities, skin lesions (which are prone to infection and heal slowly), and sexual dysfunction (such as erectile dysfunction)8,9,10. Additionally, more diseases related to T2D are emerging.

The correlation between polygenic risk scores (PRSs) and T2D has been extensively studied, and advancements in PRS models since 2007 have improved T2D risk assessment. Early PRSs were constructed using a small number of genome-wide significant variants, and although recent studies have explored more complex approaches involving larger sets of variants, the number and selection of SNPs often vary depending on study design, target population, and predictive goals. Advances in GWAS datasets and computational methods have generally improved the predictive accuracy and broader applicability of PRSs across populations. PRSs are crucial in clinical medicine for risk prediction, early diagnosis, and treatment response prediction, leading to better disease outcomes and fewer adverse drug reactions. In T2D research, PRSs have been increasingly applied to predict disease onset, complications, and treatment outcomes, particularly when combined with clinical variables11,12. Despite this, large-scale efforts to establish and apply PRS models for T2D remain limited in Taiwan. Combining PRSs with clinical data allows for more interpretable risk scores that facilitate population-level risk stratification and individualized prevention. Furthermore, constructing risk score models is beneficial to convert the predicted probabilities of outcome events into a simple score. This transformation enhances the readability of the results and facilitates the risk assessment of subjects. Risk scores allow stratification of a population for targeted screening. They use risk factor data to calculate an individual’s total score; a higher score represents a higher risk of disease. Over the past decade, a plethora of risk score methods have been developed and validated for detecting those at risk of T2D13,14,15,16. In this study, we conducted GWAS to identify genetic variants related to T2D and validated them, constructed a risk score model included PRSs and clinical variables to predict the risk of T2D, and used a phenome-wide association study (PheWAS) to identify other diseases associated with T2D, based on a population in Taiwan. Additionally, bioinformatics network analysis was conducted to explore the signaling pathways involved in the pathogenesis of T2D.

Materials

Data source

In this study, electronic medical records (EMRs) from the China Medical University Hospital (CMUH), one of the largest medical centers in Taiwan were utilized. For research purposes, demographic information, laboratory data, International Classification of Diseases codes, and medical procedure data, including personal medical information contained within the CMUH EMRs, were accessed. Additionally, the CMUH database contains genetic information from patients as part of the CMUH Precision Medicine Project (also known as HiGenome database) approved by the Ethics Committee of the Institutional Review Board of CMUH (CMUH107-REC3-058, CMUH110-REC3-005, and CMUH111-REC1-176)17. The current study received approval from the IRB of CMUH (CMUH109-REC1-003).

Identification of T2D cohorts for GWAS and PRS generation

A total of 15,424 T2D cases and 141,484 control subjects were included in the T2D cohort. Cases within this cohort consisted of individuals aged 20 years or older who were diagnosed with T2D at least twice according to the ICD, 9th /10th Revision, Clinical Modification [ICD-9-CM] (ICD-9-CM code = 250.xx, except for 250.x1 and 250.x3; ICD-10-CM = E11.xx). Control subjects were identified as individuals without any diagnosis of diabetes or prescriptions for antidiabetic drugs in their EMR.

Genotyping and GWAS

Genomic DNA was extracted from peripheral blood leukocytes of the participants using a commercial kit (Qiagen, Redwood City, CA, USA), as previously reported18. In CMUH-PMP, a TPMv1 customized single-nucleotide polymorphism (SNP) array (Thermo Fisher Scientific, Santa Clara, CA, USA) was used for genotyping. The Affymetrix Axiom genotyping platform was used according to the standard quality control procedures. Genotype data were imputed using Beagle 5.2. Imputed SNPs with an INFO R2 0.3 were kept for further GWAS. Detailed genotyping and imputation procedures have been described earlier19.

The relevant study participants were randomly divided into three groups with an 7:1.5:1.5 ratio, with 70% of subjects as the discovery cohort for GWAS, and 15% and 15% of subjects as the target and validation cohort. The quality control (QC) process was performed as follows: (1) samples were excluded if they had a call rate of < 98%, extreme heterozygosity rate (> 3 standard deviations [SDs] from the mean), and duplicate or related individuals (kinship > 0.354); (2) variants were filtered out if they had a missing rate > 2%, deviated from Hardy–Weinberg equilibrium with P < 1 × 10− 10, and minor allele frequency < 0.01. After the QC process, additive models adjusted for age, sex, and principal components (PCs) 1–10 were used using PLINK 2.020. A Manhattan plot was constructed using the CMplot R package.

Identification of T2D associated SNPs and PRSs

The SNPs identified from GWAS with P value < 0.05 (786,845 SNPs) were included for PRS construction. The PRSs were constructed using the “clumping and thresholding” approach in PRSice-2 v2.3.521. This algorithm iteratively selects a set of SNPs (with a P-value threshold of 0.05) to form clumps around the index SNPs. Each clump was composed of SNPs located within 250 kb of the index SNP and in linkage disequilibrium (LD) with the index SNP, based on a pairwise threshold of r2 = 0.2. A candidate PRS was computed using the resultant index SNPs and the corresponding estimated β-coefficient for its effect allele was computed as weights using the “score” procedure in PLINK v2.020. A total of 14 SNPs with P value < 5 × 10− 8 were included in PRS using PRSice2 (Table 1).

Table 1 14 independent genetic variants related to the type 2 dibaetes_prs (P < 5 × 10− 8).

External validation database

For external validation, we used the Taiwan Biobank (TWB) database, which contains genetic information for 95,233 individuals. The TWB included individuals aged 30–70 years from the general Taiwanese population who self-reported as being of Taiwanese Han Chinese ancestry and had no history of cancer. For the TWB, specimens and associated data were collected at enrollment and individuals will receive follow-ups every two to four years via questionnaire. Details of the information contained in the TWB database have been reported previously22. Written informed consent was obtained from the TWB participants in accordance with institutional requirements and the principles of the Declaration of Helsinki. The self-reported T2D disease status and laboratory tests, including fasting glucose value and HbA1c, were used to identify the T2D status of individuals, and external validation was performed with 19,139 subjects (851 T2D cases and 18,288 controls).

Phenotype generation and PheWAS

The PheWAS was based on PRS_T2D and 1,335 health outcomes data obtained from CMUH. We extracted the diagnosis data (ICD-9/10 codes) from the CMUH EMRs and mapped the codes to Phecodes using the R package “PheWAS”23. Logistic regression models were generated to evaluate the association between T2D PRSs and each binary trait. The PheWAS was conducted for phenotypes with more than 50 incidences to ensure sufficient statistical power. In this study, individuals were classified as phenotypic cases for a specific disease phenotype if they had at least two encounters corresponding to the designated Phecode diagnosis. Conversely, phenotypic control subjects included individuals who had never been diagnosed with the corresponding Phecode, as well as those who met the exclusion criteria for the Phecode. Unless otherwise stated, both binary and quantitative traits were adjusted for age, sex, and the first ten principal components (PCs) estimated using EIGENSOFT (version 6).

Biological network and pathway analyses

The significance threshold of PRSs for analyzing biological networks across the genome was determined to be P < 1 × 10–5. Molecular network and genetic correlations between SNPs and diseases were analyzed by using the core analysis module in the IPA software (Qiagen Sciences, Inc.). The statistical significance of the available networks was assessed using Fisher’s exact t-test with a significance level of 0.05.

Statistical analysis

Continuous variables are presented as means with SDs, and categorical variables as proportions. Comparisons between the two groups were conducted using an independent sample t-test for continuous variables and a chi-square test for categorical variables. Baseline BMI was calculated as baseline weight in kilograms divided by height in meters squared (kg/m2) and categorized into four categories: <18.5, 18.5–<25, 25-<30, and 30 + kg/m2. Waist circumference > 80 cm for female and > 90 cm for male was considered as abnormal. For the blood pressure, systolic blood pressure > 130 mmHg or diastolic blood pressure > 85 mmHg was considered as abnormal.

To investigate the association between PRSs and T2D, logistic regression models were used to estimate odds ratios (ORs) and 95% confidence intervals (CIs). The T2D PRS was normalized and analyzed as both a continuous and categorical variable. For categorical analysis, participants were divided into quartiles, with the lowest PRS quartile used as the reference group. Two models were developed to predict the risk of T2D. The clinical model included demographic and clinical variables that were selected based on univariate logistic regression analyses, with variables showing a P value < 0.05 considered for inclusion. The combined model incorporated all variables from the clinical model, along with the T2D PRS, to evaluate the added predictive value of genetic information.

The development of the prediction models followed the framework of the Framingham Heart Study24. Variables with P value < 0.05 in the univariate model were entered simultaneously into a multivariate model with variables with P value < 0.05 retained in the final model. Receiver operating characteristic (ROC) curves were generated to evaluate the predictive accuracy of the models, and the area under the ROC curve (AUCs) was calculated to assess the discriminatory abilities of the models. The DeLong test was applied to determine the significant differences between the AUCs of the correlated ROC curves.

For the PheWAS, only the T2D PRS was used as the primary independent variable. Logistic regression was performed for each phenotype, adjusting for age and sex, to assess its association with the T2D PRS. Bonferroni correction was applied to account for multiple comparisons.

All statistical analyses were performed using SAS (v9.4; SAS Institute, Cary, NC, USA), SPSS (v21.0; IBM, Armonk, NY, USA), and R (v3.4.4, R Core Team, 2018). All tests were 2-sided; P < 0.05 was considered statistically significant.

Data and resource availability

The data supporting this study’s findings are available from the CMUH, but restrictions apply to the availability of this data which is not publicly available.

Results

Study subjects

The study design is illustrated in Fig. 1. A total of 140,277 subjects, including 15,363 patients with T2D and 124,914 control subjects without diabetes were identified from the EMR. T2D primarily occurred in men (53.9% vs. 43.4%, P < 0.001). The mean (SD) ages of individuals in the T2D and control groups were 60.14 (12.69) years and 48.17 (16.60) years, respectively (P < 0.001). Baseline BMI was significantly different between the groups (the mean (SD) BMIs were 26.90 (4.65) and 24.35 (4.33) for the case and control groups, respectively; P < 0.001). Higher glucose, HbA1c, and TG values and lower TC, HDL, and LDL values were observed in the T2D group compared with those in the control group (all P values < 0.001) (Table S1).

Fig. 1
figure 1

Flowchart of the study design.

GWAS of Taiwanese patients with T2D

To identify T2D-associated SNPs and calculate PRSs, the dataset was divided into three cohorts: discovery, target, and validation, in a ratio of 7:1.5:1.5. Each group was regarded as an independent sample. GWAS was performed in the discovery cohort, which comprised 10,754 cases and 87,440 controls. The Manhattan and Q-Q plots (Fig. S1A and S1B, respectively) had a genomic inflation factor (λGC) of 1.078. A total of 202 SNPs surpassed the genome-wide significance threshold of 5 × 10− 8 for the Cochran–Armitage trend P-value. After clumping, 14 SNPs remained, including rs1337028053 on Chr5; rs9356744 (CDKAL1, intron) on Chr6, rs1431838560; rs806214 (FSCN3, intron), rs2233580 (PAX4), and rs201976370 (SND1) on Chr7; rs10811661 (CDKN2B-AS1; DMRTA1) on Chr9; rs2283228 and rs2237895 (KCNQ1) on Chr11; rs17676332 (TYRO3), rs76072851 (CORO2B), and rs527631646 (LINC01582; LINC02351) on Chr15; rs551262605 (PSMD3) on Chr17; and rs148612115 (ZNF257) on Chr19. These SNPs were associated with T2D in our study population. The most significant SNP, rs76072851, is located in the intronic region of CORO2B on 15q23 and had a P-value of 8.31 × 10− 74. Novel markers for T2D included the intronic SNP rs551262605 (P = 4.50 × 10− 8) in PSMD3 on 17q21.1, the intronic SNP rs17676332 (P = 1.34 × 10− 12) in TYRO3 on 15q15.1, and the intergenic SNP rs527631646 (P = 1.73 × 10− 8) on 15q26.3 (Table 1). Figure S2A-D show the regional plots for rs76072851 in CORO2B as well as three novel loci for T2D.

PRS models for predicting T2D risk

After the GWAS, 786,845 SNPs with a significance level of P < 0.05 were used to generate PRSs in the target cohort, and 30% of the sample data was used to generate T2D PRSs for each individual. Figure 2A and B show that the mean T2D PRSs were significantly different between two groups in target cohort (mean [SD] PRSs were − 0.3 [0.98] and 0.23 [1.10] for the control and case groups, respectively; P < 0.001) and in validation cohort (mean [SD] PRSs were − 0.3 [0.98] and 0.24 [1.11] for the control and case groups, respectively; P < 0.001). The strata plot demonstrated a positive trend, indicating that higher percentiles were associated with higher OR for T2D risk compared to the reference quantile (Fig. 2C and D).

Fig. 2
figure 2

Distribution of the polygenic risk scores, corresponding strata plot, and the receiver operating characteristic curve of the polygenic risk scores for the prediction of type 2 diabetes in the (A, B, C) target and (D, E, F) validation groups. Clinical variables included T2D_Diag_Age, gender, waist circumference, blood pressure, and diabetic family history.

In the target cohort, the ORs (95% CIs) of T2D in those who were older, male, abnormal waist circumference, and had high blood pressure or a family history of diabetes, and had a higher PRS were 1.04 (1.03, 1.05), 1.46 (1.03, 2.06), 3.75 (2.63, 5.34), 1.83 (1.27, 2.63), 3.16 (1.21, 8.25), and 1.37 (1.18, 1.60), respectively (Table 2). T2D PRSs and family history of diabetes were independent risk factors. A T2D risk model including age, sex, waist, blood pressure, family history of diabetes, and PRS was constructed (Supplemental Table S2). The total risk scores were calculated by adding the scores of all factors, and they ranged from 0 to 19. The estimated risks of T2D of each possible sum of points under the model combining clinical and genetic risk factors in the target cohort are displayed in Supplemental Table S3.

Table 2 The odds ratios by type 2 diabetes predictors in target cohort.

The AUROC values (95% CI) for the T2D prediction model combining the clinical factors and PRS were 0.804 (0.772, 0.835) and 0.842 (0.816, 0.867) for the target and validation cohorts, respectively (Fig. 2). These results demonstrated that our prediction model displayed remarkably good discrimination ability.

External validation

The external validation set consisted of 19,139 subjects from the TWB database, including 851 T2D cases. Among the 14 SNPs identified in the GWAS of the CMUH-TPMI chip, eight SNPs were identical on the TWB chip and six SNPs were replaced with highly linked SNPs located in the same gene. The association between the relevant 14 SNPs and T2D in TWB validation set was shown on Table S4. In the external validation set, the OR (95% CIs) for T2D for those who had higher PRSs was 1.08 (1.01, 1.16) (see Table S5). The AUROC value (95% CI) for the T2D prediction model, which combined clinical factors and wGRS, was 0.733 (0.718, 0.748, P < 0.001) (Fig. S3). The external validation sets demonstrated that the prediction model had good discrimination and calibration.

T2D polygenic risk score and phenome-wide association study

We evaluated T2D PRSs across 1,335 EMR-derived phenotypes with at least 50 case and control subjects. The PheWAS plot presented in Fig. 3 displays -log10 (FDR corrected) P values corresponding to each of the association tests for H0: βPRS = 0. Among the top 10 phenotypes related to T2D PRSs, including T2D with renal manifestations (P = 1.68 × 10− 142), diabetic retinopathy (P = 5.99 × 10− 66), and T2D with ophthalmic manifestations (P = 3.22 × 10− 51) (Table S6). Furthermore, we determined associations between T2D PRSs with other endocrine/metabolic disorders such as hyperlipidemia (P = 5.56 × 10− 109), disorders of lipid metabolism (P = 3.28 × 10− 99), and hypercholesterolemia (P = 2.02 × 10− 58). Additionally, we explored associations between PRS and traits beyond endocrine/metabolic disorders and observed a positive association between T2D PRS and abnormal kidney function (P = 4.82 × 10− 62) and essential hypertension (P = 1.18 × 10− 49). Moreover, T2D PRS is associated with different diseases of the circulation, digestive system, skin diseases, sense organs, muscles, and nervous system, and neoplasms. Table S7 illustrates the genetic associations between T2D and the top 10 phenotypes of several diseases, categorized as: (1) the circulatory system; (2) digestive system; and (3) neoplasms. To determine whether the secondary associations of the kidney, hypertension, and neoplasms with T2D PRS were due to patients having both T2D and these phenotypes, we reexamined the T2D PRS-phenotype associations after excluding T2D cases. The results were similar (Fig. S4).

Fig. 3
figure 3

Phenome-wide association study analysis of the correlation between polygenic risk scores and human diseases (SNPs gene loci, denoting a P < 5 × 10–8).

Bioinformatics network analysis of SNPs associated with T2D in PRS models

The significance of 1,566 SNPs was assessed in the context of the entire genome, using a threshold of P < 1 × 10− 5. A comprehensive analysis of 82 SNPs within 72 genes associated with T2D was performed using Bioinformatics IPA software. The gene numbers were cross-analyzed with pathways, resulting in the ranking of various biological processes including apoptosis, cellular immune response, cytokine signaling, cancer progression, disease-specific pathways, organismal growth and development, cell cycle regulation, ingenuity toxicity list pathways, nuclear receptor signaling, xenobiotic metabolism, cellular growth, proliferation and development, cellular stress and injury, and biosynthesis (Fig. S5A). Significant associations were also found between FTO, HNF1A, IDE, KCNQ1, KIF11, NTRK1, PAX4, SLC8A1, TCF7L2, TLE4, TYRO3, CDKAL1, CTSS, DPP6, and FOXP1 in the GWAS (Fig. S5B). Furthermore, our findings demonstrated that T2D characterized by multiple key ingenuity canonical pathways, including IL-15 production, thyroid cancer signaling, osteoblasts in rheumatoid arthritis signaling pathways, sperm motility, aryl hydrocarbon receptor signaling, HOTAIR regulatory pathway, WNT/β-catenin signaling, protein citrullination, granulocyte adhesion and diapedesis, basal cell carcinoma signaling, and pulmonary healing signaling pathway (Fig. S5C and Table 3).

Table 3 The canonical network analysis of PRS results in type 2 diabetes (P < 1 × 10− 5).

Discussion

Existing T2D risk prediction models/nomograms are based on Asian and non-Asian populations25,26,27,28. As more T2D genetic variants have been identified, the genetic information incorporated in the models has increased the accuracy of models. However, there are few nomograms that combine clinical variables and genetic information. In the current study, we used the CMUH cohort database to conduct GWAS and PRS analysis to validate previous findings and develop new risk prediction formulas for clinical use. We conducted a GWAS to screen for T2D-related polymorphisms and validated them. Based on the associated SNPs identified, we established a nomogram model based on 14 T2D-associated SNPs and five clinical variables, including age, sex, waist circumference, blood pressure, and family history to predict the risk of T2D. This model has promise for aiding early identification and prevention of T2D.

The GWAS results revealed a total of 14 independent gene variants linked to T2D (P < 5 × 10− 8) (Table 1). These SNP loci span 13 genes: CDKAL1, FSCN3, PAX4, SND1, CDKN2B-AS1, DMRTA1, KCNQ1, TYRO3, CORO2B, LINC01582, LINC02351, PSMD3, and ZNF257. Among them, CDKAL1, FSCN3, PAX4, SND1, DMRTA1, KCNQ1, CORO2B, and ZNF257 have been previously reported to be associated with T2D. TYRO3, PSMD3, LINC01582, and LINC02351 were newly identified in our study as associated with T2D. While TYRO3 and PSMD3 have been previously linked to BMI or metabolic syndrome29,30their association with T2D has not been reported before, to the best of our knowledge. The relevant associations of these gene variants with T2D and related metabolic traits, such as HbA1c, insulin, and glucose levels across different ethnic populations, are summarized in Table S831,32,33,34. TYRO3 encodes a receptor tyrosine kinase of the TAM family and is highly expressed in podocytes, specialized epithelial cells essential for glomerular filtration. Studies have demonstrated that TYRO3 activation in podocytes protects against apoptosis and reduces albuminuria, suggesting a role in maintaining glomerular integrity under diabetic conditions35,36,37. These findings support the hypothesis that TYRO3 variants may influence T2D risk through mechanisms involving immune regulation, inflammation, and renal microvascular integrity. PSMD3 encodes a non-ATPase regulatory subunit of the 26 S proteasome, which plays a critical role in ubiquitin-mediated protein degradation and signaling regulation. Genetic variants in PSMD3 have been associated with insulin resistance, as measured by HOMA-IR, particularly in the context of dietary fat and carbohydrate intake38. PSMD3 may influence T2D susceptibility through pathways related to proteostasis, metabolic inflammation, and endocrine signaling. Additional research is warranted to explore the potential roles of TYRO3 and PSMD3 in T2D pathogenesis, including their involvement in gene regulation, metabolic processes, and diabetic complications.

While numerous PRS models for T2D have been developed, including more than 45 studies involving East Asian populations according to the GWAS Catalog, most are derived from specific Asian or trans-ancestry approaches39,40,41,42. Some recent efforts have focused on improving cross-population applicability or on classifying mechanistic subtypes using multi-ancestry data. However, these models may overlook signals specific to subpopulations such as Taiwanese individuals. Moreover, recent studies have shown that PRS performance can vary depending on the ancestry of the discovery cohort, highlighting the need for population-specific models. To assess the added value of population-specific PRSs, we compared our 14-SNP model with a 287-SNP PRS from prior studies and a 4-SNP model from uniquely Taiwanese variants. All three models, when combined with clinical variables, showed similar predictive performance, suggesting that a concise, locally tailored PRS can match larger models while improving interpretability (Fig. S6).

Currently, there are limited studies utilizing PheWAS to explore SNPs and PRSs in the context of T2D. In our previous work using TWB data, we found that the T2D PRS was inversely associated with age at diagnosis (β = − 0.91 years) and linked to diabetic retinopathy and foot complications43. A PheWAS by Lee et al. identified pleiotropic loci related to glycemic and metabolic traits among 10 diseases and 34 quantitative traits using SNP-level associations44. Expanding on these efforts, we applied the T2D PRS to 1,335 EMR-derived phenotypes and observed strong associations with diabetic complications and metabolic comorbidities, including renal manifestations (P = 1.68 × 10–142), lipid disorders, and hypertension. These associations have been corroborated in previous research on T2D45,46,47. By leveraging real-world EMR data, this analysis offers enhanced clinical relevance and phenotypic resolution. The use of polygenic risk scores enables quantification of cumulative genetic burden and its relationship to a broad spectrum of disease phenotypes in routine healthcare settings. These findings underscore the pleiotropic nature of T2D genetic risk and highlight the potential of PRS as a tool for early risk stratification. Key associations are summarized in Fig. S4 and Table S6.

We conducted bioinformatics network analysis on high-risk gene loci (P < 5 × 10− 8) in the PRS model. The results indicated that numerous signaling pathways are involved in the pathogenesis of T2D, including IL-15 production, aryl hydrocarbon receptor (AHR) signaling, HOTAIR regulatory pathway, and WNT/β-catenin signaling. (Table 3 and Fig. S5(C)). IL-15, a pro-inflammatory cytokine, plays a role in lymphocyte development, survival, proliferation, and activation48. Our results identified genes involved in the IL-15 production pathway, such as EPHA6, NTRK1, PEAK1, and TYRO3 (Table 3) associated with T2D. T2D myotubes produce higher levels of IL-15 than non-diabetic myotubes49. Ye et al.50 suggested that IL-15 may increase energy expenditure, thereby protecting against obesity and T2D. Our study is the first to demonstrate a significant association between the IL15 production pathway and T2D. However, the regulatory role of IL-15 in T2D needs further investigation.

The AHR is a ligand-activated transcription factor, whose transcriptional activity is regulated by small compounds. AHR has been shown to be involved in energy homeostasis and inflammation51. Epidemiological analyses suggest that exposure to xenobiotic AHR ligands contributes to the development of obesity and T2D. Furthermore, increased levels of IL-22 and high frequencies of Th22 cells are associated with insulin resistance in patients with T2D. In patients with T2D, elevated plasma concentrations of IL-22 and IL-17 are associated with increased AHR transcription51. Our findings revealed that, in patients with T2D, the genes involved in AHR signaling include MED1, NFIA, and SMARCA4. Thus, the role of AHR signaling in the regulation of T2D is a new direction for future research.

Homeobox transcript antisense RNA (HOTAIR) an oncogenic long non-coding RNA (lncRNA), as an epigenetic hub may drive T2D pathogenesis52,53. By recruiting the PRC2 complex, HOTAIR can deposit H3K27me3 marks on downstream loci, thereby suppressing HNF1A and modulating TCF7L2, two master regulators of β-cell insulin synthesis and glucose homeostasis. Concurrently, HOTAIR-mediated activation of MMP3 promotes extracellular-matrix remodeling, a precursor to diabetic micro-vascular complications52. These coordinated effects provide a plausible mechanistic link between elevated HOTAIR activity and both the initiation and progression of T2D.

The correlation between WNT/β-catenin signaling and T2D is well-established54. The gene TCF7L2 related to the Wnt signaling pathway has been identified as a risk factor for T2D through GWAS55,56. Two SNPs, rs12255372 and rs7903146, that were also significantly associated with T2D (P values = 2.74 × 10− 7 and 7.35 × 10− 3) in our population, have been shown to exhibit strong correlations with T2D risk57,58. Moreover, certain SNPs of HNF1A increase susceptibility to T2D59. Our results corroborate earlier findings that, the genes involved in the WNT/β-catenin signaling include HNF1A, TCF7L2, and TLE4 are associated with T2D.

Our study has some strengths and limitations. We utilized a large EMR database with longitudinal information on various diseases and conducted a large-scale GWAS using a Taiwanese population. This provided a valuable opportunity to understand the correlation between T2D-related genes and various diseases. For external validation, we used the TWB database, which contains data from the general population, helping to reduce selection bias. However, the PRS construction approach used in this study was based on the “clumping and thresholding” method, which may not fully account for the effects of LD. Alternative analytical methods such as PRS-CS43 or machine learning approaches60 could potentially identify additional significant SNPs and enhance the models’ accuracy. Additionally, we developed a risk score model that may be useful for clinical screening in Taiwanese populations. However, the accuracy of this model requires validation using an independent database. The generalizability of our study is limited, as our study population included only Taiwanese individuals. Due to genetic variations and clinical differences across ethnicities, future studies should incorporate more genetic and environmental variables related to T2D. A more comprehensive nomogram model should also be established to predict T2D risk, and its diagnostic efficacy should be verified in different populations. Moreover, while the sample size of this study is large, particularly in the context of PheWAS, only phenotypes with sufficient occurrences were analyzed. This may lead to a lack of power for subgroup analyses or for specific phenotypes. Additionally, the causative relationships among these observed associations could be explored in future research.

Conclusion

Our predictive risk model based on key genetic and clinical factors can effectively distinguish individuals with different risk levels of T2D. Individualized risk estimation for T2D is important to help inform decision-making when considering measures for primary prevention of T2D.