Introduction

Transcription factors (TFs) play a pivotal role in mediating the effects of genetic risk variants by regulating gene expression in a context-specific manner, thereby influencing disease risk1,2. A better understanding of the regulatory mechanisms by which TFs mediate these genetic variants is critical for elucidating the molecular basis of complex traits, including cancer. Previous studies have identified susceptible TFs through enrichment analyses of genome-wide association studies (GWAS)-identified risk variants in TF binding sites through chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq)3,4,5,6. However, these approaches often suffer from suboptimal statistical power, as they focus primarily on a narrow range of GWAS-identified risk variants. Recently, we developed a statistical method using a generalized linear mixed model (GLMM) to enhance power by analyzing TF ChIP-seq data in conjunction with GWAS summary statistics of variants across the genome7. This method evaluated associations between variants’ risk (e.g., Chi-squared values from GWAS) and their binding status by a TF, and has identified known susceptible TFs, such as FOXA1, ESR1 and GATA3 for breast cancer, by showing significantly higher risk for variants occupied by these TFs compared to genome-wide controls7. However, the specific TFs involved in colorectal cancer (CRC) susceptibility remain unclear. Thus, identifying TFs that interact with CRC risk variants could provide insights into TF-mediated gene regulation and enhance our understanding of CRC susceptibility.

GWASs have successfully identified more than 250 risk loci for CRC1,2,8, and integrative analyses of expression quantitative trait loci (eQTL) and these GWAS loci have revealed numerous candidate susceptibility genes9,10,11,12. Further transcriptome-wide association studies (TWAS) have linked predicted gene expression levels with disease risk, offering higher statistical power and potentially revealing novel associations overlooked in GWAS13,14. Prior research, including our own, has conducted TWAS in CRC and identified more than 200 putative susceptibility genes8,15,16. However, conventional TWAS methods are prone to false positives due to the inclusion of non-regulatory variants17,18,19. To address this, we developed the sTF-TWAS approach, which refines gene expression prediction by focusing only on regulatory variants within TF-occupied cis-regulatory elements (STFCREs)7,20. This approach significantly enhances disease gene detection, outperforming conventional TWAS methods20. In addition, our recent alternative polyadenylation (APA)-WAS framework has uncovered new cancer risk genes21, while its application to CRC has yet to be explored. Moreover, while TWAS has primarily been conducted in European populations, CRC studies in East Asian populations are limited due to a lack of relevant RNA-seq data from colon tissue.

In this study, we integrate 218 ChIP-seq datasets from CRC-related cells with GWAS data from 100,204 CRC cases and 154,587 controls of both European and East Asian ancestries to identify susceptible TFs for CRC. We further utilize RNA-seq data from 364 Asian-ancestry and 707 European-ancestry individuals to build ancestry-specific models for gene expression, alternative splicing (AS), and APA) under the sTF-TWAS framework. Our goal is to identify novel CRC susceptibility genes through multi-ancestry TWAS analyses, providing a deeper understanding of the genetic basis of CRC.

Results

Identification of susceptible CRC-associated TFs and key TF-cofactor interactions

We integrated 218 ChIP-seq data for 84 TFs (Supplementary Data 1) with CRC GWAS summary statistics from European and East Asian ancestry to identify CRC susceptible TFs (Fig. 1A). At Bonferroni-corrected P  <  0.05, we identified 51 TFs associated with CRC susceptibility. Among the top TFs were VDR, MYC, JUN, CDX2, NIPBL, MED12, and ETV5 (Fig. 1B and Supplementary Data 2, “Methods”). We conducted motif analysis for each TF-binding region derived from ChIP-seq data and observed significant enrichment for host motifs, including those for VDR (24.2%), JUN (11.1%), KLF5 (49.7%), and HNF4A (36.4%) (Fig. 1C). We further explored the impact of genetic variations in co-occupied cis-regulatory regions of these 51 TFs and observed a substantial proportion of genetic variants located in co-occupied binding sites (Fig. 1D and Supplementary Data 3). Pair-wise analyses identified 154 significant interactions at Bonferroni-corrected P < 0.05, showing that genetic variants in regions co-occupied by multiple TFs exhibited stronger associations with CRC risk compared to variants bound by a single TF (Fig. 1D and Supplementary Data 3). Notably, VDR emerged as a key player in these interactions, highlighting its central role in CRC susceptibility (Fig. 1D and Supplementary Data 3). In addition, we applied stratified LD Score Regression (S-LDSC)22 as an alternative approach and found that 40 of the 51 risk TFs identified by our method were also detected by S-LDSC at a Bonferroni-corrected P < 0.05. Most of the remaining 11 TFs showed nominal significance (P < 0.05) (Supplementary Fig. 1A), providing further support for the validity of our findings. We also conducted simulation analyses to compare the statistical power of our GMM with S-LDSC. Our results demonstrate that GLMM consistently outperformed S-LDSC in detecting causal TFs across varying levels of heritability (Supplementary Fig. 1B). Quantile-quantile (Q-Q) plots demonstrate that Type I error rates are under control for both approaches, with overall comparable performance (Supplementary Fig. 1C–F). These simulation results are consistent with our observations in real data, supporting GLMM as a robust approach for identifying risk TFs in colorectal cancer.

Fig. 1: Associations between TFs with CRC risk using generalized linear mixed models.
Fig. 1: Associations between TFs with CRC risk using generalized linear mixed models.The alternative text for this image may have been generated using AI.
Full size image

A A flow chart to illustrate the integrative analysis of ChIP-seq data (n = 218) for 84 TFs and CRC GWAS summary statistics from 100,204 cases and 154,587 controls of European and East Asian ancestry. B A total of 51 identified TFs with genetic variation of TF-DNA bindings significantly associated with CRC risk. P-values were determined by a two-sided Wald Z test. The dashed line represents a Bonferroni-corrected P  <  0.05. C The host motifs of identified TFs were enriched in their ChIP-seq peaks. D Analysis of co-occupied binding regions of the top 10 CRC risk-associated TFs. Venn diagrams in the upper-right triangle show the number of genetic variants (multiplied by 1000) that are occupied by specific TFs or co-occupied by two TFs in each TF pair. Bar plots in the lower-left triangle show the association strengths (regression coefficients) for the genetic variants occupied by two TFs (only the first TF and only the second TF, respectively) as indicated from left to right. Two TFs with significant interactions at the Bonferroni-threshold of P  <  3.92 × 10−5 (0.05/1,275 TF pairs from 51 TFs) are highlighted in red. P-values were determined by a two-sided Wald Z test.

Associations of genetically predicted gene expression with CRC risk

For the multi-ancestry TWAS under the sTF-TWAS framework, we built gene expression prediction models using RNA-seq data from 364 participants of East Asian ancestry (from the ACCC) and 707 participants of European ancestry (N = 423 from the BarcUVa-seq and N = 284 from the GTEx projects) (Methods)16. A total of 7096 (for ACCC), 14419 (for BarcUVa-seq), 11361 (for GTEx) gene expression prediction models were built with a coefficient of determination R2 > 0.01 (10% correlation) using the elastic net approach (Supplementary Data 4). We then applied these ancestry-specific prediction models to summary statistics of CRC GWAS from 80,774 cases and 105,298 controls of European ancestry and from 23,572 cases and 48,700 control of East Asian ancestry. At a Bonferroni-corrected P < 0.05, we identified 138 genes whose genetically predicted expression was associated with CRC risk after applying a meta-analysis of TWAS results based on models built in both European and East Asian populations (Supplementary Fig. 2, “Methods”). In addition, we evaluated the associations separately in European and Asian populations. At the Bonferroni-corrected threshold, we identified two additional genes (SENP8 and PCNT) from the European population, while no additional significance was identified in the East Asian population (Supplementary Data 5).

Together, 140 genes showed an association between their genetically predicted expression and CRC risk (Fig. 2A). Of them, 41 (29% of 140) genes were not previously reported to associate with CRC risk1,2,8,10,15,16 (Table 1). Of note, seven putative susceptibility genes (PPAN, DERL3, PROX1, TMEM87B, SV2A, FBXL20, and MED1) are located at loci, with > 2 Mb away from the closest GWAS-identified risk variants (Table 1). To assess whether the genes identified were independent of the established GWAS association signals, we conducted conditional analyses for their associations with CRC risk, adjusting for the associations with the closest lead SNP for each locus (“Methods”). Of the 99 previously reported genes located in GWAS-identified risk loci, conditional analysis showed that 39 genes remained statistically significant at a nominal P < 0.05 (Supplementary Data 6).

Fig. 2: Manhattan plot of the associations result from TWAS, AS-WAS and APA-WAS.
Fig. 2: Manhattan plot of the associations result from TWAS, AS-WAS and APA-WAS.The alternative text for this image may have been generated using AI.
Full size image

A The meta-analysis result from the associations among European ancestry and East Asian ancestry participants. P-values were calculated from a two-sided Wald Z test conducted by sTF-TWAS. The blue line indicates a Bonferroni-threshold of nominal P < 5.8 × 10−6 for TWAS (A), and P < 1.95 × 10−6 for AS-WAS (B), and P < 8.25 × 10−6 for APA-WAS (C), respectively. The novel identified risk genes from this study are highlighted in green. Four genes showed association signals with P-values exceeding the y-axis scale: POU5F1B (P = 1.09 × 10⁻⁶⁴) in (A); COLCA2 (P = 5.49 × 10⁻⁴⁵) in (B); LAMC1 (P = 9.57 × 10⁻²⁴) and DIP2B (P = 1.25 × 10⁻²²) in (C).

Table 1 TWAS novel genes presented in novel and known GWAS loci

Additional risk genes identified by AS-WAS and APA-WAS

We built prediction models for AS and APA using the aforementioned RNA-seq data in colon tissues, generating 9680 AS models for ACCC, 10,025 for BarcUVa-seq, and 20,240 for GTEx, at prediction performance R2 > 0.01 (Methods, Supplementary Data 7). A meta-analysis of AS-WAS results from these models, we identified 82 risk genes at a Bonferroni-corrected threshold P < 0.05 (Fig. 2B). We identified three additional genes, POLD3, PARP11, and CEP89, specific in European populations (Supplementary Data 8). In total, we identified 85 genes after combining the results from the meta-analysis and population-specific analysis. For APA, a total of 2569 (for BarcUVa-seq), 4399 (for GTEx), and 2744 (for ACCC) prediction models were built successfully with R2 > 0.01 (Supplementary Data 9). After meta-analysis of APA-WAS results from these models, we identified 29 risk genes, at a Bonferroni-corrected P < 0.05 (Fig. 2C). Similarly, we identified four additional genes (ARPC5, PHLDB2, COA4, NEIL1) specifically in the European population (Supplementary Data 10). In total, we identified 33 genes after combining the results from the meta-analysis and population-specific analysis.

In total, 112 genes were identified for CRC risk combining results of SP-TWAS and APA-WAS. Of them, 58 were not previously reported for CRC risk (Table 2, 3). Among the remaining 54 previously reported genes, conditional analysis showed that 24 genes remained statistically significant at a nominal P < 0.05 (Supplementary Data 11).

Table 2 AS-WAS novel genes presented in novel and known GWAS loci
Table 3 APA-WAS novel genes presented in novel and known GWAS loci

Risk genes identified by combining TWAS, AS-WAS, and APA-WAS analyses

By integrating results from TWAS, AS-WAS, and APA-WAS, we identified 222 unique CRC risk genes, 95 of which had not been previously associated with CRC (Fig. 2). Among them, four genes (COLCA2, METRNL, ARPC5, COX14) were commonly identified by all three TWAS analyses and 28 genes commonly detected by any two of the TWAS analyses (Supplementary Fig. 3). By comparing genes reported in previous CRC TWAS and other genetic studies1,2,8,10,15,16, 95 of them were not previously identified as CRC susceptibility genes, particularly including 13 genes reported in new loci (Tables 13).

Using scRNA-seq data to explore risk genes potentially contributing to CRC development

We next analyzed scRNA-seq datasets spanning CRC developmental stages, including colon normal, serrated polyps, conventional adenomas, and carcinomas tissues from the COLON MAP23. Our scRNA-seq data analysis identified several major cell types (e.g., ABS, GOB, STM) in normal tissues, SSC in serrated polyps, and ASC in conventional adenomas, consistent with previous studies23 (Supplementary Fig. 4). Differential analysis of these major cell types was performed using DESeq224 to evaluate their contribution to two CRC etiology pathways: the normal-serrated polyp-carcinoma pathway and the normal-adenoma-carcinoma pathway (“Methods”). At an FDR-adjusted P < 0.05, 48 genes exhibited significantly differential expression with consistent associations from TWAS analysis across one of these major cell types in the normal-serrated polyp-carcinoma pathway (Supplementary Data 12), while 92 genes showed significant differential expression in the normal-adenoma-carcinoma pathway (Supplementary Data 13). Combined with these results, we provide additional evidence that 99 genes (44.6% of the identified 222 risk genes) are potentially involved in CRC development.

Druggable genes for CRC prevention and intervention

We comprehensively annotated 222 identified risk genes as therapeutic targets for existing drugs, including those either approved or in clinical development (“Methods). Our analysis identified nine CRC susceptibility genes that are druggable by 125 drugs approved or in clinical trials for CRC treatment (Fig. 3 and Supplementary Data 14). In addition, we identified 26 genes targeted by 519 drugs approved or in clinical trials for the treatment of other cancers (Supplementary Data 15). Furthermore, 40 genes were identified as targets of 585 drugs approved or in clinical trials for non-cancer indications (Supplementary Data 16). In total, 48 of the 222 CRC susceptibility genes (21.6%) were found to be druggable, with potential links to 1216 drugs, either approved or under investigation.

Fig. 3: A circular plot showing nine druggable risk genes potentially targeted by 125 drugs either approved or undergoing clinical trials for treatment of CRC.
Fig. 3: A circular plot showing nine druggable risk genes potentially targeted by 125 drugs either approved or undergoing clinical trials for treatment of CRC.The alternative text for this image may have been generated using AI.
Full size image

The innermost to outermost layers represent CRC, the identified druggable risk genes, and therapeutic drugs - approved (green) or currently under clinical trials (Phase 3: yellow, Phase 2: red).

Functional assays for three putative oncogenes, RHPN2, IRS2, and TXN

To functionally examine the CRC susceptibility genes, we performed CRISPR/Cas9-mediated gene knockout. Three risk genes, including one previously reported gene, RHPN2, and two newly identified risk genes, IRS2 and TXN, were selected for downstream experiments based on evidence of their potential oncogenic functions and literature review (Methods). This was supported by their predicted elevated expressions being associated with an increased risk of CRC, as well as differential analysis of single-cell data highlighting their oncogenic roles. For each gene (RHPN2, IRS2, or TXN), we employed two independent sgRNAs for protein coding region targeting (Supplementary Data 17). We used western blotting for confirming target protein reduction and CCK-8 proliferation and colony formation (Fig. 4) assays for gene’s functions. TXN proteins displayed 65-98% reductions between knockout cells and control cells (sgRNA targeting LacZ) in the CRC cell lines (RKO, HCT116, and SW480) (Supplementary Fig. 5). We showed that TXN knockout significantly inhibited cell proliferation and colony formation in all CRC cell lines (Fig. 4A–C). IRS2 proteins had 66–86% reductions between knockout cells and control cells in three CRC cell lines (Supplementary Fig. 5). There was only significant inhibition of colony formation (Fig. 4E, F), indicating a long-term effect of knocking out the IRS2 protein. RHPN2 proteins had 34–61% reductions between knockout cells and control cells (Supplementary Fig. 5). However, the data showed a significant inhibition of cell proliferation and colony formation only in RKO cells, together with a trend of inhibition in HCT116 cells (Fig. 4G–I). There was a minimal effect in SW480 cells by knocking out RHPN2. The variable extents of RHPN2 dependency in cell proliferation and growth were matched to the consistent protein expression levels in these cells. Our experimental verification demonstrated the potential oncogenic functions of TXN, IRS2 and RHPN2, aligning with our TWAS analysis.

Fig. 4: The effects on cell viability and colony formation for selected genes, TXN, IRS2, and RHPN2 in three CRC cell lines (HCT116, RKO, SW480).
Fig. 4: The effects on cell viability and colony formation for selected genes, TXN, IRS2, and RHPN2 in three CRC cell lines (HCT116, RKO, SW480).The alternative text for this image may have been generated using AI.
Full size image

CRC cell lines with knocking down by two sgRNAs for each gene were compared to CRC cell lines with a non-targeting sgRNA control. Cell viability was detected with CK8 assay for TXN (A), IRS2 (D), and RRHPN2 (G). Colony formation assays were performed with cell stained by crystal violet and were further quantified for TXN (B and C), IRS2 (E and F), and RRHPN2 (H and I). The independent experiments were performed in triplicate. P-values were determined by two-sided t test from the comparison of knockdown and control cells. “*”, P < 0.05; “**”, P < 0.01; “***”, P < 0.001. The height of each bar represents the mean value across replicates, and the error bar indicates the standard error of the mean.

Discussion

Key TFs transcribed in a cell type-specific manner and co-occupy cis-regulatory elements that control gene expression programs to establish and maintain cell identity25,26,27,28,29. Identifying these TFs, whose DNA-binding activities are altered by risk-associated genetic variations, and understanding the genes they regulate, is crucial for advancing our knowledge of transcriptional dysregulation in human cancers30,31,32,33. In this study, we identified 51 TFs associated with CRC susceptibility, advancing our understanding of key TF–gene regulatory networks underlying CRC etiology. By incorporating prior knowledge on TF-binding sites for CRC into the sTF-TWAS framework, we demonstrated that our approach significantly improves predictive model construction and facilitates the discovery of risk genes. Compared to traditional TWAS analysis in our previous study16, the sTF-TWAS method improved the prediction of gene expression and AS, increasing the number of predicted genes by approximately 1500 and 2000, respectively. Moreover, sTF-TWAS identified 94 additional genes beyond those detected by the traditional TWAS and AS-WAS. For the APA analysis, we integrated RNA-seq data from ACCC and BarcUVa-seq, compared to our previous study relying on GTEx data, identifying 30 novel risk genes21. Notably, the sTF-TWAS analysis, using these diverse prediction models, complements and strengthens risk gene discovery. Among the 222 identified risk genes, only four genes (COLCA2, METRNL, ARPC5, COX14) were detected by TWAS, AS-WAS, and APA-WAS. This observation aligns with prior research showing that different molecular traits within the same gene can represent largely independent signals21,34. For these four commonly detected genes, we found that very few shared variants were observed among them, indicating genetic heritability may independently contribute to each molecular trait, while a small proportion of shared risk variants may influence expression, splicing, and APA through interconnected mechanisms under multi-layered regulatory control.

Our sTF-TWAS analysis highlighted risk TF-based transcriptional networks underlying genetic susceptibility to CRC. By examining TF occupancy information for the genetic variants included in the prediction model, we found that 117 of the 222 putative susceptibility genes (52.7%) are commonly regulated by the top five TFs (RAD21, CBX3, MAX, SP1, and POLR2A). In addition, we conducted a new TF enrichment analysis on these 222 genes using publicly available TF ChIP-seq data from the Enrichr35,36. This analysis identified four CRC-associated TFs (VDR, EP300, MYC, and SOX2) significantly enriched at FDR < 0.05, and an additional set of 17 TFs (Supplementary Data 18) enriched at nominal P < 0.05. Of note, among the 222 risk genes, several TFs themselves, such as MYC, PROX1, SOX9, KLF5, and RAD21, were identified as significant genes. It is worth noting that most CRC-associated TFs were not detected by TWAS, likely because TWAS captures genetic heritability primarily from cis-variants. However, our GMM analysis using genome-wide variants provided evidence of their susceptibility. Taken together, these findings underscore the role of TFs and TF-based transcriptional networks in CRC genetic susceptibility. Moving forward, in our prediction models, we identified 144 lead variants—representing the strongest associations with CRC risk for each model, at P < 5 × 10⁻⁸. These variants are predicted to influence the binding affinity of CRC-associated TFs and are linked to 162 risk genes (Supplementary Data 19). Taken together, these findings offer valuable insights for prioritizing functional studies aimed at fine-mapping causal variants and genes for CRC through in vitro and in vivo assays. Our integration of scRNA-seq data analyses enhances causal gene discovery and improves our understanding of the cellular processes driving CRC. We acknowledge that some genes identified through differential expression analysis across developmental stages may not reflect true causal associations, particularly given the high correlation among many gene expression levels. In the trans-ancestry TWAS analysis, meta-analysis improved statistical power, enabling robust gene prediction in both East Asian and European populations. Notably, 37 CRC risk genes identified through meta-analysis exhibited consistent associations with CRC risk (P < 0.05) across both populations, providing strong evidence for these genes as potential causal factors for CRC (Supplementary Data 5, 8, 10).

Our identification of risk genes has significantly advanced our understanding of CRC biology and etiology. Among the 222 identified genes, nine are known CRC driver genes, including CDH1, ERBB2, KLF5, SOX9, SRSF6, TBX3, TET2, TRPS1, and ZNRF3, highlighting their genetic contributions to CRC development37,38,39,40. We also uncovered seven previously unreported CRC risk genes that map to six established pathways crucial for CRC development41,42,43,44. Specifically, PPP2R1B regulates both the Wnt/β-catenin and TGF-β signaling pathways, while RICTOR modulates the PI3K-AKT pathway, a key driver of cell proliferation and survival. RAD21 is involved in both the Cell Cycle and Mismatch Repair pathways, reinforcing its role in genomic maintenance. NFATC2 supports MAPK signaling and DNA mismatch repair, while ARRB1 and MAPK13 further emphasize the significance of MAPK signaling. RTEL1 strengthens the Mismatch Repair pathway. In addition to these established oncogenic pathways, we identified three novel oncogenes that enhance our understanding of CRC susceptibility. In addition to the established CRC oncogenic pathways, we provided strong evidence of three oncogenes to advance the understanding of CRC susceptibility. RHPN2, a Rho GTPase-binding protein that regulates the cell cytoskeleton, may contribute to cancer cell malignant growth45. IRS2 has been shown to activate the oncogenic PI3 kinase pathway and increase CRC cell lines adhesion (signal of invasive cells)46. Moreover, over-expression of IRS2 is also observed in many other cancers47,48 and its amplification has been proposed as a biomarker for inhibitor-based cancer therapy49. TXN, a redox protein, may regulate antioxidant reactions that promote cancer cell metabolism and survival50,51. In our prior work, we have validated the functionality of another five reported genes, DIP2B, SFMBT1, TRPS1, METRNL, and C14orf166, using siRNA/shRNA-mediated gene knockdown cell lines in vitro10,16. Furthermore, public CRISPR-Cas9 data from the Cancer Dependency Map (DepMap) provided additional evidence for the essential roles of 21 risk genes in colorectal adenocarcinoma cell proliferation (Supplementary Data 20). Together, these discoveries expand the repertoire of CRC susceptibility genes and reveal previously unexplored molecular mechanisms underlying CRC development.

Through the analysis of drug-protein interaction databases52,53,54,55, we identified nine druggable proteins that are either approved or in clinical trials for CRC treatment. Notably, six of these genes—DDR1, SIK2, EHMT2, MAPK13, PKM and PARP11—are novel CRC risk genes reported in this study. Among these six novel genes, PKM2 from PKM family, a member of the PKM family, has been reported to mediate VEGF-induced angiogenesis in human endothelial cells. Resveratrol inhibits this process by suppressing aerobic glycolysis, potentially through modulation of PKM2 nuclear translocation56. PARP11, implicated in DNA repair, can be inhibited by approved PARP inhibitors, including Olaparib, Niraparib, and Talazoparib57. However, these results are observations are exploratory and require further validation. Overall, our protein–drug mapping expands the catalog of druggable genes, candidate therapeutics and supports the development of precision medicine strategies for CRC prevention and intervention.

Methods

Study populations

This study utilized a meta-analysis summary statistics dataset of 31 GWAS comprised of 186,072 individuals of European ancestry (80,774 CRC cases and 105,298 controls) and 72,272 individuals of Asian ancestry (23,572 CRC cases and 48,700 controls). The meta-analysis was conducted using the fixed-effects inverse variance weighted model implemented in METAL58. Details on sample selection and matching, sample numbers, and demographic characteristics of study participants have been described previously59,60,61,62. All participants provided written informed consent, and all study protocols were approved by the relevant research ethics committee or institutional review board.

We included RNA-seq data from tumor-adjacent normal colon tissues and genotyping data of 364 individuals of East Asian ancestry who participated in the Asia Colorectal Cancer Consortium (ACCC). We also included data from individuals of European ancestry, including RNA-seq in normal colon tissues and genotyping data of 423 individuals in the BarcUVa-Seq project and 284 individuals in the GTEx project. The sample collection for these data has been described in detail in previous studies1,63,64.

Generalized linear mixed models

We systematically analyzed TF ChIP-seq data generated in normal colorectal epithelium and CRC cell lines from the Cistrome database65. Only ChIP-seq datasets meeting Cistrome’s recommended quality control standards were included. After evaluation, we curated 218 ChIP-seq datasets corresponding to 84 TFs. To ensure robust analysis, we focused on datasets with more than 1000 peaks and a fold change greater than 10. For TFs with multiple ChIP-seq datasets, only the dataset with the highest number of peaks was retained. This process resulted in a high-quality set of ChIP-seq data for 69 TFs, which were used in downstream analyses.

We used our recently developed generalized linear mixed model approach7 to investigate associations of CRC risk with variations of TF-DNA binding by a single TF (Eq. 1). In brief, a NxM matrix for all genetic variants from the GWAS summary statistics and the annotation from all available TF-DNA binding regions (“N” refers to the number of genetic variants and “M” refers to the number of TF ChIP-seq datasets; the variant in ChIP-seq peak is coded as “1” and the one not in ChIP-seq peak is coded as “0”). We used the Chi-squared value for each genetic variant reported in the GWAS summary data to measure its association with CRC risk. We then used generalized linear mixed models to estimate the associations between the Chi-squared values (Y) and TF binding status of genetic variants located in binding sites of each TF, given LD blocks (non-overlapping segments of 100 kb) of genetic variants to handle the dependence between genetic variants (Eq. (1)).

$${Y}_{{ij}}={\beta }_{0}+{\beta }_{1}{TF}_{{ij}}+{V}_{i}+{\varepsilon }_{{ij}}$$
(1)

In Eq. (1), \({Y}_{{ij}}\) is the Chi-squared value for the j-th variant in the i-th LD block; β0 is the fixed intercept, and β1 is the fixed slope, which measure the mean difference of the Chi-Squared values (\(\Delta {\bar{X}}^{2}\)) between TF status; \({{TF}}_{{ij}}\) is the j-th TF value (i.e., 1 for a variant located in a TF binding site, 0 otherwise) in the i-th LD block; \({V}_{i}\) is the random intercept for the i-th LD block; and εij is the error term. Based on this statistical model, we identified cis-regulatory elements occupied by TFs whose genetic variations of TF-DNA bindings are associated with target cancer risk at Bonferroni-corrected P  <  0.05. We additionally used generalized linear mixed models to estimate the associations of the Y values of variants with the TF-pair occupancy if they showed a significant interaction (Eq. (2)). Of note, we used genetic variants non-occupied by any of the TFs investigated as the referenced control group for all the analyses.

$${Y}_{{ij}}={\beta }_{0}+{\beta }_{1}{TF}{1}_{{ij}}+{\beta }_{2}{TF}{2}_{{ij}}+{\beta }_{3}{TF}{1}_{{ij}}\times {TF}{2}_{{ij}}+{V}_{i}+{\varepsilon }_{{ij}}$$
(2)

Simulation studies

To benchmark the statistical power of GLMM against S-LDSC, we conducted simulations under a polygenic causality scenario where genotypes influence phenotypes through TF binding sites. Specifically, we assumed that only SNPs located within TF binding sites contribute to disease risk, while SNPs outside these regions have no effect.

For each replicate, we randomly selected K causal variants (K = 500, 1000, 5000, or 10,000) from a pool of 600,000 SNPs annotated to 84 TFs in CRC cell lines (Supplementary Data 1). Under an additive architecture, phenotypes were simulated using Eq. 3:

$${{\rm{Y}}}_{{\rm{i}}}^{*}={\sum }_{{{\rm{j}}}=1}^{{\rm{K}}}{{\rm{\beta }}}_{{\rm{1}}} \times {{\rm{X}}}_{{\rm{ij}}}^{{\rm{TF}}}+{\varepsilon }_{{\rm{i}}}$$
(3)

where, \({{{\rm{X}}}}_{{{\rm{ij}}}}^{{{\rm{TF}}}}\) is the genotype (0,1,2) of j-th variant for subject i use data from 489 European ancestry from1000G project phase 366. β1 was set to 1 and \({{{\rm{\varepsilon }}}}_{{{\rm{j}}}}\) is sampling from a normal distribution \({{\rm{N}}}\left(0,\frac{1-{{{\rm{h}}}}^{2}}{{{{\rm{h}}}}^{2}}{\mathrm{var}}\left({\sum }_{{{\rm{j}}}=1}^{{{\rm{K}}}}{{{\rm{\beta }}}}_{1}\times {{{\rm{X}}}}_{{{\rm{ij}}}}^{{{\rm{TF}}}}\right)\right)\), with heritability \({{{\rm{h}}}}^{2}\,\)= 0.5, 0.25, 0.1, 0.05 and \({\mathrm{var}}\left(\mathop{\sum }_{{{\rm{j}}}=1}^{{{\rm{K}}}}{{{\rm{\beta }}}}_{1}\times {{{\rm{X}}}}_{{{\rm{ij}}}}^{{{\rm{TF}}}}\right)\) is the variance from the genotype part.

To facilitate downstream GWAS, we binarized the continuous trait into a case–control phenotype \({{{\rm{C}}}}_{{{\rm{i}}}}^{*}\) based on the mean value of \({{{\rm{Y}}}}_{{{\rm{i}}}\,}^{*}\) across individuals using Eq. (4).

$${C}_{i}^{*}=\left\{\begin{array}{c}1,\,{{\rm{Y}}}_{{\rm{i}}}^{*}\ge {{\rm{median}}}\left(\mathop{\sum }_{{{\rm{i}}}=1}^{{\rm{N}}}{{\rm{Y}}}_{{\rm{i}}}^{*}\right)\\ 0,\,{{\rm{Y}}}_{{\rm{i}}}^{*} < {{\rm{meian}}}\left({\sum }_{{{\rm{i}}}=1}^{{\rm{N}}}{{\rm{Y}}}_{{\rm{i}}}^{*}\right)\end{array}\right.$$
(4)

\({{\rm{We}}}\; {{\rm{performed}}}\; {{\rm{GWAS}}}\) using PLINK 2.067 and applied both GMM and S-LDSC to the resulting summary statistics to estimate TF–disease associations. For each K, we conducted 50 replicates, each with a different random subset of causal SNPs, ensuring reproducibility while keeping K fixed. Results were summarized as the number of true-positive TFs detected at a nominal threshold (P < 0.05) by each method.

To evaluate the type I error rate of GLMM, we conducted simulation analyses under the null hypothesis of no association between genotype and phenotype. Specifically, we randomly assigned binary phenotype values (0 or 1) to 489 individuals of European ancestry and performed logistic regression to generate GWAS summary statistics under the null. We then applied GLMM and S-LDSC using different numbers of TF binding variants (500, 1000, 5000, and 10000) and repeated the entire procedure 1000 times. Finally, we compared the Type I error rates of GLMM and S-LDSC using quantile-quantile (QQ) plots of p-values across the 1000 replicates for each binding variant set size.

Gene expression profiling and data processing

The RNA-seq library preparation, sequencing, data processing, and quality control for the ACCC1 and the BarcUVa-Seq project63 have been described previously. In brief, paired-end sequencing was conducted for mRNA obtained from the BarcUVa-Seq project using either an Illumina HiSeq 2500 or a NovaSeq 6000 instrument and for mRNA obtained from the ACCC project using a HiSeq 3000 Sequencer. The raw RNA-seq data were processed following the GTEx Consortium pipeline64. BAM files containing aligned reads to the reference genome (GRCh37 for ACCC and GRCh38 for BarcUVa-Seq and GTEx project), were generated using STAR68 (v2.5.4). These BAM files, along with GENCODE release 19 (for ACCC) and 26 (for BarcUVa-Seq and GTEx) were used for quantification of gene expression, AS and APA.

Building on our previous work in processing genet expression and AS16, gene expression quantification was performed using RNA-SeQC69. Read counts and Transcripts Per Kilobase Million (TPM) values were calculated from aligned reads based on the following criteria: (1) reads were uniquely mapped; (2) aligned reads were properly paired; (3) the read alignment distance was < 6. Genes with expression thresholds of ≥ 0.1 TPM in ≥ 20% of samples and ≥ 6 reads (unnormalized) in ≥ 20% of samples were selected. Quantile normalization was applied to standardize the gene expression data within each sample to the same distribution, followed by rank-based inverse normal transformation across all samples.

AS was quantified using LeafCutter70 and RegTools71. In brief, the intron usage was quantified from aligned BAM files using the junction extract from the RegTools with the default options. Next, the introns were clustered together using the leafcutter_cluster_regtools.py script from LeafCutter, with the following options: -m 50 -l 500000. The clusters were mapped to each gene using the map_clusters_to_genes.R script from the cluster_prepare_fastqtl.py provided by the GTEx. Introns with low counts or low complexity were filtered based on the criteria in the in the GTEx pipeline (e.g., cluster_prepare_fastqtl.py). The filtered counts were normalized using the prepare_phenotype_table.py script from LeafCutter.

For APA analysis, following our recent APA-WAS study21, APA was qualified using the percentage of PDUI estimated from DaPars v2.021. We performed quantile normalization to transform the quantified PDUI values of APA for each sample to the same distribution. We used 60 probabilistic estimation of expression residuals (PEER) factors72 to correct batch effects and experimental confounders in downstream prediction model building for gene expression, AS and APA.

Genotyping data processing

The generation and processing of genotype data have been described in our previous work1,16. Briefly, CRC patients of East Asian ancestry from the ACCC were genotyped using the Illumina OncoArray and the expanded Illumina MEGA Array. For genotype quality control, genetic variants were excluded based on the following criteria: (1) genotype call rate < 95%; (2) ambiguous variants; (3) duplicated variants; (4) P for Hardy-Weinberg equilibrium (HWE) < 1.0 × 10−6. For quality control of samples, we excluded those with: (1) genotype call rate < 95%; (2) genetically identical or duplicated samples; (3) first- or second-degree relatives; (4) racial outliers. To estimate the genetic relatives among samples, the quality-controlled common SNPs (minor allele frequency [MAF] > 0.01) were first pruned based on the window size = 1500 SNPs and pairwise r2 > 0.2. Identity-by-descent (IBD) was calculated using the pruned genotype data. Quality-controlled genotypes were imputed using the 1000 Genomes Project Phase 3 mixed reference haplotypes with the Michigan Imputation Server (Minimac4 for imputation and Eagle v2.4 for phasing). Variants with MAF < 0.01 were excluded before imputation. Approximately 4.5 million variants with MAF > 0.05 and R2 > 0.3 were included for prediction model building.

For the BarcUVa-Seq project, approximately 400,000 SNPs were genotyped from DNA isolated from peripheral blood samples, using the OncoArray BeadChip. Quality controls and genotyping data processing details were provided in a previous study63. Genotypes were aligned to the reference genome GRCh38 and imputed using the TOPMed (Version R2) reference panel63. Genetic variants with an imputation quality of R2 < 0.3 were excluded.

In the GTEx project, whole genome sequencing (WGS) was performed on DNA samples from 284 individuals of European ancestry. WGS-based genotype data for these individuals, in VCF format, were accessed from the dbGap (phs000424.v8.p2). Detailed information on WGS variant quality control is provided in GTEx documentation64.

Conditional analyses based on adjusted GWAS association signals

To investigate whether the identified associations between genetically predicted gene expression (or AS, APA) and CRC risk are independent of the association signals identified in previous GWAS, we performed conditional analyses by adjusting for the lead variant using the GCTA-COJO method73. The lead variant in each risk locus was determined based on the strongest association from GWAS summary statistics derived from the meta-analysis of European and East Asian populations or from Europeans alone. Genotype data from either European or East Asian populations in the 1000 Genomes Project Phase 3 were used to estimate linkage disequilibrium (LD). We then conducted TWAS analyses (or AS, APA) using the adjusted GWAS summary statistics to explore the association between genetically predicted gene expression and CRC risk. In addition, joint analyses were performed to combine p-values for the European population, as well as datasets from both European and East Asian populations, for gene expression (or AS, APA).

Building genetically predicted models of gene expression, AS, and APA

To construct genetically predicted models for gene expression, AS, and APA, based on our sTF-TWAS framework20, we selected only genetic variants that located in the binding regions of any susceptible TFs for variants with MAF > 0.05, and HWE P < 10−6. We also excluded variants missing in CRC GWAS summary statistics from either European or Asian populations, which resulted in approximately 378,000 genetic variants. Prediction models were trained using the elastic net approach based on their flanking variants within a ± 1 Mb region, accounting for potential confounding factors, such as the top five principal components (PCs), gender, potential batch effects, and other factors derived from the PEERs method. A tenfold cross-validation approach was used to fine-tune the model parameters. The performance of the prediction models was evaluated by calculating the squared correlation (R2) between the predicted values and the observed levels for gene expression, AS and APA, respectively.

Association analyses of genetically predicted gene expression, AS, and APA with CRC risk

Using the weight matrix and summary statistics from CRC GWAS datasets consisting of 186,072 individuals of European ancestry and 72,272 individuals of East Asian ancestry, we evaluated the association between gene expression (or AS, APA) and CRC risk under the sTF-TWAS framework20. The statistical method was outlined in Eq. (3)

$${Z}_{g}\approx {\sum }_{l\in {{Model}}_{g}}{\omega }_{{\mathrm{lg}}}\frac{{\hat{\sigma }}_{l}}{{\hat{\sigma }}_{g}}\frac{{\hat{\beta }}_{l}}{{se}\left({\hat{\beta }}_{l}\right)}\,5$$

Briefly, the Z-score was used to estimate the association between predicted gene expression (or AS, APA) and CRC risk, \({w}_{{\mathrm{lg}}}\) is the weight of variant \(l\) for predicting the expression of gene \(g\) (or AS, APA), \({\hat{\beta }}_{l}\) and \({{se}}(\hat{\beta }_{l})\) are the association regression coefficient and its standard error, respectively, for variant \(l\) in GWAS, and \({\hat{\sigma }}_{l}\) and\(\,{\hat{\sigma }}_{g}\) are the estimated variances of variant \(l\) and the predicted expression of gene \(g\) (or AS, APA). We further conducted a joint analysis to combine TWAS p-values for each gene from the BarcUVa-Seq and GTEx datasets for the European population, as well as from datasets across European and East Asian populations (ACCC), using the aggregated Cauchy association test (ACAT) method74. Similarly, joint analyses were also performed to combine p-values for AS and APA, respectively. Statistically significant genes were determined based on Bonferroni corrected P < 0.05, considering the total number of tests for the models of gene expression, AS, and APA, respectively.

Differential gene expression analysis from single cell data

We downloaded and analyzed single-cell RNA sequencing (scRNA-seq) data spanning CRC developmental stages, including colon normal tissues (n = 31), serrated polyps (n = 19), conventional adenomas (n = 29), and carcinomas (microsatellite stability [MSS], n = 17; and microsatellite instability-high [MSI-H], n = 15) from the Colorectal Molecular Atlas Project (COLON MAP)23. We filtered the cell count matrix for each sample by removing low-expressed genes (expressed in ≤3 cells) and excluded cells with high mitochondrial activity ( ≥ 20% of transcripts attributed to mitochondrial genes) using Seurat (version 4.2.1)75,76. We next identified the top 2000 highly variable genes for principal component analysis (PCA) and clustered cells based on the first 30 principal components using the Shared-Nearest-Neighbors (SNN) graph and the Louvain community detection algorithm. These clusters were visualized using the dimensionality reduction technique Manifold Approximation and Projection (UMAP).

For differential expression analysis across CRC developmental stages, we examined risk genes associated with the normal-serrated polyp-carcinoma pathway and the normal-adenoma-carcinoma pathway. We performed integrative analysis of the cell count matrix across stages, generating a pseudobulk gene count matrix for each cell type within each tissue sample. Using these pseudobulk count data, we applied DESeq224 to identify differentially expressed genes between developmental stages. DESeq2 accounts for heterogeneous variability (heteroscedasticity) in gene expression, where variance depends on mean values, by modeling this relationship. This approach effectively adjusts for differences in variability between low- and high-expressing genes, leading to more accurate differential expression analysis. Specifically, we focused on potential CRC cell origins in major cell types, including absorptive cells (ABS), goblet cells (GOB), and stem cells (STM), comparing normal tissues with serrated polyps and normal tissues with conventional adenomas. Furthermore, we analyzed differential gene expressions between adenoma-specific cells (ASC) and MSS, as well as ASC and MSI-H. Similarly, differential gene expression analyses were performed between serrated-specific cells (SSC) and MSS, and SSC and MSI-H. Statistically significant genes were considered as a False Discovery Rate (FDR)-adjusted p-value < 0.05.

Identifying druggable targets for the identified risk genes

We identified potentially druggable risk genes and associated drugs by integrating drug-protein interaction data from four well-established drug databases: DrugBank52, ChEMBL53, the Therapeutic Target Database54 (TTD) and OpenTargets55. Specifically, we extracted key information for each drug, including commercial names, maximum clinical trial phases, and indications. Using this data, we annotated the identified risk genes as druggable targets and linked their corresponding drugs that are either approved or in clinical trials. In addition, we categorized therapeutic drugs based on their indications, grouping them into those targeting CRC, other types of cancer, and non-cancer diseases.

Cell culture and cell proliferation and colony formation assays

Human colorectal cancer cell lines (HCT116, RKO, and SW480) from ATCC were cultured in RPMI 1640 supplemented with 10% FBS and penicillin/streptomycin. Cells were kept in a humidified incubator at 37˚ with 5% CO2. CCK-8 cell proliferation was performed for four days. In detail, 5 × 103 cells were seeded into a 96-well plate and cultured for four days. The CCK-8 reagent (APExBIO) was then added into each well, and the plate was read at OD 450 nm using a microplate reader (BioTek). Relative cell growth was calculated by normalizing the OD value of knockout cells to the control cells. For long-term effect, colony formation assay was performed for 10–15 days. In detail, 200 cells (HCT116 and RKO) or 500 cells (SW480) were seeded into a 12-well plate. Cell colonies were visualized by crystal violet staining. The colonies were photographed, and numbers were counted.

CRISPR/Cas9-mediated gene knockout

For gene knockout, the lentiviral vector LentiCRISPR v2 (Addgene) was used to clone sgRNA of the individual target gene. Lentiviruses were amplified in HEK 293 T cells. Briefly, lentiviral vector and packaging vectors, psPAX2 and pVSV-G, were transfected into 293 T cells with lipofectamine 2000 (Thermo Fisher Scientific). Viral supernatants were collected by centrifugation at 48 and 96 hours after transfection. CRC lines were infected with the corresponding lentiviruses for 24 h. Cells were changed to new growth medium for another 24 h and then selected with puromycin (2 μg/ml, Sigma-Aldrich) for 5 days. The CRISPR/Cas9 edited cells were verified by western blotting of the target protein before using for the functional assays.

Western blotting

Total proteins were extracted from the corresponding cells lysed with RIPA buffer plus protease and phosphatase inhibitors (Roche). Twenty ug of total proteins were separated by SDS-PAGE, transferred to PVDF membrane and probed with the indicated primary antibodies. Anti-rabbit or anti-mouse secondary antibody was subsequently applied, and western blot signals were detected using enhanced chemiluminescence substrate (West Femto, Thermo Fisher Scientific). β-Tubulin was used as the loading control. Band intensity was calculated with NIH ImageJ software.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.