Introduction

Colorectal cancer (CRC) is one of the leading causes of cancer-related morbidity and mortality worldwide, accounting for 9.6% of all new cancer cases and 9.3% of all cancer deaths in 20221. Its development involves a complex interplay of genetic predisposition and modifiable lifestyle factors2. Numerous studies highlight the association between physical activity (PA) and the decreased risk for CRC, suggesting that regular PA lowers inflammation, improves insulin sensitivity, and modulates gut motility, all of which may contribute to reduced carcinogenesis3,4,5,6. However, the extent to which these benefits are influenced by individual genetic susceptibility remains not fully understood.

Large-scale genome-wide association studies (GWAS) have identified over 100 genetic loci associated with CRC susceptibility, implicating biological pathways such as Wnt signaling, immune regulation, and cell cycle control7,8. These findings have improved our understanding of CRC heritability, but most GWAS have focused solely on main genetic effects without considering interactions with behavioral or environmental exposures.

While candidate gene studies have provided valuable insights into specific pathways, they are inherently limited in scope and fail to capture the broader genetic landscape influencing CRC risk. For example, polymorphisms in IL6 and TNF (inflammation regulation), FTO and PPARG (energy metabolism), and ABCA1 (lipid transport) have been associated with PA-modulated effects on biomarkers such as C-reactive protein, obesity-related traits, and lipid profiles9,10,11,12. In addition, genes such as PITX1, a tumor suppressor associated with IGF-I pathways, and oxidative stress-related genes such as CAT, GSTP1 and MPO have been shown to interact with PA to influence cancer risk and antioxidant capacity13,14. These studies have been conducted to measure outcomes such as inflammatory markers, adiposity indices and oxidative stress levels, illustrating how genetic predisposition interacts with lifestyle factors in shaping disease risk. However, their reliance on prior biological assumptions and limited genomic coverage restricts their utility in discovering novel interactions15.

Genome-wide interaction study (GWIS) can provide a robust and exploratory framework for uncovering novel gene-environment interactions. This approach is particularly useful for identifying genetic variants and pathways that have not previously been reported to be associated with CRC, thereby expanding our understanding of the interplay between PA and genetic factors in addition to the conventional genome-wide association studies (GWAS) approach16. In the context of CRC, only a few GWIS have been conducted to date, and these have focused on alcohol consumption17, NSAIDS18, and diet19. This highlights the need for systematic investigation into gene-PA interactions.

In this study, we aimed to investigate the interaction between genetic susceptibility and PA on CRC risk at a genome-wide level using data from the UK Biobank, employing a nested case-control design to minimize potential biases associated with excessively large unmatched control groups, and to preserve temporal comparability between cases and controls.

Methods

Study population

We used data from the UK Biobank (application #94695), a prospective cohort study of over 500,000 participants aged 40–69 years at baseline between 2006 and 2010.

After excluding participants who withdrew their consent, we applied the following exclusion criteria: missing information on the year or month of birth (UK Biobank field Data Field IDs: 34, 52), physical activity (Field ID: 22035) or smoking status (Field ID: 20116); missing information for all of the following: deprivation index (Field ID: 189), body mass index (BMI; Field ID: 21001) and alcohol consumption status (Field ID: 20117); non-European genetic ancestry (Field ID: 22006); a diagnosis of any cancer prior to the baseline assessment (Field ID: 53, 40005, 40006); and a genotyping call rate of less than 99% (Field ID: 22005). After applying these criteria, a total of 272,270 participants remained eligible for analysis (Fig. 1). Controls were selected through incidence density sampling from the eligible study population and were required to meet the same exclusion criteria as applied to cases, including no history of cancer prior to baseline.

Fig. 1
Fig. 1
Full size image

Flowchart of study population selection based on incidence density matching using UK Biobank Data.

We conducted a nested case-control study using incidence density matching to evaluate the interaction between physical activity (PA) and genetic susceptibility to CRC, ensuring that the controls represented the same risk set as the cases and preserving the temporal structure of exposure while maintaining comparability in a time-sensitive context. Incident CRC cases (n = 2,974) were identified through linkage with national cancer registries. For each case, up to four controls (n = 11,424) were selected from participants who were at risk at the time of case diagnosis. Controls were matched on sex, age at recruitment (± 5 years), smoking status and follow-up duration (± 6 months).

Genotyping data

We accessed genomic data provided by the UK Biobank, generated using the Affymetrix UK BiLEVE and UK Biobank Axiom Array platforms. These datasets contained over 800,000 single nucleotide polymorphisms (SNPs). Quality control procedures were applied to retain SNPs that met the following criteria: genotyping call rates > 99%, Hardy-Weinberg equilibrium p-values > 1 × 10⁻⁶, and minor allele frequencies > 0.03. After filtering, 409,059 SNPs were included in the analysis.

We limited our analyses to directly genotyped SNPs, rather than imputed variants. Given the exploratory nature of the GWIS, we deliberately adopted a conservative approach. Although genotype imputation is widely used and generally reliable, interaction models are more susceptible to uncertainty in imputation, especially when modeling subtle gene-environment interactions. By focusing on high-confidence genotyped variants, we aimed to improve the robustness and interpretability of the results, even if it meant reducing genomic coverage.

Statistical analysis

We investigated the interaction between genetic variants and PA on CRC risk using conditional logistic regression models. The interaction between genetic variants and PA was assessed using the p for interaction, derived from the statistical significance of the interaction term (SNP × PA) in the regression model under an additive genetic model. Physical activity was categorized based on the 2017 WHO guidelines20, with sufficient PA defined as at least 150 min of moderate-intensity or 75 min of vigorous-intensity activity per week.

To account for potential confounding, we adjusted the models for the following variables: BMI group, alcohol drinking status, socioeconomic deprivation index, the first 30 genetic principal components (to adjust for population stratification; Field ID: 22009), and genotyping batch (Field ID: 22000). The first 30 PCs were used as they collectively explained approximately 84.3% of the total genetic variance. Matching variables (age, sex, and smoking status) were not included in the adjustment as they were already accounted for by the matched study design.

To compare the baseline characteristics of cases and controls while accounting for the matched study design, we used conditional logistic regression with likelihood ratio tests (LRTs) for each covariate. This approach reflects the incidence density matching structure appropriately by conditioning on matched sets, and enables us to obtain a single overall P-value per variable. This is particularly useful for summarizing group differences in Table 1. Matching variables (age, sex and smoking status) were excluded from statistical testing as they were fixed by design and were not subject to comparison.

Table 1 Baseline characteristics of colorectal cancer cases and matched controls in the nested case-control study from UK Biobank.

We performed pathway enrichment analysis using MAGMA (Multi-marker Analysis of GenoMic Annotation) version 1.1021, following a three-step approach: SNP annotation to genes, gene-level association analysis, and pathway-level analysis. First, we annotated SNPs to genes based on their physical location using the NCBI37.3 gene reference file. SNPs located within a 10 kb window upstream or downstream of each gene were included in this process, resulting in a dataset that linked SNPs to their corresponding genes. We then calculated gene-level p-values by aggregating SNP-level p-values using a multiple regression framework implemented in MAGMA22. To account for linkage disequilibrium between SNPs, we used the European reference panel from the 1000 Genomes Project (phase 3). Finally, we performed pathway-level analysis by aggregating gene-level p-values into predefined pathways based on KEGG annotations23. Using MAGMA’s competitive testing framework, we compared the observed associations within each pathway to the genome-wide background distribution.

False discovery rate (FDR) correction was applied at the SNP level to account for multiple testing, using the Benjamini–Hochberg method. No variants passed the significance threshold of an FDR-adjusted p-value of less than 0.05. Consequently, the top 10 variants with the lowest FDR-adjusted p-values were reported and interpreted as exploratory findings. FDR-adjusted p-values were also calculated for gene- and pathway-level analyses conducted using MAGMA. The full results, including the FDR-adjusted p-values, are presented in Tables S1 and S2.

Data preprocessing and quality control were performed using PLINK v2.0 and Python 3. The genome-wide interaction analysis was conducted using the clogit function in R version 4.3.1 to evaluate the interaction between SNPs and physical activity on colorectal cancer risk. Visualization was performed in R and included Manhattan plots, quantile–quantile (QQ) plots, and a volcano plot to display the direction and strength of interaction effects.

Results

Table 1 presents the baseline characteristics of colorectal cancer (CRC) cases (n = 2,974) and matched controls (n = 11,424) included in the final analysis. As the matching variables (age, sex, and smoking status) were fixed by design, statistical tests were not conducted for these variables. Statistically significant differences were observed in alcohol drinking status (p < 0.001), body mass index (BMI) category (p < 0.001), deprivation index (p < 0.001), and physical activity levels (p < 0.001). A slightly higher proportion of controls met the WHO guidelines for sufficient physical activity compared to cases (53.2% vs. 51.1%).

In the genome-wide interaction analysis, we assessed the interaction between genotyped SNPs and physical activity on CRC risk. No SNPs reached statistical significance after correction for multiple testing (FDR < 0.05). Table 2 summarizes the top 10 SNPs with the lowest FDR-adjusted p-values. The variant rs61856638 in the ABI3 gene showed the strongest signal (p = 1.11 × 10⁻⁶; FDR-adjusted p = 0.44), followed by rs8043440 in GABRB3 (p = 2.16 × 10⁻⁶; FDR-adjusted = 0.44) and rs1672718 in ZBTB16 (p = 4.62 × 10⁻⁶; FDR-adjusted = 0.63). Several of these SNPs showed moderate interaction effect sizes, though none surpassed the FDR-corrected significance threshold. The main effects of these SNPs on CRC risk were generally opposite the interaction terms. While none of these findings were statistically significant, this pattern may indicate potential interactions between genetic variation and physical activity that need to be investigated further.

Table 2 Top 10 SNPs with the lowest FDR-adjusted p-values in the genome-wide interaction analysis of physical activity and colorectal cancer risk. None of the SNPs passed the FDR significance threshold (FDR < 0.05).

Figure 2 shows the QQ plot of observed versus expected p-values, which closely followed the null distribution. Figure 3 displays the Manhattan plot of interaction p-values across the genome. No locus exceeded the genome-wide significance threshold, but several SNPs showed suggestive signals. Figure 4 illustrates the volcano plot highlighting the direction and magnitude of interaction effects, with the top 10 SNPs (based on FDR-adjusted p-values) marked.

Fig. 2
Fig. 2
Full size image

Quantile–quantile plot of genome-wide SNP × physical activity interaction p-values.

Fig. 3
Fig. 3
Full size image

Manhattan plot of genome-wide SNP × physical activity interaction on colorectal cancer risk. SNPs with the top 10 smallest FDR-adjusted p-values are highlighted in green.

Fig. 4
Fig. 4
Full size image

Volcano plot of SNP × physical activity interaction effects. SNPs with the top 10 smallest FDR-adjusted p-values are highlighted in green.

A gene-level analysis was performed using MAGMA, which annotated 15,956 genes (Table S1). Although none of the genes passed the FDR-adjusted p < 0.05 threshold, several demonstrated low nominal p-values and may be of potential interest. These included PTGFR (p = 7.48 × 10⁻⁵; FDR-adjusted p = 0.60), RNASEL (p = 7.76 × 10⁻⁵; FDR-adjusted p = 0.60), NSD1 (p = 1.12 × 10⁻⁴; FDR-adjusted p = 0.60) and PTGER3 (p = 1.83 × 10⁻⁴; FDR-adjusted p = 0.65). While these results did not exceed FDR corrected threshold, they suggest candidate loci that may modulate CRC risk in relation to physical activity.

In the pathway-level analysis based on KEGG annotations, no pathways reached statistical significance after FDR correction. However, several pathways ranked among the top results based on their unadjusted p-values. These included platinum drug resistance (p = 0.0083; FDR-adjusted p = 0.85), heparan sulfate/heparin biosynthesis (p = 0.0084; FDR-adjusted p = 0.85), efferocytosis (p = 0.0148; FDR-adjusted p = 0.85), and transcriptional misregulation in cancer (p = 0.0168; FDR-adjusted p = 0.85). Inflammation-related pathways such as NF-κB signaling and Notch signaling also appeared among the top-ranked findings. While these pathways did not meet the significance threshold after multiple testing correction, they may offer biologically plausible leads for future investigation. Full pathway-level results are presented in Table S2.

Discussion

In this genome-wide interaction study, we investigated whether physical activity modifies genetic susceptibility to CRC. After applying false discovery rate correction, no variants, genes, or pathways reached statistical significance. These results highlight the difficulty of identifying gene-environment interactions in complex diseases and emphasize the exploratory nature of our analysis. Although genome-wide significant interaction signals cannot be observed, this may be partly due to the lack of a significant association between physical activity and CRC risk in this cohort (odds ratio [95% confidence interval] 1.07 [0.99–1.16] in the multivariable model). Limited power to detect modest interaction effects and potential exposure misclassification may also have contributed.

Some of the variants with relatively low interaction p-values were located within genes that may have biological relevance to CRC, particularly in the context of immune regulation or epigenetic control. ABI3 and ZBTB16 are both involved in immune cell signaling and differentiation, with ZBTB16 previously linked to colorectal tumorigenesis through modulation of Wnt signaling and inflammatory pathways24. RNASEL, a gene involved in interferon-mediated antiviral responses, has been linked to cancer susceptibility in previous studies25. Although a recent study reported no significant association between RNASEL variants and colorectal adenoma risk26, its relevance to colorectal cancer may vary across stages of tumor development. NSD1 encodes a histone methyltransferase involved in epigenetic regulation and chromatin remodeling. Although its role in colorectal cancer is not well established, NSD1 has been implicated in other malignancies and overgrowth syndromes through altered transcriptional regulation27,28. GABRB3 has been associated with neurological and psychiatric conditions, including epilepsy29, autism spectrum disorders30, and bipolar disorder31. Although its role in colorectal cancer remains unclear, GABAergic signaling has been investigated in relation to epithelial cell function32 and intestinal homeostasis33. While none of these findings reached statistical significance after multiple testing correction, they may offer tentative biological clues that merit further investigation.

In the pathway-level analysis, no KEGG-defined pathways reached statistical significance after FDR correction. Nonetheless, several pathways had relatively low unadjusted p-values and were ranked among the top results, which may offer preliminary insights for future hypothesis-driven research. Among them, the platinum drug resistance pathway includes genes involved in DNA repair mechanisms and apoptosis regulation—cellular processes that have been associated with colorectal cancer progression and may be modulated by physical activity–induced changes in oxidative stress and cellular stress response pathways34,35. The heparan sulfate/heparin biosynthesis pathway is related to glycosaminoglycan metabolism, which can affect cell adhesion and extracellular matrix interactions. These factors influence epithelial integrity and tumor invasion, and may also respond to biomechanical or hormonal changes induced by regular physical activity36.

Efferocytosis, the process by which phagocytes clear apoptotic cells, is essential for resolving inflammation and maintaining immune tolerance. Disruption of this process has been linked to chronic inflammation and the formation of tumor-promoting microenvironments in the colon37,38. Physical activity is known to influence systemic inflammatory tone, which may interact with cell death and efferocytic pathways in shaping CRC risk39. Lastly, transcriptional misregulation in cancer encompasses a diverse set of genes frequently altered in tumorigenesis, including those related to cell cycle control and differentiation. Though broad, this category may capture regulatory pathways influenced by both genetic variation and lifestyle exposures40. Although these pathways did not reach FDR-adjusted significance and should be interpreted with caution, the convergence of inflammation-, repair-, and differentiation-related processes among the top results may suggest biological pathways through which physical activity and genetic variation could jointly influence CRC development.

This study has several limitations. The sample size may have limited power to detect modest gene–environment interactions, and physical activity was self-reported, raising the possibility of measurement error. Although major covariates were adjusted for, residual confounding cannot be excluded. Despite these constraints, the use of incidence density matching helped minimize time-related bias and ensured appropriate comparability between cases and controls. Analyses were restricted to directly genotyped variants to reduce uncertainty from imputation, and population structure was controlled using principal components. These design features strengthen the internal validity of the findings, even in the absence of statistically significant associations.

This exploratory genome-wide interaction analysis did not identify statistically significant associations between physical activity and genetic variants in colorectal cancer. Nonetheless, several variants and pathways showed nominal signals that may inform future hypotheses. Although these findings require cautious interpretation, the use of a matched design, direct genotyping, and multi-level analysis supports the integrity of the results and their potential value as a starting point for further research.