Introduction

Mortality in colorectal carcinoma (CRC) is mediated by metastasis, despite expanding multimodal treatment options1. Low-stage primary tumors often have excellent outcomes after surgical resection, but CRC can recur and/or metastasize several years after surgery. For example, locally advanced T-stage tumors (T2-3) without lymph node metastasis (N0) can be managed without adjuvant chemotherapy. However, follow-up is needed to identify the 10–20% of patients who will have distant metastases within 5 years.

The CRC microbiome is composed of intestinal commensal bacteria, and “pathobionts” that contribute to carcinogenesis and disease progression. Intestinal bacteria also colonize CRC distant metastases and modulate the immune microenvironment, such as tumoricidal natural killer cell activity2,3. The most extensive prior research of pathobionts describes Fusobacterium nucleatum ssp. animalis, which adheres to host enterocytes through Fap2 and stimulates Wnt signaling through FadA interactions with E-cadherin to promote carcinogenesis and the epithelial-to-mesenchymal transition4,5. In animal models, F. nucleatum increases the risk of metastasis and may directly contribute to lymphatic or hematogenous spread3,6. F. nucleatum colonizes 10–20% of CRC with a predilection for right-sided location7, and distant metastases3. Other pathobionts have been mechanistically linked to CRC carcinogenesis and progression, such as Escherichia coli strains with biosynthetic genes (pks locus) to produce colibactin, a host DNA alkylating genotoxin8. Pathobiont pks + E. coli promotes carcinogenesis in the setting of colitis8, and the signature mutagenic effects of colibactin are detectable in human CRC9,10. Recent advances toward colibactin synthesis inhibition highlight the promise of microbiome-targeted therapies for prevention and/or treatment of CRC11. Host DNA damage can also be inflicted by protein toxins, such as the type III secretion system effector UshA in Citrobacter rodentium and cytolethal distending toxin (CDT) in Campylobacter jejuni that promote carcinogenesis in mouse models12,13. Some strains of the prevalent intestinal colonizer Bacteroides fragilis produce fragilysin or “enterotoxin” (ETBF), which promotes carcinogenesis14. ETBF enhances CRC cell migration in culture, likely through a miR-139-3p-dependent pathway15. ETBF also promotes “stemness” of CRC in culture through TLR4-NFAT516.

Reflecting the importance of bacteria in CRC biology, tumor microbiomes have potential links to survival outcomes in prior research. A microbiome signature classifier with substantial contributions from Ruminococcus spp. was predictive of survival in several CRC cohorts17. However, whether the classifier provides prognostic value beyond other dominant factors like the pathologic stage remains to be investigated. Highly powered correlational studies using the Nurses’ Health Study (NHS) and the Health Professionals Follow-up Study (HPFS) cohorts and a Chinese stage III/IV CRC cohort have suggested that F. nucleatum abundance predicts CRC-specific mortality (OR 1.52 [1.04–2.39]), though other studies found no correlation18,19,20. Several known covariates with F. nucleatum colonization, such as right-sided anatomic location and microsatellite instability, may confound this type of single-species retrospective analysis21. At least two studies have suggested that high ETBF abundance correlates with advanced stage and shorter survival22,23. One mechanism by which bacteria may contribute to survival outcomes in CRC is interaction with chemotherapeutic agents. Prior research indicates that 5-fluorouracil (5-FU), a cornerstone of CRC chemotherapy since the 1980s, inhibits F. nucleatum growth, while tumor-resident E. coli can metabolize 5-FU to eliminate its anti-tumor activity24.

Specific tumor-associated pathobionts have established causal roles in colorectal carcinogenesis and progression. For several organisms, rigorous prior research has identified mechanisms such as host genome modification by genotoxins and stimulation of pro-carcinogenic signaling25. Knowledge gaps remain regarding the roles of human CRC microbiomes in disease progression, particularly metastasis risk. We hypothesized that metastasis-promoting bacteria are enriched in primary tumors of patients who eventually develop distant metastasis. We designed a retrospective cohort study using a large multi-institutional CRC population with available tumor RNAseq data. We validated the findings using 16S rDNA sequencing of an independent stage-matched validation CRC cohort.

Results

Descriptive statistics of independent discovery and validation cohorts

Uniformly tabulated clinicopathologic data from the ORIEN cohort showed younger age in the pre-existing metastasis group, modest differences in BMI (lowest in >5-year MFS), slightly more current smokers in the 0–5 yr MFS group, and differences in known vs. unknown performance status (Table 1). Since the primary tumor stage is a dominant predictor of MFS26, the higher pathologic and clinical stage parameters and death rates among the groups with metastasis were expected. Neoadjuvant therapy is more commonly indicated and given in the advanced stage and short MFS groups. The length of follow-up difference was expected due to study design, because only the >5-year MFS group was required to have at least 5 years of follow-up.

Table 1 Clinicopathologic features of the ORIEN CRC cohort grouped by metastasis-free survival outcomes

Clinicopathologic data from the UIHC cohort showed more frequent microsatellite instability in the >5 yr MFS group (Table 2). The M stage was specified by study design. Adjuvant chemotherapy was more frequently given to metastasizers, as expected. Since adjuvant therapy was given after biospecimen collection in all cases, it does not affect the primary tumor microbiome.

Table 2 Clinicopathologic features of the UIHC cohort

CRC tumor microbiome diversity predicts metastasis-free survival

We first asked whether bacterial communities colonizing primary tumors are related to the risk of metastasis. To address this question, we compared tumor microbiomes from patients with metastasis preceding the biospecimen collection (pre-existing metastasis), those that metastasized within 5 years, and those with more than 5 years of MFS. Primary tumor microbiome alpha diversity was slightly lower in patients with long-term MFS, having the same trend and borderline statistical significance with multiple metrics in the discovery and validation cohorts (Fig. 1A–C, 2-way ANOVA p 0.02–0.08). All analyses were stratified by site (metastasis vs. primary tumor) and metastases consistently showed lower alpha diversity. Tumor microbiome beta diversity analysis also showed significant differences between the MFS groups, as well as sites (Fig. 1D, E). All the above global diversity trends were found in both cohorts. The ORIEN discovery cohort has several clinicopathologic differences among the MFS groups that could confound the microbiome associations with disease progression. Since the cohort is large enough to power multivariate modeling, we asked whether the MFS-microbiome link is independent of several potential confounders using a permutational clustering approach implemented in adonis2 (Table 3). A statistically significant correlation between microbiome beta diversity and MFS persisted after adjustment for potential confounders. Since neoadjuvant therapy can alter tumor microbiomes, we performed the same modeling approach while excluding 151 neoadjuvant treated patients. The microbiome association with MFS remained significant (adjusted adonis p = 0.015). High microsatellite instability (MSI-H) can alter CRC biology, including metastasis outcomes and response to therapy27. We performed the same modeling while excluding 39 patients with MSI-H status, and the microbiome association with MFS remained significant (adjusted adonis p = 0.03). Within the MSI-H subset, a simplified analysis with adjustment only for primary/metastasis biospecimen site found no significant association between microbiome and MFS (adjusted adonis p = 0.4). This could reflect differences in MSI-H tumor microbiome relationships to metastasis outcomes, or inadequate statistical power (n 39).

Fig. 1: CRC tumor microbiomes predict metastasis-free survival.
Fig. 1: CRC tumor microbiomes predict metastasis-free survival.The alternative text for this image may have been generated using AI.
Full size image

A, B In the ORIEN cohort, alpha diversity by several metrics is significantly lower in distant metastases (met), and among primary tumors, modestly lower in patients with long term metastasis-free survival (MFS). C Similarly, low alpha diversity in primary tumors correlated with shorter MFS in the UIHC cohort. Metastases had lower Shannon indices that did not reach statistical significance. Microbiome beta diversity in both the ORIEN D and UIHC E cohorts is also distinguishable by site (metastasis vs. primary tumor) and MFS group. Larger symbols represent calculated centroids. The findings indicate that tumor microbiome diversity corresponds to MFS, and therefore may have prognostic value.

Table 3 Tumor microbiomes still predict MFS after adjustment for potential confounders

To identify specific taxa that distinguish metastasis risk, some of which may be causally related to disease progression and thus therapeutic targets, we applied LefSe28 linear discriminant analysis (Fig. 2). Among the most discriminating features in the ORIEN dataset were at least six known pathobiont-containing taxa that contribute to carcinogenesis through known mechanisms (Fig. 2A, blue text). Surprisingly, F. nucleatum was highest in primary tumors from patients with long-term metastasis-free survival. This may be explained by an inverse correlation between F. nucleatum abundance and stage group, i.e., F. nucleatum is most abundant in tumors with a shallow depth of invasion and a lack of lymph node metastases (Kruskal-Wallis p = 0.003). Analysis of the stage-matched UIHC validation cohort revealed different discriminating taxa, although some overlap with the discovery cohort was seen (Fig. 2B, red text). Thus, beta diversity differences according to MFS are strongly consistent, but the specific discriminating taxa differ between the two cohorts, which could be due to covariance with their distinct clinicopathologic features (Tables 1, 2).

Fig. 2: Pathobiont taxa predict CRC metastasis-free survival.
Fig. 2: Pathobiont taxa predict CRC metastasis-free survival.The alternative text for this image may have been generated using AI.
Full size image

A Linear discriminant analysis of the ORIEN cohort identified several distinguishing taxonomic groups that contain known CRC-related pathobionts (blue). The Enterobacteriaceae S. enterica and E. coli are enriched in distant metastases. B. fragilis and Peptostreptococcus spp. in primary tumors predicted metastasis within 5 years, while F. nucleatum associated with long-term metastasis-free survival (MFS). B Discriminating taxa in the stage-matched UIHC cohort were not the same, but overlapping taxonomic groups are highlighted in red. C Tissue RNAseq- and 16S-derived microbiome data does not distinguish pathobiont strains, so qPCR for F. nucleatum, bft1-3 from enterotoxigenic B. fragilis (ETBF), and cnf2 and clbB genes from the pks locus in colibactin-producing E. coli were performed on tumors and non-neoplastic margins. Detection rates were significantly different (Chi-squared p < 0.01) with ETBF and pks + E. coli more frequently detected in metastasizer biospecimens. F. nucleatum detection rates did not differ in the UIHC cohort by MFS group.

The 16S rDNA sequencing cannot resolve specific pathobiont strains, and the RNAseq data were of insufficient depth to confidently discriminate the closely related genomes. For example, 16S rDNA sequences do not distinguish enterotoxigenic from nonenterotoxigenic B. fragilis. We selected three of the most studied CRC-promoting pathobionts for strain-specific detection and quantitation by PCR: F. nucleatum, enterotoxigenic B. fragilis (ETBF), and pks + E. coli. qPCR performed on all non-metastasis biospecimens from the UIHC cohort (primers listed in Table S1) showed more frequent detection of both ETBF and pks + E. coli in metastasizers than patients with long-term MFS (Fig. 2C, 33–40% vs. 20–28%, Chi squared p < 0.01). Detection of F. nucleatum was not associated with MFS in the stage-matched UIHC cohort.

Microbial co-occurrence predicts metastasis-free survival

We identified a microbial clique of the simultaneous presence of Escherichia coli and the absence of a Pseudomonas sp. and Bacteroides fragilis in the training data (randomly selected 65% of the samples). Almost half (47.6%, n = 565) of the total study sample had this microbial clique. This microbial clique was strongly associated with increased odds of metastasis (OR [95% CI]: 1.91 [1.40, 2.59], p value < 0.0001) across the whole sample (Fig. 3). It was also strongly associated with increased odds of metastasis (OR [95% CI]:1.88 [1.11, 3.20]) in the test data (35% of the sample).

Fig. 3: A microbial clique identified with the MiCA algorithm can predict metastasis.
Fig. 3: A microbial clique identified with the MiCA algorithm can predict metastasis.The alternative text for this image may have been generated using AI.
Full size image

A A forest plots shows the association between the identified microbial clique and the odds of metastasis. B A violin plot shows the distribution of the outcome among patients with and without the clique.

Primary tumor microbiome capacities for carbohydrate metabolism and quinone biosynthesis reflect metastasis risk

Although specific MFS-associated taxa, other than pathobionts, did not show a high degree of concordance between the two cohorts, we reasoned that the functional capacities of the microbiome community may correspond to tumor microenvironment differences underlying metastasis risk. These functions may also highlight targets for future mechanistic studies and microbiome-targeted interventions. Using predicted gene function in the UIHC cohort, we examined correlations with MFS (Fig. 4A). The dominant trends were the enrichment of certain carbohydrate catabolism pathways and quinone biosynthesis pathways in tumors from patients with long-term MFS. Several enriched pathways are involved in the biosynthesis of menaquinones, which are components of the cell membrane that participate in electron transport. Menaquinones are prevalent, reversible redox components in anaerobes and most Gram-positive aerobes29. For the ORIEN cohort, we leveraged a large database of phenotype inferences, ProTraits30, and mapped ~1300 of the detected taxa in our biospecimens to derive predicted trait abundances. A comparison of the 0–5 yr and >5 yr MFS group microbiomes revealed a few traits significantly enriched in the long-term MFS group (Fig. 4B). Among these were several carbohydrate utilization traits and strict anaerobic preference, which correspond well to the metabolism and anaerobe-enriched menaquinone biosynthesis in the UIHC cohort.

Fig. 4: Predicted primary tumor microbiome functional capacities are related to metastasis risk.
Fig. 4: Predicted primary tumor microbiome functional capacities are related to metastasis risk.The alternative text for this image may have been generated using AI.
Full size image

A For the UIHC cohort, linear discriminant analysis was performed on 16S rDNA sequence-based pathway abundance prediction from Picrust2. Several patterns emerged as enriched in the >5-year MFS group: carbohydrate catabolism pathways (blue), quinone biosynthesis (brown), and some amino acid biosynthesis pathways (red). The most enriched pathways in metastasizer primary tumors were nucleic acid metabolism pathways. B For the marker gene-derived microbiome data in the ORIEN cohort, the identified species were mapped to the Protraits bacterial traits database. Among the ~1300 species with phenotypic data, few specific associations were found related to MFS: strict anaerobes, alpha galactosidase, and hydrogen sulfide producers were modestly enriched in the >5-year MFS group.

Metastasis microbiomes derive from the primary tumor and have reduced alpha diversity

CRC metastases are colonized by intestinal bacteria including pathobionts3. We assessed correlations between the microbiomes of primary tumors and metastases through paired biospecimens from individual patients. We focused on the subset of 103 ORIEN cohort participants and 50 UIHC cohort participants with paired primary tumor and metastasis biospecimens. Consistent with our observation that metastasis microbiomes in general have lower alpha diversities (Fig. 1A–C), alpha diversity was lower in metastases compared to the paired primary tumor from the same subject in both the ORIEN and UIHC cohorts (Fig. 5A–C). To assess beta diversity relationships between primary/metastasis pairs, we quantified UniFrac distances. In the ORIEN cohort, paired primary tumor and metastasis microbiomes within one subject were more closely related than unpaired comparisons (Fig. 5D). This pattern did not hold in the UIHC validation cohort, though statistical power was limited by the number of paired biospecimens (Fig. 5E). Overall, the findings suggest that similar bacterial populations colonize both primary tumors and metastases within an individual. However, a restricted set of taxa (lower alpha diversity) can colonize and persist at distant sites.

Fig. 5: Metastasis microbiomes are closely related to the primary tumor microbiome but with reduced alpha diversity.
Fig. 5: Metastasis microbiomes are closely related to the primary tumor microbiome but with reduced alpha diversity.The alternative text for this image may have been generated using AI.
Full size image

Paired primary tumor and metastasis biospecimens with microbiome data were available for 103 patients in the ORIEN cohort. A, B A paired analysis demonstrated lower alpha diversity in distant metastases, a finding validated in the stage-matched UIHC cohort (C). D ORIEN cohort paired metastasis and primary tumor microbiomes were closely related by beta diversity distance metrics in contrast with unpaired biospecimens, compatible with similar bacterial communities from the primary tumors colonizing distant sites. Inter-subject primary tumors were the most distant. E However, this pattern was not seen in the UIHC cohort, where primary tumor microbiomes were most similar.

We then examined differentially abundant taxa in the paired primary tumor and metastasis microbiomes. Several taxonomic groups were significantly depleted in metastases compared to primary tumors (Fig. 6A, B). The facultative anaerobe and pathobiont-containing Bacteroidaceae family and Fusobacterium genus, as well as the Bifidobacterium genus, were significantly less abundant in metastases from both cohorts. Enterobacteriaceae, including E. coli, had similar abundances at both sites. The species Fusobacterium periodonticum and a Bacillus genus ASV were significantly enriched in the metastases of the UIHC cohort. The findings suggest that some major microbiome families, predominantly anaerobes, are less fit for survival in distant metastasis microenvironments. ProTraits functional group analysis (Fig. 6C) identified strict anaerobicity and several carbohydrate and amino acid utilization traits as depleted in metastases. Aerobes, facultative anaerobes, and alcohol utilization traits had modest advantages in metastases. Analysis with PICRUSt2 in the UIHC cohort identified few pathways enriched in metastases (none in the primary tumors), with themes of nucleic acid metabolism, quinone biosynthesis, and peptidoglycan biosynthesis (Fig. 6D). The findings suggest that nutrient sources (carbohydrates and amino acids) and oxygen levels in distant metastasis microenvironments negatively select some of the taxa that colonize primary tumors.

Fig. 6: Colonization of metastases by tumor-associated bacteria is related to oxygen tolerance and metabolic capacities.
Fig. 6: Colonization of metastases by tumor-associated bacteria is related to oxygen tolerance and metabolic capacities.The alternative text for this image may have been generated using AI.
Full size image

A In the ORIEN cohort, relative abundance comparison to primary tumors showed metastasis depletion of Fusobacterium spp., Bacteroidaceae, and Bifidobacterium spp., among others, while Enterobacteriaceae and Bacillus spp. were unchanged or enriched. B A similar analysis in the UIHC cohort identified similar patterns, except that Fusobacterium was not significantly depleted. C The Protrait database was used for enrichment analysis of bacterial functions based on taxonomy and relative abundance in the ORIEN cohort. Anaerobic bacteria were depleted in metastases, as well as H2S-producers and bacteria with the capacity to utilize certain sugars. The findings may reflect key selective pressures in the metastasis microenvironment, which tends to be oxygenated and likely lacks the diet-derived carbon sources available to bacteria in the gut. D Picrust2 predicted pathway analysis of the UIHC cohort identified modest enrichment of nucleic acid metabolism and peptidoglycan-related pathways in metastasis microbiomes. No pathways were significantly enriched in primary tumors.

Discussion

The main finding of this study is that primary CRC tumor microbiome diversity, gene function capacities, and microbial cliques are predictive of metastasis-free survival. While most prior detailed mechanistic investigations have focused on the microbiome’s contributions to initiating carcinogenesis, we highlight that human tumor-resident bacteria also associate with disease progression and can help predict the critical clinical outcome of distant metastasis. Several pathobiont taxa, including pks + E. coli and ETBF, are known to promote carcinogenesis and, in limited studies, have been shown to promote epithelial-mesenchymal transition15,16,31, are among the bacteria enriched in primary tumors that progress to distant metastasis. In a recent global study of CRC mutation signatures, colibactin-associated mutations showed regional variability and enrichment in early-onset colorectal carcinoma10. Colibactin produced by pks + E. coli directly influences cancer cell genetics by promoting dsDNA32, which may contribute to metastasis risk25. Pks + E. coli influence cancer-related signaling pathways and microRNAs that are linked to epithelial-mesenchymal transition32,33. Indirect mechanisms of pathobionts promoting metastasis through effects on the tumor microenvironment have been described, such as pks + E. coli-mediated induction of cellular senescence, promotion of growth factor release, and tumor progression34. Colibactin-producing E. coli promote the emergence of cancer stem cell populations that are resistant to chemotherapy32. ETBF can promote CRC metastasis through microRNA and epigenetic modifications in cancer cells35. B. fragilis can promote chemoresistance through host Notch1 signaling, independent of enterotoxin36. The capacities of these pathobionts to mutagenize host cells, stimulate pro-proliferation and -invasion signaling pathways, promote chemoresistance, and manipulate tumor immune cells likely contribute to metastasis risk. Thus, some mechanistic insight for bacterial promotion of CRC metastasis already exists. However, further study is needed to fully understand these mechanisms and identify causal relationships between other bacteria and metastasis. Ongoing efforts to inhibit B. fragilis enterotoxin and colibactin biosynthetic enzymes highlight a promising opportunity to target pathobionts in CRC11,37,38.

F. nucleatum is one of the most highly studied CRC pathobionts that promotes metastasis in animal models through mechanisms of host cell signaling and non-coding RNA modulation events linked to the epithelial-mesenchymal transition3,39,40. Our study did not find associations between F. nucleatum abundance or qPCR detection in tumor microbiomes and MFS outcomes. In fact, initial linear discriminant analysis indicated higher F. nucleatum in the >5 yr MFS group (Fig. 2A). However, further analysis determined that locally less advanced tumors (low T stage) tended to have more F. nucleatum, confounding the association with long-term MFS. This issue did not affect the UIHC cohort because of stage matching. Thus, tumor F. nucleatum abundances and/or functions may change during primary tumor progression, making associations to metastasis outcomes complex. Specific sub-species Fusobacterium taxa (the Fna C2 clade) have been implicated as key colonizers of human CRC41, and this level of granular genetic diversity is not measured with our 16S metagenomics approach. Differences between human CRC biology and animal models of metastasis, such as intact tumor immunity and differing microbiome communities, could also underlie the lack of a detectable F. nucleatum and MFS association in our study.

Our findings indicate that primary tumor microbiomes, pathobiont qPCR markers and specific microbial cliques have prognostic value for metastasis risk, independent of the pathologic stage at resection42. The pathobiont-containing species E. coli and B. fragilis are central to the predictive microbiome features in three parts of the analysis: 1) linear discriminant analysis by MFS outcomes, 2) pathobiont strain detection, and 3) their inclusion in a highly MFS-predictive microbial clique. These features may be useful prognostic markers for determining post-resection surveillance and decisions to treat stage II CRC with adjuvant therapy. While tumor microbiome metagenomics is not part of current clinical practice, detection of specific metastasis risk-related pathobionts such as pks + E. coli and ETBF by qPCR and/or in situ hybridization in CRC pathology specimens is imminently feasible as a clinical test43. Further study is needed to develop these tests and determine their potential clinical utility as precision oncology markers.

Tumor microbiome profiling requires tissue biospecimens from invasive procedures, and fecal biospecimens are a potential alternative for prognostic tests. Our findings of metastasis outcome prediction by tumor microbiomes differ from a recent meta-analysis of 3741 fecal microbiomes, which found no alpha or beta diversity associations with clinical stage (including distant metastases)44. Comparative studies have demonstrated differences between CRC tumor microbiomes, adjacent normal mucosal microbiomes, and the fecal microbiome, likely related to differences in the respective microenvironments45,46,47. Thus, diversity of the microbiome localized to tumors may be more predictive for this particular outcome.

The second significant finding of this study is that CRC metastasis microbiomes have a composition that is related to the primary tumor but with lower alpha diversity and depletion of specific bacterial traits. Microbiome capacities for carbohydrate utilization and anaerobic electron transport in tumor microbiomes are associated with long-term metastasis-free survival. One potential explanation for lower anaerobe abundances in metastasizing tumors could be greater oxygen content (less hypoxia) within the tumor microenvironment that selects against anaerobes. Bacteria in beneficial tumor microbiomes (patients with long-term MFS) have a high capacity for specific carbohydrate utilization, indicating the potential for dietary interventions to influence metastasis risk. The tissue microenvironments in the liver and lungs likely exert selective pressures on the CRC-associated microbiome due to factors such as high oxygen tension, less diverse food-derived carbon sources, and distinct immune cell populations. Prior research indicates that bacteria within metastases can influence tumor growth and response to therapy3,6,48. While the prior research in this area has focused almost exclusively on F. nucleatum, we find that Fusobacteriaceae and several other pathobiont-containing taxonomic groups are depleted in metastases compared to the primary tumor from the same patient. Understanding the drivers of microbiome composition in distant metastases may provide insight into bacterial roles in disease progression and highlight potential interventions such as dietary changes and selective antibiotics directed at tumor microbiomes.

An outstanding goal in the field of tumor microbiome studies is intervention with microbiome-targeted therapies. Our study suggests several avenues for further investigation toward this long-term goal. First, antimicrobials or biosynthetic gene inhibitors directed at some of the known CRC carcinogenic pathobionts (colibactin-producing E. coli and ETBF)11,38 are likely to also favorably impact disease progression. The differential capacity for carbohydrate and amino acid metabolism in primary tumors that metastasize suggests that it may be possible to manipulate the tumor microbiome with dietary (prebiotic) interventions. Dietary patterns such as the sulfur microbial diet are known to modify both the fecal microbiome and the risk of CRC49, but little is currently known regarding dietary effects on tumor-resident bacteria and disease progression. Finally, while this study emphasizes pathobionts, tumor bacteria associated with long-term MFS, such as B. thetaiotaomicron and Prevotella spp. (Fig. 2B) are candidate probiotics that can be further studied for metastasis prevention. Although neither taxonomic group has been extensively studied in CRC animal models, prior research suggests that short-chain fatty acids produced by B. thetaiotaomicron and the presence of Prevotella may suppress tumor growth and/or enhance response to chemotherapeutics50,51,52.

All tumor microbiome studies work from low bacterial biomass specimens, which, together with non-sterile handling of pathology specimens, increases the risks of contamination by exogenous bacterial DNA compared to fecal microbiome studies. These risks are mediated by the inclusion of controls and bioinformatic decontamination approaches (see methods). Although some degree of contamination undoubtedly remains, it is expected to be biased toward the null hypothesis (high variability, no difference between groups) because the study design compares biospecimens handled in the same way and is not expected to have systematically different contamination risks. Perhaps the greatest strength of this study is the use of two independent cohorts with different methodologies (RNAseq and 16S rDNA metagenomics), yielding highly consistent findings. As a retrospective biospecimen study, the results are primarily correlative and, in isolation, do not allow conclusions about the causal roles of the microbiome in metastasis. However, we highlight several organisms with known mechanistic causal roles in CRC carcinogenesis and progression that also emerged in our study, supporting the conclusion that some of the correlations do have an underlying causal relationship.

Methods

Human subject cohort and study design

All human subjects research conforms to the standards of the Declaration of Helsinki. Two subject cohorts were constructed, an Oncology Research Information Exchange Network (ORIEN) discovery cohort consisting of 1003 colorectal carcinoma biospecimens from 900 patients (103 with paired primary tumor and metastasis), and a University of Iowa Health Care (UIHC) cohort of 100 patients with colorectal carcinoma biospecimens. ORIEN is an alliance of US cancer centers established in 2014 using a standardized protocol for prospective clinical and biospecimen data collection with patients’ informed consent. Eligible patients had a tumor RNAseq dataset from at least one CRC biospecimen (primary tumor, metastasis, or both) and complete 5-year survival data (either metastasis within 5 years or at least 5 years of metastasis-free follow-up). The primary outcome was metastasis-free survival in 3 categories: 1) metastasis pre-existing at the time of biospecimen collection, 2) metastasis 0–5 years after collection, and 3) no metastasis in >5 years of follow-up. From 1315 potential candidates, 900 remained after application of the inclusion criteria. All data used in this study are de-identified with coded individual subject indicators. For the ORIEN cohort, clinicopathologic data (Table 1) were combined with microbiome relative abundance data derived from primary tumor and metastasis RNAseq with the exotic2 pipeline for processing, contaminant removal, and normalization53. Briefly, it accepts unaligned reads from tumor RNA-seq human gene expression processing (hg38 reference genome aligned with Bowtie254) and re-aligns them to the T2T-CHM13v2.0 human reference with STAR55. Unaligned reads are aligned to a custom KrakenUniq database containing finished bacterial genomes, high-quality fungal and viral genomes, as well as the T2T human reference genome and the Univec contaminants database as additional filters. Contaminants are filtered using decontam56, correlating the RNA concentration to the library preparation step, and removing common contaminants in sequencing preparations57.

The UIHC cohort was assembled using a pathology database under a protocol approved by the University of Iowa Institutional Review Board (#202212095). Fifty patients with paired primary tumor and metastasis resections (cases) were selected. Controls with only a primary tumor resection biospecimen and at least 5-year metastasis-free survival were matched 1:1 with cases on pathologic T/N stage, age, sex, and exposure to neoadjuvant therapy. Patients with less than 5 years of follow-up and no known metastases were excluded. Eligible participants had formalin-fixed paraffin-embedded (FFPE) pathology biospecimens and slides available and complete clinicopathologic data (Table 2). Starting from 200 potential controls, 57 were excluded due to incomplete clinicopathologic data or follow-up, and 50 were selected for optimal matching to cases. Clinicopathologic data were assembled for this cohort (Table 2) using Iowa Pathology databases and the electronic health record (Epic). A board-certified gastrointestinal pathologist reviewed slides to confirm all pathologic features and select tissue at a non-neoplastic mucosal margin, the primary colon tumor, and distant metastases. Several features of this cohort were designed to complement the discovery cohort: 1) orthogonal methods for microbiome profiling using 16S rDNA sequencing and strain-specific qPCR from FFPE biospecimens, and 2) control of the known potential microbiome confounders of pathologic T/N stage and neoadjuvant therapy using matching.

UIHC cohort 16S metagenomic sequencing

The selected FFPE tissue was dissected from glass slides and total DNA was extracted (QIAgen FFPE Advanced UNG kit). 16S rDNA amplification, library generation, and sequencing were performed as previously described58. Briefly, the V3-V4 region of the 16S rRNA gene was amplified using primers 5’- TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG-3’ and 5’-GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC-3’. Indexing and library construction were performed using a second round of PCR. Multiplexed samples were sequenced by paired-end sequencing on a MiSeq instrument (Illumina). The data were processed using QIIME 259. Sequences were demultiplexed, and quality profiles were visualized using demux and summarize functions. The DADA2 pipeline60 was used for sequence quality control and feature table generation. Phylogenetic trees were generated using the QIIME 2 phylogeny function. Alpha and beta diversity metrics with group significance statistics were calculated using the q2-diversity plugin in QIIME 2. Amplicon sequence variants were classified using a classifier trained on the Greengenes database (gg_12_8)61.

In contrast with fecal metagenomics, microbiome profiling from formalin-fixed paraffin-embedded tissue (FFPE) presents unique challenges. Despite this, prior research demonstrates that bacterial communities can be accurately profiled from this matrix62,63,64,65. Among the challenges are a predominance of human DNA extracted from tissue, DNA damage and cross-linking related to formalin fixation, and the potential for contaminating bacterial DNA during non-sterile specimen handling. Negative control samples included tissue-free FFPE blocks and reagents only. We applied several additional “decontamination” approaches: 1) exclusion of reads mapping to taxonomic groups that are known contaminants from FFPE processing at our institution, identified by a previous 10 subject intra-individual FFPE and fresh frozen tissue comparison, 2) statistical identification of likely contaminants with the decontam package in R56, and 3) exclusion of taxonomic groups that are not identified in at least 1% of high-quality fecal metagenomes from the Human Microbiome Project66 as measured with MetaPhlAn67. After decontamination, there was a median of 14,361 high-quality reads per sample (interquartile range 7048–23,975), comparable sequencing depth to previous studies of colon FFPE samples63.

Pathobiont gene quantitative PCR

Multiple qPCR primer pairs were designed for targeting F. nucleatum nusG and 16S rDNA, B. fragilis enterotoxin gene bft1, and E. coli pks locus genes clbB and cnf1. Total bacteria were quantified with primers targeting the V7 region of 16S rDNA, and human DNA using the actin gene. Primers were bioinformatically filtered for off-target effects using BLAST, NCBI databases, and a large collection of metagenome-assembled genomes (MAGs)68. qPCR was performed using SYBR green and thermal melt analysis to confirm the expected amplicon. Control reactions were performed on DNA extracted from pure pathobiont cultures. Pathobiont detection in biospecimens was defined as a CT value of less than 35 cycles and a thermal melt curve like pure culture controls. Detection rates were compared using chi-square testing. Relative quantities were calculated using the 2-ΔCt method69 with total bacteria (16S) as the referent. All qPCR primer sequences are in Supplementary Table 1.

Microbiome data analysis and statistics

The analysis was conducted in two primary stages. The first stage includes analyses with overall diversity measures, whereas the next stage includes identifying a group of microbes, or microbial cliques, associated with the 5-year metastasis-free survival (MFS) outcome. The alpha and beta diversity metrics to be analyzed were selected before the study. Alpha diversity was quantified with Shannon’s index, Simpson’s index, and Chao1. Beta diversity was quantified with unweighted UniFrac distance, Bray-Curtis dissimilarity, and principal component analyses of both metrics. For RNAseq-derived taxonomic abundances (ORIEN), a random taxonomy-based tree was used where required; for 16S rDNA data (UIHC), a tree was constructed from the alignment of sequence clusters. Statistical testing for alpha diversity differences by site and MFS was 2-way ANOVA. Beta diversity differences were identified by the adonis2 test in R for the ORIEN dataset and the PERMANOVA test in QIIME2 for the UIHC dataset. Differential taxon abundance testing by linear discriminant analysis (LDA) was performed with LefSe using a cutoff LDA score of >2.028. For paired primary tumor and metastasis biospecimens from the same subject, paired Wilcoxon tests were applied to alpha diversity metrics, and UniFrac distances were compared using Kruskal-Wallis tests. Gene function and bacterial trait analysis of the ORIEN dataset was performed by mapping ~1300 detected bacteria and their relative abundances to the Protraits database30. Statistical testing for trait enrichment was Mann-Whitney tests, corrected for multiple comparisons with a false discovery rate of 0.01. Gene function and pathway enrichment analysis of the UIHC 16S rDNA data was sequence-based using PICRUSt270 and LDA. Clinicopathologic data comparisons were Chi-square or Fisher’s exact tests for categorical data and Kruskal-Wallis or Student’s t-test for numerical data. Analysis outputs were visualized using R, QIIME 2 View (view.qiime2.org), and GraphPad Prism 9 (La Jolla, CA, USA).

To discover a microbial clique associated with the outcome with a combination of an interpretable machine learning tool in a regression-based framework called the microbial co-occurrence analysis (MiCA)71,72. Identifying a microbial clique consisting of multiple microbes is challenging because of the inherently large and computationally expensive combinatorial search space. MiCA utilizes a tree-based interpretable machine-learning tool to discover the most prevalent combination of microbes based on their relative abundance. Then, using a quantile-based threshold-finding algorithm creates a predictive rule associated with the outcome. Based on the relative abundances of the most frequently co-occurring microbes, this predictive rule intuitively creates a mathematical equivalence of a microbial clique. Lastly, we use a covariate-adjusted regression to estimate the effect of this microbial clique. Since machine-learning tools naturally tend to overfit, we trained the tree-based algorithm in MiCA using 65% of the data (n = 769). Then the remaining 35% of the data (n = 416) was used to validate the identified microbial clique. This two-stage procedure ensured the robustness of the identified microbial clique. The regression part of the MiCA algorithm incorporates multiple covariates, including age at diagnosis, reported sex, race, neoadjuvant therapy (yes/no), clinical stage group at biospecimen collection, and pathologic stage group. The dominant clique identified includes Escherichia coli, Bacteroides fragilis, and a Pseudomonas sp. ASV. The Kraken database and algorithm (the highest stringency parameter for taxon assignment was used) assigned this ASV as Pseudomonas yamanorum, which is neither a common gut commensal nor a common environmental contaminant. BLAST of several sequences from the taxonomic unit showed 100% identity to P. yamanorum genomes and >90% identity to at least five other Pseudomonas spp. Since we believe this could be an over-call of the species-level classification by Kraken, we refer to it as a Pseudomonas sp. in the results.