Introduction

Colorectal cancer (CRC) is a defining challenge of modern oncology, ranking among the top three most diagnosed malignancies and the second leading cause of cancer death worldwide1,2. Nearly two million new cases occur each year, and while screening has stabilized incidence in high-income regions, rates are accelerating across low- and middle-income countries3,4,5. In Africa alone, CRC prevalence is expected to more than double by 2040, reflecting shifts in diet, urbanization, and population aging6,7,8. These global trends demand new frameworks that integrate environmental, microbial, and molecular determinants of disease.

Among these determinants, the gut microbiome has emerged as a dynamic ecological system associated with CRC risk. The intestinal microbiota maintains metabolic homeostasis, epithelial integrity, and immune balance, yet disruption of this equilibrium termed “microbial dysbiosis” has been linked to inflammation, genotoxicity and tumor progression environment9,10,11,12,13,14. Multiple bacterial species, including Fusobacterium nucleatum, enterotoxigenic Bacteroides fragilis, colibactin-producing Escherichia coli, Parvimonas micra, and Hungatella hathewayi, exemplify how microbial colonization, toxin secretion, and immune modulation may influence CRC-associated pathology15,16,17,18. Consistently, CRC-associated microbiomes display enrichment of oral-derived pathobionts and depleted butyrate-producing commensals such as Faecalibacterium and Roseburia, reflecting broader ecological reorganization19,20,21,22.

Importantly, microbial function, not only composition, appears deeply relevant to CRC biology. Virulence factors (VFs) and antimicrobial-resistance genes (ARGs) provide insight into how microbes can persist within, or adapt to, the tumor microenvironment. VFs contribute to adhesion and barrier disruption, while an expanded resistome may support survival under oxidative stress, inflammation, or chemotherapeutic pressure23,24,25,26,27. These functional attributes suggest that CRC-associated microbiota occupy a selective niche shaped not merely by taxonomic shifts but also by metabolic and defensive capacity28.

Despite this evidence, most CRC-microbiome studies remain taxonomic, reference-dependent, and cross-sectional, limiting reproducibility and hindering functional interpretation. Assembly-based, multi-cohort frameworks that integrate functional inference with ecological modeling remain scarce.

Here, we present a multi-cohort, assembly-based metagenomic analysis of more than 500 samples spanning healthy, adenoma, and carcinoma states. Our pilot investigation focuses on the functional ecology of the CRC microbiome by combining de novo assembly with Double Machine Learning (DML) to identify latent statistical components in the resistome–virulome space. Rather than proposing predefined ecological categories, we examine whether reproducible functional gradients emerge across geographically distinct cohorts and explore how these gradients relate to CRC status. Our goal is to provide an initial statistical and ecological framework for understanding how functional gene patterns, particularly virulence- and stress-associated pathways vary across CRC progression.

Results

Data processing and quality assessment

Rigorous preprocessing eliminated host and contaminant sequences, yielding high-quality assemblies across all cohorts (Figure S1a-i, Data S1). Cohort 3 produced the most contiguous assemblies, whereas Cohort 1 exhibited greater fragmentation, consistent with variation in sequencing depth (Figure S1a-ii, Data S1). Batch correction effectively minimized inter-cohort bias while preserving biological structure. For instance, species-level variation among clinical groups remained significant (PERMANOVA R² = 0.018, p = 0.001), whereas functional (ARG/VF) profiles showed no residual batch structure (Figure S1b-i, S2a; Data S2). These quality-control steps ensured that downstream ecological and functional analyses reflected biological rather than technical variation. The overview of the study design and analytical workflow is shown in Fig. 1.

Fig. 1: Overview of study design and analytical workflow.
figure 1

Publicly available CRC metagenomic datasets from three international cohorts were processed through a harmonized pipeline. Following host read removal, decontamination, and de novo assembly, functional profiling was performed to extract antimicrobial resistance (ARG) and virulence factor (VF) gene matrices, followed by normalization and batch correction. Machine-learning classification was used to identify discriminative taxa and ecological signatures across disease stages, while Partial Least Squares–Double Machine Learning (PLS-DML) analysis was applied to the ARG/VF dataset to model directional associations with CRC progression. Validation included bootstrap resampling, external cohort testing, and ecological interpretation through the Dual-Axis Index (DAI), which distinguished two major microbial strategies: pathogenic expansion (red) and protective adaptation (blue).

Microbial community ecology across CRC stages

Microbial diversity remained largely stable across disease stages, with no significant differences in alpha diversity (Richness p = 0.51; Shannon p = 0.11; Simpson p = 0.13) (Fig. 2a; Data S2). However, beta-diversity analyses revealed subtle but consistent restructuring of community composition ( = 0.018, p = 0.001) across healthy, adenoma, and cancer groups (Fig. 2b). These differences reflected genuine ecological reorganization rather than dispersion effects (ANOVA F = 0.96, p = 0.39) (Data S2).

Fig. 2: Ecological organization and reproducibility of colorectal cancer–associated gut microbiomes.
figure 2

a Alpha-diversity metrics across clinical stages showing comparable within-sample diversity (i, Richness; ii, Shannon; iii, Simpson). b Non-metric multidimensional scaling (NMDS, Euclidean distance) reveals modest but significant compositional restructuring among Healthy, Adenoma and Cancer groups. c Reproducibility of CRC-associated taxa: i, core species enriched across cohorts; ii, Venn comparison of taxa shared with reference CRC datasets16,29; iii, clustered heatmap showing consistent stage-associated abundance profiles (Z-score). d Co-occurrence networks of reproducible taxa highlighting core commensals (R. intestinalis, A. hallii, B. wexlerae) as major ecological hubs forming stable, functionally cohesive modules across CRC progression.

Core species were highly conserved across cohorts ( > 99% overlap) and dominated by commensals such as Anaerobutyricum hallii, Roseburia intestinalis, and Blautia wexlerae, suggesting metabolic resilience despite disease progression (Fig. 2c-i; Data S2). Peripheral taxa accounted for the majority of diversity (approximately 85%), highlighting adaptive turnover along the CRC continuum. Comparison with previously published CRC datasets16,29 confirmed the reproducibility of our results, identifying 64 overlapping taxa (20.9%) shared with reference studies. These included Parvimonas micra, Hungatella hathewayi, and Bacteroides fragilis, which were consistently enriched in cancer, as well as A. hallii, B. wexlerae, and Eggerthella lenta, which were more abundant in controls (Fig. 2c-ii, iii; Data S2). Importantly, these overlapping species correspond to canonical CRC markers reported in the GMrepo v2 database30, further validating the cross-cohort reproducibility of our findings. Network analysis additionally identified R. intestinalis as a central hub associated with stage-specific microbial modules (Fig. 2d). Collectively, these findings reveal a stable commensal core accompanied by a dynamic peripheral microbiome undergoing progressive ecological restructuring across CRC progression.

Stage-specific microbial indicators and predictive features

Indicator-species analysis identified 42 taxa significantly associated with clinical stages (adj. p ≤ 0.05) (Fig. 3a). Health-associated microbiomes were enriched in Ligilactobacillus ruminis, Faecalibacillus intestinalis, and Clostridium thermarum, whereas Bacteroides sp. CACC 737 and B. cellulosilyticus increased in adenoma and cancer. Novosphingobium sp. 9 and Bifidobacterium longum were distinguishing indicators between cancer and healthy samples (Data S3).

Fig. 3: Indicator and predictive species defining stage-specific microbiome signatures.
figure 3

a Indicator-species analysis (IndVal.g) identifies taxa significantly associated with Healthy, Adenoma and Cancer stages (Benjamini-Hochberg-adjusted p ≤ 0.05). b Performance of supervised classifiers discriminating CRC stages from species-level abundance profiles. Models (Random Forest i, and Gradient Boosting ii) were trained as multi-class classifiers. ROC curves were generated using the standard one-vs-rest (OvR) extension of ROC analysis, where a separate ROC curve is computed for each class based on its predicted probability scores. iii Consensus feature ranking across models highlights robust discriminative taxa, including A. hallii, Blautia hansenii, Butyrivibrio fibrisolvens, and R. intestinalis.

Supervised learning further supported these stage-specific patterns. Among all models, Gradient Boosting showed the highest overall performance (mean accuracy = 0.69; balanced accuracy = 0.61; MCC = 0.48) and consistently strong discrimination across clinical groups (AUC = 0.71 for cancer, 0.58 for adenoma, and 0.73 for healthy; Fig. 3b-i, ii; Figure S1c-i). Random Forest also performed relatively well (mean accuracy = 0.66; MCC = 0.43), but with slightly lower discrimination compared to Gradient Boosting. Consensus feature ranking of the two models identified Prevotella copri, Sellimonas intestinalis, Butyrivibrio crossotus, and Faecalibacterium prausnitzii among the most predictive taxa, together with several other species overlapping with already established CRC markers (Fig. 3b-iii; Data S3). These findings indicate a reproducible microbial signature that discriminates disease stages and captures coherent ecological shifts along the CRC trajectory.

Functional variation across host and geographic contexts

To focus on biologically interpretable functional variation, we quantified the total abundance (“load”) of antimicrobial-resistance (ARG) and virulence-factor (VF) genes per sample rather than re-estimating diversity indices. Total ARG and VF loads did not differ significantly among disease stages (p > 0.05) (Fig. 4a, b; Data S3), indicating that CRC progression alone does not substantially alter the overall functional capacity of the microbiome. Instead, ARG and VF variation showed stronger associations with host and geographic factors: body-mass index (BMI) correlated negatively with both ARG (ρ = −0.26, p < 0.05) and VF (ρ = −0.21, p < 0.05) loads, and ARG abundance varied across countries (p = 0.001), with higher values in Austria than in France (p = 0.006) or Germany (p = 0.017). No associations were observed with age or sex (Data S3), suggesting that resistome and virulome magnitude covaries primarily with host physiology and environmental context rather than tumor stage.

Fig. 4: Functional architecture of antimicrobial-resistance and virulence signatures in CRC.
figure 4

a, b Total antimicrobial-resistance gene (ARG) and virulence-factor (VF) loads across Healthy, Adenoma and Cancer groups show no global differences in abundance. c Bootstrapped PLS-Double-Machine-Learning (PLS-DML) effects showing component-level average treatment effects (ATEs). d Sensitivity of estimated effects to unmeasured confounding quantified by E-values; PLS₃ shows greatest robustness. e Directed acyclic graph (DAG) illustrating assumptions linking ARG/VF abundance (indicated in the figure as Microbial_abundance) to CRC stage while adjusting for age, BMI and sex. f Gene co-occurrence network highlighting interactions between dominant ARG and VF modules along the CRC continuum. Red node = PLS₁ and green node = PLS₃. g External-validation ROC curve showing moderate discrimination of the derived functional signatures (AUC = 0.603). h Directional concordance of PLS₁/PLS₃ genes between internal and external datasets indicating consistent trends. i Power analysis for PLS₁ and PLS₃ components demonstrating statistical power > 80% at current sample size. j Gene-level ANOVA-based power estimates confirming adequate detectability for the most robust genes under about 100 samples per group.

Directional modeling of functional axes

To explore potential functional patterns associated with CRC progression, we applied a cross-validated Partial Least Squares-Double Machine Learning (PLS-DML) framework controlling for age, BMI, and sex. Ten (10) components were derived, of which two captured dominant functional axes: PLS₁, enriched for virulence and resistance functions, and PLS₃, reflecting protective, functionally adaptive traits (Fig. 4c–e; Data S3).

PLS₁ showed a modest positive average treatment effect (ATE = 0.004; 95% CI = –0.023 to 0.030), whereas PLS₃ displayed a negative association with CRC risk (ATE = –0.038; 95% CI = –0.092 to 0.015) and the highest robustness to unmeasured confounding (E-value = 1.24). Gene-loading analysis identified virulence genes (eptA, irp1, entE) within PLS₁ and protective elements (Cdif_EFTu_ELF and Ecol_EFTu_KIR, EC-5) within PLS₃ (Figure S2e, f). Functional enrichment showed that PLS₁ was dominated by exotoxin, adherence, biofilm-formation, and RND efflux mechanisms, while PLS₃ featured effector-delivery, immune-modulation, and target-alteration pathways (Data S3). Post-hoc tests confirmed that virulence-linked functions such as biofilm formation, adherence, and efflux pumps were significantly elevated in cancer compared with the other groups (p < 0.05).

Network topology further distinguished the two axes: PLS₁ hubs (stgC, Kpne_phoQ_CST, ygeH, fliA) formed cooperative resistance-virulence clusters, whereas PLS₃ hubs (upaH, tet(37), TolC, bacA) organized into cohesive protective networks (Fig. 4f, Data S3). As an exploratory proof-of-concept, we assessed whether the two stable functional axes identified by PLS-DML (PLS₁ and PLS₃) were sufficient to capture disease-associated functional variation. A minimal logistic model using only these two components achieved an out-of-fold ROC-AUC of 0.68 (95% CI 0.58–0.78) for distinguishing cancer from non-cancer samples. Risk stratification based on out-of-fold predicted probabilities reflected clinical stage patterns where high-risk assignments were most common in cancer (35.7%), followed by healthy (29.6%) and adenoma (12.5%), whereas low-risk classifications were rare in cancer (7.1%) but more frequent in non-cancer groups (18.5% in healthy; 20.8% in adenoma) (Data S4). These results indicate that the Dual-Axis Index (DAI) summarizes broad functional tendencies associated with CRC status, while its moderate discriminative performance underscores that it should be interpreted as an associative ecological score rather than a diagnostic classifier.

Validation and reproducibility of PLS-DML gene signatures

To assess reproducibility, the stable PLS₁ and PLS₃ gene signatures were first evaluated within the internal dataset and then validated in an independent CRC cohort (Data S4). Logistic regression, identified during benchmarking as the top-performing classifier, achieved strong internal discrimination (AUC = 0.776, MCC = 0.54). Before applying the model externally, we performed a leave-one-cohort-out (LOCO) analysis to examine cross-cohort robustness. Performance differed between the held-out cohorts, with Cohort 2 showing moderate discrimination (AUC = 0.71) and Cohort 3 showing lower discrimination (AUC = 0.46). However, Cohort 1 could not be assessed because only a single sample remained after quality filtering and alignment. Further, when the model was applied to the external validation cohort, it retained moderate discrimination (AUC = 0.603; balanced accuracy = 0.54; F₁ = 0.51; MCC = 0.10) (Fig. 4g; Figure S2g). Subsequently, gene-level concordance showed that 17 of 34 genes, including stgC, irp1, tet(37), upaH, Kpne_phoQ_CST, phoQ, and espX5, preserved consistent directionality across cohorts (Fig. 4h). Although effect magnitudes differed (Mann-Whitney p = 0.002; Wilcoxon p = 0.003), overall trends were conserved, supporting the biological stability of PLS components (Figure S2h-i).

Power analysis indicated robust statistical reliability for PLS₁ (power = 86.2%; Cohen’s f = 0.33) and near-sufficient power for PLS₃ (78.9%; f = 0.30) at the current sample size (N = 107). These results confirm that the PLS-DML gene signatures capture generalizable microbial determinants of CRC risk across populations (Fig. 4i, j; Figure S2d; Data S4).

Overall, these analyses delineates a reproducible ecological and functional continuum of CRC progression anchored by a stable commensal core and dynamically shifting peripheral taxa. The directional modeling framework identifies two consistent microbial strategies that may be linked to the functional reorganization of the gut microbiome during colorectal tumorigenesis. We emphasize that PLS₁ and PLS₃ represent latent statistical components extracted from the resistome–virulome matrix using five-fold cross-validated PLS regression. Their ecological interpretation is therefore post hoc, inferred from the structure of the gene loadings rather than imposed a priori. PLS₁ and PLS₃ capture the dominant axes of covariation in antimicrobial-resistance and virulence-factor profiles after adjustment for age, sex, and BMI, and their loading patterns remained stable when functional categories were permuted, supporting that these components reflect genuine covariation structure rather than annotation artifacts.

For clarity, the descriptors “pathogenic enrichment” and “protective adaptation” are used only as interpretive summaries of the enriched functional signatures. For instance, adhesion, efflux, and biofilm-associated genes in PLS₁, while immunomodulatory or barrier-associated elements in PLS₃. These labels are not intended to denote discrete ecological strategies. Instead, PLS₁ and PLS₃ should be understood as reproducible statistical gradients that align with well-reported CRC-associated microbial adaptations without implying mechanistic causality.

Discussion

This study provides a multi-population characterization of the CRC resistome and virulome using an assembly-based metagenomic framework. By reconstructing contigs through de novo assembly rather than relying solely on read-level profiling, our approach captured a broader spectrum of microbial functional diversity and improved the resolution of gene-level analyses, enabling consistent cross-cohort comparison of microbial traits. Integrating these assembly-derived features with batch-aware normalization produced reproducible ecological and functional patterns despite demographic and geographic variability. These findings demonstrate that harmonized assembly workflows can bridge technical gaps that have long limited cross-study comparability in CRC microbiome research.

Across populations, the CRC microbiome showed gradual ecological restructuring rather than abrupt community turnover. Core commensals such as A. hallii, R. intestinalis, and B. wexlerae remained conserved across disease stages, while opportunistic taxa including P. micra, H. hathewayi, and B. fragilis expanded progressively along the adenoma–carcinoma sequence. These species, consistently identified in global meta-analyses16,22,29, highlight the universality of CRC-associated microbial shifts across populations and sequencing platforms. Our harmonized cross-cohort framework reproduced these patterns in three independent datasets, confirming that CRC-related microbial transitions are ecologically stable and globally reproducible. The enrichment of oral-derived pathobionts particularly P. micra, F. nucleatum, and H. hathewayi together with the depletion of butyrate-producing commensals appears to be consistent with a continuous microbial gradient from health to malignancy accompanied by the gradual colonization of oral anaerobes.

Together, these compositional shifts imply not only taxonomic restructuring but also deeper functional adaptation of the CRC microbiome. Accordingly, we investigated the resistome and virulome to determine how antimicrobial resistance and virulence traits contribute to microbial persistence in the tumor-associated niche. Fredriksen et al. (2023) reported mild but consistent ARG enrichment in CRC despite limited antibiotic exposure, attributing it to inflammation-driven selection31. Our results align with this interpretation, as CRC-associated communities exhibited elevated efflux and biofilm potential, traits consistent with persistence under oxidative and inflammatory stress. This suggests that resistome expansion in CRC reflects adaptation to the tumor niche rather than direct antibiotic pressure.

Functionally, our analysis reveals two opposing microbial strategies that organize CRC-associated communities. Using a cross-validated PLS-DML framework, we identified a pathogenic enrichment axis (PLS₁) characterized by virulence and resistance mechanisms, and a protective adaptation axis (PLS₃) defined by immunomodulatory and barrier-stabilizing functions. These axes represent a directional ecological gradient rather than binary dysbiosis. Mechanistically, genes such as eptA and phoQ remodel bacterial surfaces to resist antimicrobial peptides and oxidative stress32,33,34, while irp1 and entE mediate siderophore-driven iron acquisition, thereby promoting microbial survival under tumor-associated hypoxia35,36,37,38. Conversely, upaH and bacA, which are central to the protective axis were reported to enhance mucosal adhesion and membrane integrity, which could lead to maintaining commensal stability in inflammatory environments39,40,41. The coordinated behavior of these antagonistic gene networks for example, stgC-phoQ-ygeH versus upaH-bacA-TolC captures the dual ecological logic of the CRC microbiome, in which virulent expansion coexists with compensatory symbiotic resilience, reflecting an adaptive equilibrium associated with selective pressures within the tumor microenvironment. Detailed annotations, mechanisms, and species associations for all PLS₁- and PLS₃-associated genes, based on CARD and VFDB classifications, are provided in Supplementary Data S1 and S2. Because PLS-DML derives components that maximize covariance between microbial functions and disease stage, the resulting axes reflect statistical structures, not predefined ecological categories. Accordingly, their interpretation as “pathogenic enrichment” and “protective adaptation” is intended to be conceptual rather than categorical. These descriptors merely summarize the dominant functional themes associated with each axis, for example, adhesion, efflux, and biofilm pathways loading positively on PLS₁, and immunomodulatory or stress-adaptive elements loading on PLS₃.

We do not claim that these components correspond to discrete in vivo ecological strategies. Rather, they represent cohort-independent functional gradients that are consistent with patterns frequently observed across CRC microbiome studies. This distinction is important as the PLS axes provide an interpretable statistical framework for organizing resistome–virulome variation, but they should not be interpreted as mechanistic or causative ecological modes.

Overall, nearly half of the 34 overlapping genes driving these two axes retained consistent directionality across cohorts, supporting their biological stability. Power analysis confirmed adequate statistical strength for both axes, reinforcing that the identified components represent reproducible functional gradients rather than stochastic variation. The robustness of these functional axes across cohorts provides a foundation for comparison with earlier CRC microbiome studies.

In contrast to the stability of gene-level patterns, predictive transferability showed greater variability. The moderate decrease in performance observed during external validation (AUC = 0.603), compared with higher internal discrimination (AUC = 0.776), reflects a well-recognized challenge in microbiome-based prediction: models trained in one population frequently exhibit attenuated performance when applied to independent cohorts. Our leave-one-cohort-out (LOCO) analysis demonstrated the same pattern, with substantial variability among internal cohorts (AUC range: 0.46–0.71), despite the absence of residual batch effects in the functional space. This heterogeneity is consistent with prior multi-cohort CRC studies, which show that differences in sequencing depth, assembly contiguity42, population-specific strain composition43, and environmental exposures (diet, geography, and antibiotic history)16 can markedly influence functional gene recovery and therefore limit cross-cohort generalizability. Given that ARGs and VFs are highly sensitive to such local ecological and technical factors, moderate variability in absolute predictive performance is expected. Importantly, the directional consistency of the top-loading genes across cohorts supports the stability of the underlying PLS components, whereas the observed AUC differences underscore the inherent difficulty of achieving universal transportability in metagenomic classifiers18.

Building on this observed gene-level stability, the virulence-associated patterns identified here are broadly consistent with prior CRC microbiome studies. For example, Yang et al. (2021) reported early enrichment of adhesion- and virulence-related functions in young-onset CRC, while Wirbel et al. (2019) described a global virulence signature dominated by oral pathobionts such as Fusobacterium and Parvimonas16,44. The recurrence of these functional themes across geography, age groups, and analytical frameworks suggests that virulence-associated modules represent a reproducible aspect of CRC-associated microbiomes rather than cohort-specific artifacts. This cross-study consistency complements the directional stability of our PLS-derived components, reinforcing the interpretation that certain functional motifs appear robustly across CRC datasets even when taxonomic compositions vary.

Within this context, the Dual-Axis Index (DAI) serves as a concise, exploratory summary of the two dominant functional gradients captured by PLS₁ and PLS₃. While its discriminative performance is moderate (AUC = 0.68), the DAI provides interpretive value by positioning samples along these functional axes rather than classifying them into discrete diagnostic categories. Higher scores occurred more frequently among cancer samples, whereas lower scores were more common among non-cancer groups. However, the substantial overlap, particularly between healthy and adenoma samples, indicates that functional gene profiles alone cannot reliably resolve early disease states. Accordingly, we do not present the DAI as a diagnostic or prognostic tool. Instead, it offers a structured statistical framework for summarizing resistome–virulome variation and may complement taxonomic, metabolic, inflammatory, or clinical markers in future multimodal approaches18,19,20,22,25. In this respect, the consistently enriched modules identified here, for example, adhesion, iron-associated functions, and immune-interaction pathways could serve as adjunct functional features to established stool-based assays, including fecal immunochemical tests or F. nucleatum detection45,46,47. Ultimately, any translational application would likely require integration of multiple data layers and validation in prospective designs with the DAI providing a conceptual starting point for such integrated frameworks.

A major strength of this study lies in its integrative analytical design, combining assembly-based contig reconstruction, functional profiling, and Double Machine Learning (DML) modeling to identify stable microbial features across populations. The contig-level quantification strategy enabled consistent cross-cohort comparison by preserving low-abundance but informative ARG and VF annotations. To ensure analytical reproducibility, each functional gene was counted once per contig and normalized to the total number of ARG/VF annotations within each sample, thereby minimizing biases from sequencing depth, assembly fragmentation, and multi-hit inflation42,48. Retaining the complete functional repertoire avoided over-weighting abundant features and allowed us to capture the full breadth of resistome and virulome variation. Importantly, the cross-fitted PLS-DML components were driven by consistently detected functional modules, such as adhesion, efflux, and siderophore-associated pathways rather than by rare features more sensitive to sequencing variation49,50,51. The absence of residual batch effects in the functional space (PERMANOVA R² < 0.01), together with directional concordance for approximately half of the high-loading genes in the independent validation cohort, underscores the stability of these functional gradients across populations. Future incorporation of long-read sequencing or prevalence-based filtering may further enhance resolution where needed.

Several limitations warrant acknowledgment. First, the cross-sectional study design precludes causal inference, and residual confounding by unmeasured factors (such as diet, medication, sampling protocol, or inflammatory comorbidities) cannot be excluded. Second, the absence of genome-resolved binning limits strain-level interpretation of resistance and virulence determinants. Future work integrating long-read sequencing, genome dereplication, and metabolically coupled multi-omic data could refine functional resolution and facilitate investigation of horizontal gene transfer and the evolution of resistance or virulence traits within the CRC microenvironment. Applying this framework to larger, prospectively sampled cohorts and combining functional, taxonomic, metabolic, and immunologic layers will help clarify how resistome–virulome patterns relate to diagnostic potential, host physiology, and broader CRC functional ecology.

In summary, our findings delineate reproducible functional gradients associated with CRC progression, illustrating how microbial communities vary along axes of virulence-associated enrichment and commensal-like functional adaptation. This function-centered perspective provides an interpretable statistical framework for organizing resistome–virulome variation and offers a conceptual bridge between microbiome ecology and emerging translational applications.

Methods

Analytical framework

All analyses were conducted using the GRUMB pipeline, which integrates read processing, assembly, taxonomic profiling, and machine-learning modules within a reproducible workflow52,53. For this study, GRUMB was adapted for CRC microbiomes and extended with a Partial Least Squares (PLS)-Double Machine Learning (DML) framework to model directional associations between microbial functions (ARGs and VFs) and disease stage. The analytical workflow is illustrated in Fig. 1 as shown above.

Metagenomic cohorts and quality processing

Publicly available shotgun metagenomic datasets (553 samples) were retrieved from the NCBI SRA on 10 November 2024, encompassing three geographically distinct cohorts: Cohort 1 (USA/Canada, n = 84)54, Cohort 2 (France/Germany, n = 316)18, and Cohort 3 (Austria, n = 153)19. These cohorts were selected for their high-quality sequencing data, harmonizable metadata (age, sex, body mass index (BMI), and disease stage: Healthy, Adenoma, Cancer), uniform Illumina sequencing platforms, and consistent read lengths (100–125 bp). As these datasets have been widely used in benchmarking studies, they enable direct comparison with previous CRC microbiome research while minimizing technical heterogeneity that could confound functional inference. Focusing on these well-curated cohorts ensures that observed patterns reflect biological rather than batch-driven variability and provides a robust, reproducible basis for future multi-cohort extensions.

The raw reads obtained from these datasets were then quality-filtered and trimmed using BBTools v37.6255, while host or contaminant sequences were removed with FastQ Screen v0.15.356 against a multi-genome reference. Clean reads were assembled de novo using metaSPAdes v3.15.5 57with default parameters, and contigs shorter than 1 kb were discarded. Assembly quality was assessed using QUAST v5.2.058, and outputs were formatted for downstream analyses.

Taxonomic profiling and ecological analysis

Contigs were taxonomically classified using Kraken2 (v2.1.3)59 and abundance estimates refined with Bracken v2.960 against the standard Kraken2 bacterial database (https://benlangmead.github.io/aws-indexes/k2). Species detected in < 5% of samples or non-microbial taxa were excluded.

Relative-abundance matrices were centered-log-ratio (CLR) transformed after pseudocount regularization61,62 and batch-corrected with empirical-Bayes linear modeling in limma::removeBatchEffect v3.64.163,64, using Project and Center Name as technical covariates. Clinical stage (Healthy, Adenoma, Cancer) was not included as a batch variable to avoid artificial inflation of disease-associated signals. Correction quality was verified by t-SNE visualization and PERMANOVA variance partitioning.

Alpha-diversity (Shannon, Simpson, Richness) and beta-diversity (Euclidean distance, NMDS) were computed in vegan v2.6-1064. Group differences were assessed with Kruskal–Wallis and PERMANOVA, and dispersion with betadisper. Indicator species were identified by IndVal.g (999 permutations; indicspecies v1.8.0) and classified as core ( ≥ 80%), secondary (50–79%), or peripheral ( < 50%). Ecological networks and prevalence intersections were visualized with igraph, ggraph, and UpSetR.

Validation of Reported CRC Markers

Previously reported CRC-associated taxa were compiled from Wirbel et al. (2019) and Piccinno et al. (2025)16,29 to assess cross-study reproducibility. In addition, we consulted the GMrepo v2 database30 to further verify the presence of established CRC biomarkers and confirm that the species detected in our assemblies corresponded to those consistently reported across independent human gut microbiome cohorts. Overlap between the current study’s taxa and reference markers was visualized using a Venn diagram and hierarchical clustering (ComplexHeatmap v2.24.0). Spearman correlation networks (r > 0.5) and Louvain clustering were used to detect conserved co-occurrence modules across clinical stages.

Functional annotation and load analysis

Contigs were aligned to the Comprehensive Antibiotic Resistance Database (CARD)65 and Virulence Factor Database (VFDB)66 using DIAMOND blastx (v2.1.10)67 (e-value < 1 × 10⁻⁵, identity ≥ 80%). To generate reproducible functional profiles across cohorts, each contig-level ARG or VF annotation was counted once and converted into a within-sample relative frequency by scaling to the total number of detected functional annotations. These normalized values were assembled into gene-by-sample matrices with standardized identifiers (ARG_ or VF_) and merged only after strict alignment of sample IDs to their corresponding metadata. Subsequently, group differences were evaluated by Kruskal-Wallis with Benjamini-Hochberg correction, and associations with host factors (age, BMI, sex, country) were assessed by Spearman correlation.

Machine-learning classification

Species-level classification used scikit-learn68 implementations of Random Forest (RF), Gradient Boosting (GB), Support Vector Machine (SVM), Decision Tree (DT), and Logistic Regression (LR). Data were split 80 : 20 (training : test) with stratified sampling and evaluated by 10-fold cross-validation. Performance metrics included accuracy, balanced accuracy, macro F₁, Cohen’s κ and Matthews Correlation Coefficient (MCC). The best-performing classifiers (RF and GB) were further evaluated using nested cross-validation with grid-search optimization. Feature importance was computed using both Gini and permutation-based methods, averaged over 50 iterations to produce consensus predictor rankings.

To prevent information leakage, all hyperparameter tuning was performed exclusively within the inner loop of the nested cross-validation framework. The outer loop (10-fold) provided an unbiased estimate of model performance, while the inner loop (5-fold) optimized hyperparameters using GridSearchCV. The complete hyperparameter search space for all taxonomic and functional models is provided in Data S3.

Directional functional modeling with PLS–DML

To explore directional relationships between microbial functions and CRC progression, Partial Least Squares (PLS) regression was combined with Double Machine Learning (DML)69,70,71. The ARG and VF datasets were first benchmarked using the same machine-learning framework described above to identify suitable nuisance estimators. RF and GB provided the most stable fits and were therefore adopted in the PLS-DML workflow. Hyperparameter tuning for these models also followed the same nested framework, with optimization restricted to inner folds to preserve the integrity of out-of-fold estimates. A directed acyclic graph (DAG) was constructed in daggity v0.3-4 in order to identify the minimal adjustment set of covariates (age, sex, BMI)72,73. Prior to analysis, five-fold cross-validated PLS regression was applied to the data to extract latent functional components summarizing ARG/VF-CRC covariation.

Average treatment effects (ATEs) and 95% confidence intervals were estimated using a cross-fitted Double Machine Learning (DML) framework (econml v0.14.2) with 200 bootstrap resamples applied to the two stable PLS components. Residual confounding was evaluated using E-value sensitivity analysis. Functional enrichment was used to identify genes contributing most strongly to each component, and component interpretation was assigned strictly post hoc, based on gene loadings and enriched functional categories, without imposing any biological labels during model training. Network topology of the stable components was visualized using NetworkX v3.3, with hub genes defined by degree and eigenvector centrality.

As an exploratory proof-of-concept, we trained a simple logistic-regression model using only PLS₁ and PLS₃ to evaluate whether these latent components captured meaningful variation associated with CRC status. Model performance was estimated using five-fold out-of-fold (OOF) predictions, with ROC-AUC confidence intervals obtained from 1000 bootstrap resamples and statistical significance assessed by a 1000-iteration permutation test. Calibration curves and OOF-derived predicted probabilities were used to construct low-, medium-, and high-score strata. This cross-validated logistic model constitutes the Dual-Axis Index (DAI), which we present as an exploratory summary index rather than a diagnostic classifier.

External validation and power analysis

To evaluate cross-cohort robustness, the two stable PLS components (PLS₁ and PLS₃) were validated in an independent CRC dataset using logistic regression with L1/L2 regularization. Prior to external validation, internal generalizability was assessed using a leave-one-cohort-out (LOCO) framework, in which the classifier was trained on two internal cohorts and tested on the third.

To assess biological consistency at the gene level, we compared effect directionality across cohorts using the difference in CLR-transformed abundance (ΔCLR = mean[Case] – mean[Control]) and evaluated significance using both Mann–Whitney U and Wilcoxon signed-rank tests. Statistical power for detecting component-level differences was estimated using a one-way ANOVA–based power test (pwr.anova.test) assuming a moderate effect size (Cohen’s f = 0.3), α = 0.05, and a target power ≥ 0.8.