Introduction

Colorectal cancer (CRC) ranks as the third most common cancer and the second leading cause of cancer-related death worldwide1. Accurate prediction of individual prognosis is of great significance for CRC patients. However, current prognostic stratification and treatment decisions mainly rely on the pathological TNM staging system, which is suboptimal due to the substantial tumor heterogeneity observed in histopathological CRC tissues2,3. Therefore, there is an urgent need for accurate risk stratification at the time of diagnosis to guide individualized treatment and surveillance strategies.

Contrast-enhanced computed tomography (CT) is the preferred imaging modality for CRC evaluation4, but conventional CT features offer limited prognostic information. In recent years, radiomics and deep learning (DL) have emerged as powerful tools in oncology, enabling the extraction of high-dimensional quantitative imaging features that capture tumor heterogeneity noninvasively and improve prognostic prediction5,6,7,8,9. Machine learning (ML) can further integrate these features to build predictive models with high accuracy10,11,12. Nevertheless, the performance of such models is highly dependent on the choice of algorithm, and different ML algorithms may yield substantially different results. To date, there is no consensus on the most effective algorithm for prognostic modeling in CRC. Moreover, the underlying biological underpinnings of imaging-based prognostic models remain poorly understood, thereby hindering their clinical application.

In this study, we aimed to develop a prognostic risk stratification based on a deep learning radiomics model (DLRM), optimized through systematic evaluation of ten machine learning algorithms across 117 combinations, to stratify CRC patients according to their overall survival (OS). Furthermore, we integrated metabolomic and transcriptomic analyses to explore the biological mechanisms underlying distinct prognostic outcomes, aiming to provide biological insights and interpretability for the imaging-based predictive models.

Results

Patient characteristics

A total of 1183 patients (mean age: 59 ± 11 years; 438 females) were enrolled for model construction. Among the included patients, 622 from Center 1 were divided into a training cohort (n = 435; mean age: 60 ± 10 years; 159 females) and an internal validation cohort (n = 187; mean age: 58 ± 11 years; 72 females) at a ratio of 7:3. Additionally, 313 (mean age: 59 ± 10 years; 122 females) from Center 2 and Center 3 were assigned to the external validation cohort 1, and 248 (mean age: 58 ± 11 years; 85 females) from Center 4 were assigned to the external validation cohort 2. The median OS for the four cohorts was 69.03, 68.97, 67.60, and 67.85 months, respectively. No significant differences were observed among the four cohorts in terms of age, sex, carcinoembryonic antigen (CEA), carbohydrate antigen 199 (CA199), T stage, N stage, location, postoperative chemotherapy, or differentiation (all p > 0.05), as shown in Table 1. The overview of the entire research process was illustrated in Fig. 1.

Fig. 1: Overview of the entire research process.
Fig. 1: Overview of the entire research process.The alternative text for this image may have been generated using AI.
Full size image

The study framework was organized into five main phases. A Data collection: Data were obtained from multiple sources to support different analytical stages. B Risk stratification: Radiomics and DL features from CT images were evaluated using 117 machine learning algorithm combinations to identify the optimal model for patient risk stratification. In parallel, three models (clinical, DLRM-RS, and combined models) were constructed and evaluated for performance comparison. C Metabolomic analysis: NMR-based profiling identified differential metabolic signatures between risk groups. D Transcriptomic analysis: Transcriptomic differences and immune infiltration patterns between risk groups were explored using TCIA data. E Pathway validation: Two commonly enriched pathways (butanoate metabolism and nitrogen metabolism) were identified in both transcriptomic and metabolomic analyses. Their prognostic significance was confirmed using GSVA scoring and Kaplan-Meier analysis in the TCGA cohort. CRC colorectal cancer, DLRM deep learning-radiomics model, DL deep learning, TCIA The Cancer Imaging Archive, TCGA The Cancer Genome Atlas, NMR Nuclear Magnetic Resonance, GSVA Gene Set Variation Analysis.

Table 1 Patient’s characteristics for DLRM-RS development

Feature selection and development of the DLRM-RS

A total of 1437 radiomics and 512 DL features were initially extracted from CT images of CRC patients. After deleting the features that showed poor reproducibility [intra- and inter-observer intraclass correlation coefficients (ICCs) ≤ 0.8], 1130 radiomics and 504 DL features were retained. Subsequent Spearman correlation analysis removed redundant features, yielding 174 radiomics and 142 DL features. Finally, univariate Cox regression identified 38 prognostic features (Supplementary Fig. S1 and Table S1).

These 38 prognostic features were analyzed using an ML-based integrative framework to construct the DLRM. As an initial screened feature set, these features were further refined through intrinsic feature selection within the ML models. Evaluation of 117 algorithmic combinations revealed that the combination of Lasso + GBM achieved the highest mean concordance index (C-index) (0.768) across all cohorts (Fig. 2a). In this approach, 15 features were initially selected by Lasso and subsequently used to train the GBM model. The final DLRM comprised nine radiomics features and six DL features, with their respective coefficients shown in Fig. 2b and Supplementary Table S2. Notably, the DL feature of D2 (feature_172) exhibited the highest coefficient weight (Fig. 2b). Spearman correlation analysis confirmed minimal multicollinearity, with all inter-feature correlations <0.6 (Fig. 2c). To assess the incremental prognostic value of the proposed DLRM, its discrimination performance was compared with simpler Cox-based baseline models. As shown in Supplementary Fig. S2 and Table S3, the DLRM consistently achieved higher C-index values than the radiomics-only Cox and DL-only Cox models across the training, internal validation, and two external validation cohorts. The improvements in C-index were statistically significant in all cohorts (all p < 0.05), indicating that the DLRM (Lasso + GBM) provided superior discriminatory ability over conventional single-modality Cox models. The GBM model generated a risk score for each patient. The optimal risk score cutoff (−0.380) (Fig. 2d) was determined in the training cohort using the ‘surv_cutpoint’ function (survminer’ package), which established the DLRM-based risk stratification (DLRM-RS) threshold for classifying patients into low- and high-risk groups.

Fig. 2: Integration of DL and radiomics features for survival risk stratification in CRC patients using a machine learning framework.
Fig. 2: Integration of DL and radiomics features for survival risk stratification in CRC patients using a machine learning framework.The alternative text for this image may have been generated using AI.
Full size image

Construction of predictive models with subsequent C-index evaluation across all cohorts. Coefficient weights (b) and Spearman correlations (c) of radiomics and DL features in the training cohort. d Optimal cutoff value of circulating risk score determined by the “surv_cutpoint” function in the training cohort. Kaplan-Meier curves of OS according to the DLRM-RS in the training cohort (log-rank test: p < 0.001) (e), internal validation cohort (log-rank test: p < 0.001) (f), external validation cohort 1 (log-rank test: p < 0.001) (g), and external validation cohort 2 (log-rank test: p < 0.001) (h). im Time-dependent ROC curves of the DLRM-RS for predicting 3- and 5-year OS in different cohorts. DL deep learning, DLRM-RS deep learning radiomics model-based risk stratification, C-index concordance index.

Kaplan-Meier (K-M) analysis revealed significantly poorer OS in high-risk versus low-risk patients across all cohorts (all p < 0.001; Fig. 2e–h). Time-dependent receiver operating characteristic (ROC) curve further demonstrated the prognostic performance of the DLRM-RS: the area under the curve (AUC) values for 3- and 5-year OS were 0.720 and 0.744 in the training cohort, 0.700 and 0.730 in the internal validation cohort, 0.663 and 0.693 in external validation cohort 1, and 0.759 and 0.720 in external validation cohort 2 (Fig. 2i–m). Collectively, these findings confirmed that the DLRM-RS provided robust prognostic stratification for CRC patients.

DLRM-RS performance across different clinical subgroups

To assess the prognostic utility of DLRM-RS across different clinical subgroups, we conducted stratified survival analyses within the entire cohort. The DLRM-RS consistently demonstrated a strong prognostic indicator across all subgroups, including age, sex, CEA/CA199 levels, T/N stage, location, postoperative chemotherapy (received vs. not received), and differentiation grade (all p < 0.001; Fig. 3, Supplementary Fig. S3 and Table S4). High-risk patients consistently showed significantly worse OS than low-risk patients in all subgroups.

Fig. 3: The HRs of OS predicted using the DLRM-RS across clinical subgroups in 1183 CRC patients.
Fig. 3: The HRs of OS predicted using the DLRM-RS across clinical subgroups in 1183 CRC patients.The alternative text for this image may have been generated using AI.
Full size image

HRs, 95% CIs and two-sided p values were calculated using the Cox proportional hazards model. HRs were depicted as the central point for the error bars, while the 95% CI was represented by the length of the error bars. CEA carcinoembryonic antigen, CA199 carbohydrate antigen 199, HR hazard ratio, OS overall survival, CRC colorectal cancer, CI confidence interval.

Comparative analysis of clinical characteristics revealed that high-risk patients were more likely to present with adverse clinical features, including advanced T/N stages, elevated levels of CEA and CA199, and poor tumor differentiation (all p < 0.05; Fig. 4a–h). However, no significant differences were observed in age, sex, or location between risk groups (all p > 0.05). DLRM-RS scores were significantly higher in patients with advanced T/N stages, elevated levels of CEA and CA199, and poor differentiation (all p < 0.05; Fig. 4i–p). No significant variations in DLRM-RS scores were observed across age, sex, or location (all p > 0.05). These findings collectively demonstrated the reliability of DLRM-RS in reflecting tumor aggressiveness and malignant potential.

Fig. 4: Associations between DLRM-RS-defined risk groups and clinical characteristics.
Fig. 4: Associations between DLRM-RS-defined risk groups and clinical characteristics.The alternative text for this image may have been generated using AI.
Full size image

ah Bar plots depicted the distribution of clinical subgroups (age, sex, CEA, CA199 levels, T stage, N stage, tumor location, and differentiation) between low- and high-risk groups as defined by DLRM-RS in the entire CRC cohort. ip Box plots illustrated the differences in DLRM-RS scores across corresponding clinical subgroups. CEA carcinoembryonic antigen, CA199 carbohydrate antigen 199, DLRM deep learning and radiomics model, DLRM-RS DLRM-based risk stratification.

Development and evaluation of the nomogram model

Univariate and multivariate Cox regression analyses were conducted to assess the independent prognostic value of DLRM-RS. The univariate Cox regression analysis identified DLRM-RS as a significant predictor of OS in CRC patients (HR = 8.00, 95% CI: 5.40–11.86, p < 0.001; Fig. 5a; Supplementary Table S5), with this association remaining significant after multivariate Cox regression adjustment (HR = 5.88, 95% CI: 3.81–9.06, p < 0.001; Fig. 5a; Supplementary Table S5). Multivariate Cox regression analysis identified CEA, N stage, and DLRM-RS as independent prognostic factors for OS. Using these three variables, a prognostic nomogram was constructed through multivariate Cox regression with stepwise selection to generate individualized risk predictions (Fig. 5b).

Fig. 5: Development and evaluation of the nomogram model.
Fig. 5: Development and evaluation of the nomogram model.The alternative text for this image may have been generated using AI.
Full size image

a Univariate and multivariate analyses of clinical characteristics and DLRM-RS in the training cohort. b A nomogram integrating clinical variables and DLRM-RS was constructed to predict 3- and 5-year OS in CRC patients from the training cohort. c, d Time-dependent ROC curves evaluating the predictive performance of the nomogram for 3- and 5-year OS across all cohorts. Calibration plots for 3- and 5-year OS predictions in the training cohort (e), internal validation cohort (f), external validation cohort 1 (g), and external validation cohort 2 (h). CEA carcinoembryonic antigen, CA199 carbohydrate antigen 199, OS overall survival, DLRM-RS deep learning radiomics model-based risk stratification, ROC receiver operating characteristic, AUC area under curve.

Our results demonstrated that the newly developed nomogram outperformed both the clinical model (including only CEA and N stage as predictors) and the DLRM-RS model across all cohorts. The C-index values and AUCs of 3- and 5-year OS predictions in all cohorts were significantly higher (all p < 0.05; Fig. 5c, d; Supplementary Fig. S4a–d; Table 2). Specifically, C-index values of the nomogram were 0.787 (95% CI: 0.742–0.828), 0.804 (95% CI: 0.736–0.866), 0.758 (95% CI: 0.693–0.822), and 0.803 (95% CI: 0.738–0.862), respectively. For 3-year OS prediction, the corresponding AUCs were 0.786 (95% CI: 0.726–0.847), 0.838 (95% CI: 0.767–0.908), 0.737 (95% CI: 0.651–0.824), and 0.830 (95% CI: 0.748–0.911), respectively (Supplementary Fig. S5a–d). For 5-year OS prediction, the AUCs were 0.822 (95% CI: 0.775–0.869), 0.853 (95% CI: 0.785–0.920), 0.786 (95% CI: 0.719–0.852), and 0.823 (95% CI: 0.754–0.893), respectively (Supplementary Fig. S5e–h).

Table 2 Comparison of C-index and time-dependent AUCs among three prognostic models across training, internal validation, and two external validation cohorts

Furthermore, the calibration plots confirmed the accuracy of the nomogram in predicting 3- and 5-year OS, indicating a good consistency between the predicted and the observed survival probabilities (Fig. 5e–h). Decision curve analysis (DCA) further showed that the nomogram yielded greater clinical net benefits compared to both the clinical model and the DLRM-RS model (Supplementary Fig. S6a–d).

Metabolomic alterations between high- and low-risk groups in CRC

To investigate metabolic differences between high-risk and low-risk CRC patients, 52 patients from Center 3 were stratified into high-risk (n = 17) and low-risk (n = 35) groups based on the DLRM-RS. Metabolomics analysis using 600 MHz NMR spectroscopy was performed on these 52 pairs of CRC tumor tissues and matched adjacent normal tissues (Supplementary Fig. S7). Principal component analysis (PCA) was conducted to visualize global metabolic differences among the high-risk, low-risk, and normal tissue groups. The 3D PCA score plot demonstrated a clear separation among the three groups, indicating distinct metabolic profiles (Fig. 6a). Hierarchical clustering heatmaps further illustrated intergroup metabolic differences, highlighting relative changes in metabolite levels across the normal, high-risk, and low-risk groups (Fig. 6b). Subsequently, orthogonal partial least squares discriminant analysis (OPLS-DA) was employed to enhance group discrimination. OPLS-DA score plots demonstrated clear segregation both between CRC and normal tissues (intercepts of goodness-of-fit R² = 0.885; goodness-of-prediction Q² = 0.817) and between high-risk and low-risk CRC groups (R² = 0.879, Q² = 0.802), supporting the robustness and predictive power of the models (Fig. 6c, d). Moreover, 999-time permutation tests were conducted to assess potential overfitting. The R2 and Q2 suggested that the OPLS-DA models were not apparently overfitted (Fig. 6e, f). Collectively, these findings revealed apparent metabolic differences among the normal, high-risk, and low-risk CRC groups.

Fig. 6: Differential metabolite and metabolic pathway analysis.
Fig. 6: Differential metabolite and metabolic pathway analysis.The alternative text for this image may have been generated using AI.
Full size image

a 3D PCA score plots between normal, high-risk and low-risk groups. b Heatmap of differential metabolites along the normal, high-risk and low-risk groups (red: increased concentration; blue: decreased concentration). OPLS-DA score plots comparing CRC group vs. normal group (c), and high-risk group vs. low-risk group (d). (e-f) Statistical validation of the corresponding model by permutation analysis (999 times). The x-axis represents the permutation retention rate of the permutation test, and the dots in the upper right corner represent the R2 (light blue) and Q2 (dark blue) values of the original model when the permutation retention rate was 1. R2 measured the goodness of fit, while Q2 measured the predictive ability of the model. Light blue dots represented the R2 value obtained from the permutation test, while dark blue dots represented the Q2 value obtained from the permutation test. The two dashed lines represented the regression lines of R2 and Q2, respectively. Metabolic pathway analysis comparing CRC group vs. normal group (g) and high-risk group vs. low-risk group (h). Pathway topology was assessed using relative betweenness centrality. Each bubble represents a pathway, with bubble color and size indicating p-value and impact value, respectively. i Venn diagram showing 17 overlapping differential metabolites between CRC group vs. normal group and high-risk group vs. low-risk group. j Sankey-bubble chart of 11 significantly enriched metabolites and their 6 associated metabolic pathways. CRC colorectal cancer, PCA principal component analysis, OPLS-DA orthogonal partial least squares discrimination analysis.

Biomarkers were selected based on variable importance in projection (VIP > 1) from the OPLS-DA model and a Student’s t-test p-value < 0.05. As a result, significant metabolic alterations included 26 metabolites differentiating CRC from normal tissues and 27 metabolites distinguishing high-risk from low-risk CRC tissues (Supplementary Tables S6, S7). Pathway enrichment analysis using MetaboAnalyst revealed that the 26 metabolites differentiating CRC from normal tissues were enriched in pathways such as arginine biosynthesis, alanine, aspartate and glutamate metabolism, butanoate metabolism, and valine, leucine, and isoleucine biosynthesis (Fig. 6g, Supplementary Table S8). Similarly, the 27 metabolites differentiating high-risk from low-risk groups were enriched in alanine, aspartate, and glutamate metabolism, arginine biosynthesis, butanoate metabolism, glyoxylate and dicarboxylate metabolism, and others (Fig. 6h; Supplementary Table S9). Venn analysis of differential metabolites from CRC-normal and high-low risk comparisons identified 17 shared metabolites, representing key signatures linking tumorigenesis with prognostic risk stratification (Fig. 6i; Supplementary Fig. S8). The enrichment analysis showed that among the 17 overlapping metabolites, 11 metabolites were significantly enriched [false discovery rate (FDR) < 0.05] in six key metabolic pathways, namely arginine biosynthesis, butanoate metabolism, alanine, aspartate and glutamate metabolism, histidine metabolism, nitrogen metabolism, and valine, leucine, and isoleucine biosynthesis (Fig. 6j; Supplementary Table S10).

Transcriptomic profiling reveals distinct biological processes between high- and low-risk groups in CRC

To explore transcriptomic alterations underlying risk stratification defined by DLRM-RS in CRC, we analyzed RNA sequencing data from 19 CRC samples in The Cancer Imaging Archive (TCIA) database, comprising 15 low-risk and 4 high-risk samples. Gene set enrichment analysis (GSEA) based on Gene Ontology (GO) terms revealed that significantly upregulated pathways in the low-risk group were predominantly related to antigen presentation, while those enriched in the high-risk group involved cardiac muscle repolarization and branching morphogenesis (Fig. 7a). All enriched pathways were ranked by their normalized enrichment score (NES) to identify the most prominent biological processes in each group. Notably, immunological synapse (NES = −2.60, p < 0.001), respiratory burst (NES = −2.567, p < 0.001), and antigen processing and presentation of peptide antigen (NES = −2.430, p < 0.001) were significantly enriched in low-risk tumors (Fig. 7b). In contrast, we observed significant up-regulation of several extracellular matrix (ECM)-related pathways in high-risk group, such as collagen fibril organization (NES = 2.307, p < 0.001), ECM structural constituent (NES = 2.274, p < 0.001), and ECM constituent conferring tensile strength (NES = 2.184, p < 0.001) (Fig. 7b). These results suggested distinct biological processes between risk groups, with low-risk tumors exhibiting immune activation and high-risk tumors characterized by ECM remodeling.

Fig. 7: Transcriptomic profiles between high- and low-risk groups defined by DLRM-RS.
Fig. 7: Transcriptomic profiles between high- and low-risk groups defined by DLRM-RS.The alternative text for this image may have been generated using AI.
Full size image

a Network visualization of GO term-enriched pathways based on GSEA. Each node represented a biological process, and edges indicated potential interactions or shared functions. The color gradient reflected the NES value, with darker colors indicating stronger enrichment. b Bar plot showing the top 15 enriched pathways ranked by the NES in the high-risk group (red) and low-risk group (blue). c Radar plot of the proportion of 22 tumor-infiltrating immune cells between high-risk and low-risk groups. d Violin plot showing significantly different levels of CD8⁺ T cells between the high-risk and low-risk groups. e, f GSEA results for KEGG pathways significantly enriched in the high-risk group. g, h GSEA results for KEGG pathways significantly enriched in the low-risk group. TCGA The Cancer Genome Archive, TCIA The Cancer Imaging Archive, NES Normalized Enrichment Score, GSEA Gene Set Enrichment Analysis, KEGG Kyoto Encyclopedia of Genes and Genomes.

Given the overall upregulation of immune-related pathways in the low-risk group–particularly antigen processing and presentation of peptide antigen, a key pathway in anti-tumor immune responses–we further assessed differences in the tumor immune microenvironment between risk groups. Using the Cell-type Identification by Estimating Relative Subpopulations of RNA Transcripts (CIBERSORT) algorithm13, we estimated the abundance of 22 immune cell subsets across samples. Radar plots revealed distinct immune cell profiles between groups (Fig. 7c), with CD8+ T cells significantly more abundant in the low-risk group (Fig. 7d).

To validate the metabolomic findings at the transcriptomic level, we conducted pathway enrichment analysis using Kyoto Encyclopedia of Genes and Genomes (KEGG) gene sets between high-risk and low-risk groups. GSEA results revealed several metabolic pathways significantly upregulated in the high-risk group, including the cGMP-PKG signaling pathway (NES = 1.882, FDR < 0.001), proteoglycans in cancer (NES = 1.819, FDR < 0.001), calcium signaling pathway (NES = 1.571, FDR = 0.006), and PI3K-Akt signaling pathway (NES = 1.529, FDR = 0.003) (Fig. 7e, f; Supplementary Table S11). Conversely, the valine, leucine and isoleucine degradation pathway (NES = −2.249, FDR < 0.001), butanoate metabolism (NES = -1.914, FDR = 0.019), nitrogen metabolism (NES = −2.080, FDR = 0.001), Th1/Th2 cell differentiation (NES = −1.840, FDR = 0.002), and T cell receptor signaling pathway (NES = −1.558, FDR = 0.019) were significantly downregulated in the high-risk group (Fig. 7g, h; Supplementary Table S11). Importantly, butanoate metabolism and nitrogen metabolism, initially identified through metabolomics analysis, were also significantly enriched at the transcriptomic level, providing convergent evidence for their potential association with prognostic stratification in CRC.

Prognostic significance of butanoate metabolism and nitrogen metabolism in CRC

Based on the above findings, butanoate metabolism and nitrogen metabolism emerged as recurrent metabolic pathways differing between risk groups in both metabolomic and transcriptomic analyses. To further validate their prognostic relevance, we performed K-M analyses using the gene set variation analysis (GSVA) scores of these pathways in The Cancer Genome Atlas colorectal cancer (TCGA-CRC) cohort (n = 417). Patients were divided into high- and low-score groups according to the median GSVA score for each pathway.

The results showed that patients with lower butanoate metabolism scores had significantly poorer OS (p = 0.007; HR = 0.52; 95% CI: 0.33–0.84) (Fig. 8a). Similarly, a lower GSVA score for the nitrogen metabolism pathway was also associated with worse OS (p < 0.018; HR = 0.56; 95% CI: 0.35–0.90) (Fig. 8b). These findings suggested that reduced activity in these two metabolic pathways may contribute to unfavorable prognosis in CRC.

Fig. 8: The relationship between key pathways and OS in CRC patients.
Fig. 8: The relationship between key pathways and OS in CRC patients.The alternative text for this image may have been generated using AI.
Full size image

a K-M analysis for OS probability stratified by different GSVA scores of butanoate metabolism (based on median value) in the TCGA cohort. b K–M analysis for OS probability stratified by different GSVA scores of nitrogen metabolism (based on median value) in the TCGA cohort. HR hazard ratio, OS overall survival.

Discussion

In this multicenter study, we developed a prognostic DLRM based on venous-phase CT images and derived a corresponding DLRM-RS to stratify CRC patients into low- and high-risk groups. The DLRM-RS demonstrated robust performance in predicting OS and maintained its discriminatory ability across diverse clinical subgroups, supporting its generalizability. To enhance clinical applicability, we integrated DLRM-RS with clinical parameters (CEA level and N stage) to construct a nomogram that improved prognostic accuracy. Moreover, integrative multi-omics analysis revealed that high-risk tumors were enriched for ECM-related pathways associated with tumor progression, whereas low-risk tumors exhibited immune-related signatures, including higher CD8⁺ T-cell infiltration. Notably, butanoate metabolism and nitrogen metabolism were consistently enriched in the low-risk group, providing exploratory insights into potential biological patterns associated with the prognostic value of the DLRM-RS.

Given the substantial heterogeneity of CRC outcomes, developing reliable prognostic tools remains a research priority. In recent years, radiomics and DL technologies have shown strong potential in tumor prognostication by extracting high-dimensional imaging features that noninvasively characterize tumor heterogeneity and biological behavior5,14. In addition, integrating these features with machine learning further improves predictive performance15. However, many previous studies relied on limited or subjectively selected algorithms, increasing the risk of overfitting and yielding suboptimal models. In our study, we applied a comprehensive computational framework incorporating ten machine learning algorithms and 117 combinations to build a stable, reproducible prognostic model. This systematic approach allowed us to explore algorithmic interactions, reduce feature dimensionality, and enhance model generalizability. Among all combinations, a combined approach using Lasso and GBM was ultimately selected due to the highest mean C-index (0.768) across cohorts. Compared to traditional survival analysis methods, Lasso can effectively perform variable selection and regularization on high-dimensional datasets, reduce model complexity, and mitigate multicollinearity16. GBM integrates the strengths of the Boosting algorithm with the generalized regression approach, incrementally enhancing the predictive capability of the model17. Notably, the final model included nine radiomics features and six DL features, which showed low inter-correlation, indicating they capture distinct but synergistic aspects of tumor biology. The DLRM-RS derived from the optimal model effectively stratified patients, with those in the high-risk group showing significantly worse OS, which were validated in both internal and external cohorts. To improve the predictive accuracy and facilitate individualized risk assessment, we integrated clinical variables into the prognostic framework. Multivariate Cox regression identified CEA and N stage as independent predictors of OS. As a key serum biomarker for gastrointestinal malignancies, elevated preoperative CEA levels have been associated with more aggressive tumor behavior and a higher risk of postoperative recurrence55. Likewise, previous studies have demonstrated that advanced N stage has consistently been associated with poorer prognosis18,19,20. Combining DLRM-RS, CEA, and N stage into a nomogram yielded superior predictive performance, offering a clinical application tool for personalized risk stratification of CRC.

Despite their excellent predictive performance, data-driven DL and radiomics studies have the nature of the post generation of scientific hypotheses, making it challenging to directly infer the biological significance of their results. However, exploring the potential biological relevance of these imaging features is essential for understanding the disease mechanism, identifying potential diagnostic and therapeutic targets, and facilitating clinical translation. To enhance the interpretability of the DLRM-RS, we conducted a series of biological analyses. Metabolomic profiling revealed significant alterations in several metabolic pathways between the high- and low-risk CRC group, including butanoate metabolism, nitrogen metabolism, and alanine, aspartate, and glutamate metabolism. Each of these metabolic pathways is related to tumor progression and immune regulation21,22,23. These findings suggest that metabolic reprogramming may be associated with the prognostic stratification defined by the DLRM-RS, providing an exploratory biological context. Furthermore, the transcriptomic analysis using GSEA demonstrated that the tumors in the high-risk group were enriched in ECM-related pathways, such as ECM structural constituent, ECM constituent conferring tensile strength, and collagen fibril organization. The ECM, as a key component of the tumor microenvironment together with various immune cells, plays a critical role in regulating cellular proliferation, migration, diffusion, and differentiation24,25,26. These findings were consistent with previous studies showing that ECM-related pathways were closely associated with tumor progression27,28. In contrast, the tumors in the low-risk group showed significant enrichment in immune-related pathways, including antigen processing and presentation, immunological synapse formation, and respiratory burst, suggesting a more active anti-tumor immune microenvironment. Consistently, immune cell infiltration analysis revealed that low-risk patients exhibited higher levels of CD8⁺ T cell infiltration. CD8⁺ T cells are central to anti-tumor immunity by releasing cytotoxic molecules and cytokines29,30, and their high density has been confirmed to correlate with better survival outcomes31,32. Collectively, these exploratory results suggest that the DLRM-RS may be associated with distinct metabolic and immune-related patterns in CRC, and point toward potential immunotherapy targets for future interventions.

Importantly, integrative metabolomic and transcriptomic analyses consistently identified two altered pathways—butanoate metabolism and nitrogen metabolism. Differential pathway analysis revealed that both pathways were significantly enriched in the low-risk group, suggesting their potential protective roles in CRC prognosis. Further corroborating their clinical relevance, GSVA analysis in an independent TCGA-CRC cohort demonstrated that lower activity of these pathways was significantly associated with worse OS, highlighting their potential relevance to prognosis and their candidacy for further investigation as biomarkers or therapeutic targets. Butanoate metabolism, closely linked to the production of short-chain fatty acids by commensal gut microbiota, was known for its anti-inflammatory, anti-proliferative, and pro-apoptotic effects in CRC33. It functions as a histone deacetylase inhibitor, suppressing CD8+ T cell exhaustion within the tumor immune microenvironment, inhibiting JAK-STAT signaling, and enhancing antitumor immunity23,34,35. The enrichment of this pathway in the low-risk group suggests a more favorable tumor microenvironment, possibly associated with a healthier microbiota composition and enhanced immunosurveillance. Interestingly, nitrogen metabolism also showed higher activity in patients with better prognosis. Although traditionally it was believed to be linked to tumor proliferation36,37,38, recent evidence suggests a context-dependent role. In the tumor microenvironment, enhanced nitrogen metabolism may inhibit tumor progression by supporting the function of immune cells, maintaining REDOX balance and metabolic homeostasis39,40, and promoting the formation of an immunologically active microenvironment. Therefore, the upregulation of nitrogen metabolism in the low-risk group may reflect a metabolic state conducive to immune surveillance, suggesting a potential link between metabolism-immune interactions and CRC prognosis.

The findings of this study carry significant clinical implications, particularly in informing personalized treatment strategies for CRC patients. Notably, the prognostic value of the DLRM-RS was consistently preserved across a wide range of clinically relevant subgroups, including age, sex, tumor location, CEA/CA199 levels, T stage, N stage, differentiation grade, and postoperative chemotherapy status. Importantly, within each of these subgroups, patients classified as high-risk by the DLRM-RS consistently exhibited significantly worse OS than their low-risk counterparts. This indicates that the DLRM-RS provides prognostic information beyond conventional clinicopathological factors, rather than merely recapitulating established predictors such as tumor stage. From a clinical perspective, this additional stratification capability suggests that high-risk patients identified by the DLRM-RS may benefit from intensified treatment strategies or closer postoperative surveillance, even within the same clinical stage or treatment category, whereas low-risk patients may be considered for de-escalated interventions to avoid overtreatment. Integrating radiomics and DL into routine clinical workflow may therefore enhance prognostic precision, support risk-adapted decision-making, and facilitate more individualized patient management. Furthermore, the identification of prognosis-related metabolic pathways, such as butanoate metabolism and nitrogen metabolism, provides biologically plausible contexts or signals that may help interpret risk stratification and inform future hypothesis-driven studies.

While this study provided significant insights, several limitations should be acknowledged. Firstly, as a retrospective multicenter study, potential biases and unmeasured confounders remain. In particular, the transcriptomic and metabolomic analyses were limited by relatively small sample sizes, particularly in the high-risk subgroup of the transcriptomic cohort. Such underpowered datasets may reduce the statistical robustness of differential expression and pathway enrichment analyses and increase the risk of overfitting. Therefore, the biological interpretations derived from these exploratory multi-omics analyses should be considered hypothesis-generating rather than definitive, and validation in larger, well-designed prospective multi-omics cohorts is warranted. Secondly, manual ROI segmentation is heavily influenced by subjective experience; future work needs to adopt an unsupervised segmentation method. Thirdly, although the patch-based DL strategy employed in this study captures inter-patch heterogeneity, it may not fully model fine-grained intra-patch heterogeneity. Recent advances, such as multiple-instance learning frameworks, enable instance-level feature learning within patches, thereby preserving detailed morphological and spatial information that may be critical for prognostic modeling. Moreover, graph-based transformer approaches can further model complex contextual relationships among instances or patches, facilitating multi-scale representation of tumor heterogeneity. These advanced strategies may address the limitations of the current approach and will be explored in future work41. Lastly, while this study integrated radiomics, DL, metabolomics, and transcriptomics, other critical dimensions like methylation, proteomics, and pathomics were not included. Future multi-omics integrative analyses are warranted to deepen understanding of tumor biology and improve precision medicine.

In conclusion, this study proposed and validated a DLRM model together with its derived DLRM-RS for prognosis prediction in CRC patients based on preoperative CT. Combining CEA and N stage with DLRM-RS will provide a unique opportunity for new management approaches. In addition, we provided new insights into the molecular mechanisms of CRC occurrence and development from the perspectives of metabolomics and transcriptomics. By integrating molecular insights with clinical applications, our findings paved the way for improving prognosis prediction and the development of personalized treatment strategies.

Methods

Patients

We retrospectively collected data from CRC patients who underwent surgical resection between January 2016 and January 2020 across four centers. A total of 1183 patients with preoperative CT images, clinical, and follow-up data were included for the development and validation of a risk stratification model. The inclusion criteria were as follows: (a) CRC confirmed by postoperative pathology; (b) radical surgery for CRC performed; (c) contrast-enhanced CT examination within 2 weeks before surgery with good image quality; and (d) a follow-up period of more than 5 years. The exclusion criteria included: (a) receipt of radiotherapy, chemotherapy, or other treatments before surgery; (b) presence of other malignant neoplasms; (c) incomplete clinical or pathological data; and (d) distant metastasis before surgery. All these patients were divided into four cohorts: (1) training cohort (n = 435) and (2) an internal validation cohort (n = 187), both from Center 1 using a 7:3 split between January 2016 and December 2019; (3) external validation cohort 1 (n = 313) from Centers 2 and 3 (January 2018 -January 2020); and (4) external validation cohort 2 (n = 248) from Center 4 (January 2018 - January 2020). The two external validation cohorts were intentionally analyzed separately rather than merged, as they originated from distinct geographic regions and clinical settings. This design allows a more rigorous assessment of the model’s generalizability across heterogeneous real-world environments.

In addition, 52 patients from Center 3 with paired CT images, follow-up data, and 1H NMR-based metabolomic data (tumor and adjacent normal tissue) were included from a previous CRC diagnostic study44 (used in metabolomic analysis). While the previous study primarily focused on diagnostic applications, our current analysis aimed to investigate metabolic alterations underlying distinct prognostic outcomes. Furthermore, CT images, follow-up, and RNA sequencing data from 19 CRC patients in the TCIA database were used for mechanistic exploration of prognostic differences (used in transcriptomic analysis), and an additional 417 patients from The Cancer Genome Atlas (TCGA) with RNA sequencing and follow-up data were included for pathway validation (used in pathway validation). To ensure the clarity of the data sources and cohort definitions, a summary of patient sources, data types, and analytic purposes was presented in Supplementary Table S12, and the patient enrollment process was illustrated in Supplementary Fig. S9.

Clinical data collection and follow-up

Clinical data recorded from the medical records of participating institutions included age, sex, tumor differentiation, tumor location, CEA, CA199, T stage, N stage, and postoperative chemotherapy.

The clinical endpoint of this study was OS, defined as the time from surgical resection to death or the last follow-up. The final follow-up was conducted in January 2025. Patients were monitored every 6–12 months in the first 2 years and once a year thereafter. All patients were followed up for at least 5 years.

CT image acquisition and image preprocessing process

CT images were obtained at four centers using five CT scanners. All patients underwent contrast-enhanced abdominal CT scans. Detailed acquisition parameters for each scanner were provided in Supplementary Table S13. To reduce noise-related intensity variability caused by different scanners and hospitals in CT images and improve the robustness of feature extraction, the voxel intensities were discretized using a fixed bin width of 25 Hounsfield units (HU). The HU is a standardized quantitative scale in CT imaging that reflects tissue attenuation relative to water and air. Fixed bin-width discretization with a bin width of 25 HU is a commonly adopted preprocessing step in CT-based image analysis, recommended by the Image Biomarker Standardisation Initiative (IBSI), to improve the robustness of image-derived feature extraction by reducing high-frequency intensity fluctuations and minor scanner-dependent variations45,46,47. Subsequently, all images were resampled to an isotropic voxel spacing of 1 mm × 1 mm × 1 mm (x, y, z) using linear interpolation to standardize spatial resolution across scanners and institutions.

Tumor segmentation and feature extraction

Tumor segmentation was performed on venous-phase CT images. Two radiologists (Z.H.L. and Y.Q., with 5 and 10 years of experience, respectively) manually segmented the tumors using 3D Slicer software (version 5.2.2; https://www.slicer.org/). Mesentery air and pericolonic fat were excluded from the volume of interest. Discrepancies were reviewed and resolved by a third radiologist (Y.L), who has over 20 years of experience. All radiologists were blinded to patients’ survival outcomes.

Subsequently, radiomic features were extracted from the volume of interest (VOI) in venous-phase CT images using the Pyradiomics Python package (version 3.8)48, including first-order, shape-based, gray level co-occurrence matrix (GLCM), gray-level size zone matrix (GLSZM), gray level run-length matrix (GLRLM), gray level dependence matrix (GLDM), and neighboring gray tone difference matrix (NGTDM). The specific classification of radiomics features is shown in Supplementary Table S14. In addition, DL-based feature extraction was performed using a three-dimensional convolutional neural network (3D CNN). Considering the limited availability of pretrained models for medical 3D imaging, the Med3D dataset—a large-scale, multi-modal, multi-organ medical imaging resource—was employed for network pretraining49. A 3D ResNet18 architecture pretrained on Med3D was adopted as the backbone network, selected for its favorable balance between representational capacity and robustness in small- to moderate-sized medical imaging datasets. Manually delineated tumor regions of interest (ROIs) were used to localize the lesion in three-dimensional space. For each patient, a cubic bounding box fully enclosing the ROI was generated. The resulting 3D volumes were then resampled using linear interpolation to a fixed size of 96 × 96 × 96 voxels at an isotropic resolution of 1 × 1 × 1 mm³, ensuring uniform input dimensions across patients while preserving essential peritumoral contextual information. Prior to network input, voxel intensities within each 3D patch were linearly scaled to a fixed range to improve numerical stability during training. Data augmentation strategies, including random affine transformations, random flips, and random 90° rotations along different axes, were applied during training to reduce overfitting and enhance model generalizability. The 3D ResNet18 backbone was initialized with Med3D-pretrained weights and subsequently fine-tuned using patient outcomes from the training cohort as supervision. Model training was performed for 200 epochs using the Adam optimizer with an initial learning rate of 0.0005 and a binary cross-entropy loss with logits, and a cosine annealing learning rate scheduler. Given that the primary objective of the DL model was feature representation learning rather than direct survival time modeling, OS was formulated as a binary outcome during network training to provide supervision for discriminative feature extraction. Specifically, patients were labeled as deceased or alive at 5 years after surgery, and this binary endpoint was used solely for network supervision rather than for final survival prediction. After fine-tuning, DL features were extracted from the penultimate layer of the 3D ResNet18 network for each patient, yielding a 512-dimensional feature vector5,50. A total of 1437 radiomics features and 512 DL features were extracted from each patient. Given the standardized image preprocessing in accordance with IBSI recommendations, ICC-based feature robustness assessment, redundancy reduction, and rigorous independent multi-center external validation, no additional feature-level harmonization methods were applied in this study.

To ensure the reproducibility and reliability of radiomic and DL features, 30 CT images were randomly selected for VOI segmentation, and inter-/intra-observer consistency of feature extraction was assessed by intra-class correlation coefficient (ICC) analysis. First, radiologist 1 (Z.H.L) and radiologist 2 (Y.Q) independently delineated the VOI and calculated the interobserver ICC. After 2 months, radiologist 1 re-segmented the VOI to calculate the intra-observer ICC. An ICC value greater than 0.80 was considered to indicate good agreement.

Feature selection and development of the DLRM-RS

The process of feature selection was confined to the training cohort. Before feature selection, z-score normalization was used to standardize the radiomics and DL features. First, features with an intra- and inter-observer ICCs value of > 0.80 were retained to ensure reproducibility. Second, Spearman correlation analysis was performed to reduce redundancy and multicollinearity, using a threshold of |r | > 0.8; when collinearity was detected, only one of the correlated features was retained for further analysis. Third, univariate Cox regression was applied as an initial feature screening step to reduce dimensionality. This analysis was not intended for final statistical inference, and therefore, no formal multiple testing correction was applied at this stage. Subsequent model development incorporated ML algorithms with intrinsic regularization or feature selection mechanisms to further control overfitting and identify the most informative features.

To develop a robust DLRM for OS prediction, we utilized the “Mime” R package (version 1.0, https://github.com/l-magnificence/Mime), which integrates ten machine learning algorithms and 117 algorithm combinations51. The integrative algorithms included random forest (RSF), elastic network (Enet), stepwise Cox (StepCox), least absolute shrinkage and selection operator (Lasso), partial least squares regression for Cox (plsRcox), supervised principal components (SuperPC), generalized boosted regression models (GBM), survival support vector machine (survival-SVM), ridge regression (Ridge), and Cox model with boosting algorithm (CoxBoost). The precise parameters of the ten machine learning algorithms can be found in the original code and in Supplementary Table S15. Each model was trained using 10-fold cross-validation in the training cohort and evaluated in the internal and two external validation cohorts. To identify a robust and generalizable final model, we summarized predictive performance using the C-index across the training cohort, the internal validation cohort, and the two external validation cohorts. The model achieving the highest mean C-index across these four cohorts was selected as the final DLRM, aiming to balance predictive accuracy and stability across heterogeneous populations. Each patient’s computed risk score from this model was termed the DLRM risk score.

To facilitate comparison with conventional prognostic modeling strategies, two baseline models were also constructed. Specifically, radiomics-only and DL-only models were developed following the same feature selection procedures and were subsequently built using a conventional multivariable Cox regression framework. These baseline models were used to benchmark the incremental prognostic value of the proposed integrative DLRM.

In addition, the optimal cutoff value for the DLRM risk score in the training cohort was determined using the ‘surv_cutpoint’ function from the ‘survminer’ package (version 0.5.0). Based on this threshold, a DLRM-RS was established, categorizing patients into low- and high-risk groups.

Evaluation of the DLRM-RS

A time-dependent ROC curve was used to evaluate the prognostic performance of the DLRM-RS. K-M analysis and the log-rank test were conducted to assess survival differences between the two risk groups. To minimize potential confounding and ensure the robustness of the DLRM-RS, subgroup analyses were performed based on sex, age (≥/ <65 years), tumor location (right colon/rectum/left colon), T stage (T1-2/T3-4), N stage (N0/N+), CEA level, CA199 level, and postoperative chemotherapy.

Development and evaluation of the nomogram model

In the training cohort, univariate and multivariate Cox regression analyses were performed to identify independent prognostic factors for OS among clinical variables and the DLRM-RS. A clinical model was developed based on the significant clinical variables from the multivariate analysis. Subsequently, a combined prognostic nomogram was constructed by incorporating the DLRM-RS with these clinical variables using multivariate Cox regression.

The overall discriminative performance of the clinical model, DLRM-RS model, and the nomogram was evaluated using the C-index. Predictive accuracy at 3 and 5 years was assessed using time-dependent ROC curves, with the AUC calculated for each time point. The C-index values and time-dependent AUCs of the models were compared using a nonparametric bootstrap method with 1000 resamples to estimate 95% confidence intervals and test for statistically significant differences in discrimination. Calibration curves were plotted to assess the agreement between predicted and observed survival probabilities at 3 and 5 years. Furthermore, DCA was performed to evaluate the clinical utility of each model across a range of threshold probabilities.

Metabolomics sample preparation, metabolite identification, and analysis

We collected metabolomics data from 52 CRC patients in Center 3, each of whom provided paired tumors and adjacent normal tissue samples (collected approximately 5 cm from the tumor margin). Tissue processing and 1H NMR spectroscopy were performed as previously described44. Briefly, dried tissue extracts were reconstituted in 550 μL phosphate-buffered saline (PBS, 0.1 M, pH 7.4, in D₂O) together with 50 μL 3-(trimethylsilyl) propionic-2,2,3,3-d₄ acid sodium salt (TSP) in D₂O stock solution. After centrifugation at 10,000 × g for 5 min, 500 μL of the supernatant was transferred into a 5 mm NMR tube for 1H NMR analysis on a Bruker Avance 600 MHz spectrometer.

The CHENOMX NMR Suite software (Chenomx Inc., Edmonton, Canada) 600 mHz, version 11 was used to identify and quantify the tissue metabolites detected by 1H NMR by matching the compounds’ peak with its library according to the chemical shift. 52 metabolites’ concentration was quantified, using TSP as the reference compound (shift at 0 ppm). The metabolites’ concentrations were reported in micromolar (μM).

Median normalization was applied to minimize systematic variations across samples, and log₂ transformation was subsequently used to reduce skewness and approximate normality of metabolite concentration distributions. Pattern recognition analysis, including PCA, metabolite heatmap, and OPLS-DA, was carried out to identify key metabolic features. To confirm the results of the established models and avoid overfitting, we performed 999 iterations of the permutation test and CV-ANOVA, and the quality of the OPLS-DA model was assessed using R2 and Q2. The differentially abundant metabolites were screened (VIP > 1.0 and p < 0.05). Metabolic pathway analysis was conducted via MetaboAnalyst 5.0. Importance and relevance of pathways are indicated by the impact factor (X axis) and -log (p value) (Y axis).

Transcriptomics analysis

To explore the transcriptomic alterations underlying the prognostic stratification defined by the DLRM-RS, we collected CT scans and RNA sequencing TPM data of 19 patients from the TCIA to reveal the molecular mechanisms associated with different prognosis groups. Patients were grouped into high- and low-risk groups defined by the DLRM-RS. To explore and visualize the biological characteristics of different risk groups, GSEA based on GO was performed by using the GSEA software (version 4.0.2, Broad Institute, San Diego, CA, USA)52. 1000 permutations were used. Permutation was performed based on gene sets. A weighted enrichment statistic was applied, using the signal-to-noise ratio as a metric for ranking genes. No additional normalization was applied within GSEA. As in the default mode, gene sets smaller than 15 and bigger than 200 genes were excluded from analysis. For visualization of GSEA results, the EnrichmentMap (version 3.5.0)53 application within the Cytoscape environment (version 3.10.3)54 was used. Gene sets were considered if they were identified at an FDR < 0.05. For gene-set-similarity filtering, data set edges were set automatically. A combined Jaccard and Overlap metric was used, applying a cutoff of 0.375. For gene set clustering, AutoAnnotate (version 1.5.2) 53 was used by automates running clusterMaker2 to generate clusters. The gene-set-similarity coefficient was utilized for edge weighting.

In addition, GSEA was also employed to investigate functional enrichment differences between high- and low-risk groups defined by the DLRM-RS, based on the GO biological process and KEGG sets. Pathways meeting the threshold of |NES | > 1 and FDR < 0.05 were considered significantly enriched pathways. The CIBERSORT algorithms were employed to evaluate differences in immune cell infiltration between the high- and low-risk groups4.

To validate the prognostic relevance of the identified metabolic pathways, we obtained RNA sequencing data (TPM values) and clinical follow-up information for 417 patients from the TCGA cohort. GSVA was performed based on KEGG pathway gene sets to calculate enrichment scores for each sample. For each pathway, patients were stratified into high- and low-score groups using the median GSVA score as the cutoff. Kaplan-Meier survival analysis was then conducted to evaluate differences in OS between the groups.

Statistical analysis

All statistical analyses were conducted using R software (version 4.4.2), Python (version 3.8), R-based MetaboAnalyst 5.0 platform (https://www.metaboanalyst.ca), Cytoscape software (version 3.10.3), GSEA software (version 4.0.2, Broad Institute, San Diego, CA, USA), and SIMCA (version 14.1). Continuous variables were expressed as mean ± standard deviation for normally distributed data or median with interquartile range (IQR) for non-normally distributed data. Categorical variables were presented as frequencies and percentages. Group comparisons for continuous variables were performed using unpaired t-tests or Mann-Whitney U tests as appropriate. Categorical variables were compared using the Chi-square test or Fisher’s exact test. All statistical tests were two-sided, with a p-value < 0.05 considered statistically significant.

Ethics statement

This multicenter retrospective study was approved by the ethics committees of the Shaoxing People’s Hospital (2020-K-Y-143-01), the First Affiliated Hospital of Ningbo University (2025-189A), Cancer Hospital of Shantou University Medical College (ChiCTR2300073613), and the Second Affiliated Hospital of Shantou University Medical College (2020-17). This research was conducted in accordance with the guidelines of the Helsinki Declaration and the principles of Good Clinical Practice. Given the retrospective nature of the study and the use of anonymized data, the requirement for informed consent was waived. The study was conducted and reported in accordance with the CheckList for EvaluAtion of Radiomics research (CLEAR)42 and the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD)43 guidelines to ensure comprehensive and transparent reporting.