Introduction

Colorectal cancer (CRC) is among the most prevalent malignancies worldwide and ranks as the fourth leading cause of cancer-related mortality. By 2030, its global incidence is projected to rise by 60%, with over 2.2 million new cases and approximately 1.1 million deaths annually1. CRC progression is primarily driven by the cumulative acquisition of genetic and epigenetic alterations in colonic epithelial cells2. Over the past decade, significant advances in cancer epigenetics have identified aberrant DNA methylation and dysregulated histone modifications as key contributors to CRC pathogenesis3,4,5. Brenner et al. reported that early-stage CRC diagnosis is associated with a five-year survival rate exceeding 90%6,7. However, the asymptomatic nature of early-stage CRC leads to delayed detection, with nearly half of patients diagnosed after hepatic metastases have developed, reducing the five-year survival rate to 14% in cases with distant metastases8. Current therapeutic strategies include surgical resection, radiotherapy, chemotherapy, and targeted therapies, though surgery remains the only potentially curative approach. In China, a considerable proportion of CRC cases are diagnosed at advanced stages, precluding optimal surgical intervention and significantly compromising survival outcomes9. The urgency of early detection and timely treatment in improving prognosis highlights the need for robust biomarkers. Developing novel, reliable biomarkers for early CRC diagnosis is crucial to enhancing treatment efficacy and alleviating disease burden.

RNA molecules undergo extensive post-transcriptional modifications, with over 100 distinct modifications identified to date10. Epitranscriptomic regulation, encompassing chemical modifications of both coding and non-coding RNAs11, plays a pivotal role in RNA metabolism, cellular homeostasis, and post-transcriptional gene regulation12. Alterations in these modifications are increasingly recognized as critical drivers of tumorigenesis. For instance, Shuibin Lin et al. demonstrated that dysregulated RNA modifications selectively enhance the translation of oncogenic transcripts, contributing to hepatocellular carcinoma (HCC) progression13. Furthermore, RNA modifications and their regulatory factors significantly influence the tumor microenvironment (TME)14. Dongliang Li et al. reported that m6A methylation facilitates non-small cell lung cancer (NSCLC) progression by modulating heterogeneous nuclear ribonucleoprotein A2/B1 (HNRNPA2B1)15. These findings underscore the strong association between aberrant RNA modifications and poor cancer prognosis, highlighting their potential as therapeutic targets. Pseudouridine (Ψ), the most abundant and first-identified RNA modification, arises from the enzymatic conversion of uridine via the formation of a carbon-carbon instead of a carbon-nitrogen bond16,17. Unlike many reversible RNA modifications, Ψ is stable in mammals and is excreted in urine as a metabolic byproduct18. Elevated urinary Ψ levels have been linked to multiple malignancies, including prostate, liver, gastric, and CRC, positioning it as a promising biomarker19. Targeting dysregulated post-transcriptional modifications has also emerged as a potential therapeutic strategy in oncology20,21. Dyskerin pseudouridine synthase 1 (DKC1), a key enzyme in Ψ biosynthesis, plays a pivotal role in tumor cell proliferation, invasion, and metastasis across various cancers, including CRC22. Through its regulation of internal ribosome entry site (IRES)-mediated translation and precursor RNA processing, DKC1 modulates the expression of cancer-related genes, thereby promoting tumor progression and metastasis22,23. Its dysregulation is strongly correlated with poor prognosis24,25. Investigating the role of Ψ in CRC may provide deeper insights into disease pathogenesis, facilitating more precise molecular classification and personalized therapeutic interventions.

Single-cell RNA sequencing (scRNA-seq), an advanced high-throughput technology, enables genome-wide transcriptomic and epigenomic profiling at single-cell resolution26. This approach is essential for identifying clinically relevant tumor subpopulations and has become an indispensable tool for dissecting tumorigenesis and intratumoral heterogeneity27. Since its introduction in 200928, scRNA-seq has significantly advanced cancer research, addressing key challenges in CRC, glioblastoma, HCC, metastatic renal cell carcinoma, and breast and lung adenocarcinomas29,30,31,32,33.

This study integrates bioinformatics approaches to identify and characterize key pseudouridine-related genes (PRGs) in CRC. By leveraging single-cell data, it examines the functional relevance of PRGs, assesses their prognostic significance, and explores their roles in tumor biology and immune infiltration. These findings aim to provide a theoretical framework for understanding the contribution of PRGs to CRC pathogenesis, offering new perspectives for targeted therapeutic strategies and prognostic refinement in CRC management.

Materials and methods

Data source

The training dataset utilized in this study was The Cancer Genome Atlas (TCGA)-CRC cohort, obtained from the TCGA database (http://cancergenome.nih.gov/), comprising RNA sequencing data, clinical characteristics, and survival information from 606 CRC tissue samples and 48 adjacent normal tissue samples. The validation dataset GSE87211 (GPL13497), consisting of RNA expression profiles from 190 CRC tumor tissues with survival data, was retrieved from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/). Detailed clinical characteristics of both cohorts are provided in Tables S1 and S2. Additionally, the scRNA-seq dataset GSE200997 (GPL21697), containing 16 CRC tumor samples and 7 adjacent normal colon tissue samples, was acquired from the GEO database. A total of 18 PRGs were extracted from the Molecular Signatures Database (MSigDB, https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) by querying the term “pseudouridine.”

Single-cell data analysis

For the single-cell dataset, raw data were processed using the CreateSeuratObject function from the Seurat package (v4.1.0)34. Quality control filtering criteria were applied to exclude cells with nCount > 10,000, Feature_RNA < 200, nFeature > 4000, and mitochondrial gene content exceeding 20%. Normalization was performed using NormalizeData, followed by identification of highly variable genes using the FindVariableFeatures function with selection.method = vst and nfeatures = 2000. The top 2000 highly variable genes underwent ScaleData normalization before principal component analysis (PCA). Linear dimensionality reduction was conducted using JackStraw and ScoreJackStraw, and the variance explained by each principal component (PC) was ranked. PCA inflection points and scree plots were generated to determine the optimal number of PCs for downstream analysis. Cell clustering was performed using the UMAP algorithm with a resolution of 0.5, and cell populations were annotated based on established marker genes from the literature35. The predominant cell type within the dataset was designated as the core cell population. Gene expression patterns across clusters were visualized, and cluster-specific annotations were refined accordingly. To elucidate the biological functions associated with each cell type, all single-cell samples underwent gene set variation analysis (GSVA) using the ReactomeGSA package, enabling pathway enrichment analysis. Differential expression analysis across distinct cell types was conducted using the FindMarkers function, comparing tumor and normal samples. Genes with a fold change exceeding 1.2 were retained, and the differential expression threshold was set at mean |log2Fold Change (FC)| > 0.5 and p < 0.05.

Acquisition of module genes

Weighted gene co-expression network analysis (WGCNA) was performed using the WGCNA package (v1.72-1)36 to identify gene modules most strongly associated with PRG scores. Prior to clustering, gene expression values were pre-filtered based on the absolute median deviation (top 75%). Tumor samples from the training cohort were subjected to hierarchical clustering, and outliers were excluded to ensure analytical robustness. A soft threshold (β) was determined to optimize gene interaction consistency with a scale-free network topology. Systematic clustering was then applied, and gene modules were segmented and merged using the dynamic tree-cutting algorithm, with a minimum module size of 30 genes and a merging threshold of MEDissThres = 0.25. PRG scores for all samples were computed using the GSVA package37 and compared between tumor and normal tissues in the training cohort via the Wilcoxon test. The correlation between PRG scores and each module was analyzed, and modules with a significant correlation to GSVA scores (|r| > 0.3) were selected as key modules. The genes within these modules were designated as module genes.

Identification of candidate genes

Differentially expressed genes (DEGs) between tumor and normal samples in the training cohort were identified using the DESeq2 package (v1.40.2), with selection criteria |log2FC| > 0.5 and p adj. < 0.05. The expression profiles of DEGs were visualized using volcano plots (ggplot2, v3.3.5) and heatmaps (pheatmap, v1.0.12). To identify candidate genes, the intersection of core cell DEGs, tumor-normal DEGs, and module genes was taken. A protein-protein interaction (PPI) network for the candidate genes was then constructed using the STRING database (http://www/string-db.org/) with a confidence threshold of 0.4.

Construction and validation of prognostic risk model

Candidate prognostic genes significantly associated with CRC prognosis (p < 0.05) were initially screened through univariate Cox regression analysis. The least absolute shrinkage and selection operator (LASSO) method was applied via the glmnet package (v4.1-4)38 (family = “Cox”) to refine feature gene selection. The genes selected through LASSO were subsequently used as inputs for multivariate Cox regression modeling, which was further optimized using stepwise selection to establish a final prognostic risk model comprising key prognostic genes and a risk score formula:

$$\:risk\:score=\sum\:_{i=1}^{n}\left(\text{c}\text{o}\text{e}\text{f}\text{i}\text{*}\text{e}\text{x}\text{p}\text{i}\right)$$

The proportional hazards (PH) assumption was tested to validate the risk model (p > 0.05), ensuring that the model met the Cox regression assumptions. The model’s predictive performance was evaluated and validated in both the training and validation cohorts, with samples stratified into high-risk and low-risk groups based on an optimal threshold. Receiver operating characteristic (ROC) curves were generated separately for the two datasets using the timeROC package, and Kaplan–Meier (K–M) survival curves were plotted to compare survival differences between risk groups. The area under the ROC curve (AUC), which accounts for the true positive rate (TPR) and false positive rate (FPR), was used to assess model performance across classification thresholds. An AUC ≤ 0.5 indicated no discriminative ability, equivalent to random guessing. An AUC > 0.5 and ≤ 1 suggested varying levels of classification accuracy, with performance improving as AUC approached 1. An AUC > 0.6 indicated reliable predictive accuracy with clinical reference value. PCA was conducted in both cohorts to evaluate the discriminative power of prognostic genes in CRC classification.

Construction of independent prognostic model

Univariate Cox regression analysis was performed to identify independent prognostic factors, incorporating clinical variables such as age, TNM stage, tumor stage, radiotherapy status, and risk scores. Clinical factors with p < 0.01 were selected for multivariate Cox regression analysis. To validate the multivariate Cox model, scaled Schoenfeld residuals were employed to confirm compliance with the PH assumption. A nomogram was then constructed using the rms package based on independent prognostic factors, providing 1-, 4-, and 7-year survival predictions. Model accuracy was assessed through calibration curves, and Wilcoxon tests were used to analyze risk score differences across clinical subgroups.

Function and gene mutational analysis for two risk cohorts

We conducted KEGG and GO pathway analyses using the GSEA method based on a preranked gene list39,40,41. Specifically, differential expression analysis between high-risk and low-risk groups in the TCGA-CRC cohort was performed using the DESeq2 package42. Genes were ranked by differential expression magnitude and subjected to gene set enrichment analysis (GSEA) in the GO and KEGG databases To ensure the objectivity of the analysis, we only presented the most significant GO terms and KEGG pathways, and selected them based on statistical significance (p < 0.05). Additionally, gene mutation profiles in the TCGA-CRC dataset were analyzed using the maftools package. The most frequently mutated genes in each risk group were visualized in a heatmap.

Immune infiltration analysis

The ESTIMATE algorithm from the Immunedeconv package was applied to score tumor samples in the training cohort, assessing stromal, immune, and ESTIMATE scores. Differences between high- and low-risk groups were evaluated using the Wilcoxon test (p < 0.05), followed by Spearman correlation analysis between each score and the risk score. To further investigate the involvement of prognostic genes in immune infiltration, CIBERSORT analysis was conducted via the GSVA package to estimate the abundance of 22 immune cell types in tumor samples. The distribution of immune cell subsets across risk groups was initially visualized, and differential immune cell abundances between the two cohorts were analyzed using the Wilcoxon test (p < 0.05). Additionally, the Spearman algorithm was used to correlate risk scores with immune cell infiltration levels in the TCGA-CRC dataset. In addition, we validated the immune infiltration results through single-cell analysis. Specifically, based on cell annotation results, we calculated the risk score using the AddModuleScore function from the R package ‘Seurat’. Using the median value of the risk score as a cutoff, we divided the cells into high-risk and low-risk groups, and then generated UMAP and violin plots. Based on the gene expression profiles in the single-cell data, we analyzed the expression differences of T cell activation-related genes between the high-risk and low-risk groups. T cell activation-related genes were identified from the literature and intersected with the single-cell data. Differential expression analysis was then performed on the intersected genes.

Immunotherapy response and drug sensitivity for two risk cohorts

To predict immune checkpoint blockade (ICB) therapy response, the immunophenoscore (IPS) was retrieved from The Cancer Immunome Atlas (TCIA, https://tcia.at/home), and differences in IPS scores between risk groups were analyzed using the Wilcoxon test (p < 0.05). The impact of immunotherapy response was further evaluated by comparing TIDE scores, Dysfunction scores, and Exclusion scores between risk groups in the training set via Wilcoxon tests. For chemotherapy drug sensitivity prediction, drug response data were obtained from the Genomics of Drug Sensitivity in Cancer (GDSC) database, and the half-maximal inhibitory concentration (IC50) of candidate drugs for patients in the TCGA-CRC dataset was estimated using the oncoPredict package. Inter-group IC50 differences were assessed using the Wilcoxon test, and drugs meeting the criteria p < 0.001 and mean IC50 < 0.1 were selected for further demonstration. These therapeutic responses were predicted through computational analysis.

Functions and upstream regulation network for prognostic genes

To identify prognostic gene-associated regulatory networks, the GeneMANIA database was used to construct a gene interaction network centered on prognostic genes. Prognostic gene-associated miRNAs were retrieved from the miRDB database using the multiMiR package, while associated lncRNAs were obtained from the same database based on miRNA interactions. A mRNA-miRNA-lncRNA regulatory network was then constructed and visualized using Cytoscape.

Pseudo-temporal analysis

Pseudo-temporal analysis was conducted using the Monocle package (v2.24.1) to construct single-cell trajectory maps. Based on highly variable genes, cellular pseudo-temporal trajectories were separately established for T cells and epithelial cells using the DDRTree method for dimensionality reduction and cell ordering. The dynamic expression patterns of prognostic genes were visualized along pseudo-time in both cell types.

RNA extraction and quantitative real-time polymerase chain reaction (qRT-PCR)

Tissue samples were collected from 40 patients with CRC at Ningxia Medical University General Hospital during surgical procedures. Total RNA was extracted using TRIzol (Takara, China) and reverse transcribed into cDNA with PrimeScript RTase (Takara, China). Quantitative real-time PCR (qRT-PCR) was performed on a Bio-Rad CFX96 System (Bio-Rad, USA), with primer sequences detailed in Table S3.

Immunohistochemistry (IHC) validation of model genes

To assess protein-level expression differences, paraffin-embedded CRC tissue samples from the Pathology Department of Ningxia Medical University General Hospital were analyzed, including matched tumor and adjacent non-tumorous tissues (Table S4). The expression levels of BCL10, TAF1B, and WWTR1 in CRC and adjacent normal tissues were evaluated using immunohistochemistry (IHC). The primary antibodies used were BCL10 (Abcam, EP606Y), TAF1B (Abcam, ab238759), and WWTR1 (Abcam, CL0371). For each tissue section, at least five high-resolution images of randomly selected fields were captured using consistent microscope settings. These images were processed in ImageJ (v1.53e), with Color Deconvolution applied to separate DAB (brown) and hematoxylin (blue) signals. The DAB channel was thresholded to quantify positively stained areas, and the positive area and integrated optical density (IOD) were measured. The average values from multiple fields were used to derive the final quantitative expression levels. Statistical analyses were conducted in GraphPad Prism (v9.0.0) using paired t-tests or other appropriate statistical methods to compare protein expression levels between tumor and normal tissues.

SiRNA transfection

Normal colonic epithelial cells (NCM460) and CRC cell lines (HT29, HCT116, SW480, and SW620) were obtained from the Cell Bank of the Chinese Academy of Sciences (Shanghai, China). Small interfering RNAs (siRNAs) targeting BCL10 and the corresponding negative control were purchased from Sangon Biotech (Shanghai, China). Cells were seeded into 6-well plates and transfected at 60–70% confluence using Lipofectamine 3000 (Thermo Fisher Scientific) according to the manufacturer’s instructions. The final siRNA concentration was 50 nM unless otherwise specified. After 6 h of transfection, the medium was replaced with fresh complete culture medium. Cells were harvested 48 h post-transfection for downstream analyses. Knockdown efficiency of BCL10 was assessed by qRT-PCR and WB.

Western blotting (WB)

Total protein was extracted from cultured cells using RIPA lysis buffer (Beyotime, Shanghai) supplemented with 1% protease and phosphatase inhibitor cocktail. Cell lysates were incubated on ice for 30 min and centrifuged at 12,000 × g for 15 min at 4 °C. The resulting supernatants were collected, and protein concentrations were determined using a BCA Protein Assay Kit (Beyotime). Equal amounts of protein (20–30 µg per lane) were separated by SDS-PAGE and transferred onto PVDF membranes (Millipore). Membranes were blocked with 5% non-fat milk in TBST (0.1% Tween-20) for 1 h at room temperature and then incubated overnight at 4 °C with primary antibodies. After three washes in TBST, membranes were incubated with HRP-conjugated secondary antibodies for 1 h at room temperature. Protein bands were visualized using an ECL chemiluminescence detection system (Thermo Fisher Scientific) and imaged on a Tanon 5200 platform under exposure conditions within the linear detection range. β-Actin was used as the internal loading control. Detailed information on primary and secondary antibodies is provided in Table S5.

Cell counting Kit-8 (CCK-8) assay

Cells were collected, resuspended in complete culture medium, adjusted to a density of 5 × 10⁴ cells/mL, and seeded into 96-well plates at 100 µL per well, with five technical replicates per group. Plates were incubated overnight at 37 °C in a humidified 5% CO₂ atmosphere to allow cell attachment. At 0, 24, 48, 72 and 96 h, 10% CCK-8 working solution (Dojindo, Japan) was added to each well and incubated for 1 h at 37 °C in the dark. Absorbance at 450 nm was recorded using a microplate reader, and the mean values were used to construct cell viability and proliferation curves.

Wound-Healing assay

Cells were seeded into 6-well plates at a density of 5 × 10⁵ cells/mL and cultured until a confluent monolayer formed. After washing with PBS, a linear scratch was generated using a 200 µL pipette tip. Detached cells and debris were removed by gentle PBS rinsing, and images were captured at 0 h. The cultures were then incubated for an additional 24 h before imaging again. Wound width reduction between 0 and 24 h was quantified in ImageJ (v1.53e) based on the scale bar, and the percentage of wound closure was calculated to evaluate cell migratory capacity.

Transwell assay

Cells were harvested, enzymatically digested, and resuspended in serum-free medium at a concentration of 5 × 10⁴ cells/100 µL. Migration assays were performed using uncoated Transwell chambers, whereas invasion assays utilized upper chambers precoated with Matrigel (Corning) and incubated at 37 °C for 2 h to allow gel solidification. The lower chambers were filled with 600 µL of complete medium containing 20% FBS as a chemoattractant. After 48 h of incubation, cells remaining on the upper membrane surface were gently removed. Cells that migrated or invaded to the lower surface were fixed with 4% paraformaldehyde for 20 min and stained with 0.1% crystal violet for 15 min. Images were captured from five randomly selected fields, and stained cells were quantified to evaluate migratory and invasive capacities.

Statistical analysis

All statistical analyses were performed using R software (v4.2.2).

Results

Identification of seven cell types in single-cell dataset

First, we performed single-cell analysis on the GSE200997 dataset, and quality control was conducted (Fig. S1). By applying PCA for dimensionality reduction, we identified 22 cell clusters (Fig. S2,3). Marker gene expression patterns enabled the annotation of seven major cell types: T cells, B cells, epithelial cells, myeloid cells, fibroblasts, endothelial cells, and a subset of unclassified cells (Fig. 1A). The proportional distribution of these cell types across single-cell samples (Fig. 1B). indicated a predominance of T cells, followed by epithelial cells, highlighting their central roles. Epithelial cells and T cells were crucial factors in the TME, influencing tumor progression, immune evasion, and treatment outcomes. GSVA was subsequently performed on all single-cell samples, revealing the 30 most significantly enriched pathways (Fig. 1C), among which T cells and epithelial cells were notably associated with TASK channels, hydroxycarboxylic acid-binding receptors, and ATP-sensitive potassium channels. Differential analysis of cell types between tumor and normal groups showed that, T cells and epithelial cells had 2748 and 5880 differentially expressed genes, respectively, with 2203 and 3568 upregulated, and 545 and 2312 downregulated (Fig. 1D).

Fig. 1
figure 1

Single-Cell Dataset Analysis. (A) UMAP representation of cell phenotype distributions. (B) Heatmap showing cell-type proportions across samples. (C) Functional enrichment analysis of cell clusters. (D) Volcano plot of differentially expressed genes in single-cell data.

Recognition of BCL10, TAF1B, and WWTR1 as prognostic genes

To screen genes related to pseudouridine, we performed WGCNA analysis based on large-scale RNA-Seq data. For WGCNA, no outlier samples were identified through cluster analysis. When the scale-free topology model fit index (R2) surpassed 0.9 (red line), the soft-thresholding power (β) was set to 24, ensuring that the network conformed to a scale-free topology (Fig. S4A). After hierarchical clustering and module merging, nine gene modules were identified (Fig. S4B). PRG scores were significantly elevated in tumor samples compared to normal tissues in the training set (Fig. 2A). Correlation analysis revealed that five modules (paleturquoise, darkolivegreen, midnightblue, skyblue, and grey60) exhibited significant associations with PRG scores, collectively encompassing 2,990 module genes (Fig. 2B). Based on large-scale RNA-Seq data, we performed differential expression analysis, the results showed, between tumor and normal groups within the training set, 9,772 DEGs were identified, including 5,204 upregulated and 4,568 downregulated genes (Fig. 2C, S5). By intersecting DEGs from core cell populations (4,762 from T cells and 4,525 from epithelial cells), 9,772 DEGs from tumor and normal samples, and 2,990 module genes, 116 candidate genes were identified. (Fig. 2D).

Fig. 2
figure 2

Recognition of BCL10, TAF1B, and WWTR1 as prognostic genes. (A) Box plots comparing GSVA scores between normal and tumor samples. (B) Heatmap depicting correlations between gene modules and PRGs. (C) Volcano plot of 9772 differentially expressed genes. (D) Venn diagram illustrating the intersection of genes. (E) Forest plot from univariate Cox analysis for prognostic gene selection. (F) Selection of prognostic genes using Lasso Cox analysis. (G) Forest plot of the prognostic model.

Univariate Cox regression analysis of these 116 candidate genes identified 14 genes significantly associated with CRC prognosis (p < 0.05) (Fig. 2E). In the LASSO model, when the partial-likelihood deviance λ was set to 0.0104603, seven genes were selected for multivariate Cox regression analysis (Fig. 2F). Ultimately, three prognostic genes—BCL10, TAF1B, and WWTR1—were incorporated into the final multivariate Cox risk model (Fig. 2G).

$$\:risk\:score=\sum\:_{i=1}^{3}\left(coefi*expi\right)$$

In this formula, “coef” represents the regression coefficient, while “exp” denotes its expression level in the sample. Specifically, the coef values for BCL10, TAF1B, and WWTR1 were − 0.25425, -0.55816, and 0.33448, respectively. None of these genes exhibited statistical significance (p > 0.05), indicating that the risk model satisfied the PH assumption (Fig. S6).

The high-risk group exhibits a worse prognosis

In both the training and validation sets, patients were stratified into high- and low-risk cohorts based on the optimal threshold (Fig. 3A-B). The risk score effectively distinguished CRC prognosis across both datasets (Fig. 3C-D). In the training set, the AUC values for 1-, 4-, and 7-year survival were 0.626, 0.654, and 0.635, respectively, suggesting moderate predictive accuracy with clinical relevance for CRC prognosis (Fig. 3E). Similarly, in the validation set, AUC values for 1-, 4-, and 7-year survival all exceeded 0.6, further supporting the model’s predictive capability (Fig. 3F). KM survival curves demonstrated a significant difference between high- and low-risk cohorts in both datasets (p < 0.05), with low-risk patients exhibiting markedly better survival outcomes (Fig. 3G-H). PCA confirmed distinct clustering patterns between risk groups in both datasets, underscoring the ability of the selected prognostic genes to effectively stratify patients based on risk (Fig. 3I-J).

Fig. 3
figure 3

Evaluation and Validation of the Prognostic Model. (A-B) Risk score distribution curves in training and validation cohorts. (C-D) Scatter plots showing the distribution of risk scores in training and validation cohorts. (E-F) ROC curve analysis for 1-, 4-, and 7-year survival predictions in training and validation cohorts. (G-H) Kaplan-Meier survival analysis comparing high- and low-risk groups in training and validation cohorts. (I-J) PCA visualization of risk group separation in training and validation cohorts.

Construction of a well-performing nomogram

Following univariate and multivariate Cox regression analyses and PH assumption verification, risk score, age, T stage, N stage, and tumor stage were identified as independent prognostic factors for constructing the final multivariate model. This model demonstrated strong predictive performance, with p < 0.0001 and a C-index of 0.77 (Fig. 4A-C). To enhance clinical applicability, a prognostic nomogram incorporating these independent prognostic factors was developed (Fig. 4D). The calibration curve showed that predicted survival probabilities at 1, 4, and 7 years closely aligned with actual survival outcomes, validating the model’s reliability (Fig. 4E). Furthermore, there was a significant difference in risk scores between subgroups of tumor stage and N stage (Fig. 4F-G).

Fig. 4
figure 4

Construction and Validation of an Independent Prognostic Model. (A) Forest plot of univariate Cox analysis for clinical factors. (B) Forest plot of independent prognostic factors in the multivariate Cox model. (C) PH assumption test for the independent prognostic model. (D) Nomogram for individualized survival prediction based on independent prognostic factors. (E) Calibration curve assessing the accuracy of the nomogram. (F-G) Differences in risk scores across pathological feature subgroups. ( *p < 0.05).

Differences between the two cohorts in functional and somatic mutations

In GO analysis, genes upregulated in the high-risk cohort were predominantly enriched in pathways related to collagen trimer formation, whereas those highly expressed in the low-risk cohort were significantly associated with DNA replication initiation (Fig. 5A). In KEGG pathway analysis, genes overexpressed in the high-risk cohort showed significant enrichment in the calcium signaling pathway, while those in the low-risk cohort were primarily enriched in the homologous recombination pathway (Fig. 5B). Mutation analysis in the training set revealed that missense mutations constituted the most frequent alteration type, with SNPs being the predominant variant form. Among SNP mutations, C > T transitions occurred at the highest frequency. On average, each sample carried 88 variants. The ten most frequently mutated genes were TTN, APC, MUC16, SYNE1, TP53, KRAS, FAT4, RYR2, PIK3CA, and CSMD3 (Fig. 5C). In both high-risk (n = 269) and low-risk (n = 307) cohorts, APC exhibited a consistently high mutation frequency, ranking among the six most frequently mutated genes (Fig. 5D). Notably, TP53 mutations were more prevalent in the high-risk cohort, whereas KRAS mutations were more frequent in the low-risk cohort.

Fig. 5
figure 5

Functional Enrichment and Mutation Analysis in High- and Low-Risk Groups. (A) GO enrichment analysis of DEGs between high- and low-risk groups. (B) KEGG enrichment analysis of DEGs between high- and low-risk groups. (C) Overview of gene mutation profiles from TCGA and GEO datasets. (D) Comparative mutation analysis of the top six most frequently mutated genes in high- and low-risk groups.

Immuno-related analysis and drug sensitivity

Stromal, immune, and ESTIMATE scores were significantly elevated in the high-risk cohort (p < 0.05), indicating a greater degree of immune cell infiltration compared to the low-risk cohort (Fig. 6A). Spearman correlation analysis demonstrated strong positive associations between risk scores and stromal (r = 0.57), immune (r = 0.37), and ESTIMATE scores (r = 0.51) (Fig. 6B-D). The composition of 22 immune cell types was analyzed across risk groups (Fig. 6E), revealing significant differences (p < 0.05) in 11 immune cell subsets, including resting and activated CD4 memory T cells, as well as regulatory T cells (Fig. 6F). Among these, regulatory T cells exhibited the strongest positive correlation with risk scores, whereas activated CD4 memory T cells showed the highest negative correlation (Fig. 6G). Immune checkpoint analysis identified significant differences in IPS scores between risk cohorts for CTLA4/PD1 (Fig. 6H). Additionally, TIDE, dysfunction, and exclusion scores were significantly elevated in the high-risk cohort, suggesting a more immunosuppressive TME characterized by increased immune cell exclusion and impaired immune function (Fig. 6I). Based on single-cell data, UMAP dimensionality reduction was used to show the distribution characteristics of cells with different risk scores in the single-cell dataset (Fig S7A). The violin plot then demonstrated significant differences in risk scores across various immune cell types (Fig S7B). Additionally, T cell activation-related genes exhibited significant differences between the high-risk and low-risk groups (Fig S7C), suggesting that the high-risk and low-risk groups may be associated with T cell activation characteristics. Drug sensitivity analysis revealed that bortezomib, docetaxel, and daporinad exhibited significant differences (p < 0.001) between risk groups, with mean IC50 values below 0.1, indicating distinct therapeutic responses across cohorts (Fig. 6J).

Fig. 6
figure 6

Exploration of Tumor Immune Status and Drug Sensitivity. (A) Comparison of ESTIMATE scores between high- and low-risk groups. (B-D) Correlation analysis between stromal, immune, and ESTIMATE scores with risk scores. (E) Heatmap showing the distribution of immune cell abundances across high- and low-risk groups. (F) Box plots illustrating significant differences in immune cell abundance between high- and low-risk groups. (G) Correlation analysis between risk scores and immune cell infiltration levels. (H-I) Comparison of IPS and TIDE scores between high- and low-risk groups. (J) Differential drug sensitivity analysis between high- and low-risk groups. (***p < 0.001, ****p < 0.0001)

The co-expression and mRNA-miRNA-lncRNA networks

To explore the potential regulatory mechanisms of prognostic genes in CRC, the 20 co-expressed genes most strongly associated with the prognostic genes were integrated with the three prognostic genes to construct a co-expression network using the GeneMANIA database (Fig. 7A). A total of 102 miRNAs linked to the prognostic genes were identified, and 57 lncRNAs were retrieved based on these miRNAs. Ultimately, 58 miRNAs were found to form a complete prognostic gene (mRNA)-miRNA-lncRNA regulatory axis. Thus, the mRNA-miRNA-lncRNA network comprised three prognostic genes, 58 miRNAs, and 57 lncRNAs (Fig. 7B), including TAF1B-hsa-miR-1304-3p-LINC00632, BCL10-hsa-miR-4743-5p-LINC01551, and WWTR1-hsa-miR-335-5p-LINC00324 etc. The expression of LINC00632 was associated with various cancers43. LINC01551 significantly reduced the proliferation, invasion, and metastasis of nasopharyngeal carcinoma (NPC) cells44. hsa-miR-335-5p had been quantified as a prognostic marker for gastric cancer45. LINC00324 regulated the proliferation, migration, and invasion of CRC cells and might have become a potential therapeutic target for CRC46. This suggested that prognostic genes might have formed complex regulatory axes through miRNAs and lncRNAs in the ceRNA network, affecting the proliferation, migration, and invasion of CRC cells.

Fig. 7
figure 7

Exploration of Prognostic Gene Interactions and Potential Regulatory Relationships. (A) geneMANIA network diagram. (B) ceRNA regulatory network diagram.

Expression patterns of prognostic genes in cell pseudotime trajectories

In T cells, pseudo-temporal trajectory analysis based on highly variable genes revealed distinct differentiation stages (Fig. 8A-C). All three prognostic genes exhibited relatively high expression at different time points within the trajectory (Fig. 8D). This suggested that these prognostic genes (BCL10, TAF1B, and WWTR1) might have played an important role in the development, differentiation, or functional regulation of T cells. Similarly, pseudo-temporal analysis in epithelial cells identified multiple differentiation states (Fig. 8E-G). BCL10 expression was elevated in states 1, 2, 4, 5, and 6 but reduced in states 3 and 7 (Fig. 8H). This suggested that the expression of BCL10 exhibited dynamic changes in the different differentiation or functional states of epithelial cells. TAF1B expression was notably higher at the terminal differentiation stages (states 5 and 6). WWTR1 displayed a relatively low overall expression but increased expression towards the end of differentiation. This suggested that TAF1B was associated with the late differentiation or functional maturation of epithelial cells, while WWTR1 might have played a specific role in the later stages of epithelial cell differentiation.

Fig. 8
figure 8

Pseudo-Temporal Trajectory Analysis. (A) Highly variable genes in T cells. (B) Pseudo-temporal dynamics of T cells. (C) Pseudo-temporal staging of T cells, with three distinct developmental stages indicated by color-coded dots. (D) Pseudo-temporal expression patterns of prognostic genes in T cells. (E) Highly variable genes in epithelial cells. (F) Pseudo-temporal dynamics of epithelial cells. (G) Pseudo-temporal staging of epithelial cells, with three distinct developmental stages indicated by color-coded dots. (H) Pseudo-temporal expression patterns of prognostic genes in epithelial cells.

Differential expression patterns of BCL10, TAF1B, and WWTR1

To examine the expression levels of BCL10, TAF1B, and WWTR1 in CRC and normal tissues, their transcriptomic profiles were retrieved from the TCGA database and subjected to comparative analysis. qRT-PCR was performed to validate these differential expression patterns. As shown in (Fig. 9A-F), BCL10 and WWTR1 were upregulated in normal colorectal tissues, whereas TAF1B expression was elevated in CRC tumor tissues.

Fig. 9
figure 9

Expression Profiles of PRGs in CRC and Normal Tissues. (A-C) Expression levels of BCL10, TAF1B, and WWTR1 in CRC tumor and normal tissues from the TCGA database. (D-F) qRT-PCR validation of BCL10, TAF1B, and WWTR1 expression in CRC tumor and normal tissues. (G, I, K) IHC staining for BCL10, TAF1B, and WWTR1 in CRC tumor and normal tissues. (H, J, L) Quantitative analysis of IHC staining intensity for BCL10, TAF1B, and WWTR1 in CRC and normal tissues, with statistical significance assessed using independent t-tests (*p < 0.05, **p < 0.01, and ***p < 0.001).

IHC analysis was conducted to assess the protein expression of BCL10, TAF1B, and WWTR1 (Fig. 9G-L). Consistent with the transcriptomic findings, BCL10 and WWTR1 exhibited significantly higher expression in normal colorectal tissues, as indicated by enhanced staining intensity and a higher proportion of positively stained cells. In contrast, TAF1B expression was markedly elevated in tumor tissues, with intense nuclear staining predominating in cancerous regions.

Knockdown of BCL10 promotes proliferation, migration, and invasion of colorectal cancer cells

To further investigate the functional role of BCL10 in CRC, we first examined its protein expression in a normal human colonic epithelial cell line (NCM460) and four CRC cell lines (SW620, HT29, HCT116, and SW480). WB analysis showed that BCL10 expression was markedly reduced in all four CRC cell lines compared with NCM460 (Fig. 10A–B). SW480 cells were selected for subsequent experiments. Transfection with si-BCL10 efficiently suppressed BCL10 expression at both the mRNA and protein levels relative to the si-NC, as confirmed by qRT-PCR and WB (Fig. 10C–D). CCK-8 assays demonstrated that BCL10 knockdown significantly enhanced the proliferative capacity of SW480 cells over time, particularly at later time points after transfection (Fig. 10E). Wound-healing assays showed that the migration distance at 24 h was significantly greater in the si-BCL10 group than in the si-NC group, indicating accelerated migratory ability following BCL10 silencing (Fig. 10F–G). Consistently, Transwell assays revealed that the numbers of migrated and invaded cells were markedly increased in the si-BCL10 group compared with controls (Fig. 10H–I).

Fig. 10
figure 10

BCL10 expression in CRC cell lines and its effects on proliferation, migration, and invasion. (A) WB analysis of BCL10 protein levels in the normal human colonic epithelial cell line NCM460 and four CRC cell lines (SW620, HT29, HCT116, and SW480). (B) Quantification of relative BCL10 protein expression in the indicated cell lines. (C) qRT-PCR analysis of BCL10 mRNA expression in SW480 cells transfected with si-NC or si-BCL10. (D) WB confirmation of BCL10 knockdown efficiency in SW480 cells. (E) CCK-8 assay showing the proliferative capacity of SW480 cells at 24, 48, 72, and 96 h after transfection with si-NC or si-BCL10. (F,G) Representative images and quantitative analysis of wound-healing assays in SW480 cells following BCL10 knockdown. (H,I) Representative images and quantitative analysis of Transwell migration and invasion assays in SW480 cells transfected with si-NC or si-BCL10. Data are presented as mean ± SD (n = 3). *p < 0.05, **p < 0.01, ***p < 0.001, ****p < 0.0001.

Discussion

Recent advancements in early detection techniques for CRC have significantly improved patient survival rates47. Despite these developments, CRC remains the second leading cause of cancer-related mortality worldwide48. Identifying effective prognostic biomarkers and therapeutic targets at earlier disease stages is critical for improving clinical outcomes49. Publicly available databases such as TCGA, GEO, and MSigDB have facilitated the development of various risk-scoring models for CRC prognosis prediction. However, no prognostic models incorporating pseudouridine modification as a predictive factor have been established to date.

In this study, bioinformatics approaches were utilized to identify 116 potential PRGs in CRC. Through co-expression network analysis and statistical modeling, including univariate and multivariate Cox regression analyses, three key prognostic genes—BCL10, TAF1B, and WWTR1—were identified. TAF1B plays a pivotal role in the RNA polymerase I preinitiation complex (PIC), which is essential for ribosomal biogenesis and cellular proliferation50. Its knockdown has been shown to regulate the p53-miR-101 axis, underscoring its influence on RNA polymerase I activity and its role in tumor progression, particularly in HCC51. Elevated TAF1B expression has been observed in gastric cancer, where it correlates with increased tumor growth and poor prognosis. Functional studies indicate that TAF1B silencing inhibits tumor proliferation and reduces viability in gastric cancer cells, as well as tumor burden in xenograft models52. In HCC, overexpression of TAF1B is linked to unfavorable clinical outcomes, whereas its deletion induces nucleolar stress, apoptosis, and activation of the p53-miR-101 pathway51. Given its involvement in survival pathway regulation and apoptosis induction, TAF1B is being explored as a potential therapeutic target in CRC, particularly in tumors exhibiting microsatellite instability53. BCL10 is implicated in RNA modifications and post-transcriptional dysregulation, often undergoing RNA-level mutations that evade genomic detection54. Research by Qin et al. highlights its role in ubiquitination and oncogenic signaling, facilitating tumor growth and immune evasion55. BCL10 is a key component of the NF-κB signaling cascade, a critical pathway governing cancer cell survival and proliferation56. In breast cancer, BCL10 enhances NF-κB activation, promoting tumor cell viability and resistance to chemotherapy57. Similarly, in CRC, aberrant BCL10 expression is associated with aggressive tumor phenotypes and poor clinical prognosis58. Beyond its direct impact on tumor progression, BCL10 contributes to an inflammatory TME, modulating immune responses and supporting immune evasion. These findings position BCL10 as a promising target for cancer immunotherapy59. WWTR1 (also known as TAZ) has been implicated in cell cycle regulation and tumor progression. In Merkel cell carcinoma (MCC), WWTR1 influences tumor development through TEA domain (TEAD)-dependent transcriptional repression mediated by MCPyV LT60. In CRC and NSCLC, WWTR1 activation enhances oncogenic pathways that drive tumor cell proliferation, migration, and invasion. Moreover, elevated TAZ expression in lung cancer has been associated with chemotherapy resistance, highlighting its potential as a therapeutic target61. Collectively, these findings suggest that BCL10, TAF1B, and WWTR1—genes strongly linked to pseudouridine modification—play critical roles in shaping the tumor immune microenvironment through RNA modifications. Their involvement in immune regulation and tumor progression underscores their potential as therapeutic targets for immunotherapy in CRC, offering promising avenues for improving patient prognosis and treatment outcomes.

A prognostic risk model was developed based on the three differentially expressed genes (BCL10, TAF1B, and WWTR1), which demonstrated high predictive accuracy in both the training and validation datasets. The model effectively stratified patients with CRC into high-risk and low-risk groups based on calculated risk scores, with KM survival analysis confirming its prognostic utility.

Clinical characteristics not only influence variable selection in prognostic models but also impact their predictive accuracy and interpretability. For instance, Aleix Prat and colleagues successfully integrated clinical data to predict survival and treatment response in early HER2-positive breast cancer, reinforcing the value of incorporating clinical parameters into prognostic models to enhance reliability and applicability62. By integrating the risk score with clinical parameters, the model was validated as an independent prognostic factor, and a prognostic nomogram was constructed for clinical implementation. This nomogram serves as a practical tool for predicting patient outcomes and guiding personalized treatment decisions. Calibration analysis further demonstrated the model’s accuracy in predicting patient survival, underscoring its potential value in targeted therapy and its broader clinical applicability.

In ovarian cancer, regulatory T cells (Tregs) suppress their own activity through the release of inhibitory cytokines such as transforming growth factor-β (TGF-β) and interleukin-10 (IL-10). Additionally, Tregs directly interact with effector T cells and other immune components via surface molecules, including CTLA-4 and PD-1, facilitating immune evasion by tumor cells63. In CRC, a significant correlation has been established between dense Treg infiltration within tumor tissues and both increased recurrence rates and reduced overall survival, highlighting Treg expression levels as a potential prognostic biomarker64. Our findings further support a strong association between CRC risk scores and immune cell infiltration. Specifically, high-risk patients exhibit a positive correlation between risk scores and Treg infiltration, whereas activated CD4 memory T cells demonstrate a negative correlation. Differences in the infiltration levels of eleven immune cell types between high- and low-risk groups underscore the critical role of Treg infiltration as a key risk factor in CRC. Moreover, genes related to T-cell activation exhibit significant differences between high-risk and low-risk groups, suggesting that these groups may be associated with distinct T-cell activation characteristics. In the TME, T cells are suppressed through various pathways, particularly within the tumor-associated immunosuppressive milieu, where they are induced to differentiate into different subtypes65. Tregs promote tumor immune evasion by inhibiting the function of effector T cells66. Additionally, T-cell immune responses are modulated by myeloid cells, which exert crucial immunosuppressive functions in the TME through differentiation into myeloid-derived suppressor cells (MDSCs)67. MDSCs suppress anti-tumor immune responses via multiple mechanisms, thereby facilitating tumor growth and metastasis68. Furthermore, M2 macrophages, as key myeloid cells, significantly impair pro-inflammatory potential by weakening tumor antigen presentation and secreting inhibitory factors (such as IL-12), ultimately suppressing immune responses69. Single-cell analysis, incorporating quality control and dimensionality reduction clustering, identified seven major cell types within CRC samples, with T cells and epithelial cells being the most abundant. Pseudo-temporal trajectory analysis of BCL10, TAF1B, and WWTR1 in T cells and epithelial cells revealed dynamic expression patterns during different stages of differentiation. Notably, BCL10 exhibited significant variation in expression across epithelial cell states70, while TAF1B expression increased towards the terminal differentiation stages, suggesting its involvement in cellular maturation or tumor-induced signaling responses71. The CBM signaling complex, composed of CARD11, BCL10, and MALT1, regulates T-cell receptor-induced gene expression by modulating NF-κB activation and mRNA stability72. Dysregulated BCL10 expression in murine models has been shown to aberrantly activate NF-κB signaling, driving T-cell activation and malignant tumor progression73, positioning BCL10 as a promising immunotherapeutic target. Furthermore, research by Markus Casper et al. identified TAF1B as a regulator of microsatellite instability (cMSI) in HCC, implicating it in tumor progression through its effects on stem cell proliferation and apoptosis74. Similarly, elevated WWTR1 expression aligns with its role in the Hippo signaling pathway, where it modulates cellular proliferation and contact inhibition61. These findings provide deeper insights into CRC cell dynamics while highlighting BCL10, TAF1B, and WWTR1 as potential molecular targets for novel therapeutic interventions aimed at improving the outcomes of patients with CRC.

Through an integrative analysis of single-cell and bulk transcriptomic data, this study investigated the role of PRGs in CRC and their prognostic implications. Beyond elucidating the critical involvement of PRG-associated transcriptional programs in CRC progression, scRNA-seq technology was leveraged to dissect the functional heterogeneity of distinct cell populations within the TME. The development of a PRG-based prognostic risk model incorporating BCL10, TAF1B, and WWTR1 significantly enhances CRC prognosis prediction and provides a theoretical framework for future targeted therapies and precision medicine strategies. Furthermore, the potential mechanisms by which these PRG-associated signatures influence CRC biology were explored through immune infiltration analysis, drug sensitivity assessments, mRNA–miRNA–lncRNA regulatory network construction, and pseudotime trajectory analysis. Collectively, this study advances the molecular understanding of CRC while providing valuable insights for future research and clinical applications. However, certain limitations should be acknowledged. First, our analyses are primarily correlative and do not demonstrate that BCL10, TAF1B, and WWTR1 are direct writers, erasers, readers, or substrates of pseudouridine; rather, they should currently be regarded as genes whose expression is closely linked to pseudouridine-related gene signatures and pathways. The relatively small sample size introduced variability in data partitioning, contributing to performance inconsistencies, including an AUC in the validation set that exceeded that of the training set, and heterogeneity within the experimental cohort may have influenced analytical robustness. In addition, the treatment response in this study represents a computational prediction rather than an outcome derived from in vivo or in vitro experimental research, and the precise molecular mechanisms and biological functions of the identified prognostic genes require further validation. Future studies should address these limitations by increasing cohort sizes, enriching subgroup stratifications, and integrating data from diverse populations to validate prognostic gene expression patterns across distinct tissue samples. In particular, the three-gene signature should be validated in large, prospective, multicenter cohorts with standardized treatment and follow-up, and further in vitro and in vivo functional experiments will be essential to clarify the mechanistic links with pseudouridylation and to corroborate these findings, thereby ensuring their translational applicability in CRC diagnosis, prognosis, and therapeutic decision-making.