Introduction

Bladder cancer (BLCA) is one of the most common malignancies of the urinary system. Recent advancements in diagnosis and treatment, largely due to the introduction of novel therapeutic strategies, have improved outcomes. However, the overall survival rate for individuals with muscle-invasive bladder cancer (MIBC) remains suboptimal. Lymph node metastasis (LNM) is the primary site of metastasis in BLCA, affecting 25% to 30% of patients who undergo radical cystectomy (RC) and pelvic lymphadenectomy (PLND). Furthermore, 80% of patients with LNM experience recurrence post-surgery1. For MIBC individuals with LNM, surgery alone often fails to achieve complete tumor resection, resulting in high recurrence and metastasis risks and limited survival benefits. Therefore, accurate early detection of LNM is crucial for improving patient management and stratification2.

Single-cell RNA sequencing (scRNA-seq) not only measures gene expression but also uncovers tumor heterogeneity at the single-cell level, tracks cancer cell metastasis and dissemination, and reveals genetic variations3,4,5. The identification of copy number variations (CNVs) from scRNA-seq data, alongside the inference of mutation profiles in cancer cell populations across different samples, has significantly advanced both basic research and clinical oncology applications6,7. Despite this progress, few studies have investigated the single-cell transcriptomic features of LNM in BLCA, and the cells and molecular mechanisms driving LNM remain inadequately explored. Bulk RNA sequencing serves as a valuable complementary tool, validating scRNA-seq findings at the broader genetic level and providing insights into how different subtypes respond to various treatments8,9,10. Therefore, the integration of these two approaches holds great potential for developing a prognostic model for BLCA based on molecular subtypes, enhancing both our understanding and management of the disease.

Building on this context, scRNA-seq analysis of clinical samples was employed to rigorously examine the cellular composition and functional characteristics of BLCA, with a focus on identifying potential cell subpopulations associated with LNM. Additionally, by combining bulk transcriptomic data with advanced machine learning algorithms, the study systematically identified prognostic genes and constructed a risk model for BLCA. The immune microenvironment of BLCA was also investigated to gain deeper insights into the molecular immune processes involved in disease progression. This research aims to develop robust predictive tools for clinical diagnosis and treatment while advancing our understanding of BLCA pathogenesis.

Materials and methods

Sample collection

This study included three patients with MIBC who underwent RC at the Department of Urology, Third Hospital of Shanxi Medical University, between December 2021 and December 2022. These patients had not received any neoadjuvant chemotherapy and were predicted to have metastatic lymph nodes based on preoperative abdominal CT scans. Primary tumor (PT) tissues (PT1, PT2, and PT3) and corresponding pelvic lymph nodes (LNM1 and LNM3), along with negative lymph nodes (LNM2), were collected after RC. Additionally, some of the resected lymph nodes underwent frozen section analysis during surgery to determine whether tumor metastasis had occurred. One sample was excluded for not meeting the criteria, and the remaining samples—PT1, PT2, LNM1, and LNM2—were retained for further analysis. The study protocol was approved by the Ethics Committee of the Third Hospital of Shanxi Medical University (SBQKL-2021-024). Sample collection was carried out with written informed consent from the patients, in accordance with all ethical regulations outlined in the Declaration of Helsinki.

Data collection

Expression data and clinical information (age, gender, tumor stage, survival time, and T/N/M stage) for TCGA-BLCA were obtained from The Cancer Genome Atlas (TCGA) database (https://portal.gdc.cancer.gov/). This dataset, serving as the training cohort, comprised 19 normal and 403 BLCA samples, with survival data available for 398 BLCA cases. Validation datasets were sourced from the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/), including GSE13507 (GPL6102) and GSE31684 (GPL570), which contained expression data and clinical information (survival time and status) for 165 and 93 BLCA samples, respectively. These datasets provided a sufficient number of validation samples for subsequent analyses.

ScRNA-seq analysis

Tissue Preparation and library construction

Tissue samples were cut into small pieces of approximately 5 millimeters in diameter and digested in a freshly prepared enzymatic solution at 37 °C for 1 hour. The resulting cell suspension was filtered twice through a 40-micron cell strainer to remove undigested tissue debris. A red blood cell lysis buffer was added to the suspension, which was kept on ice for 5 minutes. After centrifugation, the supernatant was discarded, and the cells were resuspended in culture medium. The cell concentration and viability were then assessed. Barcoded beads, unique molecular identifiers (UMIs), primers, and enzymes were introduced into the cell solution to create single-cell gel beads with distinct barcodes. These beads were identified using the 10×Genomics Chromium Next GEM Single-Cell 3’ Reagent Kit v3.1. RNA in the cells was reverse-transcribed into complementary DNA (cDNA) and subsequently amplified by PCR to generate a single-cell library. The library was then sequenced using the Illumina Nova 6000 PE150 sequencing technology.

ScRNA-seq data processing and analysis

Sequencing data in FASTQ format were aligned and quantified against the human reference genome (GRCh38) using Cell Ranger (v 7.0.1)11. Cell barcodes were assigned, and UMIs were extracted. The Seurat package (v 4.0.0)12 was utilized for supplementary quality control, incorporating UMI counts, the number of detected genes, and the percentage of mitochondrial genes. Cells were retained if the number of detected genes and total nUMI were within the range of mean ± 2 standard deviations (SD) for these metrics, and if the percentage of mitochondrial genes was less than 30%. Doublets—cell populations formed during encapsulation or due to improper separation during sample preparation—were identified and removed using DoubletFinder (v 2.0.3)13. Following filtering, data normalization was performed using the NormalizeData function from Seurat, and the FindVariableGenes function was applied to identify the top 2000 highly variable genes. Batch effects in single-cell expression data across samples were corrected using mutual nearest neighbors (MNN)14. Principal component analysis was performed on the highly variable genes to reduce dimensionality. T-distributed stochastic neighbor embedding (t-SNE) analysis was then performed to generate a two-dimensional representation of cell states, with clustering conducted at a resolution of 0.4. Marker genes for each cell cluster were identified using the FindAllMarkers function, followed by annotation with the SingleR package15 and classic marker genes for various cell types.

CNV inference

Chromosomal mutations in tumor cells were inferred using the inferCNV package (https://github.com/broadinstitute/), with CNV scores assigned to larger genomic segments. Chromosomal alterations were deduced by comparing gene expression intensities between genomic regions in tumor cells and reference immune cells.

Pseudotime analysis

A CellDataSet object was created from the RNA analysis data in the Seurat object using the importCDS function. The differentialGeneTest function was used to identify feature genes with a q-value < 0.001. This function was also applied for dimensionality reduction and clustering analysis, which was then followed by trajectory inference using the ordercells function under default settings. Temporal variations in gene expression were tracked with the plot_genes_in_pseudotime function, illustrating dynamic gene expression over pseudotime.

Differential expression and enrichment analyses

The FindAllMarkers function within the Seurat package was used to identify differentially expressed genes (DEGs) in cell clusters, applying a fold change (FC) greater than 1.5 and a P-value below 0.05. Hypergeometric distribution tests were conducted for Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG)16,17,18 enrichment analyses of the DEGs.

Gene set variation analysis (GSVA) was performed on the cell subpopulations derived from clustering. Background gene set files were constructed using GO gene sets (C5) and KEGG gene sets (C2) from the MSigDB database (v7.2) (https://www.gsea-msigdb.org/). The GSVA package (v 1.30.0)19 was employed to score pathway activity for individual cells, while the limma package (v 3.38.3)20 was used to assess variations in pathway activity among different groups. A multi-group comparison approach was applied to examine differences in pathway activity scores between each group and all others, based on GSVA-derived pathway activity ratings.

Gene set scoring algorithms in scRNA-seq

Three distinct methods—AUCell, UCell, and AddModuleScore—were used to evaluate gene sets in the scRNA-seq dataset. The key subpopulation was identified as the cluster with significantly higher scores than the others, and feature genes were selected with the condition avg_logFC > 0.5.

Construction and validation of risk model

DEGs were identified by comparing BLCA samples with control samples using the DESeq2 package (v 1.38.3) (|log2FC| > 0.5 and P < 0.05)21. The results were visualized using a volcano plot [ggplot2 package (v 3.5.1)22 and a heatmap [pheatmap package (v 1.0.12)23. The ggvenn package (v 0.1.10)24 was employed to depict the intersection of DEGs and feature genes of the key subpopulation. In the training cohort, univariate Cox regression analysis was performed on these intersecting genes using the survival package (v 3.4-0.4)25, with hazard ratio (HR) ≠ 1 and P < 0.05 as inclusion criteria. The proportional hazards (PH) assumption test was then applied to identify candidate genes (P > 0.05). Using the expression levels of candidate genes in BLCA samples along with clinical features (time and status), 10 machine learning algorithms—random survival forest (RSF), elastic network (Enet), least absolute shrinkage and selection operator (LASSO), Ridge, stepCox, CoxBoost, partial least squares Cox regression (plsRcox), SuperPC, generalized boosted regression modeling (GBM), and survival-support vector machine (SVM)26—were employed, along with 100 combinations of these algorithms, to identify the optimal model in both the training and validation sets. The model with the highest average concordance index (C-index) across all cohorts was selected, and the genes included in this model were classified as prognostic genes. The predict function calculated the risk score for each patient based on the expression of prognostic genes in the model, weighted by their regression coefficients. The survminer package (v 0.4.9)27 was used to categorize patients with BLCA into high- and low-risk groups (HRG and LRG) using the surv_cutpoint function with minprop = 0.3. Risk curves were generated, and the pheatmap was utilized to display the expression of prognostic genes across different risk categories. Kaplan-Meier (K-M) curves, generated with the survminer package, were used to compare HRG and LRG, evaluating the efficacy of the risk model. The receiver operating characteristic (ROC) curve, executed via the timeROC package (v 0.4)28, was used to assess the model’s predictive performance. Finally, the prognostic model’s efficacy was validated using the validation sets.

Immune infiltration and immunotherapy analysis

The quantities of 22 immune cell types in HRG and LRG of the training cohort were assessed using the CIBERSORT algorithm (v 0.1.0)29 and the LM22 gene set. Bar charts depicting the infiltration abundance percentages of these immune cells in HRG and LRG were generated with the ggplot2 package. The correlation between immune cell populations and prognostic genes was analyzed using Spearman correlation. The estimate package was used to compute StromalScore, ImmuneScore, and EstimateScore for HRG and LRG, and differences between the groups were visualized. Additionally, the expression levels of 38 immune checkpoint genes in HRG and LRG were compared. The Tumor Immune Dysfunction and Exclusion (TIDE) website (http://tide.dfci.harvard.edu) was used to evaluate immune therapy response. Differences in TIDE scores between HRG and LRG were analyzed using the Wilcoxon test, and the correlation between TIDE scores and risk scores was visualized through scatter plots.

Clinical feature analysis and construction of the nomogram

A heatmap was used to display the expression of prognostic genes across various clinical pathological subgroups (age, gender, stage, and M/N/T stage), providing insights into the role of prognostic genes in the progression of BLCA. The chi-square test was applied to compare the proportions of different clinical characteristic subtypes between HRG and LRG. Violin plots were used to illustrate differences in risk scores among subgroups based on clinical features. To identify independent risk factors for BLCA prognosis, six clinical features and risk scores were included in univariate and multivariate Cox regression analyses in the training cohort (HR ≠ 1, P < 0.05). Multivariate analysis was used to construct a nomogram with the rms package (v 6.8-0.8)30 to predict 1-, 4-, and 7-year survival rates. The predictive accuracy of the nomogram was evaluated using ROC curves, decision curve analysis (DCA), and calibration curves.

Drug sensitivity analysis

The GDSC database (https://www.cancerrxgene.org/) was utilized to generate 50% inhibitory concentration (IC50) values for 198 drugs using oncoPredict (v 1.2)31, and drug sensitivity differences between HRG and LRG were compared using the Wilcoxon test.

Functional enrichment analysis of HRG and LRG

Differential expression analysis was conducted between HRG and LRG to identify DEGs with |log2FC| > 0.5 and P < 0.05. GO analysis was employed to annotate gene functions, encompassing molecular functions (MFs), biological processes (BPs), and cellular components (CCs). KEGG analysis was performed to elucidate gene functions. The clusterProfiler package (v 4.6.2)32 was used to analyze the GO and KEGG pathways of DEGs with an adjusted P-value < 0.05, and enrichment maps were generated using the ggplot2 package. Additionally, gene set enrichment analysis (GSEA) was carried out using the logFC values of HRG and LRG as the ranking criterion, with the clusterProfiler program. The reference gene set (category = C2) was obtained from the MSigDB website, with |NES| > 1 and adjusted P-value < 0.05 as the threshold.

Regulatory network

MicroRNAs (miRNAs) regulating prognostic genes were predicted using the miRDB database (https://mirdb.org/) (score > 60). Subsequently, long noncoding RNAs (lncRNAs) interacting with these miRNAs were identified through the starBase database (https://rnasysu.com/encori/) (clipExpNum > 20). An mRNA-miRNA-lncRNA network was constructed using Cytoscape software.

Reverse transcription quantitative real-time polymerase chain reaction (RT-qPCR)

A total of 10 BLCA and 10 normal tissue samples were collected from the Department of Urology, Third Hospital of Shanxi Medical University, for RT-qPCR analysis. Total RNA was initially extracted from tissue samples using the FastPure Complex Tissue/Cell Total RNA Isolation Kit (RC113-01, Vazyme, Nanjing). The purity of the extracted RNA was assessed with a Nano-500 microspectrophotometer. RNA was then reverse transcribed into cDNA using the ABScript III RT Master Mix for RT-qPCR, which includes gDNA Remover (RK20429, Abconal, China). RT-qPCR was performed using the Genious 2X SYBR Green Fast RT-qPCR Mix (RK21205, Abconal, China). The relative mRNA expression levels of the prognostic genes were quantified using the 2-∆∆Ct method. GAPDH was used as the internal reference gene, and the primer sequences used for the RT-qPCR are provided in Table 1.

Table 1 The primer sequences of prognostic genes.

Statistical analysis

Analyses were performed using R versions 4.2.1 and 3.6.1, with inter-group differences assessed using the Wilcoxon test or t-test. In the scRNA-seq analysis, Pearson’s χ² test or Fisher’s exact test was applied to evaluate group differences. A significance threshold of P < 0.05 was used.

Result

Cellular heterogeneity in BLCA progression

Strict quality control was applied to PT1, PT2, LNM1, and LNM2 before analysis. Cells were categorized into 18 clusters, with detailed clustering shown in Fig. 1A. Clusters 1 and 2 represented the largest proportions in LNM, while cluster 3 was most prevalent in PT (Fig. 1B). Based on marker gene comparisons, six cell types were identified: endothelial cells, B cells, T cells, epithelial cells, macrophages, and fibroblasts (Fig. 1C). Proportions of each single-cell type were compared between the PT and LNM groups to highlight differences (Fig. 1D). The results revealed that T cells comprised the majority in LNM, while epithelial cells predominated in PT, underscoring cellular heterogeneity.

Fig. 1
figure 1

Single-cell RNA sequencing (scRNA-seq) analysis. A: Visualization of different samples and 18 cell clusters using t-distributed stochastic neighbor embedding (tSNE). B: Proportions of the 18 cell clusters in different samples. C: t-SNE plot used to identify each cell type in bladder cancer (BLCA). D: Proportions of 6 cell types in different samples.

Heterogeneity and functional remodeling of epithelial cells

To further explore the TME differences during BLCA progression, particular attention was given to the heterogeneous epithelial cells. These epithelial cells were reclustered into seven distinct clusters (Fig. 2A). Subgroup distribution analysis revealed that LNM1 predominantly consisted of clusters 3, 4, and 5, with cluster 3 being the major component (Fig. 2B). In contrast, all seven subclusters were present in PT, with cluster 1 being the dominant one. GSVA analysis indicated that clusters 2, 6, and 7 were enriched in “regulation of the actin cytoskeleton” and “focal adhesion,” while clusters 3, 4, and 5 showed enrichment in “ABC transporters” and “ascorbate and aldarate metabolism.” Additionally, clusters 6 and 7 were enriched in the “TGF-beta signaling pathway” and “pathway in cancer” (Fig. 2C). Tumor cells exhibiting CNV changes were identified and differentiated from normal epithelial cells in scRNA-seq. CNV analysis of epithelial cells revealed that epithelial cells in LNM1, PT1, and PT2 were tumor cells (Fig. 2D). Furthermore, significant CNV differences were observed among these tumor cells (Fig. 2E), highlighting the genetic heterogeneity of tumor cells from different sources, which may influence tumor behavior, treatment response, and patient prognosis.

Fig. 2
figure 2

Clustering analysis and copy number variations (CNV) of epithelial cells. A: Clustering analysis of epithelial cells to obtain seven subpopulations and their distribution in different samples. B: Proportions of different epithelial cell subpopulations in various samples. C: Gene set variation analysis (GSVA) for each subpopulation [left: Gene Ontology (GO) database; right: Kyoto Encyclopedia of Genes and Genomes (KEGG) database]. D: Feature plot of CNV analysis in epithelial cells and CNV heatmap of epithelial cells. E: CNV levels in different types of samples.

Developmental trajectory of epithelial cells identifies cluster 3 as a key subgroup

Pseudotime analysis was performed to assess the developmental trajectory of epithelial cells in BLCA. The results revealed that epithelial cells differentiated over time, progressing from the middle to both ends, and were divided into five stages (Fig. 3A-B). PT1 cells were distributed across two directions in epithelial cells, while PT2 was primarily located along the left branch (Fig. 3C). LNM1 cells were predominantly found in the later stages of the right branch. Additionally, genes with similar expression profiles were clustered based on the pseudotime process, grouping cell types into states 1 − 2−5 and state 4 (Fig. 3D). To provide a more intuitive description of this differentiation, the cells on the left branch were termed the “primary branch,” while those on the right branch were designated the “metastatic branch.” Further investigation of key subpopulations during BLCA lymphatic metastasis involved differential analysis between the primary and metastatic branches, followed by enrichment analysis of the resulting DEGs. The analysis revealed that cells in the metastatic branch exhibited distinct characteristics related to energy metabolism (Table S1). Based on this observation, scoring analysis was performed on the energy metabolism gene set of epithelial cells (Fig. 3E). Results indicated that oxidative phosphorylation was particularly prominent in cluster 3, with expression levels significantly higher than those in other subclusters. This suggests that cluster 3 of epithelial cells may represent a critical subpopulation involved in the lymphatic metastasis of BLCA. Therefore, cluster 3, which included 133 feature genes, was selected for further analysis.

Fig. 3
figure 3

Pseudotime analysis and gene set scoring of epithelial cells. A: Results of cell trajectory analysis of epithelial cells, showing changes of different subpopulations over time. B: Epithelial cell development trajectory divided into 5 stages. C: Trajectories of epithelial cells in different samples. D: Heatmap of genes clustered based on similar expression along the pseudotime process. E: Scoring of gene sets in the scRNA-seq dataset using 3 different algorithms (AddModuleScore, AUCell, and UCell).

Identification of 9 prognostic genes

Differential analysis between BLCA and control samples in the training cohort identified a total of 4,802 DEGs, including 2,721 upregulated and 2,081 downregulated genes (Fig. 4A-B, Table S2). The intersection of these DEGs and the 133 feature genes resulted in 58 overlapping genes (Fig. 4C, Table S3). Following univariate Cox regression and the PH assumption test, 13 candidate genes were identified (Fig. 4D, Table S4). To further refine the selection of prognostic genes, multiple algorithms were employed. Among these, the LASSO + RSF model (average C-index = 0.678) was found to be the most effective, comprising 9 prognostic genes: APOL1, CAST, DSTN, SPINK1, JUN, S100A10, SPTBN1, HES1, and CD2AP (Fig. 4E). Of these, APOL1, CD2AP, HES1, and S100A10 were upregulated in BLCA samples, while the remaining five genes were downregulated (Fig. 4F). A risk prediction model was established based on these 9 prognostic genes. The risk curve indicated a higher number of deceased samples in the HRG compared to the LRG (Fig. 5A). JUN, S100A10, and SPTBN1 exhibited high expression in the HRG, while SPINK1 was more highly expressed in the LRG (Fig. 5B). Survival analysis demonstrated that the risk model accurately predicts patient prognosis, with higher risk scores correlating with reduced survival rates (Fig. 5C). ROC analysis showed AUC values above 0.9 for 1, 4, and 7 years, indicating excellent predictive performance of the model (Fig. 5D). Additionally, the model’s predictive accuracy was validated in the independent validation sets (Fig. 5E-H). In summary, the risk prediction model provides an effective tool for forecasting the prognosis of patients with BLCA.

Fig. 4
figure 4

Screening of prognostic genes. A: Volcano plot of DEGs obtained by differential analysis between BLCA and control samples in the training set. B: Heatmap of DEGs. C: Intersection of DEGs and characteristic genes of epithelial cell cluster 3, resulting in 58 intersecting genes. D: Univariate Cox analysis of intersecting genes. E: A screening model was constructed using 10 machine learning algorithms and 100 algorithm combinations based on the expression levels of candidate genes in BLCA samples and clinical features (time and status). The model with the highest average concordance index (C-index) in both the training and validation sets was selected, and the genes within it were extracted as prognostic genes. F: Expression levels of prognostic genes in BLCA and normal samples. * P < 0.05; **, P < 0.01; ***, P < 0.001.

Fig. 5
figure 5

Construction and validation of the risk model. A: Risk curve in the training set. B: Heatmap of prognostic gene expression in the training set. C: Kaplan-Meier (KM) curve of the risk model in the training set. D: Evaluation of the risk model using the receiver operating characteristic (ROC) curve in the training set. E: Risk curves in the validation sets (GSE13507 and GSE31684). F: Heatmap of prognostic gene expression in the validation sets. G: KM curve of the risk model in the validation sets. H: Evaluation of the risk model using the ROC curves in the validation sets.

Nomogram for predicting BLCA prognosis

The relationship between the expression levels of the nine prognostic genes and clinical pathological characteristics was analyzed to explore their distinct roles in the progression of BLCA. A heatmap was used to clearly depict the expression variations of these genes across different clinical subgroups of patients with BLCA (Figure S1A). Significant differences were observed between the HRG and LRG for age, stage, and T/N/M stage (P < 0.001) (Figure S1B). Furthermore, the risk score showed notable differences across the subgroups of these five clinical variables (P < 0.05) (Figure S1C). Univariate and multivariate Cox regression analyses identified T stage and risk score as independent prognostic factors (Fig. 6A, Table S5). A nomogram (C-index = 0.87) was developed to visually predict the 1-, 4-, and 7-year survival rates for patients with BLCA, incorporating these variables (Fig. 6B). The calibration curve generated from the nomogram showed good alignment, indicating that the predicted probabilities closely matched the actual observed outcomes, confirming the nomogram’s reliability and accuracy (Fig. 6C). ROC curve analysis revealed AUC values greater than 0.88 for predicting 1-, 4-, and 7-year survival rates (Fig. 6D), further supporting the model’s excellent predictive performance. DCA demonstrated that the nomogram provided a significant net benefit, confirming its high accuracy (Fig. 6E). In conclusion, the nomogram offers robust support for the prognostic assessment of patients with BLCA.

Fig. 6
figure 6

Construction of the nomogram. A: Univariate and multivariate Cox regression analysis of clinical features (age, gender, stage, and T/N/M stage) with risk score. B: Nomogram constructed with T stage and risk score as variables. C: Calibration curve of the nomogram. D: ROC curve of the nomogram. E: DCA curves of the nomogram.

Immune microenvironment characteristics of different risk groups in BLCA

A comprehensive analysis of the immune microenvironment in HRG and LRG was conducted to elucidate the differences in immune profiles during BLCA progression and to identify characteristics associated with varying risk levels. The infiltration percentages of 22 immune cell types in the HRG and LRG were carefully examined and visualized using bar charts (Fig. 7A). Significant differences were observed in the infiltration of 9 immune cell types, including plasma cells, CD8+ T cells, naive CD4+ T cells, resting memory CD4+ T cells, regulatory T cells (Tregs), M0/M1 macrophages, resting dendritic cells, and neutrophils (Fig. 7B). Among these, M0 and M1 macrophages showed higher infiltration levels in HRG, suggesting a more prominent role of these cells in the high-risk state of BLCA, while plasma cells and CD8+ T cells exhibited reduced infiltration. Additionally, an analysis of the relationship between prognostic genes and immune cells revealed a strong positive correlation between SPINK1 and Tregs (cor = 0.482, P < 0.001), and a significant negative correlation between S100A10 and Tregs (cor = −0.526, P < 0.001) (Fig. 7C). Differences in the expression of 38 immune checkpoint genes were also evaluated, revealing 14 significant discrepancies. CD70, CD86, and CD276 were upregulated in HRG, indicating a potential link with immune responses in BLCA (Fig. 7D). Further scoring analysis showed notable variations in the immune landscape between HRG and LRG, with the EstimateScore, StromalScore, and ImmunScore significantly higher in HRG (Fig. 7E). These results suggest that the immune microenvironment in HRG is more complex and active, marked by increased immune cell infiltration and stromal component activity.

Fig. 7
figure 7

Immune microenvironment analysis. A: Stacked bar chart of immune cell scores in high-risk and low-risk groups. B: Box plot showing differences in immune cells between high-risk and low-risk groups. C: Correlation plot of prognostic genes with immune cells. D: Differences in immune checkpoints between different risk groups. E: Differences in ImmuneScore, StromalScore, and ESTIMATEScore between different risk groups. ns, no significance; *, P < 0.05; **, P < 0.01; ***, P < 0.001; ****, P < 0.0001.

Assessment of the response to immunotherapy and chemotherapy in risk groups

The TIDE score is a crucial indicator for predicting immunotherapy responses. Notable differences in TIDE scores were observed between the HRG and LRG, with higher TIDE scores in the HRG, suggesting reduced responsiveness to certain immune treatments in high-risk patients (Fig. 8A). Additionally, a drug sensitivity analysis was conducted to explore differences in anti-tumor drug responses between the two groups. Nineteen drugs showed significant sensitivity variations between HRG and LRG (P < 0.001) (Fig. 8B). Specifically, the HRG demonstrated increased sensitivity to 14 drugs, including AZD5363, luminespib, and tozasertib. In contrast, the LRG exhibited greater sensitivity to 5 drugs, such as elephantin, KRAS (G12C) inhibitor-12, and PRT062607. These findings provide a scientific basis for selecting targeted therapies based on tumor risk stratification and drug sensitivity.

Fig. 8
figure 8

Tumor Immune Dysfunction and Exclusion (TIDE) scores and drug sensitivity. A: TIDE scores in high-risk and low-risk groups. B: Box plot of 50% inhibitory concentration (IC50) values of drugs between high-risk and low-risk groups. ***, P < 0.001; ****, P < 0.0001.

Functional characteristics of risk groups

Differential expression analysis between HRG and LRG identified 1,022 DEGs, with 669 upregulated genes and 353 downregulated genes (Fig. 9A). To explore the underlying mechanisms, GO and KEGG enrichment analyses were performed. A total of 986 GO terms were enriched, with 840 significant terms in BPs, 72 in CCs, and 74 in MF. Among these, “glycosaminoglycan binding,” “cellular response to transforming growth factor beta stimulus,” “endoplasmic reticulum lumen,” “regulation of epithelial cell proliferation,” and “epithelial to mesenchymal transition (EMT)” were enriched in DEGs. KEGG enrichment analysis revealed 12 significantly enriched pathways, including “cytoskeleton in muscle cells,” “extracellular matrix (ECM)-receptor interaction,” and “proteoglycans in cancer,” reflecting the diverse molecular roles of DEGs (Fig. 9B). GSEA further highlighted the distinct roles of different risk groups in BLCA. “ECM receptor interaction,” “complement and coagulation cascades,” and “aminoacyl tRNA biosynthesis” were activated in HRG, while “pentose and glucuronate interconversions” and “ascorbate and aldarate metabolism” were activated in LRG (Fig. 9C). The gene sets activated in HRG were primarily associated with tumor invasion, metastasis, and metabolic activity, while those in LRG were linked to cellular metabolic homeostasis and antioxidant defense.

Fig. 9
figure 9

Differential and enrichment analysis between high- and low-risk groups. A: Volcano plot of DEGs identified by differential analysis between high- and low-risk groups. B: Enrichment analysis of DEGs. BP, Biological Process; CC, Cellular Component; MF, Molecular Function. C: Gene set enrichment analysis (GSEA) between high- and low-risk groups. D: Regulatory network of prognostic genes. Orange represents prognostic genes, green circles represent microRNAs (miRNAs), and blue represents long noncoding RNAs (lncRNAs).

Regulatory relationships of prognostic genes in BLCA

To further explore the molecular regulatory mechanisms in BLCA, multiple prediction databases were integrated to systematically analyze the interactions between prognostic genes, miRNAs, and lncRNAs. This integrated analysis predicted 158 miRNAs associated with 9 prognostic genes and identified 97 miRNA-lncRNA interactions. These findings were incorporated into an mRNA-miRNA-lncRNA network consisting of 183 nodes (8 mRNAs, 158 miRNAs, and 17 lncRNAs) and 271 relationships (Fig. 9D). For example, HCG11, MALAT1, and NORAD indirectly regulated CAST and CD2AP via hsa-miR-1297 and hsa-miR-26a-5p, while MALAT1 and MIR29B2CHG indirectly regulated CD2AP through hsa-miR-200a-3p. This regulatory network may have a significant impact on the progression of BLCA.

Validation of prognostic genes in clinical samples

To further validate the differential expression of prognostic genes, BLCA and normal tissue samples were collected for RT-qPCR analysis. The results demonstrated that the expression levels of CAST, DSTN, JUN, SPINK1, and SPTBN1 were significantly reduced in BLCA samples (Fig. 10), and these expression patterns aligned with those identified in the bioinformatics analysis.

Fig. 10
figure 10

Expression levels of prognostic genes in clinical samples. ***, P < 0.001.

Discussion

BLCA is one of the most prevalent malignancies worldwide, with its major therapeutic challenge stemming from the high heterogeneity of bladder tumors. As a critical transcriptional stratification tool, scRNA-seq allows for accurate analysis of cell subset characteristics and the identification of differences in biomarkers and heterogeneity across various cell types in BLCA33. In the present study, scRNA-seq analysis revealed significant cellular heterogeneity between PT and LNM. Special attention was given to epithelial cell subpopulations, with cluster 3 emerging as a major component of LNM. Pseudotime and enrichment analyses indicated that oxidative phosphorylation in epithelial cell cluster 3 was significantly higher than in other subgroups. In a study by Greta Petrella et al., nuclear magnetic resonance hydrogen spectroscopy was used to characterize the metabolic features of three bladder tumor cell lines, revealing that glycolysis and oxidative phosphorylation were crucial for the growth and maintenance of BLCA cell lines and were associated with genetic characteristics34. Tumor cell lines with low progression risk exhibited reliance on oxidative phosphorylation, while those with high progression risk were linked to glycolytic metabolism. In conclusion, bladder tumor cell lines depend on active oxidative phosphorylation for metabolic activity, further highlighting epithelial cell cluster 3 as a focal point for understanding the metabolic and metastatic behaviors of BLCA.

By integrating differential expression analysis with various machine learning algorithms, 9 prognostic genes were identified: APOL1, CAST, DSTN, SPINK1, JUN, S100A10, SPTBN1, HES1, and CD2AP. HES1 (hairy and enhancer of split 1), a core target gene in the Notch signaling pathway35, has been associated with EMT and metastatic characteristics of BLCA when its expression is low36. CD2AP (CD2-associated protein), a scaffold protein involved in cell adhesion, signal transduction, and cytoskeletal reorganization37, has been found to be significantly expressed in BLCA and correlated with poor prognosis38. The circCD2AP/ELAVL1/USP21/FOXQ1 axis plays a significant role in regulating EMT and stemness in BLCA. S100A10 (S100 calcium-binding protein A10) binds calcium ions and regulates various biological processes through intracellular and extracellular signaling pathways39. Jianhua Zhu et al. demonstrated that the lncRNA KCNMB2-AS1 promotes BLCA progression by modulating the miR-374a-3p/S100A10 axis40. DSTN (destrin), encoding an actin-binding protein41, was found to be highly expressed in various subtypes of bladder urothelial carcinoma and influences patient survival by participating in cancer-related pathways, as identified by scRNA-seq analysis42. APOL1 (apolipoprotein L1) plays a key role in lipid exchange and transport, as well as regulating programmed cell death and autophagy43. Functional alterations in APOL1 can lead to lipid disorders, cancer, and other diseases. CAST (calpastatin) is involved in multiple biological processes, with its core functions including the regulation of intracellular protein metabolism, cell signaling, and tissue development44. Overexpression of CAST has been shown to significantly promote tumor growth in xenograft models45. SPINK1 (serine protease inhibitor Kazal type 1), structurally similar to epidermal growth factor, directly and indirectly affects cell proliferation, metastasis, drug resistance, and cancer stemness in various cancers46. JUN (Jun proto-oncogene) encodes the c-JUN protein, which is involved in several biological processes47. c-JUN serves as a key molecule in tumor-related signaling pathways, influencing tumor initiation and progression by regulating cell proliferation, differentiation, apoptosis, and cell cycle progression48. Studies investigating the relationships between APOL149, CAST50, SPINK151, and JUN52 in BLCA largely rely on bioinformatics analyses using the TCGA database, focusing on prognosis, autophagy-related functions, and immune therapy responses. SPTBN1 (spectrin beta non-erythrocytic 1) functions primarily as a tumor suppressor by regulating inflammation, metabolism, stemness, and the immune microenvironment, inhibiting tumor onset and progression. However, research on SPTBN1 in BLCA is limited, and its mechanisms remain unclear. These findings highlight the diverse roles of these genes in BLCA biology and suggest their potential as prognostic biomarkers.

The mechanisms of action of prognostic genes, such as HES1 and CD2AP, are closely linked to the ECM. Interestingly, in the analysis of the BLCA prognostic model constructed using these genes, DEGs between the HRG and LRG were significantly enriched in the ECM-receptor interaction pathway. This finding further supports the potential role of the ECM in BLCA development, progression, and LNM. In a study on oral cancer, Shaikh et al. found that FN1 and COL5A1 were localized at the stromal edges of tumor margins and observed that their co-occurrence with ECM enrichment and an immunosuppressive microenvironment was associated with LNM53. Moreover, FSTL3, a biomarker associated with ECM remodeling in colorectal cancer, has been shown to promote LNM by inducing macrophage and fibroblast polarization, leading to T-cell exhaustion and the establishment of a suppressive immune microenvironment. These findings underscore the close relationship between the ECM, the tumor immune microenvironment (TIME), and LNM, suggesting that the ECM may be a crucial pathway through which the identified prognostic genes influence BLCA metastasis. This hypothesis is supported by the study of Adrian Wullweber et al., who reported that ECM-related genes are significantly upregulated in bladder tumors compared to normal urothelial cells54. Therefore, the regulatory role of the ECM in the involvement of prognostic genes in LNM warrants further in-depth exploration in future studies.

While the critical role of the TIME in cancer treatment has been well established55,56, and existing studies have constructed prognostic signatures for BLCA by integrating single-cell and bulk transcriptomic data57, the model developed—by incorporating single-cell data related to LNM—offers enhanced value for BLCA prognosis prediction. Compared to existing BLCA models, our risk model, which includes 9 prognostic genes, demonstrated exceptional predictive accuracy in the training cohort, with AUC values of 0.912, 0.939, and 0.923 for 1-, 4-, and 7-year survival predictions, respectively58,59. Furthermore, a nomogram model combining tumor T stage and risk score yielded AUC values of 0.927, 0.936, and 0.885 for 1-, 4-, and 7-year survival predictions, respectively. These results demonstrate a significant improvement in predictive efficacy, providing a more reliable tool for forecasting the prognosis of patients with BLCA. The primary goal of neoadjuvant therapy is to increase the likelihood of successful surgical resection by “downstaging” the tumor. A well-developed prognostic model, capable of stratifying patients based on risk levels, can help prevent unnecessary treatment in low-risk individuals. However, the current model was constructed using post-surgical tumor tissue samples, whereas pre-neoadjuvant therapy samples are typically obtained via small needle biopsies. This difference in sample types could introduce sampling bias, potentially affecting the accuracy of predictions regarding therapeutic responses. Additionally, the efficacy of perioperative treatment is influenced by various clinical variables; failure to include these factors in the model may lead to either overestimating or underestimating patient prognosis. Therefore, further validation using larger and more diverse clinical cohorts is necessary to confirm the clinical utility of these prognostic models.

The differences in immune cell infiltration between HRG and LRG were analyzed, revealing substantial variations in the proportions of 9 immune cell types between the groups. M0 and M1 macrophages were significantly more prevalent in the HRG, while resting dendritic cells, Tregs, and CD8+ T cells were more abundant in the LRG. Previous studies have shown that myeloid cells, including tumor-associated macrophages (TAMs) and dendritic cells, infiltrating the TIME, can contribute to and even drive T-cell exhaustion through their antigen-presenting function60. This suggests that T-cell dysfunction results from a cooperative regulation between tumor cells and these myeloid cells, rather than a consequence of a single cell type. Additionally, TAMs have been shown to play a role in promoting angiogenesis in BLCA and may be essential for this process61. CD8+ T cells, also known as cytotoxic T cells, are crucial for antitumor immune responses, and their presence in tumor tissues serves as a prognostic marker. Fei Lin et al. developed a BLCA risk assessment model based on genes related to CD8+ T cell infiltration, finding a negative correlation between the risk score and CD8+ T cell infiltration62. Tregs, which are known to infiltrate various cancers in high numbers63,64,65, can suppress antitumor immune responses and are often linked to poor prognosis in patients. However, this finding contrasts with our results. Interestingly, a study similar to ours found that higher Treg infiltration in BLCA was associated with a better prognosis66. This contradiction may stem from not distinguishing between suppressive and non-suppressive Treg cells67. Thus, suppressive Tregs are crucial for assessing the immunosuppressive microenvironment in tumors. Furthermore, the relationship between the BLCA tumor microenvironment and the risk score was evaluated. The findings revealed that the StromalScore, ImmuneScore, and EstimateScore were significantly higher in the HRG than in the LRG, suggesting that tumor purity was lower in the HRG, which correlates with poorer prognosis68. An analysis of immune escape demonstrated that the TIDE score was significantly elevated in the HRG compared to the LRG, with a strong positive correlation between the TIDE score and the risk score. The TIDE score evaluates a tumor’s immune escape capability based on immune dysfunction and immune exclusion, with higher scores indicating stronger immune escape abilities69, which was also confirmed by our study’s results.

While this study revealed the cellular heterogeneity and immune microenvironment characteristics of BLCA and constructed a high-precision prognostic model, several limitations remain. First, the sample size of scRNA-seq data is relatively small; increasing the sample size could improve the model’s accuracy and reliability. Second, performance degradation was observed during the initial cross-cohort validation, primarily due to significant differences in sample sources, detection platforms, and patients’ baseline clinical characteristics between the training and validation sets. Future studies should include more multicenter cohorts with consistent sample inclusion criteria and standardized experimental protocols for further validation. Third, immune infiltration and immune checkpoint analyses were based solely on transcriptomic data. Immunohistochemistry and flow cytometry will be necessary for further validation of these results. Fourth, although the expression levels of prognostic genes in BLCA and normal samples have been confirmed, targeted investigations in LNM and PT samples have not yet been conducted. To address this gap and validate the generalizability of our findings, future research plans to systematically collect a sufficient number of LNM and PT samples for further experiments, which will enhance the clinical relevance and applicability of the results. Finally, while drug sensitivity analysis offers valuable guidance for personalized treatment, factors such as drug selection and individual biological differences may influence the prediction outcomes. In summary, integrating additional transcriptomic data and other biological layers will enhance the biological interpretability of the model and further strengthen the scientific foundation of this study.

Conclusion

The cellular heterogeneity and immune microenvironment characteristics of BLCA were thoroughly examined through scRNA-seq of clinical samples and various bioinformatics analyses. A significant epithelial cell subpopulation (cluster 3) associated with LNM was identified, and a high-precision prognostic model was developed using 9 prognostic genes: APOL1, CAST, DSTN, SPINK1, JUN, S100A10, SPTBN1, HES1, and CD2AP. Additionally, unique immune microenvironment characteristics were uncovered, providing a novel theoretical basis for personalized treatment and prognostic assessment of BLCA.