Main

Cancer development is a multistep process during which cancerous cells acquire the ability to overcome limitations in replicative potential and evade immune destruction1,2. Meanwhile, noncancerous cells, such as stromal cells, are gradually reprogrammed to support tumor growth. The organ in which the tumor arises, with its unique tissue-resident cell types, creates substantial diversities in the tumor microenvironment (TME)3 and shapes distinct clinical properties of tumors, such as molecular subtypes, invasion abilities and response to targeted or immunotherapy4,5,6. Recent advances in single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) have served as powerful tools for exploring the diversity of the TME. However, it remains unclear whether phenotypic-related cell types are universally present in different cancer types and stages. Moreover, the interactions between different cell types that form specific ecotypes within the TME are not well understood. Therefore, investigating the dynamics of TME compositions and intercellular interactions from a pan-cancer perspective is critical for elucidating the pathogenesis of cancer and represents a promising therapeutic target.

The increasing accumulation of scRNA-seq datasets in the public domain allows for the integration of datasets from a pan-cancer perspective, which helps to identify common or cancer-type-specific mechanisms of the TME. Previous studies have constructed separate landscapes for myeloid cells and T cells, portraying a systematic view of tumor-infiltrating immune cells and revealing distinct patterns of cell-type composition between cancer types7,8,9,10,11. In addition to immune cells, stromal cells play important roles in the TME of solid tumors. Increasing evidence suggests that stromal cells actively regulate tumor progression and metastasis by remodeling the extracellular matrix (ECM), promoting epithelial–mesenchymal transition (EMT) and angiogenesis12,13,14,15,16,17. At present, a few studies have delineated the diversity of fibroblasts in a single or a handful of cancer types. These studies were constrained by limited cell numbers, hindering a complete characterization of the complexity of stromal cell types. Furthermore, they could not fully analyze the interactions between cell types that contributed to the heterogeneity of stromal subtypes18,19.

In this study, we have collected the largest scale of published solid tumor-associated scRNA-seq data to date, comprising approximately 4 million cells. Using this extensive dataset, we have constructed a comprehensive Tabula of TME cells across 36 different cancer types. We have defined 6 major cell lineages and 56 cell subtypes within the TME using an integrated approach. The resulting comprehensive blueprint of the TME serves as a valuable roadmap for understanding the complexity of the TME, identifying phenotypic-related cell types and developing innovative therapeutic strategies that may have broad applicability across multiple cancer types.

Results

Use of the TabulaTIME framework for integrating large-scale tumor scRNA-seq data

To characterize the heterogeneous composition and evolution of the TME during tumor initiation, progression and metastasis across different cancer types, we present the Tabula of the tumor immune microenvironment (TabulaTIME) framework. The framework consists of five major modules: tumor-related scRNA-seq data collection, data preprocessing and MetaCell identification, integration of all lineages, lineage-specific integration and characterization of cell subtypes (Fig. 1a).

Fig. 1: Characterization of scRNA-seq data in the pan-cancer TME.
figure 1

a, Schematic depicting the TabulaTIME framework and its application. TabulaTIME was applied within a multiphase workflow, encompassing tumor-related scRNA-seq data collection, data preprocessing and MetaCell identification, integration of all lineages, lineage-specific integration and characterization of cell subtypes; Imm-reg, immune-regulatory; AP, antigen-presenting; TCA, tricarboxylic acid; NSCLC, non-small cell lung cancer; HNSC, head and neck squamous cell carcinoma; KIPAN, pan-kidney cohort; OV, ovarian serous cystadenocarcinoma; ESCA, esophageal cancer; CRC, colorectal cancer; SKCM, skin cutaneous melanoma; PRAD, prostate adenocarcinoma; LIHC, liver hepatocellular carcinoma; UVM, uveal melanoma. b, Data collection statistics. The numbers of cells (top) and donors (bottom) collected for each tissue are presented; k indicates ×1,000. c, Uniform manifold approximation and projection (UMAP) visualization of all MetaCells, colored by the cell type (top) and source (bottom), respectively. d, Expression of cell-type-specific markers. Dot size and color represent the percentage of cells with the gene expressed and the average expression value, respectively; Mono, monocytes; Macro, macrophages; DC, dendritic cells; Treg, regulatory T cells; CD4+ Tconv, conventional CD4+ T cells; Tprolif, proliferating T cells.

Source data

First, we collected tumor-related scRNA-seq datasets20,21,22,23, consisting of 103 studies covering 36 cancer types, 746 donors and 4,483,367 cells (Fig. 1b and Supplementary Table 1). Second, all collected datasets were preprocessed using the MAESTRO workflow24, which included quality control, doublet and batch effect removal25, cell clustering and cell-type annotation26,27 (Extended Data Fig. 1a–h and Supplementary Tables 24). To reduce technical noise and computing resource costs, TabulaTIME grouped cells with similar expression into MetaCells within each dataset, with each MetaCell containing approximately 30 cells (Extended Data Fig. 2a–d and Methods)8,28. The average log transcripts per million (TPM)-transformed gene expression of all cells within each MetaCell was used in downstream analyses. In the following two modules, TabulaTIME first integrated all MetaCells using canonical correlation analysis (CCA) to evaluate the effectiveness of batch removal between different cancer types (Fig. 1a, all lineages integration)25. Compared to the batch effect-corrected integration of single-cell profiles, the integration using MetaCells demonstrates superior performance, significantly reducing batch effects while preserving cell-type-specific biological variation (Extended Data Fig. 2e,f). To improve resolution for cell-type-specific analyses, TabulaTIME also integrated MetaCells from each lineage (Fig. 1a, lineage-specific integration). To mitigate the risks of overclustering and underclustering, we determined the optimal clustering resolution based on average silhouette width (ASW) scores and Clustree and assessed the purity of each annotated subtype using ROGUE scores29,30,31. Finally, TabulaTIME integrated the lineage-specific pan-cancer maps with other types of cancer data, such as ST and bulk tumor profiles from The Cancer Genome Atlas (TCGA). This enabled the analyses of tumor-enriched cell types by quantifying their relative abundance across different sources, cancer types and spatial localization. Furthermore, it facilitated the investigation of cell-type-specific functions and the estimation of their effects on immune cell infiltration and prognosis (Fig. 1a, characterization of cell subtypes). Together, TabulaTIME is a powerful framework for investigating the cellular compositions and functional states of the TME with temporary and spatial resolution.

A pan-cancer single-cell transcriptome blueprint generated by TabulaTIME

We then tested the ability of TabulaTIME to integrate tumor scRNA-seq datasets with millions of cells at scale and diverse sources of batch effects. The integrated pan-cancer landscape includes datasets derived from adjacent normal tissue, precancerous tissue, primary tumor, metastatic tumor and peripheral blood mononuclear cell samples of 22 different tissues (Fig. 1b). Additionally, 16 tumor scRNA-seq datasets were obtained from individuals treated with various strategies, including PD-1/PD-L1-based immunotherapy, chemotherapy, targeted therapy and combined therapies (Extended Data Fig. 2g and Supplementary Table 1). For the pan-cancer datasets, a total of 140,072 MetaCells were generated, and batch effects were further corrected using CCA (Fig. 1c and Extended Data Fig. 2h). Interestingly, immune and stromal cells were separated into distinct clusters, which were annotated as seven major cell lineages, including cytotoxic lymphocytes (CD8+ T, natural killer (NK) and proliferating T cells), conventional and regulatory lymphocytes (CD4+ Tconv and Treg cells, respectively), B lymphocytes (B cells and plasma cells), myeloid cells (monocytes/macrophages, dendritic cells and mast cells), fibroblasts (fibroblasts and myofibroblasts), endothelial cells and epithelial-like cells (epithelial, malignant and tissue-specific; Fig. 1c,d). Considering the diverse molecular features of epithelial, malignant and tissue-specific cells, TabulaTIME mainly analyzed the immune and stromal compartments of the TME in the following analyses. The batches from different sample sources and cancer types were well mixed, and the annotated labels were highly consistent with the original labels, even at the minor cell lineage scale, indicating the efficient integration of millions of cells using our strategy.

TabulaTIME reveals common characteristics of pan-cancer lymphocytes

To understand the heterogeneity of pan-cancer immune cells at a higher resolution, we separately integrated and reanalyzed each lineage. A total of ten cytotoxic lymphocyte subtypes were identified, each labeled by its functional characteristics and marker gene (Fig. 2a). Signature enrichment analyses suggested that three NK subtypes and GZMB+ effector memory CD8+ T cells (CD8Temra_GZMB) had higher cytotoxic scores, whereas exhausted CD8+ T cells (CD8Tex_HAVCR2) showed the highest exhaustion scores and naive CD8+ T cells (CD8Tn_CCR7) displayed the lowest cytotoxic and exhaustion scores, consistent with their phenotypes, respectively (Fig. 2b, Extended Data Fig. 3a and Supplementary Table 5). Interestingly, GZMK+ effector memory CD8+ T cells (CD8Tem_GZMK) were significantly enriched in precancerous tumor samples and were more prevalent than cytotoxic NK cells across different cancer types, indicating stronger T cell-mediated antitumor immunity (Fig. 2c, Extended Data Fig. 3b and Supplementary Table 6). By contrast, blood and normal samples were enriched with naive CD8+ T cells (CD8Tn_CCR7; Fig. 2c). Additionally, TabulaTIME resolved ten subtypes of conventional and regulatory lymphocytes and six B lymphocyte subtypes (Extended Data Fig. 3c–h and Supplementary Table 6). Overall, TabulaTIME revealed shared characteristics of pan-cancer lymphocytes that were highly consistent with their corresponding sources and functional phenotypes8.

Fig. 2: Pan-cancer immune cell heterogeneity.
figure 2

a, UMAP visualization of the distribution of cytotoxic lymphocytes, colored by cell type; MAIT, mucosal-associated invariant T cell; ISG, interferon-stimulated gene-expressing T cells. b, Scatter plot illustrating the expression patterns of cytotoxic- and exhaustion-associated signature genes in cytotoxic lymphocyte subsets across different sources. c, Box plot showing the proportion of distinct cytotoxic lymphocyte subgroups in each sample type (blood (red), normal tissue (green), precancerous tissue (orange), tumor tissue (purple) and metastatic tissue (blue)), based on 650 treatment-naive samples. Significance labels in the figure were determined using Kruskal–Wallis tests to compare distributions of cytotoxic lymphocyte subgroups across five tissue types. Significance for pairwise source proportions within each cell type, assessed via two-tailed unpaired Wilcoxon tests, is reported in Supplementary Table 6. The open rectangle annotates the comparative scope, with Benjamini–Hochberg (BH) correction for multiple testing. The bottom of the box represents the first quartile (Q1), and the top of the box represents the third quartile (Q3). The height of the box represents the interquartile range (IQR), whereas the horizontal line inside the box indicates the median. The whiskers extend to the positions of Q1 – 1.5 × IQR and Q3 + 1.5 × IQR. d, A similar UMAP plot as in a was applied to myeloid cells; pDC, plasmacytoid dendritic cells; cDC1, type 1 conventional dendritic cells; cDC2, type 2 conventional dendritic cells. e, Heat map showing different expression patterns of function-associated signature genes among monocyte and macrophage subsets. f, Box plot showing the proportions of distinct monocyte and macrophage subgroups across different sources, including blood (red), normal tissue (green), precancerous tissue (orange), tumor tissue (purple) and metastatic tissue (blue), derived from 687 treatment-naive individuals. Significance labels in the figure were determined by Kruskal–Wallis tests to compare each subgroup distribution among five tissue types. Significance for pairwise source proportions within each cell type, assessed via two-tailed unpaired Wilcoxon tests, is reported in Supplementary Table 6. The open rectangle annotates the comparative scope, with BH correction for multiple testing. The bottom of the box represents Q1, and the top of the box represents Q3. The height of the box represents the IQR, whereas the horizontal line inside the box indicates the median. The whiskers extend to the positions of Q1 – 1.5 × IQR and Q3 + 1.5 × IQR. g, Heat map showing the proportion of different myeloid cell types in various cancer types or healthy tissues. For rows, a bar plot illustrates the number of MetaCells (in log10 scale) and the origin of cancer cells labeled by the different colors; PBMC, peripheral blood mononuclear cells; SCC, squamous cell carcinoma; UCEC, uterine corpus endometrial carcinoma; STAD, stomach adenocarcinoma; CHOL, cholangiocarcinoma; THCA, thyroid cancer; CESC, cervical squamous cell carcinoma and endocervical adenocarcinoma; BRCA, breast invasive carcinoma; PAAD, pancreatic adenocarcinoma; GIST, gastrointestinal stromal tumors; OS, osteosarcoma; BCC, basal cell carcinoma; SCLC, small cell lung cancer. h, Left, heat map showing Jaccard similarity indices for comparisons among 3,751 robust NMF programs based on the top 50 genes within the monocyte and macrophage cell populations. Programs are ordered by clustering and grouped into families of MPs with related functions (marked by black dashed lines); MP families are numbered and labeled. Right, list of all MP names separated into MP families; IFNγ, interferon-γ. i, Box plot illustrating the signature scores of MPs calculated across distinct monocyte and macrophage subtypes, derived from 797 samples, with MPs color coded according to their functional annotations. The bottom of the box represents Q1, and the top of the box represents Q3. The height of the box represents the IQR, whereas the horizontal line inside the box indicates the median. The whiskers extend to the positions of Q1 – 1.5 × IQR and Q3 + 1.5 × IQR. j, Pseudotime-ordered analysis of monocyte and macrophage MetaCells, colored by cell type. k, Kaplan–Meier plots show worse clinical outcomes in 184 individuals with ESCA and 459 individuals with SKCM with higher expression of SLPI+ macrophage signature genes; +, censored observations. Statistical significance was evaluated using a log-rank test, yielding P values of 0.014 for ESCA and 0.0001 for SKCM.

Source data

Divergent trajectories of myeloid cells with inflammatory and fibrotic functions

We proceeded to investigate the characteristics of myeloid cells and identified 12 subtypes (Fig. 2d and Extended Data Fig. 4a–d). Monocytes and macrophages were categorized into two groups of monocytes (classical Mono_FCN1 and nonclassical Mono_CD16) and six groups of macrophages. Interestingly, traditional M1/M2 signatures cannot clearly distinguish the macrophage subtypes within the TME, indicating a high level of plasticity and heterogeneity among macrophages (Fig. 2e and Supplementary Table 5). Consistent with previous studies, phagocytic Macro_C1QC and anti-inflammatory/angiogenic Macro_THBS1 signatures were highly enriched in precancerous and tumor samples, suggesting that they are potential tumor-associated macrophages (TAMs) regulating tumor immunity7 (Fig. 2e–g and Supplementary Table 6). Intriguingly, TablulaTIME identified a profibrotic TAM subtype (Macro_SLPI) that was enriched in a subset of tumors, such as basal cell carcinoma and cholangiocarcinoma (Fig. 2f,g). We postulate that the Macro_SLPI signature evinces a diminished phagocytic and inflammatory capacity, yet exhibits a markedly elevated ECM remodeling capability based on the enrichment of known macrophage-associated signatures (Fig. 2e).

To investigate the potential mechanisms of TAM plasticity, we applied non-negative matrix factorization (NMF) to monocytes and macrophages and identified 3,751 robust expression programs (Fig. 2h and Methods). By comparing the NMF programs, we defined the consensus programs as meta-programs (MPs) based on their shared genes. As expected, the profibrotic subtype Macro_SLPI exhibited the highest program scores for EMT and focal adhesion, similar to wound healing and profibrotic macrophages in lung fibrosis and coronavirus disease 2019 (Fig. 2i)32,33,34. Lineage tracing analyses suggested that although they both originate from monocytes, the profibrotic Macro_SLPI signature follows a distinct developmental branch compared to the phagocytic Macro_C1QC or the anti-inflammatory Macro_THBS1 signature (Fig. 2j). Finally, using the TCGA dataset, we examined the correlation between Macro_SLPI signature scores and overall survival. Higher Macro_SLPI signature scores were strongly associated with an increased risk of death in various cancer types, such as esophageal carcinoma (log-rank test, P = 0.014) and skin cutaneous melanoma (SKCM; log-rank test, P = 0.0001; Fig. 2k and Extended Data Fig. 4h,i). Together, our analyses revealed distinct molecular diversities and trajectories of TAMs in the TME.

ECM-associated CTHRC1 + fibroblasts are prevalent in tumor samples across cancer types

Stromal cells could establish a tumor-supportive environment by modifying the ECM and vasculature. Fibroblasts are the most prominent stromal cells, and their phenotypes and origins vary widely. TabluaTIME recognized seven subtypes of fibroblasts based on the expression of specific marker genes (Fig. 3a,b). These fibroblast subtypes were categorized into five major groups, including myofibroblast (myFibro), ECM-remodeling fibroblasts (eFibro), immunoregulatory fibroblasts (iFibro), antigen-presenting associated fibroblasts (apFibro) and quiescent fibroblasts (qFibro) based on characteristic gene expression patterns (Fig. 3c)15,35. Interestingly, we observed that eFibro_CTHRC1 cells were predominantly derived from tumor samples and eFibro_SFPR1 cells were highly enriched in normal samples, whereas iFibro_IL6 cells were predominantly present in precancerous samples (Fig. 3d, Extended Data Fig. 5a and Supplementary Table 6). Furthermore, eFibro_CTHRC1 cells also expressed canonical CAF markers, including FAP36, LRRC15 (ref. 12) and POSTN15, which are prevalent in nearly all cancer types but not in healthy samples, suggesting its important role in the TME (Fig. 3b,e). Functional signatures and MP analyses revealed that eFibro_CTHRC1 cells were enriched for EMT and ECM receptor interaction pathways (Fig. 3f and Extended Data Fig. 5b,c). Meanwhile, the metabolic pathway analysis showed that the glycosaminoglycan biosynthesis–chondroitin sulfate/dermatan sulfate pathway, which is involved in generating ECM components, was notably upregulated in eFibro_CTHRC1 cells (Fig. 3g and Extended Data Fig. 5d). Furthermore, we investigated the association between eFibro_CTHRC1 cells and prognosis. Higher expression of eFibro_CTHRC1 signature genes was correlated with worse clinical outcomes in multiple cancer types, including kidney renal clear cell carcinoma (KIRC; log-rank test, P = 0.00523) and bladder urothelial carcinoma (BLCA; log-rank test, P = 0.00568; Fig. 3h), which is consistent with the previously reported clinical impact of FAP+ and LRRC15+ fibroblasts. These findings collectively suggest that fibroblasts in the TME undergo pronounced reprogramming to adopt myofibroblastic phenotypes, which in turn remodel the ECM and potentially contribute to tumor growth.

Fig. 3: CTHRC1+ fibroblasts are broadly present in tumor datasets and highly express ECM-remodeling-associated genes.
figure 3

a, UMAP visualization of fibroblast MetaCell distribution, colored by cell type. b, Dot plot depicting the expression of representative signature genes of each fibroblast cell type. c, Scatter plot showing ECM remodeling, immunoregulatory and antigen-presenting signature scores for each fibroblast subset; Imm-Reg, immune-regulatory. d, Box plot showing the proportion of each fibroblast cell type in different source-derived samples (normal, precancerous, tumor and metastatic tissue) from 338 treatment-naive samples. Significance labels in the figure were assessed via Kruskal–Wallis tests to compare each cell type distribution among four tissue types. Significance for pairwise source proportions within each cell type, assessed via two-tailed unpaired Wilcoxon tests, is reported in Supplementary Table 6. The open rectangle annotates the comparative scope, with BH correction for multiple testing. The bottom of the box represents Q1, and the top of the box represents Q3. The height of the box represents the IQR, whereas the horizontal line inside the box indicates the median. The whiskers extend to positions of Q1 – 1.5 × IQR and Q3 + 1.5 × IQR. e, Heat map showing the proportion of different fibroblast cell types in various cancer types or healthy tissues. For rows, a bar plot illustrates the number of miniclusters (in log10 scale) and the origin of cancer cells labeled by the different colors. f, Heat map displaying the enriched pathways for each fibroblast subset. Enrichment was calculated using hypergeometric distribution statistics, with P values adjusted by the BH method; FDR, false discovery rate. g, Violin plot showing the glycosaminoglycan biosynthesis pathway and the average metabolic pathway GSVA score for each fibroblast subset across 379 samples. h, Kaplan–Meier plots demonstrate the clinical impact of eFibro_CTHRC1 cells in 533 individuals with KIRC and 405 individuals with BLCA, comparing low and high signature scores; +, censored observations. Statistical significance was assessed via the log-rank test, with P values of 0.00523 for KIRC and 0.00568 for BLCA.

Source data

CTHRC1 + fibroblasts are located at the leading edge of the tumor region

Fibroblast migration and pathological matrix redeposition are frequently observed in fibrotic disease37,38. To understand the potential source and spatial localization of eFibros that are dramatically increased in the TME, we collected 62 published ST slides of tumor tissue sections from six cancer types. After filtering out low-quality spots and genes from each slide, we annotated malignant spots by combining markers from original studies and inferred copy number variations (Extended Data Fig. 6a–d)26. For tumor slides with clear tumor boundaries, such as HCC-1L from primary liver cancer39, we observed that both the fraction of eFibro_CTHRC1 cells and the expression of eFibro_CTHRC1 cell signatures were highly enriched at the leading edge from nontumor to tumor regions (Fig. 4a and Extended Data Fig. 6e,f). To investigate the general enrichment of eFibro_CTHRC1 cells at tumor boundaries, we calculated the correlations between the eFibro_CTHRC1 signature score and the distances to the tumor core regions among all ST slides containing both fibroblasts and malignant cells. Excitingly, the eFibro_CTHRC1 fractions showed a negative correlation with the distance to tumors in the majority of ST slides (total ST slides, 32/41, 78%), indicating a high enrichment of eFibro_CTHRC1 cells at tumor boundaries (Fig. 4b,c). We further verified the spatial localization of eFibro_CTHRC1 cells in in-house oral cancer samples using multiplexed immunohistochemistry (mIHC), specifically at the tumor boundaries (Fig. 4d).

Fig. 4: CTHRC1+ fibroblasts are located at the leading edge from nontumor to tumor regions.
figure 4

a, Left, hematoxylin and eosin staining of a tissue section from participant HCC-IL. Middle, distribution of normal, tumor and transition regions in participant HCC-1L (images reproduced with permission from ref. 39, AAAS). Right, spatial feature plot of the eFibro_CTHRC1 signature score; scale bars, 100 μm. b, Correlation between the eFibro_CTHRC1 signature score and the distance between spots and malignant cells in all ST samples. Correlations were calculated using Pearson correlation coefficients. Resulting P values were adjusted for multiple comparisons via the BH method. The significant negative correlation represents the CTHRC1+ fibroblasts surrounding malignant cells in the ST samples; PLC, pulmonary lymphangitic carcinomatosis. c, Scatter plots showing the correlation between the distance to malignant cells (x axis) and the signature score of eFibro_CTHRC1 (y axis) in tissue sections. The correlation was calculated using Pearson correlation coefficients. The color represents the proportion of fibroblasts in each spot. The error band indicates the 95% confidence interval, which is calculated based on the standard error using the normal distribution. The center measure of the smooth line corresponds to the predicted values from the linear regression model. d, IHC staining to validate the distribution of eFibro_CTHRC1 cells; scale bars, 100 μm; PanCK, pan-cytokeratin. e, Box plot showing the correlation between inferred inducers of CTHRC1+ fibroblasts and the CTHRC1+ fibroblast signature score. The red box represents data calculated from 62 ST samples, whereas the green box is derived from 9,460 samples across 23 cancer types in the TCGA project. The bottom of each box indicates Q1, and the top represents Q3. The height of the box reflects the IQR, and the horizontal line inside the box indicates the median. The whiskers extend to the positions of Q1 – 1.5 × IQR and Q3 + 1.5 × IQR. f, Comparison of the relative intensity (each row shared a color scale, whereas different rows did not) of immune cell-type scores between the normal and tumor regions of ST slides, focusing on the ST samples with eFibro_CTHRC1 cells surrounding tumor cells. g, Bubble heat map showing the interaction strength of gene pairs between fibroblasts and immune cells. Colors in the bubble plot are proportional to the communication probability. Significant interactions are identified on the basis of a statistical test that randomly permutes the group labels of cells and then recalculates the interaction probability. h, Inferred LGALS9 and CD44 interaction between CD4+ T cells and fibroblasts; Tfh, follicular helper T cell.

Source data

To explore the impact of malignant cells on eFibro_CTHRC1, we screened for genes upregulated in malignant cells that correlated with the expression of the eFibro_CTHRC1 signature score. Our analysis identified several integrins, such as ITGA6, which encodes a matrix stiffness-regulated mechanosensitive molecule that can induce invasive fibroblast phenotypes and mediate activation of transforming growth factor-β (TGFβ) in lung fibrosis (Fig. 4e and Extended Data Fig. 6g)40,41.

We then investigated whether the boundary formed by eFibro_CTHRC1 could serve as a physical barrier that may affect the infiltration of immune cells into the tumor core. We selected the ST slides with high enrichment of eFibro_CTHRC1 in the boundary regions and divided the slides into normal and tumor regions. As anticipated, immune cells were markedly more abundant on the normal side than on the tumor side (Fig. 4f). Consistently, the estimated infiltration of CD8+ T cells was notably higher in tumor samples with a lower eFibro_CTHRC1 signature score in the TCGA cohort in almost all cancer types (Extended Data Fig. 6h,i). Furthermore, compared to other fibroblasts, eFibro_CTHRC1 fibroblasts were more likely to interact with CD8+ T cells via LGALS9–CD44 and LGALS9–CD45 interactions (Fig. 4g,h). LGALS9 has been reported to induce apoptosis in T cells by binding to HAVCR2 and bolster the stability and functionality of immunosuppressive Treg cells via interaction with CD44. Anti-Gal-9 therapy has potential in selectively augmenting intratumoral HAVCR2+ cytotoxic CD8+ T cells42,43. Together, our analyses indicate that the presence of eFibro_CTHRC1 may reduce the infiltration of immune cells, potentially both physical barriers formed by the ECM and immunosuppressive functions through cell–cell interactions.

Colocalization of CTHRC1 + fibroblasts and SLPI + macrophages forms profibrotic ecotypes

Our analyses suggest that eFibro_CTHRC1 cells are prevalent in tumor samples, and profibrotic Macro_SLPI cells are also significantly enriched in several cancer types (Figs. 2f,g and 3d,e). We then investigated whether there are potential connections or cooperation between these two cell types, as they are both enriched in similar ECM functions (Figs. 2e and 3c). Because the ST data we used were not at single-cell resolution but rather a mixture of six to ten cells, we quantified the coexistence of eFibro_CTHRC1 and Macro_SLPI cells by correlating their signature scores. As expected, eFibro_CTHRC1 and Macro_SLPI cells showed a high correlation at the spot level (R > 0.5), and this correlation increased with higher fibroblast fractions (Fig. 5a,b). This observation held true for almost all the evaluated slides (Fig. 5c). For other macrophages, Macro_SPP1 cells showed a relatively weaker correlation with eFibro_CTHRC1 cells, which have also been reported to be associated with fibrosis44. The two cell types were also observed using individual single-cell datasets, ruling out the possibility of cell doublets (Extended Data Fig. 7a,b). Using TCGA-UVM and TCGA-KICH samples, we confirmed the high concordance between eFibro_CTHRC1 and Macro_SLPI cells (Extended Data Fig. 7c). Importantly, mIHC staining of CTHRC1, SLPI, CD68 and SPP1 in oral cancer and NSCLC samples further verified the colocalization of eFibro_CTHRC1 and Macro_SLPI cells (Fig. 5d and Extended Data Fig. 7d). In summary, the colocalization of eFibro_CTHRC1 and Macro_SLPI cells creates unique profibrotic ecotypes within the tumor region and suggests functional connections between these cell types.

Fig. 5: Colocalization of CTHRC1+ fibroblasts and SLPI+ macrophages across cancer types.
figure 5

a, Spatial feature plots showing the spatial localization of eFibro_CTHRC1 and Macro_ SLPI cells in two pancreatic adenocarcinoma ST datasets. b, Dot plot showing the correlation between the signature scores of eFibro_CTHRC1 (x axis) and Macro_ SLPI (y axis) cells in ST data, with color indicating the proportion of fibroblasts in each spot. The correlation was calculated using the Pearson correlation coefficient. The error band denotes the 95% confidence interval, which was calculated based on the standard error using the normal distribution. The center measure of the smooth line corresponds to the predicted values from the linear regression model. c, Scatter plots showing the correlation between the signature score of eFibro_CTHRC1 cells and all monocytes and macrophages in tissue sections, with color representing the proportion of fibroblasts in each spot and error bands indicating the 95% confidence interval. d, Immunofluorescence images showing the localization of Macro_SLPI and Macro_SPP1 cell types (DAPI, CD68, SLPI and SPP1) and eFibro_CTHRC1 cell types (DAPI and CTHRC1) in HNSC and NSCLC samples; scale bars, 50 μm. e, Genes inferred to encode the top 20 ligands separately regulate eFibro_CTHRC1 cells and SLPI+ macrophages according to NicheNet. Ligands are ranked by Pearson correlation (left). The heat maps show the expression of ligands mentioned on the left across major cell types (middle) and the top 20 ligands inferred to regulate SLPI+ macrophages (right). f, Cartoon depicting the general distribution of eFibro_CTHRC1 and Macro_SLPI populations in tumors, as well as the function. Image created with BioRender.com, with permission.

Source data

We speculate that there may be a common regulatory mechanism or cytokine signaling pathway that activates the profibrotic ecotypes. To explore potential upstream signaling, we conducted NicheNet analyses, which indicated a tight connection between the activity of TGFβ1 and interleukin-1β (IL-1β) ligands and the eFibro_CTHRC1 phenotype (Fig. 5e)45. The existing literature suggests that TGFβ family ligands and inflammatory modulators such as IL-1β can activate CAFs by enhancing the activity of transcription factors such as SMAD, NF-κB and STAT41,46. Reassuringly, TGFβ1 and IL-1β could also stimulate the Macro_SLPI phenotype, indicating that these signaling molecules may directly activate CAFs and also promote fibrotic programs in Macro_SLPI cells, which could potentially work in conjunction with eFibro_CTHRC1 cells to remodel the ECM (Fig. 5e). In summary, these analyses suggest that the profibrotic ecotypes were possibly induced by TGFβ or IL-1β signaling (Fig. 5f).

RGCC + capillary endothelial cells are associated with vessel tip generation in tumors

Endothelial cells are not only involved in angiogenesis but also related to immune cell recruitment and semiprofessional antigen presentation47,48. We identified seven endothelial subtypes and annotated them as venous, capillary, lymphatic and vascular smooth muscle cells (VSMCs; Fig. 6a and Extended Data Fig. 8a,b). Consistently, VSMC_ACTA2 cells, with a high ECM signature score, simultaneously expressed endothelial marker genes and ACTA2 (Fig. 6b,c). RGCC+ capillary endothelial cells (capEndo_RGCC) exhibited the highest tip scores, whereas CXCR4+ capillary endothelial cells (capEndo_CXCR4) showed the highest antigen-presenting scores (Fig. 6d). Notably, capEndo_RGCC cells expressed markers associated with vessel tip generation, including APLN and ESM1, and the majority of these cells were found in tumor samples across various cancer types (Fig. 6e,f, Extended Data Fig. 8a and Supplementary Table 6)49. These findings collectively suggest that endothelial cells undergo dynamic reprogramming during cancer initiation and progression.

Fig. 6: Heterogeneity of endothelial cells.
figure 6

a, UMAP visualization displaying the distribution of endothelial MetaCells, colored by cell type; venEndo, venous endothelial; lymEndo, lymphatic endothelial. b, Box plot depicting the ECM signature score for each endothelial cell type, calculated from 367 samples. The bottom of the box represents Q1, and the top of the box represents Q3. The height of the box represents the IQR, whereas the horizontal line inside the box indicates the median. The whiskers extend to the positions of Q1 – 1.5 × IQR and Q3 + 1.5 × IQR. c, Violin plot showing the expression of representative marker genes for each endothelial cell type. d, Scatter plot illustrating the antigen-presenting and tip signature scores of endothelial subsets. e, Box plot showing the proportions of distinct endothelial cell types in various sample categories, including normal tissue (green), precancerous tissue (orange), tumor tissue (purple) and metastatic tissue (blue), derived from 344 treatment-naive samples. Significance labels in the figure were assessed via Kruskal–Wallis tests to compare each subset distribution among four tissue types. Significance for pairwise source proportions within each cell type, assessed via two-tailed unpaired Wilcoxon tests, is reported in Supplementary Table 6. The open rectangle annotates the comparative scope, with BH correction for multiple testing. The bottom of the box represents Q1, and the top of the box represents Q3. The height of the box represents the IQR, whereas the horizontal line inside the box indicates the median. The whiskers extend to the positions of Q1 – 1.5 × IQR and Q3 + 1.5 × IQR. f, Heat map showing the proportion of different endothelial cell types in cancer types and healthy tissues. For rows, a bar plot to the left of each row illustrates the number of MetaCells (in log10 scale) and origin of cancer cells, color-coded for clarity. g, Genes encoding the top 20 ligands inferred to regulate RGCC+ endothelial cells according to NicheNet. Ligands are ranked by Pearson correlation (left), and a dot plot represents the expression percentage (dot size) and value (dot intensity) of the top 20 ligands.

Source data

Angiogenesis, which typically initiates in the capillaries, is essential for tumor growth. Identifying the potential ligands that drive the phenotypes of tip-like capEndo_RGCC cells is critical for preventing angiogenesis through therapeutic targeting of endothelial cells50. VEGFA and AGT were predicted to be the most potent inducers of capEndo_RGCC cells, consistent with previous studies showing that VEGFA is a crucial regulator of angiogenesis, and AGT is required for hypoxia-induced vasculogenesis (Fig. 6g)49,51. Interestingly, both AGT and VEGFA are produced by fibroblasts, suggesting that stromal cells may influence one another in regulating angiogenesis within the TME.

Profibrotic ecotypes are associated with higher mortality risk from pan-cancer analyses

Our analyses indicate that cell types that constitute profibrotic ecotypes, including Macro_SLPI and eFibro_CTHRC1 cells, are associated with worse clinical outcomes in various cancer types (Figs. 2k and 3h). To systematically compare the effects of profibrotic ecotypes to other potential risk factors, such as T cell exhaustion, we conducted a comprehensive prognosis analysis of all 56 cell types across 23 cancer types from TCGA. The pan-cancer survival association dichotomized all cell types into favorable and adverse states, highlighting their functional and clinical heterogeneities. T cell/NK cell subsets with higher cytotoxic scores (two NK cell subsets and CD8Temra_GZMB cells) were associated with a decreased risk of death, whereas memory T cells (CD8Tm_TNF) were associated with shorter survival times (Fig. 7a). Interestingly, cell types that could constitute profibrotic-associated ecotypes, including Macro_SLPI, Macro_SPP1, eFibro_CTHRC1 and VSMC_ACTA2 cells, were all significantly associated with an increased risk of mortality at the pan-cancer level (Fig. 7b,c and Extended Data Fig. 9a). These results collectively suggest that profibrotic ecotypes have conserved protumor functions that reduce the survival time of individuals with cancer.

Fig. 7: Identification of coarse stromal subtypes in solid tumors.
figure 7

a, Survival associations of per-cell-type signature genes of cytotoxic lymphocyte subsets. Top, bar plot illustrating the pan-cancer survival association across 23 cancer types, aggregated across malignancies using Stouffer’s method. Columns are ordered by combined z score. Bottom, cancer-specific survival association determined by the Cox proportional hazards model; LUSC, lung squamous cell carcinoma; COAD, colon adenocarcinoma; READ, rectum adenocarcinoma; LUAD, lung adenocarcinoma. b, Survival associations of per-cell-type signature genes in myeloid subsets; SARC, sarcoma. c, Survival associations of per-cell-type signature genes in fibroblast subsets. d, Heat map of 8,743 individuals from TCGA classified into four distinct TME subtypes based on clustering of the signature of all cell types; Bn, naive B cells; Bm, memory B cells; Bgc, germinal center B cells. e, Box plot illustrating the cytotoxic signature scores of ecotypes, including lymphocytes, calculated from 8,734 samples from the TCGA project. Statistical significance was evaluated using two-tailed unpaired Wilcoxon tests combined with a permutation test (10,000 resamplings) to compare the signature score distribution across pairwise ecotypes; horizontal connectors denote compared groups, with the corresponding P value indicated above the horizontal line. The bottom of the box represents Q1, whereas the top represents Q3. The height of the box indicates the IQR, and the horizontal line inside the box represents the median. The whiskers extend to the positions of Q1 – 1.5 × IQR and Q3 + 1.5 × IQR. f, Box plot comparing tumor purity across different tumor ecosystem groups, calculated from 8,734 samples from the TCGA project. The bottom of the box represents Q1, whereas the top represents Q3. The height of the box indicates the IQR, and the horizontal line inside the box represents the median. The whiskers extend to the positions of Q1 – 1.5 × IQR and Q3 + 1.5 × IQR. g, Box plot comparing CD8+ T cell infiltration of individuals stratified by tumor ecosystem groups. CD8+ T cell infiltration was estimated using QUANTISEQ from 8,734 samples from the TCGA project. The bottom of the box represents Q1, whereas the top represents Q3. The height of the box indicates the IQR, and the horizontal line inside the box represents the median. The whiskers extend to the positions of Q1 – 1.5 × IQR and Q3 + 1.5 × IQR. h, Bar graphs depicting the segregation of each carcinoma into the five tumor ecosystem groups. i, Kaplan–Meier curves illustrate the overall survival analysis of five tumor ecosystem groups across 459 individuals with SKCM and 1,091 individuals with BRCA; +, censored observations. Statistical significance was assessed via the log-rank test, with P values of 2.54 × 10−5 for SKCM and 0.0229 for BRCA.

Source data

TabulaTIME enables pan-cancer patient stratification using conserved tumor ecotypes

Tumor ecosystems are highly heterogeneous and are composed of diverse cell types. Although risk association analyses can screen individual pro- or antitumor cell types, they cannot reflect the cooperation of different cell types in forming patient-specific tumor ecosystems. TabulaTIME defined 56 cell subtypes across different cancer types using scRNA-seq. With this high-resolution reference, we then investigated whether we could stratify individuals into different tumor ecosystems based on deconvoluted cell subtypes within the TCGA cohort. The cell subtypes were evaluated using gene set variation analysis (GSVA) and clustered into five ecotypes52, including active stromal ecotype (E1), naive immune ecotype (E2), active immune ecotype (E3), profibrotic ecotype (E4) and proliferating ecotype (E5; Fig. 7d,e). Based on the deconvoluted ecotypes, individuals could be classified into five major TME ecosystems, including immune desert with high proliferation (DHP), immune desert with low proliferation (DLP), active immune with high stromal presence (AIHS), active immune with low stromal presence (AILS) and naive immune with high stromal presence (NIHS; Fig. 7d,e and Extended Data Fig. 9b,c). Consistently, participants in the immune desert group (DHP and DLP) showed high levels of tumor purity (Fig. 7f), whereas participants in the active immune group (AIHS and AILS) displayed higher estimated infiltration of CD8+ T cells53,54 (Fig. 7g and Extended Data Fig. 9d).

Interestingly, although the ecosystem features were in general conserved among different cancer types, the distribution of ecosystem groups showed remarkable differences (Fig. 7h). The kidney cancers kidney renal papillary cell carcinoma (KIRP) and kidney chromophobe (KICH) were highly enriched for DLP and DHP, whereas KIRC was primarily composed of AIHS and NIHS (Fig. 7h). We then investigated whether this stratification was conserved between different cohorts. For breast cancer, participant groups were generally comparable between the TCGA-BRCA classification and the METABRIC classification, even at the subtype level (Extended Data Fig. 9e). Additionally, we benchmarked TabulaTIME against previous similar studies6,55. The TabulaTIME cell-type signatures demonstrated stronger agreement with scRNA-seq data at the cluster level, and participant stratification by TabulaTIME was more consistent with the deconvolution results (Extended Data Fig. 9f–h). Finally, we evaluated the clinical effects of different tumor ecosystem groups. Interestingly, participants in the DHP and NIHS groups showed an increased risk of mortality in various cancer types, respectively corresponding to cold and unresponsive TMEs (Fig.7i). In summary, the cell-type signatures derived from TabulaTIME could facilitate the robust stratification of individuals based on their tumor ecosystem status.

TabulaTIME serves as a reference map for pan-cancer single-cell annotation

Cell-type annotation is vital for interpreting functional phenotypes of cells when analyzing scRNA-seq datasets. A comprehensive and fully annotated dataset is essential for reference-based cell-type annotation methods. We next tested whether TabulaTIME could serve as a reference map for pan-cancer single-cell annotation (Fig. 8a). Using a published deep-learning method SELINA56, we benchmarked cell-type annotation performance with TabulaTIME as the reference compared to individual tumor scRNA-seq datasets. Excitingly, we observed high consistency between the manually curated annotations from the original papers and the cell types predicted using SELINA and TabulaTIME (Fig. 8b and Extended Data Fig. 10). Furthermore, the TabulaTIME reference showed meaningful improvement over using a single tumor scRNA-seq dataset from the same cancer type as the reference (Fig. 8c). This indicates that an integrative tumor reference that includes diverse immune and stromal cell types could pronouncedly increase cell-type annotation performance compared to a single dataset that may lack rare cell types.

Fig. 8: Application of pan-cancer single-cell landscape.
figure 8

a, Schematic overview of the workflow using pan-cancer single-cell landscapes to analyze the ecotypes of bulk RNA-seq datasets and automatic annotation for scRNA-seq datasets. b, UMAPs showing cell-type annotation by original paper (left) and SELINA prediction (right). c, Bar plots comparing the accuracy of reference-pretrained SELINA-predicted cell types against original annotations in the BRCA_GSE176078 and NSCLC_GSE146100 datasets. SELINA was trained on reference datasets from TabulaTIME, BRCA_EMTAB8107 and NSCLC_GSE131907.

Source data

Discussion

In this study, we collected a large amount of tumor scRNA-seq data from 735 donors, including nearly 4 million cells spanning 36 cancer types. Additionally, we combined ST data from 62 individuals across six cancer types to provide spatial information and characterize tissue structures. Overall, we constructed a comprehensive single-cell TME landscape that can be used to characterize the spatial and temporal dynamics of tumor heterogeneity and assess the impact of different cell types and tumor ecotypes on clinical outcomes.

Our analyses revealed that SLPI+ macrophages exhibited a profibrotic phenotype and colocalized with CTHRC1+ fibroblasts across cancer types. Considering that SLPI+ macrophages exhibit strong profibrotic features and that CAFs can originate from both macrophages and mesenchymal stem cells18,33,57,58, we speculate that CTHRC1+ fibroblasts may also derive from SLPI+ macrophages. Further studies are needed to reveal the conditions that stimulate profibrotic macrophage formation or whether they represent an existing macrophage phenotype in healthy individuals.

An important merit of our study is the publicly available tumor landscape at MetaCell resolution, which includes detailed cell-type annotations, distribution across various sources and cancer types and their functions and effects on survival. This resource can facilitate scientists in inferring the most likely phenotypes for query cells. Similarly, the TabulaTIME landscape can promote the reconstruction of tumor ecotypes for bulk tissue transcriptomes using signature genes. Overall, we believe that our data can serve as a rich resource that has the potential to contribute to future cancer subtype identification and the optimization of individual cancer therapy.

Methods

Ethics statement

This study adhered to all relevant ethical regulations. Primary tissue sections used in immunofluorescence studies were obtained with written informed consent from participants and were approved by the local medical ethics committees of the hospitals where the samples were collected.

scRNA-seq data collection

We collected published cancer-associated scRNA-seq datasets from 746 donors across 36 cancer types. These datasets were sourced from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/) and our previous work the TISCH database (http://tisch.comp-genomics.org/home/). Additionally, we incorporated scRNA-seq datasets derived from healthy donors, including three peripheral blood mononuclear cell datasets and six datasets from normal tissues. This inclusion served as a baseline for the exploration of tumor-specific cellular features. To minimize platform-specific biases, we restricted our analysis to 10x Genomics scRNA-seq datasets. For each dataset, we downloaded the expression matrix of the raw count or TPM, along with relevant sample information. Neither sex nor gender was considered in the study design because the primary focus of this study was unrelated to sex or gender. For samples collected in this study, sex was self-reported. Given that the studies focus on the TME, we excluded samples with over 90% malignant cells from each publicly available solid tumor-associated scRNA-seq dataset. For the remaining dataset, no statistical methods were used to predetermine sample sizes; however, our sample sizes align with those reported in previous publications after accounting for the exclusion of samples with over 90% malignant cells. To ensure consistent gene symbol mapping across different genome assemblies, we converted the genes into GRCh38.p13.

Data quality control and preprocessing

We applied a standardized analysis workflow based on MAESTRO v1.1.0 for processing all collected datasets, including quality control, batch effect removal and cell clustering (Extended Data Fig. 1a,b). As a result, 26,975 cells were excluded, leaving 4,456,392 cells for subsequent analysis (Extended Data Fig. 1c). Then, for each dataset, MAESTRO selected the top 2,000 highly variable genes for downstream analyses using TPM matrices. We further used principal component analysis for dimension reduction, the Louvain algorithm for cluster identification and UMAP for visualization, with parameters optimized by cell count (Supplementary Table 2).

To systematically evaluate the batch effects for each dataset, we used an entropy-based metric to quantify data mixing across batches. Low entropy values indicate that the most similar cells are from the same individual, indicating the existence of a potential batch effect. However, it should be noted that for the datasets, which mainly contain malignant cells, low entropy could arise from the heterogeneity of malignant cell expression between different tumors, and the batch effect will not need to be removed. Malignant cells were identified by integrating three approaches: original study annotations, marker expression profiles and copy number variation (Extended Data Fig. 1a,d). Apart from the dataset with malignant cells and other datasets with a median entropy of more than 0.7, the remaining dataset corrected the batch effect by CCA (Extended Data Fig. 1e,f). All cells were then annotated into 17 common cell types, based on the expression of marker genes (Extended Data Fig. 1g,h). The marker genes of each cell type were collected from the published resources and were curated manually.

MetaCell identification and batch correction

To address challenges posed by the large cell count, we executed a rigorous three-step data integration process (Extended Data Fig. 2a). First, within each dataset, gene expression Ei,j, gene i in cell j, was quantified as log2 (TPMi,j / 10 + 1), mitigating the effect of gene-specific dropout rate variability. Second, to mitigate technical noise and reduce computational resource requirements, we partitioned single cells into small groups (called MetaCells hereafter) based on transcriptional similarity, as determined by their proximity within the UMAP representation. Our strategy resembles Tanay et al.’s MetaCells28 and Zheng et al.’s metaclusters8 but uniquely assesses similarity within each sample’s cell type and incorporates cell origin.

To determine the optimal number of cells (k) for each MetaCell, we evaluated MetaCell performance in terms of gene coverage and within-MetaCell variation (Extended Data Fig. 2a). To retain a greater number of cells, we allowed for a slight fluctuation in the cell count within MetaCells, rather than strictly adhering to fixed numbers like 10 or 30. Within-MetaCell variation, quantified via the Gini Index, showed increasing average variation and decreasing total variation with cell count. The Elbow method identified 30-cell MetaCells as optimal across datasets (Extended Data Fig. 2b,c). Next, using LISI59, measuring dataset diversity in neighborhoods, we found that these 30-cell MetaCells achieved higher LISI and >50% gene coverage (Extended Data Fig. 2d). For each scRNA-seq dataset, we then systematically partitioned the cells of each cell type within individual samples into MetaCells, with each MetaCell comprising approximately 30 cells. The average log TPM-transformed gene expression of all cells within each MetaCell was used to represent the MetaCell’s expression, and the original gene-by-cell expression matrix was converted to the gene-by-MetaCell expression matrix.

Third, we evaluated the integration performance and biological signal preservation of MetaCell + CCA using ten additional randomly selected datasets. We compared LISI and Entropy scores across four strategies and found that CCA-integrated MetaCells outperformed other methods, demonstrating enhanced integration performance and reduced batch effects (Extended Data Fig. 2e). Additionally, we assessed biological signal preservation by calculating the adjusted Rand index against original cell-type labels and using ASW to evaluate within-cluster homogeneity. MetaCell + CCA achieved the highest adjusted Rand index and ASW scores, confirming that cell-type structures were preserved after integration and batch correction. These results indicate that MetaCell + CCA effectively removed batch effects while retaining cell-type distinctions compared to original data (Extended Data Fig. 2f).

Lineage separate integrated analysis

To gain more detailed insights into the MetaCell heterogeneity of specific cell types, we divided all cells into six lineages for downstream analysis, including cytotoxic lymphocytes, conventional and regulatory lymphocytes, B lymphocytes, myeloid cells, fibroblasts, endothelial cells and epithelial cells (Fig. 1a). For each lineage MetaCell, we extracted the expression profiles and subjected them to reprocessing using the MAESTRO pipeline, which includes quality control, batch effect correction, cell clustering and annotation (Fig. 1c and Extended Data Fig. 2h).

Determining the optimal number of cell types

To determine the most effective clustering resolution for each cell lineage, we used both ASW and Clustree metrics to identify clusters that accurately represent biological diversity without excessive clustering. First, to address the curse of dimensionality, we calculated the ASW based on Euclidean distances in the top 30 principal components of the principal component analysis space. A higher ASW indicates a more pronounced separation among distinct clusters. For each lineage, we calculated ASW values ranging from 0.1 to 1.5 and designated the resolution with the highest ASW as the optimal one. Subsequently, we performed clustering at various resolutions for each lineage to determine the optimal resolution (Extended Data Fig. 4a).

Clustree was used to visually inspect marker gene expression. If marker gene expression was intertwined with markers of other cell types, this suggested that relying solely on ASW might not be sufficient to effectively distinguish subcellular populations. For example, in myeloid cells, the ASW suggested a resolution of 0.1. However, CLEC9A+ dendritic cells and C1QC+ macrophages were separated at a resolution of 0.5, leading us to identify 0.5 as the optimal resolution (Extended Data Fig. 4b). Thus, we combined ASW with prominent marker gene expression observed at varied resolutions using Clustree to determine the optimal resolution.

Based on the clustering results, we annotated cell clusters using marker gene expression. We then used ROGUE to evaluate cellular heterogeneity (Extended Data Fig. 4c). ROGUE scores range from 0 to 1, with 1 representing a completely pure subtype. Therefore, cell types with median ROGUE scores below 0.9 were considered to exhibit significant heterogeneity, prompting further investigation to reannotate them into more homogeneous subtypes exhibiting higher median ROGUE values.

Source preference analysis

To assess dynamic changes in the TME during tumor progression, we analyzed the proportions of sub-cell types across distinct sources within each lineage individually. Considering that various treatments may lead to dynamic changes in the TME, our analysis was limited to treatment-naive datasets. The proportion of each subtype in a sample was determined by dividing the number of MetaCells of that subtype by the total number of MetaCells in the lineage. Consequently, in any given lineage within a single sample, the sum of all subtype proportions equals 1. To investigate the cell-type distribution of source preference, we used the Kruskal–Wallis test to assess each cell type across all tissue types. Additionally, each pairwise source proportion comparison within each cell type was assessed via two-tailed unpaired Wilcoxon tests (Figs. 2c, 3d and 6e, Extended Data Fig. 4e and Supplementary Table 6).

Furthermore, to validate subtype source preferences, we calculated odds ratios by constructing 2 × 2 contingency tables for each subtype i and source j within lineages. Tables included (1) cells of subtype i in source j, (2) subtype i cells in other sources, (3) non-i subtypes in j and (4) non-i subtypes in other classifications. A Fisher’s exact test was applied, with BH-adjusted P values for multiple testing.

Distribution of cell types across cancer types

We investigated the distribution of specific cell types across diverse cancer types in treatment-naive datasets. For each cancer type or tissue, we calculated the subtype proportion by dividing the total number of MetaCells within the corresponding lineage. It is important to note that the sum of all subtype proportions within a specific lineage for a particular cancer type equals 1.

Scoring cell types using function-associated signature genes

Here, we used the signature gene lists obtained from previously published studies to describe the functional diversity of cell types (Supplementary Table 5). The AddModuleScore function in Seurat was applied to calculate the score for individual MetaCells. For T cells/NK cells, cytotoxic, exhausted and regulatory MetaCell scores were computed to assess functional states and validate subtypes (Fig. 2b and Extended Data Fig. 3a). Monocyte/macrophage MetaCell functions were evaluated via M1/M2 polarization, angiogenesis, phagocytosis and pro-/anti-inflammatory activity (Fig. 2e). Additionally, fibroblast subsets were analyzed for their tumorigenic roles through ECM remodeling and immune regulation (Fig. 3c).

Metabolic and cancer hallmark pathways and gene set enrichment analysis

Metabolic pathway activities were estimated using GSVA. The metabolic pathways were collected from KEGG60, including 85 pathways. Differentially activated pathways of each subtype were identified by running the Wilcoxon rank sum test against other cell types within the one lineage.

To elucidate the functional characteristics of the subtypes, we used gene sets associated with cancer hallmarks and KEGG pathways from MSigDB v6.1 (ref. 61). Enrichment analysis was implemented using the clusterProfiler package62. Pathways with a log (fold change) of greater than 0.05 and an adjusted P value of less than 0.01 were deemed significantly upregulated.

Characterization of intratumoral transcriptional heterogeneity

To capture transcriptional heterogeneity, we performed NMF within each sample. Negative values in each centered expression matrix were set to zero. To minimize the influence of cell lineage-specific expression patterns and to enhance sensitivity and specificity, we conducted NMF analysis separately for each cell lineage. For each sample, we performed NMF (k = 10) for each cell lineage across datasets separately and summarized each NMF program using the top 50 genes based on NMF coefficients. To avoid redundancies, we removed NMF programs that overlapped more than 20% with others within the tumor.

Subsequently, we clustered the NMF programs within cell lineage based on Jaccard similarity. Given the substantial number of NMF programs, we restricted the clustering to NMF programs with at least a minimum overlap of 20% with any modules observed within a cell lineage. Clustering was performed using hierarchical clustering, and MPs were defined through manual inspection. Each MP included at least five NMF programs. For each MP, we selected the top 50 genes based on their frequency of occurrence. Furthermore, we removed MPs that originated from a single study or exhibited strong enrichment of ribosomal protein genes or mitochondria-encoded genes. Finally, we analyzed each MP’s function using a hypergeometric test to assess enrichment of its signature genes in MSigDB cancer hallmark and KEGG pathways. The MPs were then further grouped based on functional similarity (Fig. 2h, Extended Data Fig. 5b and Supplementary Table 7).

Survival analyses

To assess the prognostic significance of cell types, we obtained clinical and expression data from TCGA (Supplementary Table 8). Per cell type, the top 50 highly expressed genes were selected to calculate cell-type signature scores using GSVA for each individual. These scores were adjusted to sum to 1 per lineage to mitigate multicollinearity (Extended Data Fig. 4h,i). Survival differences between high- and low-scoring groups, stratified by median signature scores, were assessed via Kaplan–Meier analysis (Figs. 2k and 3h).

Additionally, to depict the clinical relevance of cell types within and across cancers, we conducted the following analyses. For each cell type, univariable Cox regression linked the adjusted GSVA score to overall survival per tumor type. A z score higher than 0 indicated increased mortality risk. To account for multiple tests, we adjusted the P values using the BH method. Furthermore, to assess the clinical relevance at the pan-cancer level, we used a meta-z-score, derived from the combined z scores across tumor types using Stouffer’s method63 (Fig. 7a and Extended Data Fig. 9a).

mIHC

Human tissue specimens were provided by Shanghai Pulmonary Hospital (for NSCLC sections), West China Hospital of Stomatology at Sichuan University (for HNSC sections) and Shanghai East Hospital (for CESC sections) under the approval of local medical ethics committees (Supplementary Table 9). For mIHC analysis, there were seven individuals (three males (NSCLC-1, NSCLC-2 and HNSCC-2) and four females (HNSCC-1, HNSCC-3, CESC-1 and CESC-3)). For samples collected in this study, sex was self-reported. Sex was not considered in the study design, and sex-based analyses were not conducted. Tissues were fixed in formalin, embedded in paraffin, sectioned to 4 μm and positioned onto adhesion microscope slides following routine methods. Automated staining was performed with a Leica BOND-MAX autostainer (Leica Microsystems). Slices were deparaffinized and pretreated with Epitope Retrieval Solution 2 (AR9640-CN, LeicaBiosystems) at 100 °C for 20 min. Peroxidase blocking was performed for 10 min using Bond Polymer (DS9800-CN, Leica) subsequent to rinsing in tris-buffered saline with tween-20 (TBST) buffer. Slides were incubated with primary antibody at room temperature (RT) for 1 h. Primary antibodies included rabbit anti-CTHRC1 (1:400; Abcam, ab85739), rabbit anti-SLPI (1:500; Thermo Fisher, PA5-82990), rabbit anti-osteopontin (SPP1; 1:1,500; Abcam, ab214050), rabbit anti-CD68 (1:400; Biolynx, BX50031) and rabbit anti-pan-cytokeratin (1:300; Biolynx, BX50143). All slides were stained with secondary reagents at RT for 10 min and tyramide signal amplification reagents at RT for 10 min. Nuclei were stained with DAPI (Sigma-Aldrich) after all human antigens had been labeled. Stained slides were scanned at ×20 magnification using a Pannoramic MIDI scanner (3DHISTECH), and images were analyzed using Halo software (Indica Labs).

NicheNet analysis

To infer potential ligands that influence transcriptomic changes or phenotypic shifts in target cell types, we used NicheNet for upstream regulatory factor identification. Our analysis used the top 100 upregulated genes in the cell types of interest as input, with all genes expressed in relevant cell lineages serving as the background gene set. Subsequently, we used the predict_ligand_activities function from the R package NicheNet to predict and rank these potential ligands (Fig. 5e).

Cell–cell interaction analysis

To quantify the probability of cell–cell communication, we used CellChat64 to predict potential interactions based on the expression of ligand–receptor pairs across cell types. Receptor–ligand interactions between cell types were identified by the specific expression of a receptor by one cell type and a ligand by another cell type. Significance of specific ligand–receptor interactions between two cell subsets was calculated by a permutation test.

ST analysis

From the GEO database, we collected ST data from 62 individuals of 6 cancer types (Supplementary Table 10). Due to the lack of gender information in many data entries and the fact that sex was not considered in the study design, sex-based analyses were not conducted. To ensure data quality, we conducted several standard statistical analyses using the Seurat package (Extended Data Fig. 6a).

In 10x Visium ST data, spots may contain multiple cells, complicating cell-type assignment. We applied STRIDE, a topic-model-based tool trained on single-cell data, to deconvolute cell-type proportions per spot. For ST data with matched scRNA-seq datasets, we used its annotated transcriptome for spot decomposition. For unmatched ST samples, we selected a compatible scRNA-seq dataset from TISCH, prioritizing shared cancer/cell types with the ST data as reported in the original study. Especially for malignant cell identification, we adopted a comprehensive approach, combining information provided by the original studies, STRIDE decomposition results and CopyKat predictions based on copy number variation (Extended Data Fig. 6c). Finally, the ST data were annotated into the major lineage level (Extended Data Fig. 6d).

Spatial localization analysis

To map subcellular composition within the ST dataset, we leveraged the top 50 marker genes from scRNA-seq datasets and applied the AddModuleScore function in Seurat to calculate subtype-specific enrichment scores.

Colocalization analysis

Because fibroblasts and macrophages were unevenly distributed, we assessed the correlation between CTHRC1+ fibroblasts and SPP1+ macrophage signature scores only in fibroblast-containing spots. A strong correlation across all cohorts suggests that these two cell types are closely associated within the tissue.

Distance measurement between fibroblasts and malignant cells

To characterize the spatial relationship between CTHRC1+ fibroblasts and malignant cells, we calculated the correlation between the CTHRC1+ fibroblast signature score and the distance to the tumor center. A positive correlation indicates that CTHRC1+ fibroblasts are located farther away from malignant cells, whereas a negative correlation suggests their proximity to the tumor center.

Bulk tumor subtyping

Following the analysis of the TME in pan-cancer scRNA-seq data, we used the signature genes of all 56 cell types to deconvolute the composition and density of the tumors. Our dataset comprised 8,743 individuals across 23 cancer types from the TCGA project (Supplementary Table 8). To make the expression profiles comparable across cancer types, for each cancer type, we normalized the expression profile for each gene by subtracting the average expression value. We used GSVA to score individuals based on the top 50 highly expressed genes of each cell type. Subsequently, hierarchical clustering was used to categorize individuals into five subtypes based on the signature score matrix. These subtypes were labeled according to clustering results, tumor purity, signature scores of major lineage and cell-type functional scores (Fig. 7e,f and Extended Data Fig. 9b,c). Additionally, we collected immune cell infiltration data estimated for each individual in TCGA by different methods from TIMER65 (Fig. 7g and Extended Data Fig. 9d).

For the validation dataset, we obtained RNA-seq profiles from METABRIC, comprising 1,992 individuals. To mitigate noise introduced during data processing and sequencing, we normalized expression profiles by subtracting the average expression value within each cancer type.

Benchmarking of different bulk tumor subtyping studies

To comprehensively evaluate the performance of TabulaTIME alongside previous similar studies, we compared the identified cell states, coassociation of cell types and the capacity to categorize bulk RNA-seq samples. First, regarding the identified cell states, we hypothesized that comprehensive signatures would show preferential upregulation in distinct clusters, whereas nonspecific or redundant cell states would co-upregulate within the same clusters. For a more equitable and quantitative comparison, we used the two gene sets from TabulaTIME and Luca as inputs for SCINA66, a marker gene-based cell-type annotation method, we predicted cell types in five additional scRNA-seq datasets. The accuracy of predicted cell-type annotations versus the original cell-type annotations and normalized mutual information (NMI) across cluster labels in the additional scRNA-seq dataset were determined (Extended Data Fig. 9g). A higher NMI reflected more precise and dependable annotations.

Additionally, we compared the TCGA stratification between TabulaTIME and the other three studies. Participant TME subtypes derived from TabulaTIME were more consistent with the Bagaev et al.6-defined participant TME subtypes, as they both considered immune and stromal cell types (Extended Data Fig. 9i). To quantitatively evaluate accuracy, we classified TCGA participants into immune-hot and immune-cold tumors based on deconvolution results (CIBERSORT, MCP counter67, TIMER and xCell68; Extended Data Fig. 9h). We then compared participant stratification from the different studies to these immune-hot and immune-cold classifications. For studies that explicitly mentioned the TME subtypes characterized by either high or low immune cell infiltration, we classified participants according to the designated TME subtypes. Specifically, in the TabulaTIME and Bagaev et al. studies, we defined desert subtypes, including DHP, DLP and D, as immune-cold tumors. Similarly, in Thorsson et al.55, we classified participants within the C3, C4 and C5 subtypes as immune-cold tumors. For studies that did not describe the immune cell infiltration for each TME subtype, we divided the original TME subtypes based on the reported immune cell infiltration levels. For example, in the Ecotyper study, we categorized CE2-, CE4-, CE5-, CE6-, CE7- and CE8-high tumors as immune-cold, whereas the remaining subtypes were defined as immune-hot.

Reference data for scRNA-seq data annotation methods

To benchmark our integrated transcription profiles, TabulaTIME, as a reference, we used the SELINA framework. We pretrained the SELINA model using TabulaTIME and the additional NSCLC and BRCA datasets. Performance was validated on external datasets by calculating accuracy as the fraction of cells with predicted versus annotated labels.

Statistics and reproducibility

This study focuses on individuals with solid tumors, necessitating targeted data selection and making random sampling inappropriate. Samples were defined by the original studies. No statistical method was used to predetermine sample size. Consequently, the experiments were not randomized, and data collection and analysis were not conducted blind to the experimental conditions. All statistical methods, selected as suitable for the data distribution, were reported in the Methods and figure legends. Cell-type distribution across multiple sources was evaluated using Kruskal–Wallis tests. Pairwise source distribution for each cell type, along with LISI, entropy, ASW score, cytotoxic signature score, exhausted signature score, NMI and Cibersort-estimated CD8+ T cell infiltration, were assessed using two-tailed unpaired Wilcoxon tests. A Fisher’s exact test validated subtype source preferences, and a log-rank test evaluated differences in survival. Multiple testing correction was applied where appropriate, P values are labeled in the figures, and significance was defined as P < 0.05. The corresponding figures illustrate the distribution of the individual data points.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.