Abstract
Deciphering the features, structure, and functions of the cell niche in tissues remains a major challenge. Here, we present scNiche, a computational framework to identify and characterize cell niches from spatial omics data at single-cell resolution. We benchmark scNiche with both simulated and biological datasets, and demonstrate that scNiche can effectively and robustly identify cell niches while outperforming other existing methods. In spatial proteomics data from human triple-negative breast cancer, scNiche reveals the influence of the microenvironment on cellular phenotypes, and further dissects patient-specific niches with distinct cellular compositions or phenotypic characteristics. By analyzing mouse liver spatial transcriptomics data across normal and early-onset liver failure donors, scNiche uncovers disease-specific liver injury niches, and further delineates the niche remodeling from normal liver to liver failure. Overall, scNiche enables decoding the cellular microenvironment in tissues from single-cell spatial omics data.
Similar content being viewed by others
Introduction
The cell niche, also referred to as the cellular microenvironment or spatial domain, is defined as the local environment or communities surrounding cells and plays a critical role in determining various biological processes, such as maintaining tissue homeostasis1,2,3 and shaping disease progression4,5,6. Recent advances in spatial omics technologies7,8,9,10,11,12,13,14,15 provide molecular profiles at single-cell resolution, allowing systematic exploration of cellular states, functions, and interactions in the tissue context. However, while these advances have offered extensive spatial atlases, it remains a challenge to decipher the latent cell niche information within these data accurately.
Various computational methods have been developed to identify cell niches by integrating molecular profiles of the cell with spatial information. Early methods such as HMRF16, BayesSpace17, and DR-SC18 employ a Potts model to encourage physically proximal cells to have the same label. This strategy assumes that a cell niche is a region with homogeneous gene expression and models the gene expression of all cells with the same distribution, which cannot accurately capture the gene expression heterogeneity of different cell types within the same niche19. As an improvement, BASS19 introduces additional hierarchical modeling structures on top of the Potts model to explicitly model heterogeneous gene expression of different cell types, thus enabling more flexible and effective modeling of spatial omics data. SCGP20, as another class of method, constructs spatial cellular graphs by computing spatial edges and feature edges between cells separately, enabling traditional graph community detection algorithms to identify cell niches. Most other subsequent methods, on the other hand, tend to combine the molecular profiles of the cell itself with that of its neighbors in different ways to generate new features that are more representative of the cell niche. Specifically, UTAG21 and CellCharter22 integrate the molecular profiles of neighbors into the cell’s own molecular profiles using linear weighting and neighborhood aggregation, respectively. BANKSY23 generates neighbor-augmented features by combining the molecular profiles of the cell itself with that of its neighbors, and provides a specific hyperparameter to tune the contributions of the cells and their local microenvironments. Deep learning-based methods such as SpaGCN24, STAGATE25, GraphST26, and SpaceFlow27, learn better latent features through graph neural networks. In addition, there are also some methods primarily used for spatial proteomics data such as CytoCommunity28 and Spatial-LDA29, which rely on well-annotated cell type information and utilize only the neighborhood composition features of cells to identify cell niches. However, these methods may not reveal some niches located in spatially specific regions, such as the tumor-immune interface, where both tumor and immune cells exhibit altered molecular profiles13. Overall, the effectiveness of these methods suggests that the various features of cells and their microenvironments may both be potentially helpful in accurately identifying cell niches. However, current methods are generally designed based on a fixed architecture of feature combinations, which may have limitations when users want to integrate specific combinations of features they only have. In addition, except for a few methods such as BANKSY and CellCharter, most methods are primarily demonstrated on small datasets (such as the spatial transcriptomics data of individual tissue slices). Scaling to large datasets with dozens or hundreds of tissue slices and simultaneously identifying conserved or specific cell niches across these slices remains a prominent challenge in the field.
In this study, we define the features of the cell itself (e.g., the molecular profiles of the cell) and various features of its microenvironment (e.g., the cellular compositions or molecular profiles of neighborhoods of the cell) in a unified way as features from different “views” of the cell, and introduce scNiche, a computational method that leverages these multi-views features of the cell to identify and characterize cell niches in tissue. We highlight the novelty and strengths of scNiche over other existing methods: 1) unlike most previous deep learning-based methods, which typically run graph neural networks on the spatial graph to integrate molecular profiles, scNiche first constructs separate graphs for features from different views of the cell, and then utilizes the graph neural networks to integrate these multi-views features of the cell into a meaningful joint representation of niches. This unique model framework allows the flexibility to dynamically replace or add features from other views of the cell in practice, and as such can be used as a model paradigm to comprehensively consider and investigate the optimal combination of multi-views features of the cell for niche modeling; 2) through the batch training strategy, scNiche can scale to large datasets containing millions of cells from a series of tissue slices, holding the potential to simultaneously identify conserved or specific cell niches across multiple slices or samples.
We first benchmarked the performance of scNiche with existing methods using simulated and biological datasets. We then applied scNiche to a variety of spatial omics datasets from different tissues, including human triple-negative breast cancer across two archetypical subtypes (mixed and compartmentalized)13 as well as mouse liver under normal and early-onset liver failure states10, to identify patient- or disease-specific cell niches and to further provide comprehensive characterization and interpretation of these niches from both the cellular composition and molecular expression perspectives.
Results
Design concept of scNiche
scNiche is designed to leverage and integrate multi-view features of the cell from both itself and its microenvironment to identify cell niches. By default, scNiche takes single-cell spatial omics data as input and first extracts the following three-view features of each cell within a pre-defined neighborhood range: the molecular profiles of the cell, the molecular profiles of its neighborhoods, and the cellular compositions of its neighborhoods (Fig. 1a). Notably, when applied to spatial transcriptomics datasets containing multiple tissue slices, dimensionality reduction and batch correction on the features of the first two views are usually necessary to balance the dimensionality of different views while eliminating potential batch effects across different slices (Methods). On the other hand, in addition to the default three views, features from other views (such as the histological information of cells or the deconvoluted cellular compositions of spots in the low-resolution spatial transcriptomics data) can also be added or replaced conveniently, allowing for a more flexible investigation of the optimal combination of multi-view features of the cell for niche modeling. Subsequently, scNiche applies a neural network architecture of the multiple graph autoencoder (M-GAE) coupled with a graph fusion network (GFN) to integrate the multi-view features of the cell into a joint representation (z). Specifically, the M-GAE model encodes the complementary information of multi-view data, and the GFN captures the relationships among graphs from different views and generates a consensus graph that contains a global node relationship across all views, which is then input back into the M-GAE model. scNiche also applies a multi-view mutual information maximization (MMIM) module to guide the joint representation (z) to be more clustering-friendly by boosting the similarity between representations of neighboring samples within any view (Fig. 1a and Supplementary Fig. S1). The training process is guided by minimizing the combined loss function comprising the M-GAE reconstruction, graph reconstruction, and mutual information loss (Fig. 1a and Methods). Additionally, a batch training strategy is developed to enable scNiche to efficiently handle large datasets (Methods). After model training, the learned joint representation (z) can be clustered using any unsupervised clustering algorithms such as k-means or Leiden30 to identify the cell niches. Finally, scNiche also implements an integrated downstream analytical framework for the comprehensive characterization of identified cell niches (Fig. 1b and Methods).
a Schematic workflow of scNiche. Given the spatial omics data, scNiche first extracts the multi-view features of cells within a pre-defined neighborhood range. Subsequently, scNiche combined features from different views into a joint representation (z). The combined loss function comprising the M-GAE reconstruction loss (\({L}_{{rec}}\)), graph reconstruction loss (\({L}_{{gre}}\)), and mutual information loss (\({L}_{{mim}}\)) is applied to guide the training process of the model. scNiche finally performs the unsupervised clustering step on the learned joint representation (z) to identify cell niches. b Downstream analytical framework of scNiche for the comprehensive characterization and interpretation of identified cell niches.
Multi-view feature fusion improves the accuracy of cell niche identifications
We first evaluated the performance of scNiche using the simulated datasets generated by scCube31, where the heterogeneity in both the cellular composition and gene expression of cell niches was considered. Furthermore, the cell niches in each simulated dataset exhibited variations in spatial continuity and compositional complexity, aiming to simulate the cellular microenvironment across different tissues (Supplementary Fig. S2a and Methods). Ten existing methods (DC-SC18, BASS19, UTAG21, CellCharter22, BANKSY23, SpaGCN24, STAGATE25, GraphST26, SpaceFlow27, and CytoCommunity28) were selected for comparison. Two evaluation metrics, the adjusted Rand index (ARI) and the macro-F1 score, were used to assess the accuracy of identifying true cell niches. As shown in Supplementary Fig. S2b, scNiche outperformed other methods in accurately identifying cell niches, with its performance being nearly unaffected by the spatial continuity or compositional complexity of cell niches (Supplementary Fig. S2c–e).
We next assessed the performance of scNiche when the data quality degrades through two simulation scenarios. Specifically, in one scenario, we randomly set the expression values of a certain proportion of genes to 0 (the gene expression dropout), and in another scenario, we randomly altered the cell annotation labels of a certain proportion of cells to “ambiguous” (the cell annotation dropout) (Methods). As expected, the accuracy of all methods dropped as the data quality degraded (Supplementary Fig. S3a, b). In the former simulation scenario, scNiche exhibited relatively stable performance at lower dropout rates of gene expression. However, for higher dropout rates of gene expression, the performance of scNiche and all other methods declined dramatically (Supplementary Fig. S3a). In the latter simulation scenario, we found that the performance of scNiche was more robust compared to CytoCommunity as the dropout rates of cell annotation increased, suggesting that scNiche’s strategy of multi-view feature fusion may effectively mitigate the impact of ambiguous cell annotations by considering the cell gene expression information (Supplementary Fig. S3b).
We also conducted ablation studies on each view of the three default inputs as well as on each model component of scNiche respectively to assess their individual contributions. For the former, as shown in Supplementary Table 1–2, scNiche outperformed all its derivatives, each of which excludes the fusion of features from a specific view, indicating that features from all three views contribute to the accurate identification of cell niches. Furthermore, scNiche also performed better than using the features from a single view alone, and its model-based feature fusion strategy was superior to the simple concatenation of features from different views. For the latter, as expected, scNiche was unable to effectively encode the complementary information from multiple views of cells when the M-GAE or GFN component was removed, resulting in an inability to accurately model the cellular microenvironment (Supplementary Table 3). In addition, the performance of scNiche-w/o MMIM also declined compared to scNiche, suggesting that the MMIM component contributes to the learning of more discriminative joint representations (Supplementary Table 3).
In summary, the performance evaluation on the simulated datasets demonstrated that scNiche can effectively integrate information from different views of the cell and holds the potential to identify cell niches accurately.
Performance evaluation of scNiche on mouse spleen CODEX dataset
The real spatial omics data are better examples than simulated data for evaluating the performance of scNiche because the cell niches therein are objectively present and biologically meaningful. Therefore, we first applied scNiche to a mouse spleen spatial proteomics dataset generated by the co-detection by indexing (CODEX) technology12. The compartment label of each cell from three wild-type spleen samples (BALBc-1, BALBc-2, and BALBc-3) was provided and can be regarded as the ground truth of niches (Fig. 2a). We first evaluated the performance of scNiche’s batch training strategy on this dataset. As shown in Supplementary Fig. S4, our results indicated that scNiche maintained a relatively stable performance under different batch number settings without requiring additional training epochs. Furthermore, the performance of scNiche in identifying cell niches across multiple slices is comparable to that of using only a single slice, which was consistent with previous findings32 (Supplementary Fig. S5). Benchmarking results showed that scNiche outperformed other methods on both evaluation metrics, and accurately identified the marginal zone (a unique cell niche located on the periphery of the B cell follicle) in all three samples (Fig. 2b, c).
a The mouse spleen CODEX data from three wild-type spleen samples, each cell is colored by the cell type labels (left) and the niche labels (right). b The cell niches identified by each method on the mouse spleen CODEX data. c, Performance comparison of scNiche with other methods across three samples using the adjusted Rand index (ARI) (left) and the macro-F1 score (right) metrics on mouse spleen CODEX dataset. The ARI and macro-F1 score metrics relative to the niche annotation are calculated based on the result of each method. Data are presented as boxplots (minima, 25th percentile, median, 75th percentile, and maxima). n = 3 samples for each method. d The human UTUC IMC data from one representative sample, each cell is colored by the cell type labels (top) and the niche labels (bottom). e The cell niches identified by each method on the human UTUC IMC data. f Performance comparison of scNiche with other methods across 16 samples using the adjusted Rand index (ARI) (left) and the macro-F1 score (right) metrics on human UTUC IMC dataset. The ARI and macro-F1 score metrics relative to the niche annotation are calculated based on the result of each method. Data are presented as boxplots (minima, 25th percentile, median, 75th percentile, and maxima). n = 16 samples for each method. Source data are provided as a Source Data file.
Performance evaluation of scNiche on human upper tract urothelial carcinoma IMC dataset
We also applied scNiche to another spatial proteomics dataset from human upper tract urothelial carcinoma (UTUC) generated by the Imaging Mass Cytometry (IMC) technology33 to further evaluate its performance. In this dataset, 16 images had been manually annotated with boundaries of tumor and stroma, which can be regarded as the ground truth of niches21 (Fig. 2d and Supplementary Fig. S6). As shown in Fig. 2e, f and Supplementary Fig. S6-7, although distinguishing tumor and stroma niches in this dataset is a relatively simple task and all methods achieved comparatively good performance in most samples, scNiche still demonstrated the best overall performance across all 16 samples, and successfully resolved the fine structure of the boundaries of tumor and stroma niches in some samples such as PM57_B8-01. Furthermore, scNiche with higher clustering granularity identified more fine-grained niches, including different immune-enriched niches and tumor-enriched niches (Supplementary Fig. S8).
Additionally, since the finer subpopulation annotation of each cell was also provided by the original authors, we thus further evaluated the robustness of scNiche to the granularity of cell population annotation. As illustrated in Supplementary Fig. S9, scNiche continued to accurately identify tumor and stromal niches when using the refined cell population annotation of cells and consistently outperformed CytoCommunity, which also utilized the refined cellular annotation information.
Performance evaluation of scNiche on mouse brain spatial transcriptomics datasets
We further evaluated the performance of scNiche on two additional mouse brain single-cell spatial transcriptomics datasets with more complex niche structures generated by different technologies. Specifically, the STARmap dataset34 contains one tissue slice from the mouse V1 neocortex and the MERFISH dataset35 contains 31 tissue slices from the mouse frontal cortex and striatum. The brain region label of each cell was manually annotated in both datasets and can be regarded as the ground truth of niches (Fig. 3a, d). Consistent with the results on spatial proteomics datasets, scNiche also demonstrated superior overall performance compared with other methods on these two additional spatial transcriptomics datasets (Fig. 3b, c, e, f and Supplementary Fig. S10-11), suggesting the general applicability of scNiche in accurately identifying cell niches from different spatial omics data.
a The mouse V1 neocortex STARmap data from one slice, each cell is colored by the cell type labels (left) and the niche labels (right). b The cell niches identified by each method on the mouse V1 neocortex STARmap data. c Performance comparison of scNiche with other methods using the adjusted Rand index (ARI) (top) and the macro-F1 score (bottom) metrics on mouse V1 neocortex STARmap dataset. The ARI and macro-F1 score metrics relative to the niche annotation are calculated based on the result of each method. d, The mouse frontal cortex and striatum MERFISH data from one representative slice, each cell is colored by the cell type labels (top) and the niche labels (bottom). e The cell niches identified by each method on the mouse frontal cortex and striatum MERFISH data. f Performance comparison of scNiche with other methods across 31 slices using the adjusted Rand index (ARI) (left) and the macro-F1 score (right) metrics on mouse frontal cortex and striatum MERFISH dataset. The ARI and macro-F1 score metrics relative to the niche annotation are calculated based on the result of each method. Data are presented as boxplots (minima, 25th percentile, median, 75th percentile, and maxima). n = 31 slices for each method. Source data are provided as a Source Data file.
Performance evaluation of scNiche on low-resolution spatial transcriptomics dataset
We explored the potential applicability of scNiche to the spatial transcriptomics data generated by platforms with a lower resolution such as ST36 and 10X Visium on the human DLPFC 10X Visium dataset37. Specifically, we first used the human middle temporal gyrus (MTG) scRNA-seq data by Hodge et al.38, which contains 75 transcriptomically distinct cell types, as the single-cell reference, and deconvoluted the spots of each DLPFC slice data using Cell2location39 (Supplementary Fig. S12a). The deconvolution results were subsequently used to replace the feature of cellular compositions of neighborhoods utilized in the single-cell spatial omics data, and were input into scNiche along with features from the remaining two views (the molecular profiles of the spot and its neighborhoods). Four tissue slices from the same donor were selected, where the cortical layer label of each spot was manually annotated and can be regarded as the ground truth of niches (Supplementary Fig. S12b). As illustrated in Supplementary Fig. S12c, d, although scNiche was not originally designed for spot-based spatial transcriptomics data, its modified version still performed comparably to some state-of-the-art methods on the four DLPFC slices.
Scalability analysis of scNiche to large datasets
In addition to the spatial transcriptomics and spatial proteomics datasets used in the benchmarking studies, we further tested the scalability of scNiche and other methods on a much larger mouse whole brain MERFISH dataset generated by Zhang et al.40. As shown in Supplementary Fig. S13, scNiche, BANKSY, UTAG, and CellCharter are the only four methods that can scale to the dataset with more than 3 million cells. For the mouse whole brain MERFISH dataset, scNiche identified 14 cell niches according to the cluster stability, which were aligned across sequential tissue sections (Supplementary Fig. S14). Moreover, we also selected four representative sections corresponding to different regions of the brain: C57BL6J-1.032, C57BL6J-1.056, C57BL6J-1.081, and C57BL6J-1.136, and compared the cell niches identified by each method with the anatomical regions from the Allen Mouse Brain Reference Atlas41. As shown in Fig. 4a, b, the niches identified by scNiche accurately correspond to different structures in the mouse brain. In contrast, the niches identified by UTAG and the nonspatial method lack clear spatial separation, while BANKSY and CellCharter failed to distinguish certain brain regions, such as the hypothalamus and striatum in the C57BL6J-1.056 section (Fig. 4c). In summary, these results suggest that scNiche can scale to large datasets while maintaining good performance.
a Annotated mouse brain coronal section images from the Allen Mouse Brain Atlas. The cell niches identified by scNiche (b) and other methods (c) on C57BL6J-1.032, C57BL6J-1.056, C57BL6J-1.081, and C57BL6J-1.136 sections.
Robustness analysis of scNiche
We first evaluated the robustness of scNiche to the size of the pre-defined neighborhood range on four different datasets. As shown in Supplementary Fig. S15-16, the performance of scNiche was stable when different numbers of k-nearest neighbors were selected. Indeed, previous studies have shown that moderately changing the number of k-nearest neighbors does not lead to a significant degradation in method accuracy. Nevertheless, given the complexity of tissues, it may be still necessary to empirically determine an appropriate size of the neighborhood range in practical analyses28,42,43,44. Furthermore, we also evaluated the sensitivity of scNiche to different random seed choices, and the results indicated that scNiche was also relatively robust to different random seed choices compared to other methods such as UTAG, SpaGCN, and DR-SC (Supplementary Fig. S17).
scNiche deciphers the cell niches in different subtypes and patients of human triple-negative breast cancer
The tumor microenvironment has been demonstrated by mounting evidence and new therapeutic strategies to play a pivotal role in the initiation and progression of cancer, opening new opportunities for diagnosis and therapy45,46,47,48,49. Here, we applied scNiche to a human triple-negative breast cancer (TNBC) dataset generated by the multiplexed ion beam imaging by time-of-flight (MIBI-TOF) technology13. Studies have shown that TNBC exhibits three archetypical subtypes based on different tumor-immune interactions: cold, mixed, and compartmentalized. Among these, the cold subtype is characterized by low immune infiltration and is easily distinguished, while the mixed and compartmentalized subtypes may contain similar numbers of immune cells. Furthermore, the spatial organization and degree of mixing of tumor and immune cells may differ in the mixed and compartmentalized subtypes13. Therefore, we used scNiche to decipher the tumor microenvironment of these two TNBC subtypes. A total of 173,205 cells from 19 mixed subtype samples and 15 compartmentalized subtype samples were analyzed.
According to the cluster stability proposed by Varrone et al.22 (Supplementary Fig. S18), scNiche identified 13 cell niches, which broadly manifested as tumor-enriched niches (Niche 1, 10, 2, 9, 3, 5, and 11) and other immune-enriched niches (Niche 7, 6, 4, 0, 8, and 12) characterized by distinct combinations of immune and stroma cell types (Fig. 5a and Supplementary Fig. S19, 20). By comparing the enriched cell niches in the two subtypes of TNBC samples, we found that the tumor-enriched niches were predominantly enriched in the mixed subtype samples. In contrast, other immune-niches were more prevalent in the compartmentalized subtype samples (Fig. 5a). This finding of scNiche was supported by the previous studies13,50,51 that immune cells in the mixed subtype demonstrated a higher degree of mixing with tumor cells compared to the compartmentalized subtype and were therefore less likely to form spatially separate regions. Furthermore, different niches tended to be shared only among a small number of specific patients (Supplementary Fig. S21a), reflecting the inter-patient heterogeneity of tumor microenvironment52.
a The proportion of immune, stromal, and tumor cells in each niche (top) and the enrichment scores of each cell type in each niche (bottom) across 19 mixed subtype samples and 15 compartmentalized subtype samples. The enrichment scores are calculated among all 13 niches, and the p-value is calculated with the one-sided Mann-Whitney U test (*p ≤ 0.05; **p ≤ 0.01; ***p ≤ 0.001; ****p ≤ 0.0001). b The two spatially adjacent cell niches are enriched with immune cells from different lineages respectively, corresponding to different immune microenvironments. Each cell is colored by the niche labels and the cell type labels, and the proportions of CD4 T cells and other immune cells within the neighbors of each cell are displayed. c The proportions of Niche 6 and Niche 12 in each sample. Only samples with a proportion of Niche 6 (or Niche 12) exceeding 5% are shown. The average expression levels of monocyte markers, immunoregulatory proteins, and antigen presentation proteins of macrophages in Niche 6 and Niche 12 in each sample (d) and the proportions of macrophages, immune cells, and tumor cells in Niche 6 and Niche 12 in each sample (e). For Niche 6 (or Niche 12), only samples with a proportion of that niche exceeding 5% are considered. The p-value is calculated with the two-sided Mann-Whitney U test. Data are presented as boxplots (minima, 25th percentile, median, 75th percentile, and maxima). n = 6 (10) samples for Niche 6 (Niche 12). f Niche 6 in Patient 9 (top) and Patient 10 (bottom) is located at the tumor-immune border. g The spatial expression level of CD11b, CD11c, IDO, and PD-L1 in Patient 9 (top) and Patient 10 (bottom). Source data are provided as a Source Data file.
The 6 immune-enriched niches showed differential cellular composition, corresponding to distinct microenvironments. Niche 7 exhibited significant enrichment of B cells, CD4 T cells, CD8 T cells, dendritic cells, and NK cells, which may represent the tertiary lymphoid structure (TLS)53,54; Niche 8 was enriched with endothelial cells and mesenchymal-like cells, potentially representing the stromal microenvironment in tumor (Supplementary Fig. S21b). The two spatially adjacent cell niches, Niche 4 and Niche 0, co-existed in a specific subset of patients (Patient 3, 4, 5, 9, and 27) and were enriched with lymphoid immune cells (such as CD4 T cells) and other immune cells, respectively (Fig. 5b and Supplementary Fig. S21c). It has been reported that cells from the same lineage tended to be more spatially proximate, thus, these two cell niches may reflect the diversity of the immune responses to tumors, whereby specific immune cells were recruited to the tumor sites via specific mechanisms or local environments13,45,55,56.
We also noticed that the two macrophage-enriched niches (Niche 6 and Niche 12) resolved by scNiche were mainly present in different TNBC subtypes and consisted of macrophages with distinct phenotypes (Fig. 5c). Specifically, while macrophages within these two niches exhibited consistent expression of classical monocyte markers such as CD68 and CD63, those in Niche 6 also displayed increased expression of both CD11b, CD11c, immune regulation proteins, and antigen presentation proteins, suggesting they were myeloid derived suppressor cells57 (Fig. 5d). This inconsistency in the phenotype exhibited by cells from the same lineage was observed across the entire patient cohort scales, which may be related to the microenvironments the cells reside in. We found that the macrophages in Niche 6 were likely to exhibit more pronounced spatial proximity to other immune cell types, whereas macrophages in Niche 12 were more likely to co-localize with tumor cells (Fig. 5e). Indeed, Niche 6 may represent a unique niche at the tumor-immune border13, with its altered phenotypes compared with Niche 12 arising from changes in the expression profiles across all cell types, rather than being specific to a particular cell population (Fig. 5f, g and Supplementary Fig. S22).
Further cell population enrichment analysis of 7 tumor-enriched niches revealed more subtle compositional differences among them. For instance, Niche 5 and Niche 11, characterized by the enrichment of Keratin+ tumor cells, exhibited extremely low immune or stromal cell infiltration. In contrast, other tumor-enriched niches exhibited heterogeneity in the type of cells infiltrated and the degree of infiltration (Fig. 6a, b). Although the cellular compositions of the immune-exhausted tumor niches (Niche 5 and Niche 11) were similar, the tumor cells within these niches exhibited differential expression of the tumor-related proteins, including cytokeratin 6 (CK6) and CK17 (Fig. 6c). Interestingly, Niche 5 and Niche 11 were identified in different patients, potentially representing patient-specific niches composed of cells in distinct cellular states58 (Fig. 6d). Furthermore, survival analysis results on public datasets suggested that the differences between these two niches may also reflect phenotypic differences between patients (Supplementary Fig. S23). On the other hand, the tumor cells within other infiltrating tumor-enriched niches exhibited variation in the expression of other types of proteins, which may be associated with the specific infiltrating cell types. For instance, the tumor cells in Niche 9 showed increased expression of antigen presentation proteins such as HLA1 and HLA-DR, suggesting the localized production of cytokines like IFN-γ induced by the extensive infiltration of immune cells59,60 (Fig. 6e, f). Similarly, we observed that the tumor cells in Niche 3, the tumor-enriched niche infiltrated by stromal cells, displayed high expression of stromal-related proteins such as SMA and vimentin, which may indicate invasion and metastasis, and was often associated with poor prognosis61,62,63 (Fig. 6g, h).
a The enrichment scores of each cell type in each tumor-enriched niche across 19 mixed subtype samples and 15 compartmentalized subtype samples. The enrichment scores are calculated among 7 tumor-enriched niches, and the p-value is calculated with the one-sided Mann-Whitney U test (*p ≤ 0.05; **p ≤ 0.01; ***p ≤ 0.001; ****p ≤ 0.0001). b Three representative tumor-enriched niches exhibit heterogeneity in the type of cells infiltrated and the degree of infiltration. Each cell is colored by the niche labels (left) and the cell type labels (right). c The average expression levels of tumor-related proteins of tumor cells in Niche 5 and Niche 11 in each sample. For Niche 5 (or Niche 11), only samples with a proportion of that niche exceeding 5% are considered. The p-value is calculated with the two-sided Mann-Whitney U test. Data are presented as boxplots (minima, 25th percentile, median, 75th percentile, and maxima). n = 7 (8) samples for Niche 5 (Niche 11). d The proportions of Niche 5 and Niche 11 in each sample. Only samples with a proportion of Niche 5 (or Niche 11) exceeding 5% are shown. The average expression levels of HLA1 and HLA-DR of tumor cells in Niche 9 and other tumor-enriched niches in each sample (e) and the proportions of tumor cells, immune cells, and stromal cells in Niche 9 and other tumor-enriched niches in each sample (f). For Niche 9 (or other tumor-enriched niches), only samples with a proportion of that niche exceeding 5% are considered. The p-value is calculated with the two-sided Mann-Whitney U test. Data are presented as boxplots (minima, 25th percentile, median, 75th percentile, and maxima). n = 8 (32) samples for Niche 9 (other tumor-enriched niches). The average expression levels of SMA and Vimentin of tumor cells in Niche 3 and other tumor-enriched niches in each sample (g) and the proportions of tumor cells, immune cells, and stromal cells in Niche 3 and other tumor-enriched niches in each sample (h). For Niche 3 (or other tumor-enriched niches), only samples with a proportion of that niche exceeding 5% are considered. The p-value is calculated with the two-sided Mann-Whitney U test. Data are presented as boxplots (minima, 25th percentile, median, 75th percentile, and maxima). n = 10 (29) samples for Niche 3 (other tumor-enriched niches). Source data are provided as a Source Data file.
Together, these results effectively highlight the important influence of the microenvironment on cellular phenotypes, while also demonstrating the accuracy of scNiche in revealing both compositional and phenotypic heterogeneity of cell niches.
scNiche characterizes the cell niches in normal and disease mouse livers
To further demonstrate the applicability of scNiche on other types of spatial omics data, we next applied scNiche to a mouse liver spatial transcriptomics dataset generated by Cho et al.10 with the Seq-Scope technology. A total of 37,505 cells from 6 normal donors and 4 early-onset liver failure induced by excessive mTORC1 signaling64 (Tsc1Δhep/Depdc5Δhep or TD model) donors were analyzed. Considering the significant batch effects presented in the high-dimensional spatial transcriptomics data between normal and TD livers, we first used scVI65 for dimensionality reduction and batch effect removal before applying scNiche (Supplementary Fig. S24).
According to the cluster stability (Supplementary Fig. S25), scNiche identified 15 cell niches, with the majority of them showing specific enrichment in either normal or TD livers, potentially revealing distinct physiological states (Fig. 7a and Supplementary Fig. S26, 27a). Specifically, we found that the 7 cell niches (Niche 0, 3, 12, 5, 14, 11, and 1) enriched in normal livers exhibited spatial continuity, encompassing the zonation patterns from the central vein to the portal node (Fig. 7b). For example, Niche 0 (enriched with pericentral hepatocytes) and Niche 1 (enriched with periportal hepatocytes) were located in the pericentral and periportal zones, respectively, whereas the other 5 niches were situated in the transition zones between pericentral and periportal zones, and characterized by various enrichment patterns of different hepatocyte subtypes and other non-parenchymal cell types (Fig. 7c). Moreover, differentially expressed genes (adjusted p-value < 0.05) within these 7 niches also showed a pronounced spatial expression pattern of zones (Fig. 7d). From Niche 0 to Niche 1, we observed a gradual decrease in the expression of the pericentral genes66,67,68,69 such as Cyp2e1, Cyp1a2, Mup17, Gsta3, and Gulo. Conversely, the expression of the periportal genes66,67,68,69,70 such as Ass1, Alb, Cyp2f2, Sds, Hsd17b13, and Mup20 exhibited a gradual increase (Fig. 7e and Supplementary Fig. S27b). Meanwhile, we also found that some specific genes exhibited the non-monotonic expression patterns across the 7 consecutive niches. For example, the hepcidin-encoding genes, Hamp and Hamp266, which demonstrated non-monotonic zonation expression patterns that peak at intermediate lobule layers, were highly expressed in Niche 5 and Niche 12, respectively (Fig. 7e). Additional non-monotonic genes such as Cyp8b166 and Apoc166,71, were highly expressed in Niche 3 and Niche 5, respectively (Supplementary Fig. S27b). The gene expression signature scores of KEGG pathways for the 7 niches also largely recapitulated previous zonation studies66,67. Cells in Niche 0 had higher scores of drug metabolism, primary bile acid biosynthesis, and metabolism of xenobiotics pathways, while cells in Niche 1 had higher scores of oxidative phosphorylation, gluconeogenesis, and complement and coagulation cascades pathways (Fig. 7f). Overall, these results suggested that scNiche can precisely reveal the zonation profiles of normal livers.
a The proportion of each donor in each niche (top) and the enrichment scores of each cell type in each niche (bottom) across 6 normal donors and 4 TD donors. The enrichment scores are calculated among all 15 niches, and the p-value is calculated with the one-sided Mann-Whitney U test (*p ≤ 0.05; **p ≤ 0.01; ***p ≤ 0.001; ****p ≤ 0.0001). b The cell niches identified by scNiche in tile 2103, tile 2104, and tile 2105, each cell is colored by the niche labels (top) and the cell type labels (bottom). c The proportion of each cell type in Niche 0, 3, 12, 5, 14, 11, and 1. d, The normalized average expression levels of 216 differentially expressed genes (adjusted p-value < 0.05) in Niche 0, 3, 12, 5, 14, 11, and 1. The p-value is calculated with the two-sided Wilcoxon rank-sum test and is adjusted with the Benjamini-Hochberg correction. e The average expression levels of Cyp2e1, Gulo, Hamp, Hamp2, Alb, and Cyp2f2 in Niche 0, 3, 12, 5, 14, 11, and 1. For each niche, only donors with a proportion of that niche exceeding 5% are considered. Data are presented as mean values with a 95% confidence interval. f The gene expression signature scores of Niche 0, 3, 12, 5, 14, 11, and 1 for six different KEGG pathways. Data are presented as mean values with a 95% confidence interval. Source data are provided as a Source Data file.
To further appreciate the difference in niches between normal and TD livers, we also investigated the cell niches enriched in TD livers. The scNiche results revealed three unique niches in TD livers: Niche 4, Niche 9, and Niche 7. These niches were spatially distributed from the core to the periphery of the injury and inflammation sites, and were characterized by the enrichment of a series of emerging cell populations, including inflamed macrophages, hepatic progenitor cells (HPC), activated hepatic stellate cells (HSC-A), and injured hepatocytes (Figs. 7a, 8a). Differential expression analysis showed that these three niches upregulated a range of injury-associated genes that were individually induced by different cell populations, possibly reflecting the unique response of different cell populations to liver injury as previously reported10 (Fig. 8b and Supplementary Fig. S27c). For example, injured hepatocytes and HPC highly expressed serum amyloid protein-encoding genes (Saa1 and Saa2) and Spp1, respectively, which have been reported to be associated with injury response72,73, whereas inflamed macrophages and HSC-A exhibited high expression levels of pro-inflammatory markers (Cd74 and MHC-II components) and fibrosis markers (Acta2 and collagens), respectively. Consistent with these up-regulated markers, the cellular inflammatory infiltration and fibrosis74 signature scores in Niche 4, Niche 9, and Niche 7 were also higher than in other niches (Fig. 8c). Overall, these results suggested that these three niches identified by scNiche reflected the specific microenvironment associated with liver injury.
a Highlighting of the injury-associated cell populations and niches (Niche 4, 9 and 7) in tile 2117, 2118, and 2119. each cell is colored by the niche labels (top) and the cell type labels (bottom). b The normalized average expression levels of differentially expressed genes in the injury-associated niches (Niche 4, 9, 7) and other niches. The top 20 genes highly expressed in the injury-associated niches are shown. c The gene expression signature scores of the injury-associated niches (Niche 4, 9, 7) and other niches for cellular inflammatory infiltration (top) and fibrosis (bottom). The p-value is calculated with the two-sided Mann-Whitney U test. Data are presented as boxplots (minima, 25th percentile, median, 75th percentile, and maxima). n = 1183, 482, 1612, and 34228 cells for Niche 4, Niche 9, Niche 7, and other niches. d The number of links among niches on the spatial connectivity graph in normal (left) and TD (right) mouse livers. CV, central vein; PN, portal node. e Gene set enrichment analysis (GSEA) results of differentially expressed genes of Niche 10. f Comparisons of the trends in spatial expression of genes across niches from the central vein to the portal node in normal and TD mouse livers. Data are presented as mean values with a 95% confidence interval. Source data are provided as a Source Data file.
Furthermore, scNiche uncovered the partial remodeling of the zonation patterns from the central vein to the portal node in TD livers compared to normal livers. Specifically, we found that niches proximal to the central vein observed in normal livers (Niche 0, 3 and 12) were also present in TD liver, whereas niches proximal to the portal node were altered, despite no significant changes in their cellular composition (Fig. 8d and Supplementary Fig. S27d). These results indicated specific phenotypic changes in the niches located proximal to the portal node during liver injury, which were precisely captured by scNiche. We further performed the differential expression analysis between Niche 1 and Niche 10, which were located at the portal node in normal and TD livers, respectively, and the results showed that up-regulated genes in Niche 10 compared with Niche 1 comprised some antioxidant genes such as Gpx3, Gsta1, Gsta2, and Gsto1; the cathepsins-encoding genes like Ctsl and Ctsb, which have been reported to potentially mediate liver fibrosis75,76,77; and several specific cytochrome P450 genes whose expression were increased during liver injury and fibrosis, including Cyp17a178, Cyp2b979,80, and Cyp2b1064,79. In contrast, major urinary protein-encoding genes, carboxylesterase-encoding genes, and Traf5, a protective factor in liver inflammation and hepatic steatosis81,82, were downregulated in Niche 10 (Supplementary Fig. S27e). Consistent with differential expression analysis results, gene set enrichment analysis (GSEA) confirmed that the up-regulated genes in Niche 10 were enriched in mTORC1 signaling, interferon response, inflammatory response, fibrosis-related, and apoptosis pathways (Fig. 8e). In particular, the mTORC1 signaling pathway, known as the vital pathway for homeostasis, metabolism, transplantation, and regeneration in the liver83,84,85, has been reported to induce pronounced and extensive hepatocyte damage when activated64,86,87. The results of CellChat88 also confirmed the enhanced interactions related to inflammation and fibrosis between cells in Niche 10 compared to Niche 1 (Supplementary Fig. S28). Interestingly, we further found that these differential expression genes reflecting phenotypic changes during liver injury between Niche 1 and Niche 10 exhibited spatial expression gradients similar to those of pericentral or periportal markers (Fig. 8f), suggesting that scNiche is also capable of precisely deciphering subtle spatial variation trends among different niches in either normal or disease states.
Discussion
In this study, we have presented a computational framework, scNiche, for identifying and characterizing cell niches from spatial omics data at single-cell resolution. scNiche employs a different approach to utilizing graph neural networks compared to previous deep learning-based methods, which typically run graph neural networks on the spatial graph constructed by integrating molecular profiles of the cell with spatial information. Specifically, scNiche first constructs the separate graph for features from each view of the cell, and then integrates them by a multiple graph autoencoder model coupled with a graph fusion network. This approach provides greater flexibility in niche modeling while more comprehensively considering the common and complementary information from multiple views of cells. Additionally, scNiche applies a multi-view mutual information maximization module to guide the learning of more discriminative and clustering-friendly joint representations. Benchmarking studies demonstrated the superior performance of scNiche compared to other existing methods on different spatial omics datasets, including spatial transcriptomics and spatial proteomics.
The batch training strategy of scNiche enables its scalability to large-scale spatial omics datasets containing multiple samples under different conditions to identify homogeneous or heterogeneous cell niches across multiple samples, without compromising on accuracy. Our results on the mouse whole brain dataset containing over 3 million cells effectively indicate the potential of scNiche in this regard. Furthermore, the results on the human TNBC dataset and the mouse liver dataset also convincingly demonstrate the universality of scNiche in identifying refined patient- or disease-specific cell niches. In the former analysis, we deciphered the heterogeneity of cell niches within different TNBC subtypes, and identified patient-specific niches that exhibited distinct phenotypic characteristics. In the latter analysis, we discovered three liver injury-associated niches characterized by the enrichment and co-localization of inflamed macrophages, HSC-A, HPC, and injured hepatocytes. Furthermore, we also revealed the specific remodeling of niches located proximal to the portal node during liver injury.
scNiche also implements an integrated downstream analytical framework for the comprehensive characterization and interpretation of identified cell niches. The enrichment analysis framework of scNiche allows for the comprehensive characterization of identified cell niches from various perspectives (including cellular compositions, conditions, samples, etc.). The multi-sample analysis framework of scNiche allows for differential analyses at the sample scale, such as the comparison of specific niches across different conditions, or the comparison of specific cell populations across different niches, which holds the promise of identifying clinically relevant key niches or cell populations from large-scale datasets while avoiding the influence of individual outliers. On the other hand, benefiting from its modular architecture, scNiche can be conveniently compatible and integrated with other computational tools. For example, in the analysis of the mouse liver dataset in this study, we first applied scVI65 to remove batch effects before employing scNiche to identify cell niches. Similarly, in subsequent downstream analysis, we also performed the spatial connectivity analysis among different niches facilitated by Squidpy89 apart from the workflow provided by scNiche itself.
We also have some additional concerns and discussions about the “cellular compositions of neighborhoods” view. First, the ablation results of each view on the simulated datasets indicated that this view seemed to contribute less to the accurate identification of cell niches compared to the other two views (Supplementary Table 1). This is expected, as cell types are typically inferred from the molecular profiles of cells; therefore, the “cellular compositions of neighborhoods” view may be just a coarser version of the “molecular profiles of neighborhoods” view. Nevertheless, we found that the performance of scNiche consistently declined across the simulated datasets as well as other biological datasets when this view was removed (Supplementary Fig. S29), suggesting that this view, as an expert-based feature, may help to identify niches more accurately to a certain extent by reducing the potential noise that exist in the original molecular profiles. Second, although our results on both the simulated datasets and the human UTUC IMC dataset showed that scNiche was relatively robust to the dropout and granularity of cell types (Supplementary Fig. S3b, 9b), the quality of the cell type labels still needs to be assessed to avoid introducing additional noise during the subsequent integration of multi-view features. For example, accurate expert-annotated or expert-verified cell type labels typically provide more useful information compared with annotations that are just derived from a clustering algorithm. Finally, cell type labels are usually unavailable for the spatial transcriptomics data generated by platforms with a lower resolution such as ST36 and 10X Visium. To address this issue, features from other views can be used as substitutes, such as the cell type deconvolution results of spots (which can be inferred through a series of spot deconvolution methods39,90,91) or the histological information extracted from H&E staining images. Benchmarking studies on the human DLPFC 10X Visium dataset suggested that scNiche, using the deconvolution results of spots inferred by Cell2location39 as alternative inputs, performed comparably to other state-of-the-art methods (Supplementary Fig. S12). However, users are supposed to test different deconvolution methods to obtain optimal results of niche identification in practice. Additionally, it is worth noting that scNiche may still be limited in accurately resolving sufficiently fine-grained cellular microenvironments on low-resolution spatial transcriptomics data due to the resolution constraints of the spot arrays. Another alternative strategy is to first employ single-cell spatial mapping92,93,94 or reconstruction95,96 methods to generate spatial coordinates for cells, and subsequently apply scNiche to the reconstructed spatially resolved single-cell data, which may effectively overcome the inherent limitations of technical platforms.
Finally, the strategy of scNiche for modeling features from different views of the cell offers more possible avenues for expansion, such as application to spatial multi-omics data. We tested this on a postnatal day (P)22 mouse brain coronal section dataset generated by Zhang et al.97, which includes RNA-seq and CUT&Tag (acetylated histone H3 Lys27 (H3K27ac) histone modification) modalities. As shown in Supplementary Fig. S30, scNiche achieved clearer brain region identification compared to the single-modality results provided by the original authors. In summary, scNiche offers an accurate and scalable approach to identify and characterize the cell niches in tissues, with great potential for expanding to larger and more complex datasets.
Methods
Data collection and preprocessing
Two multi-condition scRNA-seq datasets used for constructing simulated data by scCube: The human PBMCs dataset98 (eight control vs. eight IFN-β treated samples) and the mouse cortex dataset99 (four vehicle vs. four LPS treated samples) were downloaded using the ExperimentHub100 R package. We performed normalization and principal components analysis (PCA) dimensionality reduction steps on the data using scanpy101 Python package (version 1.9.1) before running scNiche.
Mouse spleen CODEX dataset12: Raw data were downloaded from https://data.mendeley.com/datasets/zjnpwh8m5b/1. The compartment labels of all cells from three wild-type spleen samples (BALBc-1, BALBc-2, and BALBc-3) were downloaded from https://github.com/huBioinfo/CytoCommunity. We did not perform the dimensionality reduction step and retained all proteins for running scNiche.
Human upper tract urothelial carcinoma (UTUC) IMC dataset33: Processed h5ad files that contain raw data were downloaded from https://doi.org/10.5281/zenodo.6376766. A total of 16 images with manually annotated topological domain labels were utilized. We did not perform the dimensionality reduction step and retained all proteins for running scNiche.
Mouse V1 neocortex STARmap dataset34: Raw data were downloaded from https://zenodo.org/record/7830764#.ZDpObi-1HUI. A total of one slice replicate was utilized. We performed normalization and PCA dimensionality reduction steps on the data using scanpy101 Python package (version 1.9.1) before running scNiche.
Mouse frontal cortex and striatum MERFISH dataset35: Processed h5ad files were downloaded from CELLxGENE (https://cellxgene.cziscience.com/collections/31937775-0602-4e52-a799-b6acdd2bac2e). A total of 31 tissue slices were utilized. We retained 7 major niches shared across all slices: striatum, cortical layer VI, cortical layer V, corpus callosum, cortical layer II/III, olfactory region, and pia mater. In addition, since the data have been normalized by the original authors, we directly performed the PCA dimensionality reduction step using scanpy101 Python package (version 1.9.1) before running scNiche.
Human middle temporal gyrus (MTG) snRNA-seq dataset38: Raw data were downloaded from https://portal.brain-map.org/atlases-and-data/rnaseq/human-mtg-smart-seq, containing 15,928 cells of 75 transcriptomically distinct cell types.
Human dorsolateral prefrontal cortex (DLPFC) 10X Visium dataset37: Raw data were downloaded from http://spatial.libd.org/spatialLIBD/. A total of 4 tissue slices from the same donor (slice 151673, 151674, 151675, 151676) were utilized. Before running scNiche, we performed the normalization step using scanpy101 Python package (version 1.9.1) and subset the data based on the top 2000 highly variable genes (HVGs). Subsequently, we performed the dimensionality reduction and batch effect removal steps using scvi65 Python package (version 1.1.2) on the data.
Mouse whole brain (ABCA-1) MERFISH dataset40: Processed h5ad files were downloaded from CELLxGENE (https://cellxgene.cziscience.com/collections/0cca8620-8dee-45d0-aef5-23f032a5cf09). A total of 129 coronal sections were utilized after removing the sections that were not registered to the Allen CCFv341. Since the data have been normalized by the original authors, we directly performed the PCA dimensionality reduction step using scanpy101 Python package (version 1.9.1) before running scNiche.
Human triple-negative breast cancer (TNBC) MIBI-TOF dataset13: Raw data were downloaded from Spatial Omics DataBase102 (https://gene.ai.tencent.com/SpatialOmics/dataset?datasetID=47). A total of 19 mixed subtype samples and 15 compartmentalized subtype samples were utilized. We did not perform the dimensionality reduction step and retained all proteins for running scNiche.
Mouse liver Seq-Scope dataset10: Processed RDS files that contain raw gene expression matrix and cell type annotation information were downloaded from Deep Blue Data (https://doi.org/10.7302/cjfe-wa35). A total of 6 normal donors and 4 early-onset liver failure donors were utilized. Before running scNiche, we performed the normalization step using scanpy101 Python package (version 1.9.1) and subset the data based on the top 2000 highly variable genes (HVGs). Subsequently, we performed the dimensionality reduction and batch effect removal steps using scvi65 Python package (version 1.1.2) on the data.
Mouse brain spatial CUT&Tag–RNA-seq dataset97: Processed h5ad files were downloaded from https://zenodo.org/records/10362607. Since the data have been processed by the authors, we directly used the low-dimensional representations of RNA (reduced by PCA) and CUT&Tag (reduced by latent semantic indexing) modalities.
The detailed description of each dataset is summarized in Supplementary Data 1.
Design of scNiche
Model overview
The scNiche model is developed based on the multi-view clustering method proposed by Wang et al.103 and consists of three components: M-GAE, GFN, and MMIM. Importantly, we innovate the original framework in the following ways to better adapt to the spatial omics data: (1) we expand the M-GAE model to allow the creation of graph convolutional layers for each view in an extensible manner, replacing the design of a fixed number of views in the original framework. This optimization greatly enhances the flexibility of niche modeling with different numbers or combinations of views; (2) unlike the sequential architecture of M-GAE and GFN in the original framework, we develop an improved model architecture that couples these two components, and optimize the corresponding training process so that the parameters of both M-GAE and GFN can be updated simultaneously during training, enhancing the synergy between them; and (3) we develop a subgraph-based batch training strategy to adapt to the increasing size of spatial omics data, enabling scNiche to scale to large datasets.
scNiche initially extracts the multi-view features of cells within a pre-defined neighborhood range from the given spatial omics data and constructs graphs corresponding to each view. Notably, the extracted features can be obtained with dimensionality reduction and batch effect removal steps as needed before graph construction. Subsequently, scNiche applies this coupled neural network architecture of M-GAE and GFN to integrate information from multiple views and learn a joint representation. Furthermore, the MMIM module is also introduced to the learning of more clustering-friendly joint representation. Finally, scNiche uses k-means algorithm on the learned joint representation to identify cell niches, although other unsupervised clustering methods such as Leiden30 algorithm are also provided. Below we describe each step of the workflow of scNiche in detail.
Cellular neighborhoods determination
scNiche applies the k-nearest neighbors algorithm to determine the size of cellular neighborhoods. We evaluated the robustness of scNiche to the different values of k on the simulated and biological datasets (Supplementary Fig. S15-16).
Dimensionality reduction and batch effect removal
In contrast to spatial proteomics data, which usually contain only a few dozen proteins, spatial transcriptomics data can often measure hundreds to thousands of genes, with potential batch effects commonly present across tissue slices from different samples. Therefore, dimensionality reduction and batch effect removal need to be performed on the molecular profiles of the cells and their neighborhoods before multi-view feature fusion. Additionally, considering that the number of genes measured in spatial transcriptomics data usually far exceeds the number of cell types that exist, this preprocessing step also helps balance the dimensionality of features across different views, allowing for more accurate niche identification (Supplementary Fig. S31). We use scVI65 by default to perform dimensionality reduction and batch effect removal. However, simple PCA dimensionality reduction or other deep learning-based integration methods like scArches104 are also applicable.
Multiple graph auto-encoder
To learn the joint representation that combines the common and complementary information from multiple views, scNiche applies a multiple graph autoencoder (M-GAE) model consisting of a multi-graph attention fusion encoder base on the GCN105 and view-specific decoders.
In the multi-graph attention fusion encoder, we use V view-specific GCN layers as the first layer. Given the multi-view features \({{\mathscr{X}}}={\{{X}^{\left(v\right)}\}}_{v=1}^{V}\) and the corresponding graphs \({{\mathscr{A}}}={\{{A}^{\left(v\right)}\}}_{v=1}^{V}\), where \({X}^{\left(v\right)}{{\mathbb{\in }}{\mathbb{R}}}^{N\times F}\) is the feature matrix of the v-th view with N nodes and F features, and \({A}^{\left(v\right)}{{\mathbb{\in }}{\mathbb{R}}}^{N\times N}\) is the graph of the v-th view, the v-th view-specific representations \({Z}_{(1)}^{(v)}\) learned by the first layer can be obtained as follow:
where δ(·) is the activation function. \(\widetilde{A}=A+I\), I is the identity diagonal matrix. \(\widetilde{D}={{\rm{diag}}}({\sum }_{j}{\widetilde{A}}_{{ij}})\) is the degree matrix of \(\widetilde{A}\). \({W}_{(1)}^{(v)}{{\mathbb{\in }}\,{\mathbb{R}}}^{F\times {d}_{1}}\,\) is the parameter matrix of the v-th view learned by GCN layers, with \({d}_{1}\) being the output dimension for GCN layers.
To adaptively fuse the representations of a sample across different views, we introduce an attention coefficient matrix \({W}_{a}^{(v)}\) to learn the importance of each view. This allows for a weighted combination of view-specific representations, leading to a more informative common representation. The operation to compute this joint representation, denoted as \({Z}_{(2)}\), is defined by the following equation:
where \({W}_{(2)}^{(v)}{{\mathbb{\in }}\,{\mathbb{R}}}^{{d}_{1}\times {d}_{2}}\,\) is the parameter matrix learned by GCN layers, the \({W}_{a}^{(v)}{{\mathbb{\in }}\,{\mathbb{R}}}^{{d}_{2}\times {d}_{2}}\) is the attention coefficient matrix, with d2 being the output dimension for GCN layers.
We then continue to use the GCN layers to apply convolution over the obtained joint representation \({Z}_{(2)}\,\) and the consensus graph \({A}^{*}\) learned by the graph fusion network, and the final joint representation Z can be obtained as follow:
where \({W}_{(3)}{{\mathbb{\in }}\,{\mathbb{R}}}^{{d}_{2}\times {d}_{3}}\,\) is the parameter matrix learned by GCN layers, with \({d}_{3}\) being the output dimension for GCN layers.
In the view-specific decoders, we use the inner-product as the decoder to reconstruct the multi-view graphs from the joint representation Z:
where \({W}^{(v)}{{\mathbb{\in }}\,{\mathbb{R}}}^{{d}_{3}\times {d}_{3}}\) is the parameter matrix learned by the v-th view-specific decoder.
In order to minimize the reconstruction error between the original graph \({A}^{\left(v\right)}\) and the reconstruction graph \({\hat{A}}^{(v)}\) of each view, the loss of the multiple graph autoencoder is defined as:
The loss function to be optimized is binary cross entropy (BCE) loss.
Graph fusion network
In the scNiche framework, we introduce an additional graph fusion network (GFN) in the M-GAE model to learn the consensus graph \({A}^{*}\), which contains the global adjacency relations of graphs from different views (i.e., the global node relationships). Notably, the information is shared between M-GAE and GFN during training. The GFN is a two-layer fully connected model, where the first layer is followed by a ReLU activation. The consensus graph learned by l-th layer can be described as:
where δ(·) is the activation function, \({W}_{{GFN}\left(l\right)}{{\mathbb{\in }}\,{\mathbb{R}}}^{{d}_{l}\times {d}_{l-1}}\) and \({b}_{{GFN}\left(l\right)}\in {{\mathbb{R}}}^{{d}_{l}}\) are the weight matrix and bias of the l-th layer, respectively, with \({d}_{l}\) being the output dimension for layer l. The initial input \({G}_{0}\) to the GFN is defined as:
where \({A}^{(v)}\) is the graphs from v-th view, and V is the number of views.
The GFN’s goal is to ensure that the final consensus graph \({A}^{*}\) integrates the information from each individual graph \({A}^{(v)}\) comprehensively. The optimization of the GFN involves a loss function that minimizes the discrepancy between the individual graphs from each view \({A}^{(v)}\) and the consensus graph \({A}^{*}\):
This loss function is typically a mean squared error (MSE) loss, focusing on reducing errors between the graphs from each view and the synthesized consensus graph.
Multi-view mutual information maximization
Mutual information is a Shannon entropy-based measure of dependence between random variables106. Recent studies have revealed that maximizing the mutual information between input samples and learned latent representations contributes to the learning of useful representations by the models (such as the encoder)107. Given the input samples \(X={\left\{{x}^{\left(i\right)}\right\}}_{i=1}^{N}\) and the corresponding representations \(Y={\left\{{y}^{\left(i\right)}\right\}}_{i=1}^{N}\), the mutual information between X and Y can be expressed as:
Based on the assumption by Wang et al.103 that if two samples x and \({x}^{{\prime} }\) are close in any view, their corresponding representations z and \({z}^{{\prime} }\) should also be close in the common latent view, we here applied their multi-view mutual information maximization (MMIM) module to guide the learning of the clustering-friendly joint representations. Specifically, the MMIM module aims to guide the coupled model of M-GAE and GFN to ultimately generate more useful joint representations for each cell by boosting the similarity of the multi-view joint representations of two cells that are similar to each other in any view, as a way to make subsequent cell niche identification more accurate. According to the relevant properties of mutual information, larger mutual information denotes the representations are more similar, thus the optimization objective of the MMIM module can be expressed as:
Where Z and \({Z}^{{\prime} }\) are the corresponding representations of the samples X and their nearest neighbors \({X}^{{\prime} }\), respectively.
According to the Eq. (9) and Eq. (10), the loss of the MMIM module can be written as:
Since the KL divergence is unbounded, we use JS divergence instead of KL divergence in mutual information and Eq. (11) can be converted to:
The JS divergence, as a specific case of the f-divergences108, is challenging to compute directly in practice. We thus utilize the variational lower bound on the f-divergences \({{{\mathcal{D}}}}_{f}\left({P||Q}\right)\)108 to estimate a generative model Q given the true distribution P. In this approach, we adopt the generative-adversarial network methodology, employing two neural networks: Q and T. Here, Q is our generative model that outputs a sample of interest from a random vector input, and T is the variational function that evaluates these samples. The variational estimation of f-divergences is defined as:
where \(p(x)\) and \(q(x)\) are the probability density functions of the true distribution P and the estimated distribution Q respectively, with Q being parameterized by the generative model in the GAN framework. The functions \(f(u)\) and the its conjugate \(g(t)\) dictate the specific form of divergence being measured108:
This framework facilitates the calculation of the JS divergence as follows:
Let \(T\left(x\right)=\log [2D(x)]\) (Here, \(D(x)\) is a variational function that can be related to \(T(x)\) via a simple transformation. The purpose of this transformation is to simplify the optimization process, enabling more tractable gradients for optimization. Importantly, this transformation does not alter the intrinsic nature of \(T\left(x\right).\)), then Eq. (15) can be converted to:
In our loss function, \(p\left(z^{\prime} {|z}\right)p\left(z\right)\) and \(p\left(z^{\prime} \right)p\left(z\right)\) are used to replace \(p\left(x\right)\) and \(q\left(x\right)\), and the loss of the MMIM module can be rewritten as:
The problem in Eq. (17) can be solved using the negative sample estimation107. \(D\left(z,{z}^{{\prime} }\right)\) is a discriminator to distinguish the negative sample pairs and positive sample pairs to estimate the distribution of positive samples. Positive sample pairs are composed by the latent representations of the sample x and its nearest neighbor \(x{\prime}\) in any view, and negative sample pairs are composed by the latent representations of the sample x and random samples outside its nearest neighbors. The nearest neighbors of each sample are identified by the k-nearest neighbors algorithm.
Loss function of scNiche
The total loss function of scNiche is defined as:
where \({\lambda }_{1}\) and \({\lambda }_{2}\) are hyperparameters that balances three parts of the loss function. By default, \({\lambda }_{1}={\lambda }_{2}=1\).
Batch training strategy
We develop a subgraph-based batch training strategy that enables scNiche to scale to large datasets and multiple slices. Specifically, after extracting multi-view features of cells, we do not directly construct the corresponding graphs with all cells, which would result in insufficient memory due to the excessive number of nodes and edges on the entire graph. As an alternative, we initially divide the entire dataset into several non-overlapping subsets using a random sampling strategy, and subsequently construct corresponding graphs for each subset, which are referred to as subgraphs. Next, we employ the batched graph data loader facilitated by DGL109 for batch-iterating over this set of subgraphs to generate the batched graph of each batch for model training. Considering the sharp reduction in the number of nodes and edges on each subgraph compared to the entire graph, this batch training strategy effectively avoids the out-of-memory limitation. We evaluated the robustness of scNiche’s batch training strategy to the different batch number settings on the mouse spleen CODEX dataset (Supplementary Fig. S4).
Clustering
We employ the unsupervised clustering algorithm k-means by default on the learned joint representation to identify the cell niches. Additionally, if the target number of clusters is not provided, we identify the optimal candidates for K based on the stability of the clustering proposed by Varrone et al.22. In brief, for each K within the specified range, we execute a single clustering run with K clusters. Subsequently, we calculated the average Fowlkes–Mallows Index110 (FMI) between the clusters at K-1 and K, and between K and K+1. A higher average FMI indicates a higher similarity between clustering solutions of the continuous cluster number, i.e., the clustering results are more stable.
Enrichment analysis of cell niches
We apply a general enrichment analysis framework that can characterize the identified cell niches from various perspectives (including cellular compositions, conditions, and samples, etc.) and compute the corresponding enrichment scores. Taking cellular composition as an example, given cells belonging to S samples, N identified cell niches, and M cell populations \(C=\left\{{c}^{(s,n,m)}|1\le s\le S,1\le n\le N,1\le m\le M\right\}\), we first compute the observed value of the proportion of the cell population m within the cell niche n in each sample s:
and the expected value of the proportion of the cell population m within the cell niche n in each sample s is defined as:
we then define the enrichment score of cell population m within the cell niche n across S samples as follow:
The P-value of \({{ES}}^{(n,m)}\) can be computed with the one-sided Mann-Whitney U test if requested.
Multi-sample analysis framework
For large-scale datasets containing multiple samples under different conditions, scNiche provides a multi-sample analysis framework that enables niche comparisons at the sample scale. Specifically, for each niche within each sample, we compute its cell composition and phenotypic characteristics (defined as the average expression value of all cells belonging to this niche), as well as the proportion and phenotypic characteristics (defined as the average expression value of all cells belonging to this cell population) of specific cell populations within this niche. Subsequently, we can perform differential analyses across the entire sample series, including the comparison of compositions or phenotypes for specific niches between conditions, as well as the comparison of proportions or phenotypes for specific cell populations between niches. The p-value is calculated with the two-sided Mann-Whitney U test. Furthermore, to avoid the effect of outliers, for each niche, only samples with a proportion of that niche exceeding a set threshold (5% by default) are considered.
Simulation experiment setup
We simulated the situation in which the cell niches exhibit heterogeneity in both gene expression and cellular composition among each other (Supplementary Fig. S2a). To achieve this, we generated the simulated data from the multi-condition scRNA-seq datasets following the simulation framework used in scCube31. Specifically, we first generate the proportion and cellular composition of each niche in a randomized manner. If two cell niches were designated to exhibit heterogeneity in gene expression, the cells within these two cell niches were derived from different conditions of scRNA-seq data with similar composition proportions. Conversely, if two cell niches were designated to exhibit heterogeneity in cellular composition, the cells within these two cell niches were derived from the same condition of scRNA-seq data but with different composition proportions. Subsequently, we generated the random spatial patterns for each cell niche with the reference-free strategy of scCube.
We considered four variabilities in the simulated data: continuity of spatial patterns, complexity of niche composition, gene expression dropout, and cell annotation dropout. For the first variability, we generated cell niches with different levels of spatial pattern continuity by setting the parameter δ in scCube to 10, 20, 30, and 50. For the second variability, we generated cell niches with different composition complexity by randomly selecting 2, 3, and 4 cell populations for each cell niche. The last two variabilities corresponded to scenarios of degraded data quality. For the dropout of gene expression, we randomly set the expression values of genes to 0 with the proportions of 0.1, 0.2, 0.4, and 0.8. For the dropout of cell annotation, we randomly altered the cell annotation labels to “ambiguous” with the proportions of 0.1, 0.2, 0.4, and 0.8.
Benchmarking analysis
We compared the performance of scNiche with other existing methods using simulated and biological datasets. The target number of clusters was kept consistent across all methods for each benchmarking dataset, as determined by the ground truth. Furthermore, for methods that require specifying the range of neighborhoods based on the k-nearest neighbors algorithm first, such as scNiche, BANKSY, STAGATE, GraphST, SpaceFlow, and CytoCommunity, we set a consistent value of k for each method (20 for the simulated datasets, 30 for the single cell spatial omics datasets, and 6 for the human DLPFC 10X Visium dataset). Below we describe the application of each method.
scNiche
The preprocessing step of each dataset is described in the “Data collection and preprocessing” section above. For the mouse spleen CODEX dataset, we applied the batch training strategy to run scNiche on both single and multiple slices with the number of batches set to 30 and 100, respectively. For the human UTUC IMC dataset, mouse frontal cortex and striatum MERFISH dataset, and the human DLPFC 10X Visium dataset, we applied the batch training strategy to run scNiche on multiple slices with the number of batches set to 100, 20, and 2, respectively. For other datasets, we directly ran scNiche with the full-graph-based training. The parameter ‘epochs’ was set to 200 for the mouse spleen CODEX dataset and 100 for other datasets; the parameter ‘lr’ was set to 0.01 for all datasets.
DR-SC
DR-SC18 is implemented in the R package DR.SC (version 3.3). The parameter ‘platform’ was set to ‘Visium’ for the human DLPFC 10X Visium dataset and ‘Other_SRT’ for other datasets. All other parameters were set with default values.
BASS
BASS19 is implemented in the R package BASS (version 1.1.0.016). We ran BASS with default parameters except for the parameter ‘C’, which was determined by the number of cell types in each benchmarking dataset.
CellCharter
CellCharter22 is implemented in the Python package cellcharter (version 0.1.2). We followed the instructions in the original article for dimensionality reduction and batch effect removal. Specifically, for the simulated datasets and spatial proteomics datasets, we applied the TRVAE model implemented in the Python package scArches104 (version 0.5.9). For the human DLPFC 10X Visium dataset, we applied the scVI model implemented in the Python package scvi65 (version 1.1.2). For the mouse V1 neocortex STARmap dataset, we applied the PCA dimensionality reduction directly since this dataset only contains one tissue slice. For the mouse frontal cortex and striatum MERFISH dataset, we also applied the PCA dimensionality reduction since this dataset has been normalized by the original authors. All other parameters were set with default values.
SpaGCN
SpaGCN24 is implemented in the Python package SpaGCN (version 1.2.7). We ran SpaGCN with default parameters. The refinement step was also performed.
UTAG
UTAG21 is implemented in the Python package utag (version 0.1.1). We ran UTAG with default parameters except for the parameter ‘max_dist’, which was set to be consistent with the value of k in methods that require specifying the range of neighborhoods based on the k-nearest neighbors algorithm first (20 for the simulated datasets, 30 for the single cell spatial omics datasets, and 6 for the human DLPFC 10X Visium dataset). Additionally, since the Leiden clustering algorithm employed in UTAG does not directly allow for setting the number of clusters, we modified the clustering process by introducing the ‘search_res’ function from SpaGCN24 to determine the clustering resolution first, and subsequently performed Leiden clustering with this resolution.
BANKSY
BANKSY23 is implemented in the R package Banksy (version 0.99.13). We followed the tutorials provided by the original authors to set recommended values of the parameter ‘lambda’. Specifically, for the human DLPFC 10X Visium dataset, the parameter ‘lambda’ was set to 0.2. For other datasets, the parameter ‘lambda’ was set to 0.8. All other parameters were set with default values. Additionally, for the human DLPFC 10X Visium dataset, we applied the multi-sample analysis followed by the tutorial. The clustering process was also modified as described above to determine the clustering resolution first.
STAGATE
STAGATE25 is implemented in the Python package stagate-pyg (version 1.0.0). For the mouse spleen CODEX dataset, we ran STAGATE using the batch training strategy with the parameters ‘num_batch_x’ = 4, ‘num_batch_y’ = 6, and ‘n_epochs’ = 500, and all other parameters were set with default values. For other datasets, we ran STAGATE with default parameters. Additionally, for the mouse V1 neocortex STARmap dataset and the human DLPFC 10X Visium dataset, we applied the ‘mclust’ algorithm in the clustering step as recommended, and for other datasets, we applied the Louvain algorithm with the clustering resolution determined by the modified clustering process described above.
GraphST
GraphST26 is implemented in the Python package GraphST (version 1.1.1). We ran GraphST with default parameters. The refinement step was also performed. Notably, GraphST raised a “CUDA out of memory” error on the mouse spleen CODEX dataset.
SpaceFlow
SpaceFlow27 is implemented in the Python package SpaceFlow (version 1.0.3). We ran SpaceFlow with default parameters. The clustering process was also modified as described above to determine the clustering resolution first.
CytoCommunity
CytoCommunity28 is implemented in the Python package CytoCommunity (version 1.1.0). We applied the unsupervised version of CytoCommunity. For the simulated datasets and the mouse V1 neocortex STARmap dataset, we ran CytoCommunity with default parameters. For the mouse spleen CODEX dataset, the human UTUC IMC dataset, and the mouse frontal cortex and striatum MERFISH dataset, we set the parameter ‘Num_Epoch’ to 100, 500, and 500, respectively, to reduce the training time. For the mouse spleen CODEX dataset, we further set the parameter ‘Loss_Cutoff’ to -0.3 to reduce the training time. All other parameters were set with default values. In addition, we did not run CytoCommunity on the human DLPFC 10X Visium dataset because the current version of CytoCommunity does not support data at non-single-cell resolution.
Evaluation metrics
We used the adjusted Rand index (ARI) and the macro-F1 score to evaluate the performance of each method. The benchmarking was conducted on a computing cluster with 2 AMD EPYC 7K62 CPUs (48 cores each), with approximately 503.65 GB of usable system memory. For GPU-compatible methods, an NVIDIA A10 GPU with 24 GB total memory (approximately 22.5 GB usable) was used.
Scalability analysis of each method on the mouse whole brain MERFISH dataset
We tested the scalability of scNiche and other methods on the processed mouse whole brain MERFISH dataset containing 3,698,530 cells from 129 coronal sections. For scNiche, we performed the PCA dimensionality reduction on the data first and then applied the batch training strategy on multiple sections with the number of batches set to 500. We set other parameters ‘k_cutoff’ = 30, ‘epochs’ = 25, and ‘lr’ = 0.01. The target number of clusters was set to 14 based on the cluster stability result. For CellCharter, we performed the PCA dimensionality reduction on the data first, and ran it with default parameters. For UTAG, we applied the batch mode provided in the tutorial with the parameter ‘max_dist’ = 30. For BANKSY, we applied the multi-sample analysis with the parameters ‘k_geom’ = 30 and ‘lambda’ = 0.8. The number of clusters was also set to 14 to be consistent with scNiche. In addition, we performed the k-means algorithm on the principal components of the data directly as the nonspatial clustering for comparison. The annotated mouse brain coronal section images were downloaded from the Allen Mouse Brain Atlas41 [https://mouse.brain-map.org/experiment/thumbnails/100048576?image_type=atlas]. The scalability benchmarking was conducted on the same computing cluster as the other benchmarking studies.
Analysis of the human TNBC MIBI-TOF dataset and the mouse liver Seq-Scope dataset
For both the human TNBC MIBI-TOF dataset and the mouse liver Seq-Scope dataset, we applied the batch training strategy to run scNiche on multiple slices with the number of batches set to 20. Notably, for the Seq-Scope data, despite its ability to achieve subcellular resolution, the UMI information for each high-definition map coordinate identifier (HDMI) needs to be aggregated to produce interpretable results10. We therefore used the data binning with 10 μm grids provided by the original authors for our analysis, as the resolution of this data was already close to the single-cell level and there was no much noise in cell type identification10 (Supplementary Fig. S32). We set the parameters ‘k_cutoff’ = 30, ‘epochs’ = 100, and ‘lr’ = 0.01 for both datasets. The target number of clusters was determined by the cluster stability results.
Spatial connectivity analysis of cell niches
Given cells belonging to N identified cell niches \(C={\left\{{c}^{\left(n\right)}\right\}}_{n=1}^{N}\), we first computed the spatial graph of cells based on Delaunay triangulation using the ‘spatial_neighbors’ function in squidpy89 Python package (version 1.2.3) and get the adjacency matrix A. The number of spatial links between cell niche i and j then can be defined as:
Larger values indicate a stronger spatial connectivity between cell niches.
Differential gene expression analysis
The differentially expressed genes were calculated using the ‘rank_genes_groups’ function in scanpy101 Python package (version 1.9.1) with the Wilcoxon rank sum test (adjusted p-value < 0.05).
Gene expression signature score calculation
The gene expression signature scores were calculated using the ‘score_genes’ function in scanpy101 Python package (version 1.9.1) with default parameters. The signature gene sets of KEGG111 pathways were downloaded using the ‘get_library’ function in gseapy112 Python package (version 1.1.2). The signature gene sets of the cellular inflammatory infiltration and fibrosis were obtained from the original publication by Te et al.74.
Pathway enrichment analysis
The gene set enrichment analysis113 (GSEA) was performed using the gseapy112 Python package (version 1.1.2) with default parameters, whose hallmark gene sets were downloaded from the Molecular Signatures Database114,115 using the ‘get_library’ function in gseapy112 Python package (version 1.1.2).
Statistics
Python (version 3.9.19) and R (version 4.2.1 and 4.3.1) are used for the statistical analysis.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
No new data was generated for this study. All data used in this study is publicly available and can be accessed through the following links: (1) the multi-condition human PBMCs dataset [downloaded using the ExperimentHub R package (muscData, EH2259)]98; (2) the multi-condition mouse cortex dataset [downloaded using the ExperimentHub R package (muscData, EH3297)]99; (3) the mouse spleen CODEX dataset [https://data.mendeley.com/datasets/zjnpwh8m5b/1]12; (4) the human UTUC IMC dataset [https://doi.org/10.5281/zenodo.6376766]33; (5) the mouse V1 neocortex STARmap dataset [https://zenodo.org/record/7830764#.ZDpObi-1HUI]34; (6) the mouse frontal cortex and striatum MERFISH dataset [https://cellxgene.cziscience.com/collections/31937775-0602-4e52-a799-b6acdd2bac2e]35; (7) the human MTG snRNA-seq dataset [https://portal.brain-map.org/atlases-and-data/rnaseq/human-mtg-smart-seq]38; (8) the human DLPFC 10X Visium dataset [http://spatial.libd.org/spatialLIBD/]37; (9) the mouse whole brain (ABCA-1) MERFISH dataset [https://cellxgene.cziscience.com/collections/0cca8620-8dee-45d0-aef5-23f032a5cf09]40; (10) the human TNBC MIBI-TOF dataset [https://gene.ai.tencent.com/SpatialOmics/dataset?datasetID=47]13; (11) the mouse liver Seq-Scope dataset [https://doi.org/10.7302/cjfe-wa35]10; (12) the mouse brain spatial CUT&Tag–RNA-seq dataset [https://zenodo.org/records/10362607]97. Source data are provided with this paper.
Code availability
scNiche is an open-access python package available in the GitHub repository (https://github.com/ZJUFanLab/scNiche), under the GPL-3.0 license. The relevant code is also accessible via Zenodo (https://zenodo.org/records/14195486)116. Source code from CMGEC was used, with written permission from the authors.
References
Rojas-Ríos, P. & González-Reyes, A. Concise review: The plasticity of stem cell niches: a general property behind tissue homeostasis and repair. Stem Cells 32, 852–859 (2014).
Mendelson, A. & Frenette, P. S. Hematopoietic stem cell niche maintenance during homeostasis and regeneration. Nat. Med. 20, 833–846 (2014).
Kanemaru, K. et al. Spatially resolved multiomics of human cardiac niches. Nature 619, 801–810 (2023).
Ren, Y. et al. Spatial transcriptomics reveals niche-specific enrichment and vulnerabilities of radial glial stem-like cells in malignant gliomas. Nat. Commun. 14, 1028 (2023).
Guilliams, M. et al. Spatial proteogenomics reveals distinct and evolutionarily conserved hepatic macrophage niches. Cell 185, 379–396.e38 (2022).
Lake, B. B. et al. An atlas of healthy and injured cell states and niches in the human kidney. Nature 619, 585–594 (2023).
Moffitt, J. R. et al. Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Science 362, eaau5324 (2018).
Stickels, R. R. et al. Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-seqV2. Nat. Biotechnol. 39, 313–319 (2021).
Shi, H. et al. Spatial atlas of the mouse central nervous system at molecular resolution. Nature 622, 552–561 (2023).
Cho, C.-S. et al. Microscopic examination of spatial transcriptome using Seq-Scope. Cell 184, 3559–3572.e22 (2021).
Chen, A. et al. Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball-patterned arrays. Cell 185, 1777–1792.e21 (2022).
Goltsev, Y. et al. Deep Profiling of Mouse Splenic Architecture with CODEX Multiplexed Imaging. Cell 174, 968–981.e15 (2018).
Keren, L. et al. A Structured Tumor-Immune Microenvironment in Triple Negative Breast Cancer Revealed by Multiplexed Ion Beam Imaging. Cell 174, 1373–1387.e19 (2018).
Jackson, H. W. et al. The single-cell pathology landscape of breast cancer. Nature 578, 615–620 (2020).
He, S. et al. High-plex imaging of RNA and proteins at subcellular resolution in fixed tissue by spatial molecular imaging. Nat. Biotechnol. 40, 1794–1806 (2022).
Zhu, Q., Shah, S., Dries, R., Cai, L. & Yuan, G.-C. Identification of spatially associated subpopulations by combining scRNAseq and sequential fluorescence in situ hybridization data. Nat. Biotechnol. https://doi.org/10.1038/nbt.4260 (2018).
Zhao, E. et al. Spatial transcriptomics at subspot resolution with BayesSpace. Nat. Biotechnol. 39, 1375–1384 (2021).
Liu, W. et al. Joint dimension reduction and clustering analysis of single-cell RNA-seq and spatial transcriptomics data. Nucleic Acids Res 50, e72 (2022).
Li, Z. & Zhou, X. BASS: multi-scale and multi-sample analysis enables accurate cell type clustering and spatial domain detection in spatial transcriptomic studies. Genome Biol. 23, 168 (2022).
Wu, Z. et al. Discovery and generalization of tissue structures from spatial omics data. Cell Rep. Methods 4, 100838 (2024).
Kim, J. et al. Unsupervised discovery of tissue architecture in multiplexed imaging. Nat. Methods 19, 1653–1661 (2022).
Varrone, M., Tavernari, D., Santamaria-Martínez, A., Walsh, L. A. & Ciriello, G. CellCharter reveals spatial cell niches associated with tissue remodeling and cell plasticity. Nat. Genet 56, 74–84 (2024).
Singhal, V. et al. BANKSY unifies cell typing and tissue domain segmentation for scalable spatial omics data analysis. Nat. Genet. 56, 431–441 (2024).
Hu, J. et al. SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nat. Methods 18, 1342–1351 (2021).
Dong, K. & Zhang, S. Deciphering spatial domains from spatially resolved transcriptomics with an adaptive graph attention auto-encoder. Nat. Commun. 13, 1739 (2022).
Long, Y. et al. Spatially informed clustering, integration, and deconvolution of spatial transcriptomics with GraphST. Nat. Commun. 14, 1155 (2023).
Ren, H., Walker, B. L., Cang, Z. & Nie, Q. Identifying multicellular spatiotemporal organization of cells with SpaceFlow. Nat. Commun. 13, 4076 (2022).
Hu, Y. et al. Unsupervised and supervised discovery of tissue cellular neighborhoods from cell phenotypes. Nat. Methods 21, 267–278 (2024).
Chen, Z., Soifer, I., Hilton, H., Keren, L. & Jojic, V. Modeling Multiplexed Images with Spatial-LDA Reveals Novel Tissue Microenvironments. J. Comput Biol. 27, 1204–1218 (2020).
Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).
Qian, J. et al. Simulating multiple variability in spatially resolved transcriptomics with scCube. Nat. Commun. 15, 5021 (2024).
Yuan, Z. MENDER: fast and scalable tissue structure identification in spatial omics data. Nat. Commun. 15, 207 (2024).
Ohara, K. et al. The evolution of metastatic upper tract urothelial carcinoma through genomic-transcriptomic and single-cell protein markers analysis. Nat. Commun. 15, 2009 (2024).
Wang, X. et al. Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Science 361, eaat5691 (2018).
Allen, W. E., Blosser, T. R., Sullivan, Z. A., Dulac, C. & Zhuang, X. Molecular and spatial signatures of mouse brain aging at single-cell resolution. Cell 186, 194–208.e18 (2023).
Ståhl, P. L. et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science 353, 78–82 (2016).
Maynard, K. R. et al. Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nat. Neurosci. 24, 425–436 (2021).
Hodge, R. D. et al. Conserved cell types with divergent features in human versus mouse cortex. Nature 573, 61–68 (2019).
Kleshchevnikov, V. et al. Cell2location maps fine-grained cell types in spatial transcriptomics. Nat. Biotechnol. 40, 661–671 (2022).
Zhang, M. et al. Molecularly defined and spatially resolved cell atlas of the whole mouse brain. Nature 624, 343–354 (2023).
Wang, Q. et al. The Allen Mouse Brain Common Coordinate Framework: A 3D Reference Atlas. Cell 181, 936–953.e20 (2020).
Xu, H. et al. Unsupervised spatially embedded deep representation of spatial transcriptomics. Genome Med. 16, 12 (2024).
Duan, B., Chen, S., Cheng, X. & Liu, Q. Multi-slice spatial transcriptome domain analysis with SpaDo. Genome Biol. 25, 73 (2024).
Haviv, D. et al. The covariance environment defines cellular niches for spatial inference. Nat. Biotechnol. https://doi.org/10.1038/s41587-024-02193-4 (2024)
Quail, D. F. & Joyce, J. A. Microenvironmental regulation of tumor progression and metastasis. Nat. Med 19, 1423–1437 (2013).
Anderson, N. M. & Simon, M. C. The tumor microenvironment. Curr. Biol. 30, R921–R925 (2020).
Binnewies, M. et al. Understanding the tumor immune microenvironment (TIME) for effective therapy. Nat. Med 24, 541–550 (2018).
Junttila, M. R. & de Sauvage, F. J. Influence of tumour micro-environment heterogeneity on therapeutic response. Nature 501, 346–354 (2013).
Bejarano, L., Jordāo, M. J. C. & Joyce, J. A. Therapeutic Targeting of the Tumor Microenvironment. Cancer Discov. 11, 933–959 (2021).
Ptacek, J. et al. Multiplexed ion beam imaging (MIBI) for characterization of the tumor microenvironment across tumor types. Lab Invest 100, 1111–1123 (2020).
Hsieh, W.-C. et al. Spatial multi-omics analyses of the tumor immune microenvironment. J. Biomed. Sci. 29, 96 (2022).
Kudelova, E. et al. Genetic Heterogeneity, Tumor Microenvironment and Immunotherapy in Triple-Negative Breast Cancer. Int J. Mol. Sci. 23, 14937 (2022).
Fridman, W. H. et al. B cells and tertiary lymphoid structures as determinants of tumour immune contexture and clinical outcome. Nat. Rev. Clin. Oncol. 19, 441–457 (2022).
Sautès-Fridman, C. et al. Tertiary Lymphoid Structures in Cancers: Prognostic Value, Regulation, and Manipulation for Therapeutic Intervention. Front Immunol. 7, 407 (2016).
Mao, X. et al. Crosstalk between cancer-associated fibroblasts and immune cells in the tumor microenvironment: new findings and future perspectives. Mol. Cancer 20, 131 (2021).
Nishikawa, H. & Koyama, S. Mechanisms of regulatory T cell infiltration in tumors: implications for innovative immune precision therapies. J. Immunother. Cancer 9, e002591 (2021).
Gabrilovich, D. I. & Nagaraj, S. Myeloid-derived suppressor cells as regulators of the immune system. Nat. Rev. Immunol. 9, 162–174 (2009).
Keren, L. et al. MIBI-TOF: A multiplexed imaging platform relates cellular phenotypes and tissue structure. Sci. Adv. 5, eaax5851 (2019).
Carrel, S., Schmidt-Kessen, A. & Giuffrè, L. Recombinant interferon-gamma can induce the expression of HLA-DR and -DC on DR-negative melanoma cells and enhance the expression of HLA-ABC and tumor-associated antigens. Eur. J. Immunol. 15, 118–123 (1985).
Hadrup, S., Donia, M. & Thor Straten, P. Effector CD4 and CD8 T cells and their role in the tumor microenvironment. Cancer Microenviron. 6, 123–133 (2013).
Lee, E. S. et al. Calretinin, CD34, and alpha-smooth muscle actin in the identification of peritoneal invasive implants of serous borderline tumors of the ovary. Mod. Pathol. 19, 364–372 (2006).
Anggorowati, N. et al. Histochemical and Immunohistochemical Study of α-SMA, Collagen, and PCNA in Epithelial Ovarian Neoplasm. Asian Pac. J. Cancer Prev. 18, 667–671 (2017).
Sarrió, D. et al. Epithelial-mesenchymal transition in breast cancer relates to the basal-like phenotype. Cancer Res. 68, 989–997 (2008).
Cho, C.-S. et al. Concurrent activation of growth factor and nutrient arms of mTORC1 induces oxidative liver injury. Cell Discov. 5, 60 (2019).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Halpern, K. B. et al. Single-cell spatial reconstruction reveals global division of labour in the mammalian liver. Nature 542, 352–356 (2017).
Ben-Moshe, S. et al. Spatial sorting enables comprehensive characterization of liver zonation. Nat. Metab. 1, 899–911 (2019).
Hildebrandt, F. et al. Spatial Transcriptomics to define transcriptional patterns of zonation and structural components in the mouse liver. Nat. Commun. 12, 7046 (2021).
Richter, M. L. et al. Single-nucleus RNA-seq2 reveals functional crosstalk between liver zonation and ploidy. Nat. Commun. 12, 4264 (2021).
Wang, S. et al. Region-specific cellular and molecular basis of liver regeneration after acute pericentral injury. Cell Stem Cell 31, 341–358.e7 (2024).
Paris, J. & Henderson, N. C. Liver zonation, revisited. Hepatology 76, 1219–1230 (2022).
Sack, G. H. Serum Amyloid A (SAA) Proteins. Subcell. Biochem. 94, 421–436 (2020).
Strazzabosco, M., Fabris, L. & Albano, E. Osteopontin: a new player in regulating hepatic ductular reaction and hepatic progenitor cell responses during chronic liver injury. Gut 63, 1693–1694 (2014).
Te, J. A., AbdulHameed, M. D. M. & Wallqvist, A. Systems toxicology of chemically induced liver and kidney injuries: histopathology-associated gene co-expression modules. J. Appl Toxicol. 36, 1137–1149 (2016).
Manchanda, M. et al. Cathepsin L and B as Potential Markers for Liver Fibrosis: Insights From Patients and Experimental Models. Clin. Transl. Gastroenterol. 8, e99 (2017).
Yu, C., Wan, Y., Piao, L. & Wu Cheng, X. Can cysteinyl cathepsin activity control diet-induced NAFLD? Int J. Cardiol. Heart Vasc. 28, 100516 (2020).
Ruiz-Blázquez, P., Pistorio, V., Fernández-Fernández, M. & Moles, A. The multifaceted role of cathepsins in liver disease. J. Hepatol. 75, 1192–1202 (2021).
Anakk, S. et al. Combined deletion of Fxr and Shp in mice induces Cyp17a1 and results in juvenile onset cholestasis. J. Clin. Invest. 121, 86–95 (2011).
Gant, T. W. et al. Gene expression profiles associated with inflammation, fibrosis, and cholestasis in mouse liver after griseofulvin. EHP Toxicog. 111, 37–43 (2003).
Li, L. & Falany, C. N. Elevated hepatic SULT1E1 activity in mouse models of cystic fibrosis alters the regulation of estrogen responsive proteins. J. Cyst. Fibros. 6, 23–30 (2007).
Lalani, A. I., Zhu, S., Gokhale, S., Jin, J. & Xie, P. TRAF molecules in inflammation and inflammatory diseases. Curr. Pharm. Rep. 4, 64–90 (2018).
Gao, L. et al. Tumor necrosis factor receptor-associated factor 5 (Traf5) acts as an essential negative regulator of hepatic steatosis. J. Hepatol. 65, 125–136 (2016).
Fang, J. et al. Scientometric analysis of mTOR signaling pathway in liver disease. Ann. Transl. Med. 8, 93 (2020).
He, J. et al. Mammalian Target of Rapamycin Complex 1 Signaling Is Required for the Dedifferentiation From Biliary Cell to Bipotential Progenitor Cell in Zebrafish Liver Regeneration. Hepatology 70, 2092–2106 (2019).
Matter, M. S., Decaens, T., Andersen, J. B. & Thorgeirsson, S. S. Targeting the mTOR pathway in hepatocellular carcinoma: current state and future trends. J. Hepatol. 60, 855–865 (2014).
Cho, C.-S., Kowalsky, A. H. & Lee, J. H. Pathological Consequences of Hepatic mTORC1 Dysregulation. Genes (Basel) 11, 896 (2020).
Chen, F. et al. Loss of Ufl1/Ufbp1 in hepatocytes promotes liver pathological damage and carcinogenesis through activating mTOR signaling. J. Exp. Clin. Cancer Res 42, 110 (2023).
Jin, S. et al. Inference and analysis of cell-cell communication using CellChat. Nat. Commun. 12, 1088 (2021).
Palla, G. et al. Squidpy: a scalable framework for spatial omics analysis. Nat. Methods 19, 171–178 (2022).
Cable, D. M. et al. Robust decomposition of cell type mixtures in spatial transcriptomics. Nat. Biotechnol. 40, 517–526 (2022).
Elosua-Bayes, M., Nieto, P., Mereu, E., Gut, I. & Heyn, H. SPOTlight: seeded NMF regression to deconvolute spatial transcriptomics spots with single-cell transcriptomes. Nucleic Acids Res 49, e50 (2021).
Wei, R. et al. Spatial charting of single-cell transcriptomes in tissues. Nat. Biotechnol. 40, 1190–1199 (2022).
Biancalani, T. et al. Deep learning and alignment of spatially resolved single-cell transcriptomes with Tangram. Nat. Methods 18, 1352–1362 (2021).
Vahid, M. R. et al. High-resolution alignment of single-cell and spatial transcriptomes with CytoSPACE. Nat. Biotechnol. 41, 1543–1548 (2023).
Qian, J. et al. Reconstruction of the cell pseudo-space from single-cell RNA sequencing data with scSpace. Nat. Commun. 14, 2484 (2023).
Zhang, Q. et al. Leveraging spatial transcriptomics data to recover cell locations in single-cell RNA-seq with CeLEry. Nat. Commun. 14, 4050 (2023).
Zhang, D. et al. Spatial epigenome-transcriptome co-profiling of mammalian tissues. Nature 616, 113–122 (2023).
Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).
Crowell, H. L. et al. muscat detects subpopulation-specific state transitions from multi-sample multi-condition single-cell transcriptomics data. Nat. Commun. 11, 6077 (2020).
Morgan, M. & Shepherd, L. ExperimentHub: Client to Access ExperimentHub Resources. (2022).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Yuan, Z. et al. SODB facilitates comprehensive exploration of spatial omics data. Nat. Methods 20, 387–399 (2023).
Wang, Y., Chang, D., Fu, Z. & Zhao, Y. Consistent Multiple Graph Embedding for Multi-View Clustering. IEEE transactions on multimedia 25, 1008–1018 (2021).
Lotfollahi, M. et al. Mapping single-cell data to reference atlases by transfer learning. Nat. Biotechnol. 40, 121–130 (2022).
Kipf, T. N. & Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations (ICLR) (2017).
Belghazi, I., Rajeswar, S., Baratin, A., Hjelm, R. D. & Courville, A. C. MINE: Mutual Information Neural Estimation. In Proceedings of the 35th International Conference on Machine Learning (PMLR), 531–540 (2018).
Hjelm, R. D. et al. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations (ICLR) (2019).
Nowozin, S., Cseke, B. & Tomioka, R. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. Advances in Neural Information Processing Systems 29, 271–279 (2016).
Wang, M. et al. Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks. arXiv abs/1909.01315 (2019).
Fowlkes, E. B. & Mallows, C. L. A Method for Comparing Two Hierarchical Clusterings. J. Am. Stat. Assoc. 78, 553–569 (1983).
Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28, 27–30 (2000).
Fang, Z., Liu, X. & Peltz, G. GSEApy: a comprehensive package for performing gene set enrichment analysis in Python. Bioinformatics 39, btac757 (2023).
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).
Liberzon, A. et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740 (2011).
Liberzon, A. et al. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 1, 417–425 (2015).
Dr. Fan, X. & Jingyang, Q. ZJUFanLab/scNiche: scNiche v1.1.0. Zenodo https://doi.org/10.5281/zenodo.14195486 (2024).
Acknowledgements
This work is supported by the National Natural Science Foundation of China (U23A20513, X.F., 82274213, H.H.), the “Pioneer” and “Leading Goose” R&D Program of Zhejiang (2024C03106, X.F.), the Fundamental Research Funds for the Central Universities (226-2024-00001, X.F.). The authors thank the High-Performance Computing Cluster of Zhejiang University Innovation Center of Yangtze River Delta for their technical support and thank Dr. Yao Zhao and his team for developing the multi-view clustering framework CMGEC, which inspired us during the development of this project.
Author information
Authors and Affiliations
Contributions
X.F. and H.H. conceived the study. J.Q. and X.S. implemented the scNiche model. J.Q., X.S., H.B., Y.F., W.G., C.L., and A.L. collected datasets involved in this article, performed benchmarking experiments, and conducted experimental analyses on biological datasets. J.Q. wrote the manuscript, and all authors edited and revised the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Nigel Chou, who co-reviewed with Vipul Singhal and Xinrui Zhou; Zhenqin Wu and Nancy Zhang for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Qian, J., Shao, X., Bao, H. et al. Identification and characterization of cell niches in tissue from spatial omics data at single-cell resolution. Nat Commun 16, 1693 (2025). https://doi.org/10.1038/s41467-025-57029-9
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41467-025-57029-9
This article is cited by
-
DECIPHER for learning disentangled cellular embeddings in large-scale heterogeneous spatial omics data
Nature Communications (2025)
-
Hypergraph-driven spatial multimodal fusion for precise domain delineation and tumor microenvironment decoding
Communications Biology (2025)
-
Role of intratumoral heterogeneity in metastatic progression and drug resistance
Discover Oncology (2025)










