Scalable topic modelling decodes spatial tissue architecture for large-scale multiplexed imaging analysis

Peng, Xiyu; Smithy, James W.; Yosofvand, Mohammad; Kostrzewa, Caroline E.; Bleile, MaryLena; Ehrich, Fiona D.; Lee, Jasme; Postow, Michael A.; Callahan, Margaret K.; Panageas, Katherine S.; Shen, Ronglai

doi:10.1038/s41467-025-61821-y

Download PDF

Article
Open access
Published: 18 July 2025

Scalable topic modelling decodes spatial tissue architecture for large-scale multiplexed imaging analysis

Nature Communications volume 16, Article number: 6619 (2025) Cite this article

5035 Accesses
1 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Recent progress in multiplexed tissue imaging is deepening our understanding of tumor microenvironments related to treatment response and disease progression. However, analyzing whole-slide images with millions of cells remains computationally challenging, and few methods provide a principled approach for integrative analysis across images. Here, we introduce SpatialTopic, a spatial topic model designed to decode high-level spatial tissue architecture from multiplexed images. By integrating both cell type and spatial information, SpatialTopic identifies recurrent spatial patterns, or “topics,” that reflect biologically meaningful tissue structures. We benchmarked SpatialTopic across diverse single-cell spatial transcriptomic and proteomic imaging platforms spanning multiple tissue types. We show that SpatialTopic is highly scalable to large-scale images, along with high precision and interpretability. It consistently identifies biologically and clinically significant spatial topics, such as tertiary lymphoid structures, and tracks spatial changes over disease progression. Its computational efficiency and broad applicability will enhance the analysis of large-scale imaging datasets.

Optimizing multiplexed imaging experimental design through tissue spatial segregation estimation

Article Open access 30 December 2022

Spatially visualized single-cell pathology of highly multiplexed protein profiles in health and disease

Article Open access 27 May 2021

Spatial architecture of development and disease

Article 30 September 2025

Introduction

Recent advancements in multiplexed tissue imaging allow the profiling of RNA and protein expression in situ across thousands to millions of single cells within a whole-slide tissue context^1,2,3,4,5. These technologies generate high-dimensional molecular imaging data, offering significant opportunities for a spatially resolved understanding of cellular heterogeneity and organization within tissues. Compared to other single-cell technologies (such as single-cell RNA-seq, flow cytometry), multiplexed imaging provides unique opportunities to examine spatial patterns of diverse cell types and characterize the tissue microenvironment of interest, which may play an essential role in understanding disease progression, tissue development, and mechanisms of treatment response^1,2,4,5,6,7. One recent discovery in cancer, partly enabled by multiplexed spatially resolved omics data, is the presence of tertiary lymphoid structures (TLSs) in tumor tissues and its role in the adaptive antitumor immune response^8,9,10,11. TLSs have been identified in a wide range of human cancers⁹ and have demonstrated a promising positive association with improved outcomes in cancer patients who underwent immunotherapy⁸.

While promising, the complex cellular architecture revealed by whole-slide multiplexed tissue imaging presents significant analytical challenges. Pathology images of tissue samples affected by certain diseases, such as cancer, are particularly complex, displaying abnormal cellular structures and significant variation between tumor samples. Currently, most analyses focus on individual images, examining elements such as cell densities and inter-cellular distances^1,2,6, or conducting basic spatial domain analyses that primarily focus on binarized tissue compartments, such as tumor versus stroma^12,13,14. Associating these features with outcomes requires manual and heuristic aggregation across images. While promising, a significant hurdle in spatial pattern analysis is deciphering biologically and clinically relevant patterns from the complex architecture within tissue across various slides.

In recent literature, cell neighborhood (niche) analysis has emerged as a popular approach. This analysis pipeline typically consists of two primary steps by first identifying neighborhood features for each single cell using either a K-nearest-neighbor (KNN) graph or a defined radius, and then applying a clustering algorithm, such as k-means, Louvain, or Latent Dirichlet Allocation (LDA)^{2,6,7,15,16,17}. Seurat v5¹⁸, for instance, clusters cells using k-means based on similar cell type compositions, offering a straightforward niche analysis method. There are different variants of the approach depends on how to incorporate spatial information into the clustering process. UTAG¹⁵ averages marker expression within the neighborhood for clustering, while BankSY¹⁹ further refines this by combining local mean expression with individual cell expression. Spatial-LDA¹⁶ incorporates spatial priors into clustering to allow proximity-closed cells to share similar cell neighborhoods. More recently, graph neural networks have been employed to discern cell neighborhood patterns, such as CytoCommunity²⁰. However, deep learning methods like CytoCommunity require significant computational resources, posing challenges for individual labs, particularly for large-scale image analysis. Other studies adapt computational methods designed for spatial transcriptomics to analyze tissue imaging data^21,22,23, such as those intended for 10× Visium, face limitations due to high computational costs¹⁹ and are generally restricted to single tissue sections with fewer spots^15,21. These methods struggle with large-scale images, like whole-slide multiplexed images containing millions of cells, and are challenging to adapt for modern imaging platforms like Nanostring CosMx and 10× Xenium.

Highly interpretable and scalable machine learning methods are in great need for analyzing molecular tissue imaging data. In this work, we propose SpatialTopic, a Bayesian topic model designed to identify and interpret spatial tissue architecture across various multiplexed images by considering both the cell types and their spatial arrangement (Fig. 1A). We adapt an approach originally developed for image segmentation in computer vision²⁴, incorporating spatial information into the flexible design of regions (image partitions, analogous to documents in language modeling). Unlike standard image pixels, the basic units of analysis in multiplexed tissue images are cells, which are not uniformly distributed due to the complexity of human tissue samples, posing a unique challenge. To address these challenges, we refined the original model used for image segmentation by using a nearest-neighbor kernel function to boost computational efficiency, as well as a unique initialization strategy for increasing robustness. In addition, we also provide an efficient C++ implementation of the spatial topic model in our R package SpaTopic.

**Fig. 1: *SpatialTopic* unsupervisedly identifies distinct tissue microenvironments across images, utilizing topic model concepts in computer vision.**

SpatialTopic offers a scalable solution for cell neighborhood and domain analysis for large-scale, multi-image datasets, efficiently handling millions of cells without the need to extract cell neighborhood information for each individual cell—a process that becomes computationally demanding and inefficient at scale. In contrast, SpatialTopic completes analysis of an image with 100,000 cells within 1 min on a laptop. Moreover, unlike the rigid clustering strategies of other methods, SpatialTopic identifies “topics”—tissue microenvironment features—through a probabilistic distribution over cell types and across diverse tissue images using a generative model. We demonstrate our method can accurately identify and quantify interpretable and biologically meaningful topics from imaging data without human intervention. We also present multiple case studies encompassing tissue images from mouse spleen, non-small cell lung cancer (NSCLC), healthy lung, and melanoma tissue samples. Finally, we highlight an example of a TLS-like topic and its correlation with outcomes from SpatialTopic analysis across different platforms, as well as a multi-stage example showing dynamic changes in spatial tissue architecture across varying disease stages. With computational efficiency and broad applicability, SpatialTopic provides a scalable framework that will enhance the analysis of large-scale tissue imaging studies.

Results

Overview of SpatialTopic, a Bayesian probabilistic model for highly scalable and interpretable spatial topic analysis across multiplexed tissue images

SpatialTopic is designed as a flexible spatial analysis module within the current imaging analysis workflow (Fig. 1B). Its main objective is to identify biologically meaningful topics across multiplexed images using unsupervised learning. Here, “topics” refer to latent spatial features defined by distinct cell type compositions within tissue microenvironment neighborhoods. SpatialTopic incorporates spatial data into an LDA model, assuming that each cell in an image arises from a mixture of spatially resolved topics, with each topic being a distribution over distinct cell types. Combining cell-type information with their spatial layout, this method enables the automated and simultaneous detection of immunological patterns across multiple images. Subsequent analyses can further link these topics with patient data, such as treatment response and survival.

We adopt a Bayesian approach for model inference, considering the uncertainties inherent in tissue spatial patterns. SpatialTopic requires cell types and their locations as input, with the cell types determined by the users’ preferred phenotyping algorithm tailored to the specific marker panel of the dataset. The algorithm generates two key statistics for further analysis: (1) topic content, a spatially resolved topic distribution over cell types, and (2) topic assignment for each cell within the images. After Gibbs sampling, the topic assignment of each cell is determined by the topic with the highest posterior probability. Cell types enriched in the same topic tend to be spatially correlated across images, leading to the identification of recurrent patterns of cell–cell interactions.

We developed an R package SpaTopic to efficiently implement the SpatialTopic algorithm as outlined in Fig. 1A, which details the primary steps of the algorithm (See the “Methods” section). Figure 1C displays a graphical representation of the spatial topic model. The key inputs for SpatialTopic are the cell type annotations ${\boldsymbol{\mathcal{C}}}$ and their locations ${\boldsymbol{\mathcal{X}}}$ across all images. Here, Z_gi denotes the topic assignment, and D_gi indicates the region assignment of cell i in image g. Analogous to how computer vision algorithms segment images by spatially co-occurring pixel patterns with similar color, intensity or texture for object detection, SpatialTopic identifies topics as clusters of spatially co-occurring cell types (shown in Fig. 1D), potentially corresponding to biologically meaningful cellular structures (e.g., TLSs). The process involves the following steps:

Initialization: anchor cells are chosen as regional centers via spatially stratified sampling. For each image, a KNN graph is constructed between anchor cells and all other cells: for each cell, we retrieve its top m closest anchor cells. The initial region assignments of cells are made based on proximity to region centers.
Collapsed Gibbs Sampling: each cell undergoes two main steps per iteration:
- Sample topic assignment Z_gi conditional on its region assignment D_gi and cell type c_gi, as well as the topic composition of the region D_gi and the cell type composition of the topic Z_gi.
- Sample region assignment D_gi conditional on current topic assignment Z_gi, distance of the cell ${{\boldsymbol{x}}}_{gi}^{c}$ to the region center ${{\boldsymbol{x}}}_{{D}_{gi}}^{d}$, and the topic composition of the region D_gi. The spatial information is weakly incorporated with a kernel function.
After Gibbs sampling, the output includes the posterior probabilities of Z_gi of each cell and the per-topic cell type distribution $\{{\hat{{\boldsymbol{\beta }}}}_{k}\}$. Each cell in the image is assigned to a topic with the highest posterior probability $P({Z}_{gi}| {\boldsymbol{\mathcal{C}}},{\boldsymbol{\mathcal{X}}})$.

We applied SpatialTopic to multiple datasets from diverse imaging platforms, including spatial proteomics data from Co-detection by Indexing (CODEX), Multiplexed ImmunoFluorescence (mIF), and Imaging Mass Cytometry (IMC) platforms, as well as spatial transcriptomics data from Nanostring CosMx platform (Supplementary Table 1). In the next few sections, we apply SpatialTopic to analyze tissue imaging data from a variety of spatial molecular profiling platforms and benchmark SpatialTopic with other popular algorithms for spatial domain/niche analysis, including Seurat v5¹⁸, Spatial-LDA¹⁶, CytoCommunity²⁰, UTAG¹⁵, and BankSY¹⁹ (Supplementary Table 2). The benchmark datasets contain between 0.1 and 1 million cells; making it challenging to apply methods with high computational costs. In contrast, SpatialTopic processes these large-scale images within just a few minutes.

SpatialTopic identifies global and local spatial features of human lung cancer tissue with higher precision and interpretability

We applied our method to a single NSCLC tissue image generated using a 960-plex CosMx RNA panel on the Nanostring CosMx Spatial Molecular Imager platform, which is publicly available on the Nanostring website²⁵. We selected a Lung5-1 sample containing ~100,000 cells, with 38 cell types annotated using Azimuth²⁶ based on the human lung reference v1.0 (Fig. 2A).

**Fig. 2: *SpatialTopic* better detects tumor microenvironments in a Nanostring human non-small cell lung cancer tissue image.**

To illustrate the general tissue architecture, Fig. 2A displays the distribution of the top 10 main cell types and the expression patterns of key genes, including KRT17, C1QA, IL7R, TAGLN, and MS4A1. These genes serve as markers for tumor cells, macrophages, CD4 T cells, stroma cells, and B cells, respectively (Fig. 2B). Our results demonstrate that SpatialTopic identified seven distinct topics from the complex image (Fig. 2A), with each topic representing a unique spatial niche characterized by a specific cell-type composition, as detailed in Fig. 2C. For example, Topic 2 is predominantly composed of tumor cells, indicating the tumor region in the image. Topic 4 represents a stromal region with a high proportion of macrophages, as well as dendritic cells. Topic 6 consists of a mixture of myofibroblast and smooth muscle cells spatially located tightly around the tumor, with dense cellular structure limiting the interactions between tumor and immune cells. Topics 1, 5, and 7 are fibroblast-concentrated stroma regions, each enriched with a different immune cell type: macrophages, dendritic cells, and plasma cells, respectively.

Notably, Topic 3 captures lymphoid aggregate structures in this lung cancer tissue sample, consisting of B cells, CD4 T cells, and smaller proportions of dendritic cells and CD8 T cells. This composition aligns with the current understanding of cell types in TLSs, recognized as a promising biomarker for cancer immunotherapy across various cancer types, including NSCLC^8,9,10,11,27. TLSs are generally characterized as aggregates of B cells and other types of immune cells found in nonlymphoid tissue, and the presence of TLSs in tumor biopsy has shown to be highly correlated with a better prognosis and clinical outcome upon immunotherapy²⁸. In a recent publication, we reported B-cell aggregates strongly associated with progression-free survival in patients with unresectable melanoma treated with immune checkpoint inhibitors¹⁴. Despite the importance of TLSs, challenges remain to reliably detect them to be used in clinical applications due to their complex cell type composition and variation in size and spatial location. SpatialTopic provides a flexible and efficient computational tool to address this need.

Cancer-associated fibroblasts (CAFs) are also an important component of tumor microenvironment. A recent publication by Liu et al.²⁹ found four spatially distinct CAF subtypes, each exhibiting different transcriptomic profiles as a result of cellular interactions with their unique neighbors. On the Nanostring CosMx lung cancer dataset, the topics identified by SpatialTopic are highly correlated with the four distinct CAF spatial subtypes reported in the paper²⁹ (Supplementary Fig. 1). Specifically, Topic 6 is associated with s1-CAFs adjacent to the tumor; Topics 5 and 7 correspond to s2-CAFs within the stromal niche; Topic 1 is linked to s3-CAFs in the myeloid niche; and Topic 3, primarily comprising B cells and CD4 T cells along with some myofibroblasts, corresponds to s4-CAFs in the TLS niche. In addition, SpatialTopic offers a more comprehensive analysis by encompassing all cell types, not just CAFs, and is scalable to larger datasets. This allows for further exploration, such as testing for differential gene expression among cell types, like CAFs, across various tumor microenvironments (topics).

Moreover, we benchmarked the results from SpatialTopic with Seurat v5, Spatial-LDA, CytoCommunity, BankSY, and UTAG. BankSY and UTAG directly use cell-level gene expression as input, whereas the other four methods, including SpatialTopic, rely on cell-type annotations (Fig. 2A, D, E). All methods can detect the global structure of the lung cancer tissue and classify tumor and stromal regions. However, BankSY and UTAG appear to miss the TLSs, likely because they did not take advantages of the cell-type annotations. Moreover, both methods assume homogeneous gene expression patterns in the neighborhoods, while TLSs are mixture of B cell and T cell subsets with highly mixed and variable transcriptomic profiles, thus making it difficult for methods that rely on homogeneous gene expression assumptions. Reference-based cell type annotation typically offers more detailed information and can be more robust for noisy data when matched with a single-cell reference^30,31. SpatialTopic distinctly identified the lymphoid structure as Topic 3, comprising a mix of CD4 T cells and B cells (Fig. 2F). Additionally, when we focused on two local tumor tissue regions (Fig. 2D, E), SpatialTopic identified the tumor region with higher precision (Topic 2), more consistently matching the expression pattern of KRT17, a lung cancer marker gene. SpatialTopic and UTAG are the only two methods showing consistency (between tumor domain and KRT17 expression) higher than 0.8 across the entire image (Fig. 2G), which aligns with the visual measure in Fig. 2D, E. Additionally, we can evaluate the posterior probabilities of each cell’s assignment to various topics (Supplementary Fig. 2), as SpatialTopic employs a soft-clustering approach. Unlike hard clustering methods such as k-means, which categorically assign cells to clusters/domains, SpatialTopic quantifies uncertainty, providing a probabilistic measure of the statistical confidence for each cell’s topic assignment, and enriching the interpretation of the cell-topic relationships by revealing nuanced details that might otherwise be overlooked.

SpatialTopic identifies tertiary lymphoid structures from whole-slide melanoma tissue imaging

We applied SpatialTopic to a whole-slide melanoma tissue image obtained from our previous published mIF imaging datasets, with a 12-plex marker panel¹⁴. This analysis covered a whole-slide soft tissue image containing 0.4 million cells, annotated into seven major cell types (CD4 T cells, Tumor/Epithelial, B cells, CD8 T cells, Macrophages, Regulatory T (Treg) cells, and Others). The categorization was based on the expression of six lineage markers: PanCK/SOX10, CD3, CD8, CD20, CD68, and FoxP3. Cells were annotated as “Other” if they showed negative expression for all six markers.

Despite fewer identified cell types (due to fewer markers) compared to the Nanostring CosMx platform, SpatialTopic identified five distinct topics (Fig. 3A): Topic 1 (tumor), Topic 2 (CD4 immune zone), Topic 3 (stroma), Topic 4 (immune-enriched tumor-stroma boundary), and Topic 5 (TLSs). The tissue structures revealed by these topics visually correspond to the histological pattern seen in the co-registered Hematoxylin and Eosin (H&E) image (Fig. 3B) and the merged raw mIF images (Fig. 3C) with three key markers: CD3 (T cells), CD20 (B cells), and PanCK/SOX10 (tumor cells). Figure 3D demonstrates that topic 5 (TLSs) mainly consists of B cells, CD4 T cells, a few CD8 T cells, and Treg cells, consistent with the TLS-like pattern identified in the Nanostring dataset discussed earlier. Due to the lack of a dendritic cell marker in the mIF dataset, dendritic cells could not be identified and included in topic 5. This analysis demonstrates that SpatialTopic can consistently detect the same biologically relevant patterns across various tumor tissues and imaging platforms, which may be clinically significant, as TLS has been recognized as a promising biomarker for cancer immunotherapy²⁸.

**Fig. 3: *SpatialTopic* identifies tertiary lymphoid structures from a whole-slide melanoma tissue sample.**

SpatialTopic recovers spatial domain from cell type spatial organization in healthy lung tissue

We further demonstrate that SpatialTopic can effectively distill signals from noisy cell type annotations and identify clear tissue architecture based solely on the spatial arrangement of cells. To illustrate this, we applied SpatialTopic to the IMC dataset from the UTAG paper¹⁵, which includes 26 small regions of interest (ROIs) images from healthy lung tissue. For comparison, we used the UTAG result provided in the paper¹⁵ without rerunning UTAG.

Our analysis shows that SpatialTopic can recover tissue architectures directly from the spatial distribution of cell type annotations, yielding results consistent with manual annotations (Fig. 4A). SpatialTopic performs comparably to UTAG using only cell type annotations (Fig. 4B), as indicated by the adjusted Rand index, which shows similar performance levels. Additionally, Fig. 4C illustrates the topic content and the cell type composition for each topic identified by SpatialTopic. This demonstrates SpatialTopic’s capability to perform domain analysis without discarding existing cell-type annotations, offering valuable flexibility for datasets with cell-type annotations or for incorporating any existing cell-type annotation method. Unlike UTAG, which learns spatial tissue architecture directly from cell features due to noisy cell type annotations, we show that SpatialTopic can effectively identify tissue architecture from these annotations. Thus SpatialTopic is a robust alternative that leverages existing data without the need for additional cell-level features. We note that this dataset is challenging and we have to increase number of initialization to find a good start for SpatialTopic.

**Fig. 4: *SpatialTopic* recovers spatial domain architecture from cell-type spatial layout in healthy lung.**

SpatialTopic identifies disease-specific topics and tracks topic evolution in mouse spleen over disease progression

We also applied SpatialTopic to a CODEX mouse spleen dataset² to demonstrate its proficiency in identifying spatial topics across multiple images. This dataset includes nine images: three control normal BALBc spleens (BALBc 1–3) and six MRL/lpr spleens (MRL/lpr 4–9) at varying disease stages–early (MRL/lpr 4–6), intermediate (MRL/lpr 7–8), and late (MRL/lpr 9) of systemic autoimmune disease (Fig. 5A). Using a 30-plex protein marker panel, the study identified 27 major splenic-resident cell types across the nine tissue images. We use the cell type annotation provided in the original paper².

**Fig. 5: *SpatialTopic* captures the main dynamics in tissue architecture of normal and diseased mouse spleen.**

SpatialTopic identified six topics from ~0.7 million cells across the nine images, highlighting the dramatic changes in spatial tissue structures associated with disease progression from normal spleen to spleen tissue at different disease stages (Fig. 5A). Figure 5B, C highlights per-topic cell type compositions aiding in labeling each topic. The normal spleen tissue samples predominantly comprised three topics: Topic 1 (red pulp), Topic 2 (periarteriolar lymphoid sheath, PALS), and Topic 3 (B-follicle). Supplementary Figs. 3 and 4 show the cell type distribution and domain annotations from the original paper, demonstrating SpatialTopic’s ability to capture the main structures consistent with these annotations, as compared to other methods (Supplementary Figs. 3 and 5). With an increasing number of topics, SpatialTopic also successfully delineated the marginal zone from the B-follicle (Supplementary Fig. 3).

One key advantage of SpatialTopic is its ability to identify topics jointly across all images, ensuring that topics are comparable across normal and diseased spleens. This allows us to identify condition-specific topics and quantify changes in topic proportions as the disease progresses. Diseased tissues often become disorganized, posing additional challenges to delineate spatial features compared to normal tissues. For instance, in contrast to normal spleens, the red pulp region in lupus-affected spleens (MRL/lpr) shows early signs of reorganization. These spleens exhibit an increase in granulocytes and erythroblasts, indicative of lupus-related splenic hematopoiesis and potentially leading to splenomegaly, in contrast to the B cells and F4/80(+) macrophages typically observed in normal spleens^2,32,33. With SpatialTopic, this alteration is marked by the dominance of Topic 6 (Erythromyeloid niche) in lupus spleens, which supersedes Topic 1, the prevalent topic in normal spleen tissue. Other methods, including Spatial-LDA, UTAG, and BankSY, did not detect this change in the original red pulp region (Supplementary Fig. 5), partly due to overcorrection of batch effects. Topic 4 (CD106+ stroma niche), emerging in the lupus spleen, is characterized by a high abundance of CD106+ stroma cells, which attract immune cells to the inflamed areas^2,34, and is enriched with CD4 and CD8 T cells. Notably, plasma cell, a disease-related B-cell subtype, is also uniquely enriched in Topic 4 and appears consistently across different disease stages. A plasma cell is a representative cell type for this topic, as identified by both lift and FREX metrics (Supplementary Fig. 6). A high abundance of plasma cells is often observed in lupus-affected tissue, such as the spleen. Therapeutic strategies aimed at eliminating plasma cells have demonstrated efficacy in patients with refractory systemic lupus erythematosus³⁵. Topic 5 (Double Negative T cell niche), also unique to lupus spleens, features an enrichment of B220+ double negative (DN) T cells, as well as conventional CD4 T cells, and is more likely to be seen in the advanced stages of the disease, as compared to Topic 4 and 6. The expansion of DN T cells, is associated with disease progression, not only in MRL/lpr mice but also in patients with systemic lupus erythematosus^36,37. These dynamic changes in the spleen tissue architecture indicate a significant reorganization of the immune landscape, reflecting immune dysregulation as systemic lupus erythematosus progresses².

Furthermore, SpatialTopic identifies topics based on the spatial proximity of cell types, implying that cell types grouped within the same topic are likely co-localized and prone to interaction. Figure 5D illustrates the changes in topic proportions throughout the course of the disease progression. The distinct contributions of cell types to each topic are highlighted in Fig. 5E, mainly selected based on the lift and FREX metrics^38,39 (Supplementary Fig. 6), as well as cell type composition. Cell types are clustered into topics that exhibit similar dynamics across different slides, which provides further insights into cell–cell interactions in both normal and diseased tissues.

SpatialTopic is highly scalable on large-scale modern images

To benchmark the scalability of SpatialTopic as the number of cells in images increases, we conducted tests using simulated datasets of varying scales. Figure 6A shows that SpatialTopic significantly outperforms Seurat v5 in terms of scalability with an increasing number of cells within a single image. Moreover, as demonstrated in Fig. 6B, SpatialTopic shows high efficiency on real large-scale imaging datasets, requiring less user time compared to other methods. For example, on the Nanostring CosMx NSCLC image with ~0.1 million cells, SpatialTopic runs within 1 min on a standard MacBook Air, a performance currently unbeatable by other methods.

**Fig. 6: *SpatialTopic* is scalable to large-scale images and can be run on a regular laptop within minutes.**

Across all datasets, SpatialTopic ranks in the highest tier with Seurat v5, while BankSY and UTAG fall into a second tier due to their reliance on similar but less optimized strategies. CytoCommunity, limited by its dependency on GPU support and memory demands, was run with reduced epochs and CPU-only for the NSCLC dataset, which compromised its performance and underscored its impracticality for labs without extensive computing resources on large-scale imaging analysis. Additionally, on images with cells more than 0.1 million, both UTAG and CytoCommunity required running on high-performance computing servers, due to their high memory demands. In contrast, SpatialTopic is highly scalable on large-scale imaging analysis, with all analysis done within minutes on a standard laptop.

Discussion

In summary, we introduced SpatialTopic, a spatial topic model designed to identify and quantify biologically relevant topics across multiple multiplexed tissue images. This unique computational approach leverages language modeling techniques to decipher the tissue microenvironment from tissue imaging data. SpatialTopic stands out as one of the few unsupervised learning methods capable of discerning clinically relevant spatial patterns^15,19,21. Unlike other methods that rely on hard clustering strategies for analyzing samples, SpatialTopic is a probabilistic model-based approach using Bayesian inference to identify complex tissue architectures. The model generates two key outputs: The first of these, the topic content maps the cell type composition in spatial niches, allowing direct interpretation of the corresponding topic (e.g., TLSs); The second output, topic assignment for each single cell allows the quantification of each topic in individual tissue samples for subsequent association analysis with patient outcome. Application to multiple datasets along with benchmark analysis shows that SpatialTopic achieves higher precision in defining global and local spatial niches and higher sensitivity at capturing complex structures such as TLSs. Notably, our method is highly scalable to large-scale imaging data with efficient runtime, handling millions of cells on a standard laptop.

SpatialTopic is designed as a flexible spatial analysis module within the current imaging analysis workflow. A standard image analysis pipeline includes cell segmentation, data normalization/batch correction, cell phenotyping/clustering, and the analysis of cell type content and spatial relationships. Downstream statistical analysis typically starts with cell-level metadata derived from image analysis. Due to varied marker panels and molecular imaging platforms, a one-size-fits-all solution for cell phenotyping across diverse platforms seems unlikely. In practice, we find that reference-based cell annotation works best on single-cell imaging data rather than unsupervised clustering due to high-noisy data. SpatialTopic does not specify any upstream method and thus can be seamlessly integrated with other cell phenotyping modules tailored for datasets from different platforms. This design offers users adaptability, accommodating datasets from different panel designs.

In our proposed analysis pipeline for imaging data, we separate cell phenotyping from cell neighborhood/domain analysis for image-based spatial data, with SpatialTopic directly taking cell types as input. This key difference sets SpatialTopic apart from UTAG and BankSY, which use protein/gene expression as input for niche/domain analysis. UTAG performs dimension reduction before message passing, while BankSY engineers new spatial features for each cell before dimension reduction. We propose that treating cell phenotyping and neighborhood/domain analysis as distinct steps is a better analysis strategy for datasets generated by image-based technology with selected marker panels. Using cell type annotations as input for cell neighborhood analysis enhances the interpretability of different tissue microenvironments and undoubtedly increases the computational efficiency when analyzing large-scale images. The performance of SpatialTopic may rely on the accuracy of cell phenotyping. A better strategy for cell phenotyping is to annotate cells directly from cell images instead of using summary statistics, such as mean marker expression or gene count data. As part of the analysis pipeline, we are developing an image-based deep learning method for cell phenotyping, incorporating subcellular information, as well as domain knowledge⁴⁰.

For multi-sample analysis, addressing the batch effect is a key challenge. Our proposed analysis pipeline seeks to mitigate the batch effect during cell phenotyping using a reference-based cell phenotyping method. For spatial transcriptomics data, a supervised classification method with a reliable single-cell reference can mitigate batch effects and inherent noises in the imaging data. The Batch effect is more critical for algorithms that directly consider gene expression data as input. When analyzing the mouse spleen dataset, we used Combat⁴¹ for batch correction across multiple images before applying UTAG and BankSY. However, Combat appears to over-correct for batch effects (Supplementary Fig. 5), thus failing to distinguish between normal and diseased red pulp tissue and ignoring key players in diseased tissues. This might stem from the substantial differences between normal and diseased tissues.

Modern datasets from platforms, such as 10× Xenium and Nanostring CosMx, require scalable computational methods to handle their size and complexity. Existing spatial domain analysis methods, originally designed for 10× Visium spatial transcriptomics data and optimized for datasets with thousands of cells or spots per slide, find it challenging to handle these more advanced datasets with millions of cells per image. SpatialTopic meets this need by efficiently managing neighborhood calculations and constructing the KNN graph only among m anchor cells instead of all n cells in the image. This reduces the time complexity of constructing KNN graphs from $O(n\log n)$ to $O(m\log m)$, and the time complexity of finding the closest anchor cell for each cell from $O(\log n)$ to $O(\log m)$, where m ≪ n. Furthermore, SpatialTopic maintains linear time complexity relative to the number of cells and iterations with collapsed Gibbs sampling and adapts an efficient approach for K nearest neighbor searching. These optimizations ensure SpatialTopic’s computational efficiency, making it accessible on standard laptops and practical for analyzing large-scale imaging data from platforms such as the 10× Xenium and Nanostring CosMx.

Moreover, advances in technology now enable the quantification of immune cell spatial diversity and the characterizing of tumor microenvironments in three-dimensional (3D) tissues⁴². While SpatialTopic can be adapted to infer immunological topics from 3D tissue, a refined strategy is needed to select anchor cells in the 3D spaces, as the spatial information obtained by SpatialTopic primarily relies on the relationships between anchor cells and other cells. In Supplementary Fig. 7, we demonstrate the applicability of SpatialTopic to a 3D spleen tissue image reconstructed from multiple tissue sections. To further improve the performance and applicability of SpatialTopic, several strategies could be pursued in the future. For instance, incorporating a hierarchical Dirichlet prior to topic distributions across regions would allow regions within the same image to share priors while allowing variability across different images. Furthermore, optimizing the initialization strategy is essential for applying SpatialTopic to population-scale datasets comprising hundreds or even thousands of images. These improvements would broaden the applicability and robustness of SpatialTopic.

Methods

SpatialTopic

Notations

We assume there are total V cell types that contribute to K different tissue microenvironments (topics) across G multiplexed images. Let c_gi be the ith cell at the location ${{\boldsymbol{x}}}_{gi}^{c}=({x}_{gi1}^{c},{x}_{gi2}^{c}),g=1,2,\ldots,G,i=1,2,\ldots,{n}_{g}$, on the gth image with total n_g cells. Let c_gi = v if the cell has been classified to the vth cell type. Let ${\boldsymbol{\mathcal{C}}}={\{{c}_{gi}\}}_{i=1,2,\ldots,{n}_{g}}^{g=1,2,\ldots,G}$ and ${\boldsymbol{\mathcal{X}}}^{c}={\{{{\boldsymbol{x}}}^c_{gi}\}}_{i=1,2,\ldots,{n}_{g}}^{g=1,2,\ldots,G}$ denote all observed cell types and cell locations across all G images.

Model

In a conventional LDA model, each image is treated as an individual document, employing a bag-of-words approach without accounting for spatial information. This approach is similar to our prior work on longitudinal flow cytometry data analysis³⁸. Here, in order to incorporate spatial information within images, we introduce a spatial topic model, SpatialTopic, integrating spatial data into the foundational LDA framework. This spatial topic framework was originally proposed for image segmentation²⁴, instead of viewing each image as a singular document, we treat each image consisting of densely placed overlapping regions (documents). Unlike the conventional LDA model where relationships between documents and words are known and fixed, the word-document relationship here is unknown: each cell (word) is flexible to be assigned to all possible regions (documents). This flexible region (document) design allows us to identify spatial structure with irregular shapes.

For SpatialTopic, we introduce an additional hidden variable, D_gi, to denote cell region (document) assignment. Thus, each cell is associated with two hidden variables: the latent topic assignment Z_gi ∈ {1, 2, …, K } and the latent region assignment D_gi ∈ {1, 2, …, M}, M = ∑_gM_g, where M_g denotes the number of regions on the image g. During the initialization, we pre-selected anchor cells as region centers. Let ${{\boldsymbol{\mathcal{X}}}}^{d}={\{{{\boldsymbol{x}}}_{d}^{d}\}}_{d=1,2,\cdots,M}$ be the set of all M region centers across all images. Let θ_d be the proportion of region d over K topics and β_k be the proportion of topic k over V cell types. Hyperparameters ψ and α specify the nature of the Dirichlet priors of {β_k} and {θ_d}, respectively.

Then we are ready to describe our generative model:

For each topic k, sample β_k (topic weights over V cell types) from a Dirichlet prior β_k ~ Dir(ψ).
For each image region d (centered at ${{\boldsymbol{x}}}_{d}^{d}$), sample topic proportion θ_d ~ Dir(α)
For each cell, the ith cell in the image g:
- Sample its region assignment D_gi from a uniform prior over possible documents (regions) in the image g.
- Sample the location ${{\boldsymbol{x}}}_{gi}^{c}$ conditional on its region assignment D_gi with a kernel function based on the distance between the cell location ${{\boldsymbol{x}}}_{gi}^{c}$ and the region center ${{\boldsymbol{x}}}_{d}^{d}$.
  $${{\boldsymbol{x}}}_{gi}^{c}| {D}_{gi}=d\propto K({{\boldsymbol{x}}}_{gi}^{c},{{\boldsymbol{x}}}_{d}^{d}).$$
- Sample topic assignment Z_gi∣D_gi = d ~ Multi(θ_d, 1).
- Sample cell type c_gi∣Z_gi = k ~ Multi(β_k, 1).

Hyperparameters α and ψ should be chosen based on the belief on {θ_d} and {β_k} in a Bayesian perspective. In our application, both α and ψ are set very small by default (default: α_k = 0.01, ∀k; ψ_v = 0.05, ∀v) to encourage the sparsity in region-topic distributions {θ_d} and topic-celltype distributions {β_k}.

Nearest-neighbor exponential kernel

The flexible relationships between regions and cells in SpatialTopic allow each cell to be assigned to any one of its proximate regions. We employ a nearest-neighbor Gaussian kernel to capture the spatial correlation between cells and their respective regions, as previously used in the nearest-neighbor Gaussian process⁴³. For computational efficiency, especially with large-scale images, we restrict our consideration to the top nearest-neighbor regions for each cell. Let ${\mathcal{N}}({{\boldsymbol{x}}}^c_{gi})\subset {{\boldsymbol{\mathcal{X}}}}^{d}$ be the collection of m closed region centers to the cell ${{\boldsymbol{x}}}_{gi}^{c}$ (default: m = 5). In practice, the commonly used squared exponential Gaussian kernel function decays too rapidly. This rapid decay often results in cells predominantly being linked to their closest region, irrespective of their cell types. Let σ be the length scale that controls how fast correlation decays with distance in the kernel function. Thus, drawing inspiration from ref. ⁴⁴, instead of the squared exponential kernel, we used the following exponential kernel,

$$K\left({{\boldsymbol{x}}}_{gi}^{c},{{\boldsymbol{x}}}_{d}^{d}\right)\propto {\mathbb{1}}\left\{{{\boldsymbol{x}}}_{d}^{d}\in {\mathcal{N}}\left({{\boldsymbol{x}}}_{gi}^{c}\right)\right\}\exp \left\{-| | {{\boldsymbol{x}}}_{gi}^{c}-{{\boldsymbol{x}}}_{d}^{d}| {| }_{2}/\sigma \right\},$$

(1)

where $| | {{\boldsymbol{x}}}_{gi}^{c}-{{\boldsymbol{x}}}_{d}^{d}| {| }_{2}$ represents the Euclidean distance between the cell location ${{\boldsymbol{x}}}_{gi}^{c}$ and the region center ${{\boldsymbol{x}}}_{d}^{d}$. We fix σ for computational efficiency, but it may be sampled during the Gibbs sampling. Increasing σ would reduce the strength of the spatial correlation, resulting in a diminished spatial effect when assigning cells to regions.

Collapsed Gibbs Sampling

We use collapsed Gibbs sampling for model inference. The collapsed Gibbs sampling algorithm was originally introduced as the Bayesian approach of LDA⁴⁵. This method’s comprehensive derivation and implementation can be found in the paper⁴⁶. Similar to ref. ²⁴, we further adapted and extended the algorithm for our proposed spatial topic model. It’s noteworthy that during the collapsed Gibbs sampling process, the parameters β_k and θ_d are integrated out and are not explicitly sampled. Instead, our focus is on the two hidden variables associated with each cell: the topic assignment Z_gi and the region (or document) assignment D_gi. These variables undergo iterative sampling using the collapsed Gibbs sampler:

1.
Sample topic assignment Z_gi conditional on region assignment D_gi with⁴⁵
$$P({Z}_{gi}= k\;|\; {D}_{gi}=d,{c}_{gi}=v,{{\boldsymbol{\mathcal{D}}}}_{-gi},{{\boldsymbol{\mathcal{Z}}}}_{-gi},{{\boldsymbol{\mathcal{C}}}}_{-gi},{\boldsymbol{\psi }},{\boldsymbol{\alpha }})\\ \propto \frac{{n}_{k,-gi}^{(v)}+{\psi }_{v}}{\mathop{\sum }\nolimits_{t=1}^{V}{n}_{k,-gi}^{(t)}+{\psi }_{t}}\frac{{n}_{d,-gi}^{(k)}+{\alpha }_{k}}{\mathop{\sum }\nolimits_{{k}^{{\prime} }=1}^{K}{n}_{d,-gi}^{({k}^{{\prime} })}+{\alpha }_{{k}^{{\prime} }}}$$
(2)
where ${n}_{k,-gi}^{(v)}$ refers the number of times that cell type v has been observed with topic k and ${n}_{d,-gi}^{(k)}$ refers the number of times that topic k has been observed in region d, both excluding the current cell gi, the ith cell on the gth image. The first ratio expresses the probability of cell type v under topic k, and the second ratio expresses the probability of topic k in region d. ${{\boldsymbol{\mathcal{D}}}}_{-gi}$, ${{\boldsymbol{\mathcal{Z}}}}_{-gi}$, and ${{\boldsymbol{\mathcal{C}}}}_{-gi}$ denote collections of ${\boldsymbol{\mathcal{D}}}$, ${\boldsymbol{\mathcal{Z}}}$, and ${\boldsymbol{\mathcal{C}}}$ excluding cell gi.
2.
Sample D_gi conditional on Z_gi with
$$P({D}_{gi}=\, d\;|\; {Z}_{gi}=k,{{\boldsymbol{\mathcal{D}}}}_{-gi},{{\boldsymbol{\mathcal{Z}}}}_{-gi},{{\boldsymbol{x}}}_{gi}^{c},{{\boldsymbol{x}}}_{d}^{d},{\boldsymbol{\alpha }},\sigma )\\ \propto \, P({Z}_{gi}=k\;|\; {{\boldsymbol{\mathcal{Z}}}}_{-gi},{D}_{gi}=d,{{\boldsymbol{\mathcal{D}}}}_{-gi},{\boldsymbol{\alpha }})P({{\boldsymbol{x}}}_{gi}^{c}\;|\; {D}_{gi}=d,{{\boldsymbol{x}}}_{d}^{d},\sigma )P({D}_{gi}=d)$$
According to ref. ⁴⁶, $P({Z}_{gi}=k\;|\; {{\boldsymbol{\mathcal{Z}}}}_{-gi},{D}_{gi}=d,{{\boldsymbol{\mathcal{D}}}}_{-gi},{\boldsymbol{\alpha }})$ can be obtained by integrating out θ_d, that
$$P({Z}_{gi}=k\;|\; {{\boldsymbol{\mathcal{Z}}}}_{-gi},{D}_{gi}=d,{{\boldsymbol{\mathcal{D}}}}_{-gi},{\boldsymbol{\alpha }})=\frac{{n}_{d,-gi}^{(k)}+{\alpha }_{k}}{\mathop{\sum }\nolimits_{{k}^{{\prime} }=1}^{K}{n}_{d,-gi}^{({k}^{{\prime} })}+{\alpha }_{{k}^{{\prime} }}}.$$
We can further omit P(D_gi = d) due to uniform prior. Thus D_gi can be sampled based on the following conditional distribution:
$$P({D}_{gi}=\, d\;|\; {Z}_{gi}=k,{{\boldsymbol{\mathcal{D}}}}_{-gi},{{\boldsymbol{\mathcal{Z}}}}_{-gi},{{\boldsymbol{x}}}_{gi}^{c},{{\boldsymbol{x}}}_{d}^{d},{\boldsymbol{\alpha }},\sigma )\\ \propto \, K({{\boldsymbol{x}}}_{gi}^{c},{{\boldsymbol{x}}}_{d}^{d})\,\frac{{n}_{d,-gi}^{(k)}+{\alpha }_{k}}{\mathop{\sum }\nolimits_{{k}^{{\prime} }=1}^{K}{n}_{d,-gi}^{({k}^{{\prime} })}+{\alpha }_{{k}^{{\prime} }}}$$
(3)

Initialization

During initialization, we employ a spatially stratified sampling approach to randomly select anchor cells from each image, which serve as region centers. The number of anchor cells per image is determined by a predefined region radius r (default: r = 400) and the image size. The choice of r should take into account both image resolution and tissue complexity. To ensure an accurate estimation of the topic distribution θ_d, each region should contain a sufficient number of cells. In practice, for whole-slide imaging, we expect at least 100 cells per region on average, which guides our selection of region radius r for each dataset.

Since different imaging platforms may report spatial coordinates in either pixels or microns, users are advised to adjust parameters accordingly. In addition, the length scale σ, another critical parameter controlling the strength of the spatial effect, should be tuned in conjunction with the region radius r. Empirically, we have found that setting $\sigma \approx \sqrt{r}$ often works well. However, the optimal value may vary depending on the structure complexity of the imaging data. A smaller σ benefits the identification of local structure, while a larger σ supports global structure. We recommend users to turn the two parameters on a small subset of images before applying to the whole datasets. For each image, we construct an m-nearest-neighbor graph linking all cells to the selected anchor cells. Specifically, for each cell, its m-closest anchor cells are identified. For computational efficiency, distances between each cell and its top m-nearest anchor cells are pre-computed before Gibbs sampling.

The performance of SpatialTopic may be sensitive to the initialization of anchor cells, especially on images with highly complex spatial organization. To address this, we take a warm start strategy instead of starting Gibbs sampling from a single random initialization. This involves running multiple short Gibbs sampling chains during initialization (default: ninit = 10), each with a unique set of randomly-selected anchor cells. After a few iterations (default: niter_init = 100), only the one with the highest log-likelihood is retained and continued.

Implementation

We implemented SpatialTopic in Rcpp and made it an R package SpaTopic (officially available on CRAN after Jan 17, 2024). The complete algorithm is shown in Box 1. For the Gibbs sampling, we have set the default parameters as follows: iter = 200, burnin = 1000, thin = 20 (200 Gibbs sampling draws are made with the first 1000 iterations discarded and then every 20th iteration kept). We can infer topic distributions across all images using the posterior samples drawn from the Gibbs sampling. For each of these posterior samples, both parameters {β_k} and {θ_d} can be estimated as follows:

$${\hat{\beta }}_{kv}=\frac{{n}_{k}^{(v)}+{\psi }_{v}}{\mathop{\sum }\nolimits_{t=1}^{V}{n}_{k}^{(t)}+{\psi }_{t}},$$

(4)

$${\hat{\theta }}_{dk}=\frac{{n}_{d}^{(k)}+{\alpha }_{k}}{\mathop{\sum }\nolimits_{{k}^{{\prime} }=1}^{K}{n}_{d}^{({k}^{{\prime} })}+{\alpha }_{{k}^{{\prime} }}}.$$

(5)

Moreover, we also keep the posterior distribution of Z_gi from all posterior samples for each individual cell. Notably, D_gi has been marginalized during this process and each cell in the end is assigned to the topic with the highest posterior probability. Thus we are also able to visualize the spatial distribution of cell topics in the images.

Box 1 Collapsed Gibbs sampling algorithm

1.
Identify M anchor cells (located at ${\{{{\boldsymbol{x}}}_{d}^{d}\}}_{d=1,2,\ldots,M}$) as the region centers across images.
2.
For each image, pre-compute a KNN graph between all cells and the selected region centers.
3.
Initialize topic assignment Z_gi and region assignment D_gi for each cell. Compute region-topic counts ${n}_{d}^{(k)}$ and topic-celltype counts ${n}_{k}^{(v)}$.
4.
Gibbs sampling over burn-in and sampling period. For each cell, do
1. (a)
  Update counts ${n}_{d}^{(k)}$ and ${n}_{k}^{(v)}$ excluding the current Z_gi and D_gi.
2. (b)
  Sample topic assignment Z_gi conditional on region assignment D_gi based on Eq. (2).
3. (c)
  Sample region assignment D_gi conditional on topic assignment Z_gi based on Eq. (3).
4. (d)
  Update counts ${n}_{d}^{(k)}$ and ${n}_{k}^{(v)}$ with the updated Z_gi and D_gi.
5.
Check convergence. If converged during burn-in and L posterior samples drawn, output posterior samples and parameters estimated based on Eqs. (4) and (5). If not, increase the number of iterations for burn-in.

Model selection

The likelihood of the topic model is intractable to compute in general, but we can approximate the model log-likelihood in terms of model parameters {β_k} and {θ_d}⁴⁷. With the law of total probabilities, we take into account uncertainties both in cells’ region and topic assignment, then the log-likelihood of the spatial topic model can be presented as

$$ll({\boldsymbol{\mathcal{C}}},{\boldsymbol{\mathcal{X}}})=\sum _{g}\mathop{\sum }\limits_{i=1}^{{N}_{g}}\log \left[\mathop{\sum }\limits_{k=1}^{K}\mathop{\sum }\limits_{d=1}^{M}\mathop{\sum }\limits_{v=1}^{V}{\mathbb{I}}({c}_{gi}=v){\theta }_{dk}{\beta }_{kv}{\eta }_{gi}^{d}\right],$$

(6)

where ${\eta }_{gi}^{d}=P({{\boldsymbol{x}}}_{gi}^{c}\;|\; {D}_{gi}=d,{{\boldsymbol{x}}}_{d}^{d})P({D}_{gi}=d)\propto K({{\boldsymbol{x}}}_{gi}^{c},{{\boldsymbol{x}}}_{d}^{d})$.

We use the Deviance Information Criterion (DIC)⁴⁸ to select the number of topics, a generalization of the Akaike Information Criterion (AIC) in Bayesian model selection:

$$DIC={p}_{D}+\overline{D({\boldsymbol{\mathcal{C}}},{\boldsymbol{\mathcal{X}}})},$$

(7)

where the Deviance is defined as $D({\boldsymbol{\mathcal{C}}},{\boldsymbol{\mathcal{X}}})=-2ll({\boldsymbol{\mathcal{C}}},{\boldsymbol{\mathcal{X}}})$ and ${p}_{D}=\frac{1}{2}\overline{Var(D({\boldsymbol{\mathcal{C}}},{\boldsymbol{\mathcal{X}}}))}$.

DIC requires calculating the log-likelihood for every posterior sample, which is time-consuming. To determine the optimum number of topics, we run SpatialTopic with a varied number of topics (2–9 in practice) and collect a few posterior samples (such as the first 20 posterior samples) after convergence (with trace=TRUE). The number of topics was selected based on DIC with (7). Otherwise, we only output the deviance and the log-likelihood of the final posterior sample (default: trace=FALSE). In Supplementary Fig. 8, we show the convergence of the Gibbs sampling algorithm on the Nanostring NSCLC dataset. The number of topics was selected as seven based on DIC.

Comparing to other methods

We compared the performance of SpatialTopic with other five niche analysis methods: spatial-LDA, Seurat-v5, UTAG, CytoCommunity, and BankSY. BankSY and UTAG used protein or gene expression data and cell spatial coordinates as inputs, while the other methods used existing cell-type annotations and cell spatial coordinates. We followed the pre-processing procedures and parameters described in the original papers and tutorials for each method, with some hyperparameters slightly adjusted for computational efficiency on large datasets or when clear guidelines for tuning parameters were available. Details of these adjustments and the rationale for not using the default settings are described in this section.

All methods were initially run using R Studio (for R-based methods) or Jupyter Lab (for Python-based methods) on a standard MacBook Air (M2, 2022). If a method could not be run on a standard Mac due to memory constraints, we used our high-performance computing server with a single-core CPU and 200 GB of assigned memory. For the Nanostring CosMx NSCLC dataset, both CytoCommunity and UTAG were run on the server due to high memory usage. Additionally, for the CODEX mouse spleen dataset, UTAG can be run on the Mac only without the default parallel mode due to memory constraints.

SpatialTopic (SpaTopic R package v1.1.0)

We ran SpatialTopic with region_radius = 400, 150, 300 for the Nanostring CosMx NSCLC, the CODEX mouse spleen, and the mIF melanoma datasets, respectively, allowing around 100 cells per region on average during initialization, which is necessary for accurately estimating the topic-region distribution. We chose length-scale sigma = 20 for the mouse spleen dataset while using the default parameters for the NSCLC and the melanoma dataset. Posterior samples were collected after the convergence of the Gibbs sampling chain, with a burn-in period of 2000 iterations for the NSCLC and the melanoma dataset, and 1500 iterations for the mouse spleen dataset. For the healthy lung dataset with 26 small ROIs, SpatialTopic was run with sigma = 5 and region_radius = 60 to identify the complex local structures. Only on this dataset, we increased the number of initializations to 200 times to increase the robustness of identifying consensus patterns across ROIs while increasing the running time.

Seurat-v5 (v5.0.2)

We used the default niche analysis in Seurat v5, specifically the BuildNicheAssay() function in the Seurat R package. Seurat v5 employs k-means clustering to group cell neighborhood features, which are derived from the shared-nearest-neighbor graph (default neighbors.k = 30), a variant of the KNN graph, as part of its image-based spatial data analysis pipeline. We ran BuildNicheAssay() with all default parameters except for the Nanostring CosMx NSCLC datasets, for which we set neighbors.k = 100. Because we found that increasing neighbors.k from 10 to 100 (testing neighbors.k = 10, 30, 50, 100) significantly improved the algorithm’s performance on this dataset, with results presented in Supplementary Fig. 9.

Spatial-LDA (v0.1.3)

For the CODEX mouse spleen datasets, we used the same parameters as the authors used in the original methodology paper, though we now use neighborhoods of all cells as the input, not only B cells. For the Nanostring CosMx NSCLC datasets, we also use neighborhoods of all cells as the input but set radius = 400 to extract neighborhood cell type compositions, consistent with region_radius r = 400 in SpatialTopic on this dataset. To reduce the computational complexity for both datasets, we set the threshold = 0.01 for ADMM Primal-Dual optimizer. Finally, we output the topic weights for every cell and assign every cell to a topic with the maximal weight.

CytoCommunity (v1.1.0)

CytoCommunity (unsupervised mode) was run on a CPU with 200 GB of assigned memory and evaluated only on the Nanostring CosMx NSCLC dataset due to its demand for large memory and its unsupervised mode’s inability to learn Tissue Cell Neighborhoods across multiple images. We set KNN-K = 300 for 0.1M cells, as suggested in the original paper. For large-scale image data, the second step of CytoCommunity is time-consuming when trained on a CPU. Therefore, we greatly reduced num_RUN to 10 and Num_Epoch to 100 per run while ensuring the final loss was less than −0.2 for each run. Other parameters were set to the default.

UTAG (v0.1.1)

UTAG was primarily developed for protein expression data with limited marker channels. For the Nanostring CosMx NSCLC datasets with 960 genes, we used typical pre-processing steps suggested by Scanpy (v1.9.8) for analyzing scRNA-seq datasets. These steps included filtering low-prevalence genes, log transformation, and retaining only highly variable genes. We then performed z-score normalization, truncated at 10 standard deviations, followed by PCA. Only the top 50 principal components were used as input for UTAG. UTAG was run under multiple clustering resolutions [0.05, 0.1, 0.3, 0.5] and mix_dist = 60, with an image resolution of 0.18 microns per pixel, since the authors suggested setting mix_dist between 10 and 20 µm in the user manual. For the CODEX mouse spleen dataset (with intensity values already transformed), we performed z-score normalization truncated at 10 standard deviations, followed by Combat batch correction⁴¹ and a second z-score normalization truncated at 10 standard deviations, a similar procedure as introduced in the UTAG paper for preprocessing IMC data¹⁵. We also set mix_dist = 60, with an imaging resolution of 0.188 microns per pixel.

BankSY (v0.99.9)

In contrast to UTAG, BankSY is specifically designed to analyze spatial transcriptomics datasets. We ran BankSY with lambda = 0.8 to identify spatial domains, as recommended, with other parameters set to default, as described in the Github tutorial. For the Nanostring CosMx NSCLC dataset, we followed the same pre-processing procedures outlined in the domain analysis tutorial, using k_geom = 30, npcs = 50, and clustering resolutions of 0.1, 0.2, 0.3, and 0.5. For the CODEX mouse spleen datasets, we used the same input as UTAG, after batch correction and normalization. We followed the tutorial for multi-sample analysis, running the results under npcs = 30 since the dataset has only 30 markers.

Data preprocessing

Nanostring CosMx human NSCLC

The Nanostring CosMx NSCLC dataset is available on the Nanostring Website (https://nanostring.com/products/cosmx-spatial-molecular-imager/ffpe-dataset/nsclc-ffpe-dataset/). For our analysis, we selected Lung5-1 sample and annotated about 0.1M cells into 38 cell types using Azimuth²⁶ with a human lung reference v1.0 (https://azimuth.hubmapconsortium.org/references/). We used the same cell type annotations from the Seurat image analysis pipeline tutorial (https://satijalab.org/seurat/articles/seurat5_spatial_vignette_2.html). Since healthy lung tissue was used as the reference, the “basal” cells were re-labeled as tumor cells since they are the most closed cell type. We checked that the tumor locations indicated by the reference-based cell annotations are consistent with the tumor region labeled by the Nanostring company.

CODEX mouse spleen

We used the cell type annotation, marker expression level, and imaging coordinates from the original paper². The image dataset can be downloaded from https://data.mendeley.com/datasets/zjnpwh8m5b/1. For cell coordinates, we only use the X and Y axes of the samples, ignoring Z axis. However, the result is similar when considering all three dimensions.

IMC healthy lung

We used the cell type annotation, marker expression level, cell imaging coordinates, and cell UTAG domain labels in the original paper¹⁵. This image dataset can be downloaded from https://zenodo.org/records/6376767.

mIF melanoma

The 12-plex whole-slide mIF image on Melanoma tissue sample is one of the imaging datasets that have been published from our group¹⁴. The dataset contains following 12 markers: CD8, PD-1, PD-L1, CD68, CD3, CD20, FoxP3, pancytokeratin+SOX10 (panCK/SOX10), TCF1/7, TOX, Ki67, LAG-3. The processed dataset can be downloaded from Mendeley data with link https://data.mendeley.com/datasets/syfmgsv3d9/1. Here, we only used cell phenotypes (classified based on marker expression of CD8, panCK/SOX10, CD68, CD3, CD20, FoxP3) and cell locations as the input of SpatialTopic.

Simulation

We tested methods on simulated datasets of different scales to benchmark the scalability of SpatialTopic with an increasing number of cells in images. We randomly sampled 10 k, 40 k, 90 k, 160 k, and 250 k pixels from an image, similar to the simulation method described in ref. ²³, to represent cell locations. We did not simulate gene expression levels for every individual cell. Instead, for each domain, we randomly sampled cells with domain-specific cell type distributions, with parameters simulated from Dirichlet(1, 1, 1, 1, 1), anticipating five distinct cell types per domain (a simulated example is shown in Supplementary Fig. 10A). Five unique datasets were generated for each simulation scenario. We also scaled the X and Y axes to maintain consistent cell densities across all simulation scenarios.

In addition, we also assessed other aspects of SpatialTopic, including the recovery of the number of topics and topic distributions in our simulation studies. In Supplementary Fig. 10, we evaluated SpatialTopic across 20 simulated datasets (each with 10 k cells) and showed that SpatialTopic not only achieves high clustering accuracy (evaluated using Adjusted Rand Index) but also effectively recovers the topic distributions (evaluated by sum of residual squared error between the true topic content matrix {β_k} and the estimated $\{{\hat{{\boldsymbol{\beta }}}}_{k}\}$), which highlights our model’s capability to recover the underlying topic structure.

Statistics and reproducibility

All statistical calculations were implemented in R (v4.3.3). No statistical method was used to predetermine sample size. The experiments were not randomized. This study does not involve group allocation that requires blinding.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All datasets we used in the study are publicly available and can be downloaded online, with analysis details described in the Method section. The Nanostring CosMx NSCLC dataset is available on the Nanostring Website (https://nanostring.com/products/cosmx-spatial-molecular-imager/ffpe-dataset/nsclc-ffpe-dataset/). The CODEX Mouse spleen dataset can be downloaded from Mendeley Data with link https://data.mendeley.com/datasets/zjnpwh8m5b/1. The IMC Healthy Lung dataset can be downloaded from Zenodo with link https://zenodo.org/records/6376767. The mIF Melanoma image dataset¹⁴ present in the manuscript can be downloaded from Mendeley Data with link https://data.mendeley.com/datasets/syfmgsv3d9/1. Source data for each individual figure are provided with this paper. Source data are provided with this paper.

Code availability

The R package is available on Github (https://github.com/xiyupeng/SpaTopic/) with a tutorial (https://xiyupeng.github.io/SpaTopic/). The R package is also available on CRAN (https://cloud.r-project.org/package=SpaTopic) with https://doi.org/10.32614/CRAN.package.SpaTopic. The first version of the R package was officially released on CRAN on Jan 17, 2024. Analysis codes are available on Github (https://github.com/xiyupeng/SpatialTopic_Analysis_codes) with https://doi.org/10.5281/zenodo.15588176.

References

Keren, L. et al. A structured tumor-immune microenvironment in triple negative breast cancer revealed by multiplexed ion beam imaging. Cell 174, 1373–1387.e19 (2018).
Article PubMed PubMed Central Google Scholar
Goltsev, Y. et al. Deep profiling of mouse splenic architecture with CODEX multiplexed imaging. Cell 174, 968–981.e15 (2018).
Article PubMed PubMed Central Google Scholar
Ko, J. et al. Spatiotemporal multiplexed immunofluorescence imaging of living cells and tissues with bioorthogonal cycling of fluorescent probes. Nat. Biotechnol. 40, 1654–1662 (2022).
Article CAS PubMed PubMed Central Google Scholar
Hoch, T. et al. Multiplexed imaging mass cytometry of the chemokine milieus in melanoma characterizes features of the response to immunotherapy. Sci. Immunol. 7, eabk1692 (2022).
Article CAS PubMed Google Scholar
Moldoveanu, D. et al. Spatially mapping the immune landscape of melanoma using imaging mass cytometry. Sci. Immunol. 7, eabi5072 (2022).
Article CAS PubMed Google Scholar
Nirmal, A. J. et al. The spatial landscape of progression and immunoediting in primary melanoma at single-cell resolution. Cancer Discov. 12, 1518–1541 (2022).
Article CAS PubMed PubMed Central Google Scholar
McCaffrey, E. F. et al. The immunoregulatory landscape of human tuberculosis granulomas. Nat. Immunol. 23, 318–329 (2022).
Article CAS PubMed PubMed Central Google Scholar
Helmink, B. A. et al. B cells and tertiary lymphoid structures promote immunotherapy response. Nature 577, 549–555 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Vanhersecke, L. et al. Mature tertiary lymphoid structures predict immune checkpoint inhibitor efficacy in solid tumors independently of PD-L1 expression. Nat. Cancer 2, 794–802 (2021).
Cabrita, R. et al. Tertiary lymphoid structures improve immunotherapy and survival in melanoma. Nature 577, 561–565 (2020).
Article ADS CAS PubMed Google Scholar
Fridman, W. H. et al. B cells and tertiary lymphoid structures as determinants of tumour immune contexture and clinical outcome. Nat. Rev. Clin. Oncol. 19, 441–457 (2022).
Article CAS PubMed Google Scholar
Feng, Y. et al. Spatial analysis with SPIAT and spaSim to characterize and simulate tissue microenvironments. Nat. Commun. 14, 2697 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Vanguri, R. S. et al. Integration of peripheral blood- and tissue-based biomarkers of response to immune checkpoint blockade in urothelial carcinoma. J. Pathol. 261, 349–360 (2023).
Article CAS PubMed PubMed Central Google Scholar
Smithy, J. W. et al. Quantitatively defined stromal B cell aggregates are associated with response to checkpoint inhibitors in unresectable melanoma. Cell Rep. 44, 115554 (2025).
Kim, J. et al. Unsupervised discovery of tissue architecture in multiplexed imaging. Nat. Methods 19, 1653–1661 (2022).
Article CAS PubMed PubMed Central Google Scholar
Chen, Z., Soifer, I., Hilton, H., Keren, L. & Jojic, V. Modeling multiplexed images with spatial-LDA reveals novel tissue microenvironments. J. Comput. Biol. 27, 1204–1218 (2020).
Patrick, E. et al. Spatial analysis for highly multiplexed imaging data to identify tissue microenvironments. Cytometry A 103, 593–599 (2023).
Article PubMed Google Scholar
Hao, Y. et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat. Biotechnol. 42, 293–304 (2024).
Article CAS PubMed Google Scholar
Singhal, V. et al. BANKSY unifies cell typing and tissue domain segmentation for scalable spatial omics data analysis. Nat. Genet. 56, 431–441 (2024).
Article CAS PubMed PubMed Central Google Scholar
Hu, Y. et al. Unsupervised and supervised discovery of tissue cellular neighborhoods from cell phenotypes. Nat. Methods 21, 267–278 (2024).
Article CAS PubMed PubMed Central Google Scholar
Li, Z. & Zhou, X. BASS: multi-scale and multi-sample analysis enables accurate cell type clustering and spatial domain detection in spatial transcriptomic studies. Genome Biol. 23, 1–35 (2022).
Article Google Scholar
Chidester, B., Zhou, T., Alam, S. & Ma, J. SpiceMix enables integrative single-cell spatial modeling of cell identity. Nat. Genet. 55, 78–88 (2023).
Article CAS PubMed PubMed Central Google Scholar
Shang, L. & Zhou, X. Spatially aware dimension reduction for spatial transcriptomics. Nat. Commun. 13, 1–22 (2022).
Article Google Scholar
Wang, X. & Grimson, E. Spatial latent Dirichlet allocation. In Proc. Advances in Neural Information Processing Systems 20 (NIPS, 2007).
He, S. et al. High-plex imaging of RNA and proteins at subcellular resolution in fixed tissue by spatial molecular imaging. Nat. Biotechnol. 40, 1794–1806 (2022).
Article CAS PubMed Google Scholar
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587 (2021).
Article CAS PubMed PubMed Central Google Scholar
Rakaee, M. et al. Tertiary lymphoid structure score: a promising approach to refine the TNM staging in resected non-small cell lung cancer. Br. J. Cancer 124, 1680–1689 (2021).
Article CAS PubMed PubMed Central Google Scholar
Schumacher, T. N. & Thommen, D. S. Tertiary lymphoid structures in cancer. Science 375, eabf9419 (2022).
Article CAS PubMed Google Scholar
Liu, Y. et al. Conserved spatial subtypes and cellular neighborhoods of cancer-associated fibroblasts revealed by single-cell spatial multi-omics. Cancer Cell. 43, 905–924 (2025).
Aran, D. et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat. Immunol. 20, 163–172 (2019).
Article CAS PubMed PubMed Central Google Scholar
Meng, G. et al. imply: improving cell-type deconvolution accuracy using personalized reference profiles. Genome Med. 16, 65 (2024).
Article PubMed PubMed Central Google Scholar
Golub, R., Tan, J., Watanabe, T. & Brendolan, A. Origin and immunological functions of spleen stromal cells. Trends Immunol. 39, 503–514 (2018).
Article CAS PubMed Google Scholar
Zervopoulou, E. et al. Enhanced medullary and extramedullary granulopoiesis sustain the inflammatory response in lupus nephritis. Lupus Sci. Med. 11, e001110 (2024).
Article PubMed PubMed Central Google Scholar
El-Jawhari, J. J., El-Sherbiny, Y., McGonagle, D. & Jones, E. Multipotent mesenchymal stromal cells in rheumatoid arthritis and systemic lupus erythematosus; from a leading role in pathogenesis to potential therapeutic saviors? Front. Immunol. 12, 643170 (2021).
Schett, G., Mackensen, A. & Mougiakakos, D. CAR T-cell therapy in autoimmune diseases. Lancet 402, 2034–2044 (2023).
Article CAS PubMed Google Scholar
Li, H. et al. Systemic lupus erythematosus favors the generation of IL-17 producing double negative T cells. Nat. Commun. 11, 2859 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Crispín, J. C. et al. Expanded double negative t cells in patients with systemic lupus erythematosus produce il-17 and infiltrate the kidneys. J. Immunol. 181, 8761–8766 (2008).
Article PubMed Google Scholar
Peng, X. et al. A topic modeling approach reveals the dynamic T cell composition of peripheral blood during cancer immunotherapy. Cell Rep. Methods 3, 100546 (2023).
Article CAS PubMed PubMed Central Google Scholar
Roberts, M. E., Stewart, B. M. & Tingley, D. stm: An R package for structural topic models. J. Stat. Softw. 91, 1–40 (2019).
Article Google Scholar
Yosofvand, M. et al. Spatial immunophenotyping from whole-slide multiplexed tissue imaging using convolutional neural networks. Preprint at bioRxiv https://doi.org/10.1101/2024.08.16.608247 (2024).
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2006).
Article PubMed Google Scholar
Kuett, L. et al. Three-dimensional imaging mass cytometry for highly multiplexed molecular and cellular mapping of tissues and the tumor microenvironment. Nat. Cancer 3, 122–133 (2022).
Article CAS PubMed Google Scholar
Datta, A., Banerjee, S., Finley, A. O. & Gelfand, A. E. Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. J. Am. Stat. Assoc. 111, 800–812 (2016).
Article MathSciNet CAS PubMed PubMed Central Google Scholar
Weber, L. M., Saha, A., Datta, A., Hansen, K. D. & Hicks, S. C. nnSVG for the scalable identification of spatially variable genes using nearest-neighbor Gaussian processes. Nat. Commun. 14, 4059 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Griffiths, T. L. & Steyvers, M. Finding scientific topics. Proc. Natl. Acad. Sci. USA 101, 5228–5235 (2004).
Article ADS CAS PubMed PubMed Central Google Scholar
Heinrich, Gregor. Parameter estimation for text analysis. Darmstadt, Germany: Technical report, 2005.
Newman, D., Asuncion, A., Smyth, P. & Welling, M. Distributed algorithms for topic models. J. Mach. Learn. Res. 10, 1801–1828 (2009).
MathSciNet MATH Google Scholar
Gelman, A., Carlin, J. B., Stern, H. S. & Rubin, D. B. Bayesian Data Analysis 2nd edn. in Texts in Statistical Science (CRC Press, 2004).
Peng, X. et al. 1303 Spatial topic modeling of tumor microenvironment with multiplexed imaging. J. Immunother. Cancer 11, A1448–A1448 (2023).
Google Scholar

Download references

Acknowledgements

We would like to thank the computational support from MSK-MIND. This work is supported in part by the MSKCC Society and NIH/NCI Cancer Center Support Grant NIH P30 CA008748 for MSKCC (to X.P., J.W.S., C.E.K., F.E., J.L., M.B., M.Y., M.A.P., M.K.C., R.S., and K.S.P.), the V Foundation (to M.K.C.), the Parker Institute for Cancer Immunotherapy (to X.P.), NIH/NCI grant R01 CA276286 (to R.S., K.S.P., M.K.C., and J.L.), and the MSK-MIND consortium (to R.S. and M.K.C.).

Author information

Authors and Affiliations

Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
Xiyu Peng, Mohammad Yosofvand, Caroline E. Kostrzewa, MaryLena Bleile, Fiona D. Ehrich, Jasme Lee, Katherine S. Panageas & Ronglai Shen
Department of Statistics, Texas A&M University, College Station, TX, USA
Xiyu Peng
Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, NY, USA
James W. Smithy & Michael A. Postow
Weill Cornell Medical Center, New York, NY, USA
Michael A. Postow
Neag Comprehensive Cancer Center, UConn Health, Farmington, CT, USA
Margaret K. Callahan

Authors

Xiyu Peng
View author publications
Search author on:PubMed Google Scholar
James W. Smithy
View author publications
Search author on:PubMed Google Scholar
Mohammad Yosofvand
View author publications
Search author on:PubMed Google Scholar
Caroline E. Kostrzewa
View author publications
Search author on:PubMed Google Scholar
MaryLena Bleile
View author publications
Search author on:PubMed Google Scholar
Fiona D. Ehrich
View author publications
Search author on:PubMed Google Scholar
Jasme Lee
View author publications
Search author on:PubMed Google Scholar
Michael A. Postow
View author publications
Search author on:PubMed Google Scholar
Margaret K. Callahan
View author publications
Search author on:PubMed Google Scholar
Katherine S. Panageas
View author publications
Search author on:PubMed Google Scholar
Ronglai Shen
View author publications
Search author on:PubMed Google Scholar

Contributions

X.P. contributed to the original draft, developed the statistical model, and wrote the software. X.P., J.W.S., R.S., and K.S.P. developed the initial study concept. X.P., R.S., K.S.P., and J.L. developed the algorithm. X.P., C.E.K. contributed to the R package. X.P., J.W.S., C.E.K., F.E., M.Y., and M.B. analyzed the data. R.S., K.S.P., M.A.P., and M.K.C. oversaw all data generation and analysis. X.P., J.W.S., F.E., J.L., M.B., R.S., and K.S.P. edited the manuscript. All authors reviewed and approved the final manuscript.

Corresponding authors

Correspondence to Xiyu Peng, Katherine S. Panageas or Ronglai Shen.

Ethics declarations

Competing interests

J.W.S.: Research funding—IO Biotech (Inst), Regeneron (Inst), Daiichi Sankyo (Inst); Consulting or advisory role—IO Biotech, Iovance, Bristol Myers Squibb, Daiichi Sankyo; Travel—Immatics; M.A.P.: Consulting or Advisory Role—Bristol-Myers Squibb; Cancer Expert Now; Chugai Pharma; Eisai; Erasca, Inc; Intellisphere; Merck; MJH Associates; Nektar; Novartis; Pfizer; WebMD; Research Funding—Array BioPharma (Inst); Bristol-Myers Squibb (Inst); Infinity Pharmaceuticals (Inst); Merck (Inst); Novartis (Inst); Rgenix (Inst); K.S.P. Stock ownership in 23 and Me, Vincerx, Eyepoint, and Kyverna; C.E.K.: Stock ownership in Johnson & Johnson; M.K.C.: BMS—Research support (Inst), advisory role/consulting; Medimmune—advisory role/consulting; Immunocore—advisory role/consulting; Merus—family member employee; M.B.: Regeneron—collaborator; Sanofi—direct contractor; X.P., J.L., M.Y., R.S., and F.D.E.: No disclosures;

Peer review

Peer review information

Nature Communications thanks Ellis Patrick, who co-reviewed with Elijah Willie, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Transparent Peer Review file

Source data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Peng, X., Smithy, J.W., Yosofvand, M. et al. Scalable topic modelling decodes spatial tissue architecture for large-scale multiplexed imaging analysis. Nat Commun 16, 6619 (2025). https://doi.org/10.1038/s41467-025-61821-y

Download citation

Received: 24 October 2024
Accepted: 03 July 2025
Published: 18 July 2025
DOI: https://doi.org/10.1038/s41467-025-61821-y

Subjects

Abstract

Similar content being viewed by others

Optimizing multiplexed imaging experimental design through tissue spatial segregation estimation

Spatially visualized single-cell pathology of highly multiplexed protein profiles in health and disease

Spatial architecture of development and disease

Introduction

Results

Overview of SpatialTopic, a Bayesian probabilistic model for highly scalable and interpretable spatial topic analysis across multiplexed tissue images

SpatialTopic identifies global and local spatial features of human lung cancer tissue with higher precision and interpretability

SpatialTopic identifies tertiary lymphoid structures from whole-slide melanoma tissue imaging

SpatialTopic recovers spatial domain from cell type spatial organization in healthy lung tissue

SpatialTopic identifies disease-specific topics and tracks topic evolution in mouse spleen over disease progression

SpatialTopic is highly scalable on large-scale modern images

Discussion

Methods

SpatialTopic

Notations

Model

Nearest-neighbor exponential kernel

Collapsed Gibbs Sampling

Initialization

Implementation

Model selection

Comparing to other methods

SpatialTopic (SpaTopic R package v1.1.0)

Seurat-v5 (v5.0.2)

Spatial-LDA (v0.1.3)

CytoCommunity (v1.1.0)

UTAG (v0.1.1)

BankSY (v0.99.9)

Data preprocessing

Nanostring CosMx human NSCLC

CODEX mouse spleen

IMC healthy lung

mIF melanoma

Simulation

Statistics and reproducibility

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Reporting Summary

Transparent Peer Review file

Source data

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links