Detecting anomalous anatomic regions in spatial transcriptomics with STANDS

Xu, Kaichen; Lu, Yan; Hou, Suyang; Liu, Kainan; Du, Yihang; Huang, Mengqian; Feng, Hao; Wu, Hao; Sun, Xiaobo

doi:10.1038/s41467-024-52445-9

Download PDF

Article
Open access
Published: 19 September 2024

Detecting anomalous anatomic regions in spatial transcriptomics with STANDS

Nature Communications volume 15, Article number: 8223 (2024) Cite this article

8404 Accesses
11 Citations
3 Altmetric
Metrics details

Subjects

Abstract

Detection and Dissection of Anomalous Tissue Domains (DDATD) from multi-sample spatial transcriptomics (ST) data provides unprecedented opportunities to characterize anomalous tissue domains (ATDs), revealing both population-level and individual-specific pathogenic factors for understanding pathogenic heterogeneities behind diseases. However, no current methods can perform de novo DDATD from ST data, especially in the multi-sample context. Here, we introduce STANDS, an innovative framework based on Generative Adversarial Networks which integrates three core tasks in multi-sample DDATD: detecting, aligning, and subtyping ATDs. STANDS incorporates multimodal-learning, transfer-learning, and style-transfer techniques to effectively address major challenges in multi-sample DDATD, including complications caused by unalignable ATDs, under-utilization of multimodal information, and scarcity of normal ST datasets necessary for comparative analysis. Extensive benchmarks from diverse datasets demonstrate STAND’s superiority in identifying both common and individual-specific ATDs and further dissecting them into biologically distinct subdomains. STANDS also provides clues to developing ATDs visually indistinguishable from surrounding normal tissues.

Learning generalizable AI models for multi-center histopathology image classification

Article Open access 19 July 2024

Spatial and single-nucleus transcriptomic analysis of genetic and sporadic forms of Alzheimer’s disease

Article Open access 22 November 2024

A graph self-supervised residual learning framework for domain identification and data integration of spatial transcriptomics

Article Open access 12 September 2024

Introduction

Detecting and characterizing biologically heterogeneous anomalous tissue domains (ATDs) from multiple tissue samples is of paramount importance in clinical diagnostics, targeted therapies and biomedical research^1,2,3 (Fig. 1a). This procedure, which we refer to as Detection and Dissection of Anomalous Tissue Domains (DDATD), serves as the first and foremost step in a comprehensive analysis of tissues from affected individuals for revealing population-level and individual-specific factors (e.g., pathogenic cell types) associated with disease development.

**Fig. 1: Overview of STANDS and DDATD from multi-sample ST data.**

Traditionally, DDATD relies on either meticulous expert visual inspection or computer vision algorithms applied to histology images^4,5. Spatial Transcriptomics (ST) provides an unprecedented opportunity to enhance DDATD by integrating spatial gene expression information across tissues.

Unlike the spatial clustering task frequently encountered in ST, which focuses on clustering spatial spots into spatial domains^6,7,8,9,10, DDATD necessitates further isolation of anomalous clusters from normal ones. Existing methods developed for spatial clustering either fail to identify anomalous clusters (e.g., SpaGCN⁶ and GraphST¹¹) or rely on “expert-defined” anomaly markers (e.g., cancer gene signatures in iStar⁹, TESLA¹², and cell2location¹³). However, such markers are frequently unavailable, especially for under-researched or novel domain types, severely constraining the applicability of spatial clustering methods in de novo DDATD. Currently, Spatial-ID¹⁴ is the sole available marker-free, spatial clustering method capable of distinguishing anomalous spatial spots from normal ones. However, its presents significant shortcomings, including its reliance on annotated reference scRNA-seq data of the target tissue, which is not always available, a high false-positive rate stemming from confusing normal spots of uncertain assignments with bona fide anomalies¹⁵, and challenges in handling multiple target datasets due to the lack of cross-sample ST alignment mechanisms. To our knowledge, no existing computational methods can perform de novo, annotated-reference-free DDATD from ST data, especially in the context of multiple samples.

The de novo DDATD workflow for multi-sample ST comprises three sequential core tasks: detecting ATDs, aligning them across multiple samples, and classifying them into shared or sample-specific types. Several computational challenges impede these tasks. First, batch effects can compromise the accuracy of anomaly detection^16,17, thus necessitating a preliminary multi-sample alignment for batch correction. However, anomalies themselves can complicate this alignment process^15,18, reducing the effectiveness of batch correction. Additionally, existing methods fall short of aligning multiple ST samples within a unified space that maintains the data’s original scale and semantic integrity¹⁹, thus hampering both the interpretation of aligned data and its utilization in subsequent tasks, such as anomaly subtyping. Furthermore, the scarcity of “normal” reference ST datasets hampers DDATD, as the definition of “anomaly” relies on the comparison with “normal”. Although ST datasets derived from healthy human tissues are being continually added to public databases such as HubMap²⁰ and STOmicsDB²¹, they are limited in quantity and tissue types, given acquiring healthy tissues presents significant challenges, particularly for tissues of vital organs. Moreover, the lack of effective methods to integrate multimodal signals from both spatial gene expression data and histology images severely restricts the potential of DDATD analysis. Finally, integrating the three sequential tasks into a methodologically coherent framework, rather than simply combining individual methods specifically designed for each task, is vital to ensure their collective effectiveness.

To address these challenges, we introduce Spatial Transcriptomics ANomaly Detection and Subtyping (STANDS), an innovative framework built on a suite of specialized Generative Adversarial Networks (GANs) for seamlessly integrating the three core tasks of DDATD. The framework consists of three components, each dedicated to one of the core tasks of DDATD (Methods and Fig. 1b). Component I (C1) trains a GAN model on the reference dataset, learning to reconstruct normal spots from their multimodal representations of both spatial transcriptomics data and the associated histology image patches²². Subsequently, the model is applied on the target datasets to identify anomalous spots as those with unexpectedly large reconstruction deviances, namely anomaly scores²³. As nonbiological variations (e.g. batch effects) potentially complicate anomaly subtyping, Component II (C2) employs two cooperative GAN models to diminish them via “style-transferring” target datasets into a common space, while preserving the data’s original scale and semantic integrity²⁴. Component III (C3) fuses the embeddings and reconstruction residuals of aligned anomalous spots to serve as inputs to a self-paced clustering algorithm²⁵ which groups anomalies into biologically distinct subtypes. STANDS is extensively benchmarked in a series of experiments that are deliberately designed for the three core tasks of DDATD under various scenarios (Table 1). These experiments involve datasets spanning a diverse range of platforms, tissues, and species (Supplementary Table 1). Compared to benchmark methods (Supplementary Table 2), our method demonstrates unparalleled performances across these experiments in identifying both nascent and developed ATDs from single and multiple ST samples, aligning ST samples, whether adjacent or non-adjacent, to diminish nonbiological variations among ATDs, and dissecting ATDs into biologically distinct subdomains that can be either shared across samples or unique to individual samples. Notably, STANDS remains effective even using scRNA-seq datasets as surrogate references, and its performance is not compromised when dealing with multiple samples with unalignable sample-specific ATDs. In addition to STANDS, we also introduce Spatial Grouping Discrepancy (SGD) as a novel metric to evaluate the spatial organization of detected ATD subdomains (see “Evaluation metrics” section in Methods). In contrast to conventional clustering metrics like accuracy and Adjusted Rand Index (ARI), which treat observations as equal-weighted and independent, SGD pioneers in incorporating spatial relationships among spots into the evaluation and thus are particularly suited for assessing spatial clustering results in ST analyses.

Table 1 Overview of the experimental settings

Full size table

Results

Overview of STANDS

STANDS is a suite of GAN modules specialized for integrating and fulfilling three core tasks of DDATD from multi-sample ST: detecting, aligning, and subtyping of ATDs. For ease of discussion, we will refer to observations in ST datasets as “spots” regardless of their resolutions, and to spots within reference and target datasets as reference and target spots, respectively. Moreover, the term “dataset” is used interchangeably with “sample”, the phrase “adjacent datasets” refers to datasets obtained from vertically sliced tissue samples, while “non-adjacent datasets” are derived from noncontiguous samples of the same tissue type but different donors.

The fundamental idea behind STANDS is that a GAN model can learn either the “content” (biological) variations between normal and anomalous regions or the “style” (batch) variations across ST datasets. The workflow of STANDS is structured into three components, each corresponding to one of the aforementioned three tasks (Fig. 1b). Component I (C1) is dedicated to detecting ATDs in target datasets (see “Anomalous tissue region detection” section in Methods). It involves a GAN module (module I), which is trained on the reference dataset to reconstruct normal spots accurately. Trained module I is then applied on target datasets to identify anomalous spots as those exhibiting substantial reconstruction deviances (or anomaly scores), given that module I has been trained exclusively on normal data. Crucially, this approach remains effective despite potential batch variations, as anomalies consistently exhibit higher anomaly scores relative to normal spots. The reconstruction is based on spots’ latent multimodal embeddings from module I’s encoder, which employs a graph attention (GAT) network and a ResNet-GAT hybrid network to learn the latent representations of gene expression and histology images for each spot. These two embeddings are then fused using a transformer fusion (TF) block²² to generate a multimodal embedding (see “Multimodal learning of spatial gene expression data and histology image” section in Methods). In scenarios where a normal ST dataset is unavailable, the encoder employs a multi-layer perceptron (MLP) network to utilize normal scRNA data as surrogate reference and treat spatial spots in target dataset as single-cells to generate their embeddings.

Component II (C2) utilizes the spot embeddings generated by C1 to align target datasets in the reference data space (see “Multi-sample ST data alignment” section in Methods). Initially, identified anomalous spots from component I are excluded from target datasets to eliminate their confounding effects. Two GAN modules, module II &III, work in tandem in this component. Module II creates “kin” pairs between reference and target spots by learning a non-negative mapping matrix ${{\bf{M}}}\in {{\mathbb{R}}}^{{N}_{T}\times {N}_{R}}$, where ${N}_{T}$ and ${N}_{R}$ denote the number of target and reference spots, respectively. Subsequently, module III utilizes these “kin” pairs to learn a “style”-divergence matrix ${{\bf{S}}}\in {{\mathbb{R}}}^{{N}_{{batch}}\times d}$, based on which the target datasets are “style”-transferred to the reference dataset’s “style” space, thereby diminishing their batch variations.

Component III (C3) aims to group anomalous spots into biologically distinct tissue subdomains through a clustering process (see “Subtyping anomalous tissue domains across multiple datasets” section in Methods). Specifically, the C1-generated embeddings and reconstruction residuals of aligned anomalous spots are fused by a TF block to serve as inputs for the discriminatively boosted clustering algorithm, DEC. This algorithm iteratively refines anomalies’ embeddings and cluster centroids, incrementally transforming harder-to-cluster anomalies into easier ones until either a predetermined number of iterations is reached, or the cluster assignment change between two successive iterations falls below a prespecified threshold.

STANDS identifies cancerous domains in single human breast cancer ST dataset

Initially, we leverage STANDS to identify carcinogenic domains within human breast cancer tissues. Specifically, this experiment (Exp ID = 1 in Table 1) involves a 10x Visium reference dataset (10x-hNB-v05), which is prepared from healthy human breast tissues and encompasses four normal domain types, and a 10x Visium target dataset (10x-hBC-G2), which is derived from human breast cancer tissues and encompasses two additional carcinogenic domain types including the cancer in situ (CIS) and the invasive cancer (IC) domains. As benchmarks, we include Spatial-ID, the sole available marker-free method for ATD detection in ST, along with four methods—CAMLU¹⁵, scPred²⁶, CHETAH²⁷, and scmap²⁸ — originally developed for anomaly detection in scRNA-seq, all of which treat spatial spots as single-cells.

STANDS significantly outperforms the competing methods in correctly identifying spots within the carcinogenic domains in terms of accuracy, F1-score, Precision-Recall (PR) curve, and SGD scores (Fig. 2a, b). Here, a lower SGD score indicates the corresponding result is more spatially consistent with the ground truth (see “Evaluation metrics” section in Methods). Furthermore, the legitimacy of using STANDS’ anomaly scores for identifying ATDs is evident from the bimodal distribution of the normalized anomaly scores in the target dataset, which clearly separates anomalous and normal spots (Fig. 2c, lower panel). Intriguingly, normal spots in closer proximity to anomalous regions tend to have higher anomaly scores (Fig. 2c, top panel). We postulate that these normal-looking spots may be pathologically relevant. To verify this, we examine normal spots with high anomaly scores, which STANDS falsely identifies as anomalies. Many of these spots are adjacent to carcinogenic regions and may represent emerging cancerous areas (Fig. 2c, middle panel). Such areas are visually indistinguishable from the surrounding “normal” tissues in the histology image (Fig. 2d) but exhibit substantially elevated expression levels of breast cancer marker genes (e.g., ACTB²⁹ &TMSB10³⁰, Fig. 2e). Therefore, these false positive anomalies are clinically significant as they provide insights into the tumor developmental trajectories. Finally, we repeat this experiment using Slide-seqV2 datasets (Exp ID = 4), where two mouse embryo datasets are selected as the reference (ssq-mEmb-32) and target (ssq-mEmb-33) datasets, respectively. The domains of cardiac valve (CV) cells and neuroectodermal (NED) cells are designated as ATDs in the target dataset and excluded from the reference dataset. Not surprisingly, STANDS surpasses the benchmark methods in accurately distinguishing CV and NED domains from others (Supplementary Fig. 1).

**Fig. 2: Intral- and cross-modality detection of anomalous tissue domains (ATDs) in single 10x Visium datasets.**

STANDS concurrently identifies anomalous tissue domains across multiple ST datasets

We then evaluate STANDS’ performance in concurrently detecting ATDs across multiple ST samples derived from different donors. This evaluation (Exp ID = 3) involves the 10x-hNB-v05 dataset as the reference and two 10x Visium human breast cancer datasets (10x-hBC-H1 and 10x-hBC-G2) as the targets. We challenge STANDS by curating the target datasets to encompass both shared and dataset-specific ATDs: the IC domain is shared across datasets, while the CIS and adipose tissue (AT) domains are unique to 10x-hBC-G2 and 10x-hBC-H1, respectively. Additionally, we continue with the same four benchmark methods employed in the preceding section. As shown in Fig. 3, STANDS is the only method that accurately delineates ATDs in spatial maps of both target datasets. In contrast, Spatial-ID can only demarcate the AT domain, CAMLU almost entirely fails to identify any anomalies, scPred and CHETAH tend to identify anomalies dispersed throughout the tissues, and scmap mistakenly categorizes most spots as anomalous. STANDS’ superiority is further quantitatively substantiated by its better accuracy, F1-score, and Precision-Recall (PR) curve results, compared to those of the benchmark methods. Furthermore, as detailed in Table 1 (Exp ID = 5-8) and Supplementary Note 1.1, this evaluation is extensively repeated across datasets of diverse platforms, tissue types, and health states, including the mouse embryo Slide-seqV2 (ssq-mEmb) and Stereo-seq (stereo-mEmb) datasets, the human primary sclerosing cholangitis 10x Visium datasets (10x-hPSC), as well as the human renal cell cancer 10x Visium datasets (10x-hRCC). Supplementary Figs. 1–5 consistently demonstrate that, compared to the benchmarks, STANDS excels in concurrently identifying biologically heterogeneous ATDs from multiple target datasets.

**Fig. 3: Detection of anomalous tissue domains (ATDs) across multiple human breast cancer 10x Visium datasets.**

STANDS identifies pancreatic cancerous domains via cross-referencing pancreatic scRNA-seq data

The scarcity of normal ST datasets from healthy individuals, which serve as comparative baselines, presents a primary challenge in detecting ATDs from diseased ST datasets. STANDS innovatively circumvents this limitation by transferring knowledge from more accessible normal scRNA-seq datasets of the same tissue type. To assess STANDS’ performance in this cross-modal anomaly detection, we use a normal scRNA-seq dataset derived from healthy human pancreatic ductal tissues as a surrogate reference. The target dataset is a 10x Visium dataset (10x-hPDAC) prepared from pancreatic ductal adenocarcinomas (PDAC), with cancerous domains designated as anomalous. Recognizing that researchers often resort to using unsupervised anomaly detection methods in the absence of suitable reference datasets, we expand our benchmarking to include SCEVAN³¹ and CopyKAT³², two unsupervised methods developed for detecting tumor cells from scRNA-seq data, alongside the five supervised benchmark methods in earlier sections. As shown in Fig. 2f, STANDS successfully delineates the primary carcinogenic region on the spatial map, outperforming all benchmark methods that either mark only sporadic cancer spots or erroneously label most spots as anomalous. This superiority is corroborated by STANDS’ highest accuracy and F1-scores as well as its lowest SGD scores (Fig. 2g).

STANDS effectively aligns multiple ST datasets sharing identical domain types

This section evaluates STANDS’ effectiveness in aligning multiple ST datasets without the interference of unalignable, dataset-specific domain types. This evaluation begins with a simple scenario (Exp ID = 13) involving aligning multiple vertical human PSC 10x Visium datasets (10x-hPSC-A1, C1, and D1). The second scenario (Exp ID = 9) is more involved as we aim to align three non-adjacent 10x Visium healthy human breast tissue datasets (10x-hNB-v05, 10x-hNB-v06, and 10x-hNB-v07). The benchmark methods include Harmony³³ and ComBat³⁴, originally designed for scRNA-seq data integration, as well as GraphST¹¹ and STAligner³⁵, designed for aligning ST datasets.

The results of the first and second scenarios are displayed in Fig. 4 and Supplementary Fig. 6, respectively. UMAPs in Fig. 4a and Supplementary Fig. 6a demonstrate that STANDS surpasses most benchmark methods in intermingling data and maintaining accurate domain type alignment across datasets. This superiority is supported by metrics that measure the batch mixing effects (iLISI, BatchKL, and ASW_batch, see Methods) and the accuracy of cross-batch domain type alignment (ASW_type), as shown in Fig. 4b and Supplementary Fig. 6b. To further assess the effectiveness of multi-sample alignment, we utilize GraphST for joint spatial domain clustering, which is a natural and important downstream task, on the aligned datasets in both scenarios. For establishing baseline results, we also perform joint spatial clustering on the unaligned datasets. The annotated spatial maps in Fig. 4c and Supplementary Fig. 6c illustrate that, although GraphST achieves improved spatial domain clustering results for all post-alignment datasets relative to the baseline results, the most accurate results are observed for STANDS-aligned datasets, as evidenced by STANDS’ highest joint ARI score shown in Fig. 4d and Supplementary Fig. 6d.

**Fig. 4: Alignment of multiple vertical human primary sclerosing cholangitis (PSC) 10x Visium human PSC datasets sharing identical domain types.**

We repeat this experiment using multiple vertical human breast cancer 10x Visium datasets (Exp ID = 13, Supplementary Fig. 7), non-adjacent mouse embryo Slide-seqV2 (Exp ID = 12, Supplementary Fig. 8) and Stereo-seq (Exp ID = 14, Supplementary Fig. 9) datasets. It is worth mentioning that, to increase the challenge of the first scenario (Exp ID = 13), we select three vertical slices (10x-hBC-A1, 10x-hBC-A4, and 10x-hBC-A6) from a collection of six 10x Visium human breast cancer datasets (10x-hBC-A1 to A6) so that their inter-slice z-axis distance a maximized. Given that adjacent 10x-hBC datasets are 32 μm apart along the z-axis, our selection achieves a minimum z-axis distance of 32×2 = 64μm between A4 and A6 and a maximum distance of 32×5 = 160μm between A1 and A6. Generally, these experiments demonstrate results similar to those in Fig. 4 and Supplementary Fig. 6, except that, in Supplementary Fig. 7a, Harmony appears to better separate immune infiltrates from other domain types in the post-alignment data compared to STANDS, contradicting with the metric results (Supplementary Fig. 7b and d). This observation can be explained by the small proportion of immune infiltrates, whose well-separation has limited impact on overall type alignment. STANDS, on the other hand, excels in separating major domain types such as breast glands, adipose tissues, and connective tissues. This is evidenced by the ASW_type metric in the right panel of Supplementary Fig. 7b, which assesses cross-batch type alignment. STANDS has a higher ASW_type score (0.52) compared to Harmony (0.5), indicating a slightly advantage in separating cell types in mixed batches. Additionally, Supplementary Fig. 7d shows that data aligned by STANDS (ARI = 0.24) achieves better spatial clustering performance than data aligned by Harmony (ARI = 0.23). Finally, these findings altogether highlight STAND’s remarkable ability not only in effectively integrating multiple ST datasets but also in ensuring their accurate domain alignment, irrespective whether the datasets are adjacent or non-adjacent.

STANDS effectively aligns multiple ST datasets with non-overlapping domain types

In this section, we aim to align datasets with both shared and unique domain types. This scenario is not only more realistic for ADT analysis in multi-sample ST, but also presents greater challenges since the biological variations arising from inconsistent domain types can confound with nonbiological variations, thus complicating the data alignment. Our experiment (Exp ID = 10) involves a normal breast tissue dataset (10x-hNB-v07) and two breast cancer datasets (10x-hBC-G2 &10x-hBC-H1) that have been curated to share all domain types, except for the CIS and IC domains that are unique to 10x-hBC-G2 and 10x-hBC-H1, respectively. The benchmark methods include Harmony, ComBat, STAligner, and GraphST. As shown in Fig. 5a, STANDS is the sole method that not only achieves superior batch integration but also successfully separates the two dataset-specific domains (CIS and IC) from regions common to all datasets and avoids their intermingling with each other. To systematically evaluate the impacts of dataset-specific domains, we compare the effectiveness of multi-sample alignment in the presence versus absence of dataset-specific domains. The latter condition is created by excluding both CIS and IC domains from the two cancerous datasets. As shown in Fig. 5b, STANDS not only significantly outperforms all benchmark methods in both batch mixing and domain type alignment, as evidenced by its superior iLISI, BatchKL, ASW_batch, and ASW_type scores, but also stands out as the only method whose effectiveness remains robust to the dataset-specific domain types. Moreover, as shown in Fig. 5c, d, GraphST yields the most accurate joint spatial clustering results using datasets aligned by STANDS, further demonstrating that STANDS excels in diminishing non-biological variations across datasets. Such an observation can be attributed to both the effective mitigation of batch effects and the enhancement of domain-relevant gene expression patterns in STANDS-aligned datasets. For instance, IRF6 and SLCA40A1 genes, known to be implicated in AT function and immune metabolism respectively^36,37, exhibit elevated expression levels within the AT and immune infiltrates domains in the normal 10x-hNB-v07 dataset. Although such expression patterns are subdued in the original 10x-hBC-G2 and 10x-hBC-H1 datasets, they become more pronounced following STANDS’ alignment (Fig. 6).

**Fig. 5: Alignment of multiple non-adjacent human breast cancer 10x Visium datasets with dataset-specific domain types.**

**Fig. 6: Enhanced expression patterns of domain-relevant genes in STANDS-aligned datasets.**

STANDS discerns biologically distinct anomalous tissue subdomains in single ST datasets

Subtyping ATDs in single target datasets represents the most fundamental task for discerning biologically distinct subdomains, as the ATDs are devoid of batch variations. We assess the performances of STANDS in this context using the 10x-hBC-G2 (Exp ID = 18) and 10x-hBC-H1 (Exp ID = 19) datasets, respectively. The CIS and IC domains in 10x-hBC-G2, and the AT and IC domains in 10x-hBC-H1 have been curated as anomalous. Given the accuracy of anomaly subtyping is contingent upon the outcomes of anomaly detection, the benchmark methods are composite methods, which are simple combinations of individual methods specialized for anomaly detection and subtyping. These include the five supervised benchmark methods for anomaly detection (Spatial-ID, CAMLU, scPred, CHETAH, and scmap), coupled with a spatial clustering method—GraphST, STAGATE, or iStar—for grouping identified ATDs into distinct subdomains, yielding six composite benchmark methods: scmap-GraphST, CAMLU-GraphST, CHETAH-GraphST, scPred-GraphST, SpatialID-STAGATE, and SpatialID-iStar. As shown in Supplementary Fig. 10, STANDS consistently outperforms the benchmark methods in generating ATD subtype labels that closely align with the ground truth in annotated spatial maps, as quantitatively evidenced by its highest Macro-F1 and NMI scores. Note that NMI scores are computed on true positive anomalous spots identified by each method, while Macro-F1-scores are on true positives, false positives, and false negatives. Supplementary Fig. 11 showcases that STANDS’ superiority remains when this experiment is extended to using single Slide-seqV2 datasets (Exp ID = 20 and 21).

STANDS discerns biologically distinct anomalous tissue subdomains across multiple ST datasets

In the final assessment, we examine STANDS’ ability to concurrently discern biologically distinct subdomains within ATDs across multiple datasets. The benchmark methods include the same six composite methods as in the preceding section, with the addition of STAligner to SpatialID-STAGATE, ComBat to SpatialID-iStar, and GraphST to the other four methods for aligning cross-dataset ATDs (see “Benchmark implementations” in Methods). This experiment comprises two scenarios involving two 10x Visium target datasets (10x-hBC-G2 & 10x-hBC-H1). In the first scenario (Exp ID = 16, Supplementary Fig. 12), both datasets share identical ATD subdomains (IC &CIS), while in the second scenario (Exp ID = 15, Fig. 7), they have been curated to include both overlapping (CIS) and unique (IC in 10x-hBC-G2; AT in 10x-hBC-H1) ATD subdomains. In both scenarios, ATD subdomains delineated by STANDS closely align with the ground truth in the annotated spatial maps and Sankey plots (Supplementary Fig. 12a, b and Fig. 7a, b). Conversely, the benchmark methods struggle with correctly distinguishing ATD subdomains, especially those shared between datasets. For example, spots within the CIS subdomain are either overlooked (e.g., CAMLU-GraphST) or misclassified as distinct subdomains (e.g., scmap-GraphST). Supplementary Fig. 12c and Fig. 7c quantitatively demonstrate STANDS’ superiority over the benchmark methods in terms of Macro-F1, NMI, and multi-SGD scores. It is worth mentioning that NMI scores are computed on true positive anomalous spots identified by each method, Macro-F1 on all iden- and multi-SGD scores are on true positives, false positives and false negatives. Additionally, the cells along the leading diagonals of heatmaps of Supplementary Fig. 12d and Fig. 7d represent the spatially optimal matches between the predicted subdomain labels and the ground truth, determined using the multi-SGD scores (see “Evaluation metrics” section in Methods). The cell color intensity indicates the degree of spatial consistency between the matched label pairs. STANDS exhibits the most pronounced patterns along these diagonals, signifying that its subtyping results consistently achieve the best spatial matching with the ground truth. We still test STANDS and benchmark methods’ performance on Stereo-seq datasets (Exp ID = 23, Fig. 8), which share identical ATD subdomains (branchial arch & primitive gut tube). The spatial plots (Fig. 8a) and metrics (Fig. 8c, d) show the advantages of STANDS compared with the benchmark methods on Stereo-seq data. To assess STANDS’ generalizability, we further conduct a series of experiments (Exp ID = 17, 22, 24) using the ATDs identified from the three experiments (Exp ID = 5, 6, 8) in Supplementary Note 1.1, where ST datasets spanning across diverse platforms, tissue types, and health states are involved. The experimental settings and results of these experiments are detailed in Supplementary Note 1.2. Briefly, compared to the benchmarks, STANDS not only excels in accurately detecting ATDs and dissecting them into biologically distinct subdomains, but also in recognizing them as either shared by target datasets or unique to specific ones.

**Fig. 7: Subtyping anomalous tissue domains (ATDs) across multiple human breast cancer 10x Visium datasets with both shared and dataset-specific domain subtypes.**

**Fig. 8: Subtyping anomalous tissue domains (ATDs) across multiple mouse embryo Stereo-seq datasets sharing identical domain subtypes.**

The impacts of reference diversity and size on STANDS’ performance

Recognizing the important roles of diversity and quantity of normal spots in the reference dataset for training STANDS, we systematically evaluate these factors using a healthy human breast 10x Visium dataset (i.e., 10x-hNB-v05) as the reference and a breast cancer 10x Visium dataset (i.e., 10x-hBC-G2) as the target. For the quantity evaluation, we conduct three experiments where we incrementally exclude one-third of the connective tissue spots from the reference dataset, as shown in Fig. 9a. We observe a decline in ATD detection accuracy during inference with fewer reference spots, as evidenced by lower AUC scores, F1-score, and an increased number of false positives (the left two bar groups in Fig. 9c). Notably, many new false positives appear in the connective tissue domains—14 spots (no removal), 20 spots (one-third removal), and 35 spots (two-third removal) as shown by the rightmost bar group in Fig. 9c. These findings demonstrate that reducing the number of reference spots of a specific domain type most significantly impacts the detection accuracy of the same type in the target dataset.

**Fig. 9: Impact of reference data size and diversity on STANDS’ performance in ATD detection.**

For the diversity evaluation, we perform two experiments where spots within the immune infiltrate domains or an equal number of spots within the breast gland domains are excluded from the reference dataset. Compared to the breast glands, immune infiltrates are more similar to other normal domains in terms of average cosine similarities in gene expressions (Fig. 9d). Thus, the removal of breast glands decreases the level of reference diversity more significantly compared to immune infiltrates. Figure 9e shows that the removal of either domain type reduces STANDS’ overall ATD detection accuracy. Particularly, Fig. 9f shows that the removal of breast glands leads to a larger increase ( ~ 3.30 fold) in false positives for breast glands domains, compared to the removal of immune infiltrates ( ~ 2.94 fold). This can be explained by the fact that the information loss from removing immune infiltrates is better compensated by the remaining domain types during training. Consequently, STANDS learns less effective reconstruction of breast glands than immune infiltrates, leading to larger reconstruction errors and a higher false positive rate during inference. In summary, our findings indicate that a larger quantity of spots for a normal domain type during training reduces STANDS’ misclassification likelihood for that domain type during inference. Meanwhile, decreased reference diversity increases the misclassification likelihood for normal domain types missing in the reference dataset.

Ablation study

Here, we conduct a series of ablation studies on STANDS’ key model components to evaluate their impacts on performance for each subtask. The 10x-hNB-v05 dataset is used as the reference and the 10x-hBC-G2 dataset as the target for analyzing components in Components I (i.e., ATD detection) and III (i.e., ATD subtyping), while the 10x-hBC-A1, -A4, and -A6 datasets are used for Component II (i.e., multi-sample alignment). Model components involved in the ablation studies include the memory bank within the generator of GAN module I, the utilization histology data in ATD detection and subtyping, the TF blocks in STANDS’ Components I and III, and the nonnegative mapping matrix for multi-sample alignment.

Removing the memory bank leads to lower ATD detection accuracy (Supplementary Fig. 16a) and reduced variability in the training losses of both the generator and discriminator of GAN module I, especially in the later training phase where the discriminator’s loss becomes lower while the generator’s loss becomes higher without further improvement (Supplementary Fig. 16b). This loss pattern indicates a tendency towards mode collapse, a frequent cause to GAN’s failure of producing diverse outputs. Memory bank with a size significantly larger than the batch size helps prevent this by storing various previously seen spots to increase the diversity during the reconstruction process.

The removal of histology data in Component I &III reduces STANDS’ performance in DDATD, as indicated by the lower accuracy and F1-score in ATD detection, and lower Macro-F1x NMI score in ATD subtyping (Supplementary Fig. 16a). We also notice an increase in false positives, particularly in normal domains that are histologically similar to surrounding normal tissues but exhibit high sparsity in ST data (as shown by the encircled area in the two rightmost panels in the top row of Supplementary Fig. 16c). We postulate that, when only using ST data, such normal domains tend to have a poor reconstruction due to their high sparsity, which distinguishes them from normal tissues in the absence of histological data.

Also shown in Supplementary Fig. 16a, omitting TF blocks in Component I &III also decreases STANDS’ performance in ATD detection and subtyping, respectively. This aligns with our expectation given that the TF block in Component I enhances ATD detection by facilitating complementary information exchange between data modalities, while the TF block in Component III integrates anomalous spot embeddings and reconstruction errors to inform anomaly clustering.

Finally, replacing the non-negative mapping matrix with a mutual-nearest neighbor (MNN) method³⁸ in Component II worsens batch-mixing effect and domain type alignment, as indicated by the lower 1-ASW_batch and ASW_type scores in the middle panel of Supplementary Fig. 16a. This finding suggests that our approach, which is more oriented towards reconstruction-based task, is superior to MNN for multi-sample alignment in our context.

Discussion

Understanding the biological heterogeneity of pathogenic tissue domains across affected individuals serves as the cornerstone in the realm of personalized medicine. DDATD from multi-sample ST greatly facilitates such studies by providing critical insights into the molecular mechanisms and transcriptomic variations that drive the development of diseases. However, to the best of our knowledge, there is currently no computational methods that can utilize ST data for this purpose. In this study, we have introduced an innovative computational framework named STANDS to bridge this gap.

STANDS integrates and fulfills the three core tasks of DDATD: detecting, aligning, and subtyping ATDs across multiple ST samples. STANDS’ remarkable features include formulating the ATD detection problem under a generative model-based framework to significantly mitigate the batch variation interference, utilizing multimodal information from both histology images and spatial gene expression to enhance DDATD’s sensitivity and precision, transferring knowledge from scRNA-seq data when normal ST datasets are unavailable as references, utilizing “style-transferring” techniques to diminish nonbiological variations among ATDs across samples so that common and individual-specific ATDs can be discerned. These features effectively tackle the primary challenges in DDATD from multi-sample ST and collectively contribute to STANDS’ unparalleled performances in fulfilling the three tasks under various scenarios, as evidenced by our extensive benchmark studies using diverse real datasets from different platforms (10x Visium, Slide-seqV2), species (human, mouse) and tissues (breast, embryo, pancreas). Despite these advantages, there is still room for STANDS’ future improvement. For instance, the current GAN model can be replaced with an adapted denoising diffusion probabilistic model. This next-generation generative model could be more accurate in reconstructing normal observations, thereby further reducing the false positives in ATD detection.

In summary, the main contributions of this study include introducing STANDS as the first dedicated method for detecting ATDs from ST data. STANDS distinguishes itself from existing ST alignment methods (e.g., STAligner¹⁹) in preserving the original data scale and semantic integrity of aligned data, as well as being robust to unalignable non-overlapping domain types and ATDs. Additionally, STANDS is currently the only method that can further dissect ATDs into biologically distinct subdomains and discern them as either shared across samples or specific to individual samples. The study also introduces SGD and multi-SGD as the first metrics that incorporate spatial relationships among observations for a more effective evaluation of spatial clustering results. Moreover, STANDS are not restricted to anomaly detection in ST. It also can be used to detect, align, and classify anomalous single cells across scRNA-seq datasets, given its capability of treating target spatial spots as single-cells, e.g., when cross-referencing scRNA-seq data. Furthermore, the core concept of STANDS’ integrated framework for detecting and subtyping anomalies can potentially be applied to other omics data modalities, and its transfer-learning-based reconstruction strategy offers a promising means for utilizing more accessible data modalities to inform anomaly detection in emerging or rarer data modalities, such as scATAC-seq. Therefore, we anticipate STANDS will prove useful for the diagnosis of pathogenic tissue regions, the discovery of novel or individual-specific pathogenic sources, and the development of targeted therapies.

Methods

Anomalous tissue domain detection

As illustrated in the C1 part of Fig. 1b, module I consists of a generator and a discriminator. The generator itself is composed of four subcomponents: an encoder, a decoder, a transformer fusion (TF) block, and a memory bank. The encoder employs a GAT and a ResNet-GAT hybrid network to generate spot embeddings based on spatial gene expression data and the associated histology image, respectively:

$${z}_{i}^{g}=\, {f}_{{GAT}}\left({x}_{i}^{g},\, G\,,\,{{{\bf{W}}}}_{1}\right),$$

(1)

$${z}_{i}^{p}=\,{f}_{{rn}-{GAT}}\left({x}_{i}^{p},\, G\,,\, {{{\bf{W}}}}_{2}\right),$$

(2)

were, ${x}_{i}^{g}$ denotes the gene expression vectors of spot $i$, ${x}_{i}^{p}$ the segmented patches of the histology image, $G=\left(V,E\right)$ the graph representation of all spatial spots, ${z}_{i}^{g}$ and ${z}_{i}^{p}{{\boldsymbol{\in }}}{{\mathbb{R}}}^{d}$ the transcriptomic and image patch embeddings for spot $i$, respectively. These embeddings are then fused using a TF block:

$${z}_{i}{{\boldsymbol{\in }}}{{\mathbb{R}}}^{2d}={TF}\left(\left[{z}_{i}^{g}{{\rm{||}}}{z}_{i}^{p}\right],{{{\bf{W}}}}_{3}\right).$$

(3)

The memory bank is essentially an embedding queue ${{\bf{Q}}}{{\boldsymbol{\in }}}{{\mathbb{R}}}^{{N}_{{mem}}{{\boldsymbol{\times }}}2d}$ filled with $z$, where ${N}_{{mem}}$ denotes the number of in-memory embeddings. It provides an attention-based means to reconstruct $z$ as $\widetilde{z}{{\boldsymbol{\in }}}{{\mathbb{R}}}^{2d}$:

$${\widetilde{z}}_{i}={{{\bf{Q}}}}^{T}{{\rm{softmax}}}\left(\frac{{{\bf{Q}}}{z}_{i}}{\tau }\right)=\left[{\widetilde{z}}_{i}^{g}\parallel {\widetilde{z}}_{i}^{p}\right]{{\boldsymbol{.}}}$$

(4)

where $\tau$ is a temperature hyperparameter. ${{\bf{Q}}}$ is continuously updated during training by enqueuing recent $\widetilde{z}$ and dequeuing the oldest to maintain a balance between preserving previously learnt features and adapting to new spots, thereby mitigating the mode collapse risk. The decoder consists of a Multi-Layer Perceptron (MLP) network and a ResNet-based deconvolution network for reconstructing the gene expression vector ${\hat{x}}_{i}^{g}$ and image patch ${\hat{x}}_{i}^{p}$ from their respective ${\widetilde{z}}_{i}^{g}$ and ${\widetilde{z}}_{i}^{p}$.

The discriminator $D$ comprises an encoder, similar to the generator’s encoder, and an MLP-based classifier. $D$ is trained to distinguish between $x=({x}^{g},\,{x}^{p})$ and $\hat{x}=({\hat{x}}^{g},{\hat{x}}^{p})$. The total loss functions for the generator (${{{\mathcal{L}}}}_{{Gen}}$) and the discriminator (${{{\mathcal{L}}}}_{D}$) are defined as:

$${{{{\mathcal{L}}}}_{{Gen}}={{\rm{\alpha }}}{{{\mathcal{L}}}}_{{rec}}+{{\rm{\beta }}}{{{\mathcal{L}}}}_{{adv}}=\alpha {\mathbb{E}}\left(\left(1-{{\rm{\gamma }}}\right){\|{x}^{g}-{\hat{x}}^{g}\|}_{1}+{{\rm{\gamma }}}{\|{x}^{p}-{\hat{x}}^{p}\|}_{1}\right)-{{\rm{\beta }}}\left({\mathbb{E}}\left[D\left(\hat{x}\right)\right]\right),}$$

(5)

$${{{\mathcal{L}}}}_{D}={\mathbb{E}}\left[D\left(\hat{x}\right)\right]{\mathbb{-}}{\mathbb{E}}\left[D\left(x\right)\right]+\lambda {\mathbb{E}}\left[{\left({\|\nabla D\left({{\rm{\xi }}}\right)\|}_{2}-1\right)}^{2}\right].$$

(6)

where ${{\rm{\xi }}}=\epsilon \hat{x}+\left(1-\epsilon \right)x,\,\epsilon \in \left({\mathrm{0,1}}\right)$. Here, ${{{\mathcal{L}}}}_{{rec}}$ denotes the data reconstruction loss, while ${{{\mathcal{L}}}}_{{adv}}$ the adversarial loss. $\alpha,\beta,$ a${nd}$ $\lambda \ge 0$ represent the weights of each loss function, ${{\rm{\gamma }}}\in \left[{\mathrm{0,1}}\right]$ represents the relative importance between the gene expression and imagery data. $D\left({\hat{x}}_{i}\right){{\boldsymbol{\in }}}{{\mathbb{R}}}^{h}$ is the discriminator’s output for ${\hat{x}}_{i}$, and ${\mathbb{E}}\left[ D\left({\hat{x}}_{i}\right)\right]\in \left[0,\,1\right]$ represents the probability that ${\hat{x}}_{i}$ is classified as real by ${D}$. Additionally, a gradient penalty term applied to ${{\rm{\xi }}}$ ensures the Lipschitz continuity of the discriminator and is critical for maintaining the stability of the adversarial training process³⁹.

When only transcriptomic data (scRNA-seq or ST) is available for referencing, GAN module I undergoes specific modifications. Specifically, in the case of cross-referencing scRNA-seq, the GAT-based encoder of the generator is replaced by a two-layer MLP to generate ${x}_{i}^{g}$:

$${z}_{i}^{g}\in {{\mathbb{R}}}^{d} \,=\, {f}_{{MLP}}\left({x}_{i}^{g},{{{\bf{W}}}}_{1}^{*}\right){{\boldsymbol{.}}}$$

(7)

Similarly, the GAT-based encoder of the discriminator is replaced with the two-layer MLP. Moreover, without the image modality, the ResNet-based image encoder/decoder and the TF block are omitted so that ${z}_{i}={z}_{i}^{g}$ and the memory bank ${{\bf{Q}}}$ has a size of ${{\mathbb{R}}}^{{N}_{{mem}}{{\boldsymbol{\times }}}2d}$. All other components remain unchanged.

Upon completing the training, STANDS is utilized to reconstruct spots in the target datasets. The reconstruction fidelity for a given spot $j$ is quantified using an anomalous score (${{{\mathcal{d}}}}_{j}$), computed as the cosine dissimilarity between $D\left({\hat{x}}_{j}\right)$ and $D\left({x}_{j}\right)$:

$${{{\mathcal{d}}}}_{j} \,=\, 1-\frac{{D\left({x}_{j}\right)}^{T}D\left({\hat{x}}_{j}\right)}{\|D\left({x}_{j}\right)\|\|D\left({\hat{x}}_{j}\right)\|}.$$

(8)

A higher value of ${{{\mathcal{d}}}}_{j}$ indicates a less accurate reconstruction, implying spot $j$ is more likely to be an anomaly. As such, we model the anomaly scores’ distribution as a univariate Gaussian Mixture Model (GMM) with two components: one for anomalous spots (component 1) and the other for normal ones (component 2). We specify the prior for anomaly abundance as a beta distribution and the priors for the mean and variance of the two Gaussian components as a Normal Inverse Chi-squared (NIX) distribution. Utilizing the Maximum A Posteriori (MAP)-EM algorithm, we infer the parameters for both Gaussian components and then assign spots into either normal or anomalous groups based on their probabilities within each component. Specifically, let $\Theta=\left\{\pi,{\mu }_{k},{\sigma }_{k}^{2},\forall k\in \left\{{\mathrm{1,2}}\right\}\right\}$ represent the GMM parameters, where $\pi \in \left[{\mathrm{0,1}}\right]$ represents the proportion of anomalies, and ${\mu }_{k},{\sigma }_{k}^{2}$ represent the mean and variance for the $k$-th component, respectively, with the constraint that ${\mu }_{1} > {\mu }_{2}$. Then, the probability density function of ${{{\mathcal{d}}}}_{i}$ can be formulated as:

$${{\rm{P}}}\left({{{\mathcal{d}}}}_{i}|\varTheta \right) \,=\, \pi {{\mathcal{N}}}\left({{{\mathcal{d}}}}_{i}|{\mu }_{1},{\sigma }_{1}^{2}\right)+\left(1-\pi \right){{\mathcal{N}}}\left({{{\mathcal{d}}}}_{i}|{\mu }_{2},{\sigma }_{2}^{2}\right),$$

(9)

$$\pi \sim {{\rm{Beta}}}\left(\pi |a,b\right),$$

(10)

$${\mu }_{k},{\sigma }_{k}^{2} \sim {{\rm{NIX}}}\left({\mu }_{k},{\varSigma }_{k}|{m}_{0},{\kappa }_{0},{s}_{0}^{2},{\nu }_{0}\right).$$

(11)

Parameters for the priors in the GMM are empirically set based on the reference dataset’s anomaly scores ${{{\rm{\delta }}}}_{i},\forall i\in \left[1,\, {N}_{{ref}}\right]$:

$${m}_{0}=\frac{{\sum }_{i=1}^{{N}_{{ref}}}\, {\delta }_{i}}{{N}_{{ref}}},\,{\kappa }_{0}=\,0.01,\, {\nu }_{0}\,=\, 3,\, {s}_{0}^{2}=\frac{{\sum }_{i=1}^{{N}_{{ref}}}\left({\delta }_{i}-{m}_{0}\right)}{{N}_{{ref}}},$$

(12)

$$a=1,\, b=10.$$

(13)

The values of ${{\rm{a}}}$ ${{\rm{and}}}$ ${{\rm{b}}}$ can be adjusted if prior knowledge about anomaly abundance is available. The complete data log likelihood for the posterior, denoted as ${{\ell}}_{{{\mathcal{c}}}}\left(\Theta \right)$, is expressed as:

$${{\ell}}_{{{\mathcal{c}}}}\left(\Theta \right)=\log {{\rm{P}}}\left({{\mathcal{D}}}|\Theta \right)= \, \mathop {\sum }\limits_{i}\left[{\mathbb{I}}\left({z}_{i}=1\right)\left({{\rm{l}}}{{\rm{og}}}{{\rm{\pi }}}+\log {{\mathcal{N}}}\left({{{\mathcal{d}}}}_{i}|{\mu }_{1},{\sigma }_{1}^{2}\right)\right)\right. \\ +\left.{\mathbb{I}}\left({z}_{i}=2\right)\left({{\rm{l}}}{{\rm{og}}}(1-{{\rm{\pi }}})+\log {{\mathcal{N}}}\left({{{\mathcal{d}}}}_{i}|{\mu }_{2},{\sigma }_{2}^{2}\right)\right)\right]\\ +{{\rm{l}}}{{\rm{og}}}{{\rm{Beta}}}\left(\pi |a,b\right)+\mathop {\sum }\limits_{k=1}^{2}\log {{\rm{NIX}}}\left({\mu }_{k},{\sigma }_{k}^{2}|{m}_{0},{\kappa }_{0},{s}_{0}^{2},{\nu }_{0}\right),$$

(14)

where, ${{{\rm{z}}}}_{i}$ denotes the component membership of spot $i$. In the $t$-th iteration of the E-step, the expected sufficient statistics ${\bar{{z}_{i}}}^{(t)}$ is derived from ${\Theta }^{(t-1)}$. In the subsequent M-step, ${\Theta }^{(t-1)}$ is updated to ${\Theta }^{(t)}$ by maximizing the auxiliary function $Q\big(\Theta,{\Theta }^{(t-1)}\big)={\mathbb{E}}\big[{{\ell}}_{{{\mathcal{c}}}}\big(\Theta \big)\big({\Theta }^{(t-1)}\big)\big]$. Refer to Supplementary Note 1.4 for details about the model inference.

Multimodal learning of spatial gene expression data and histology image

To effectively integrate and harness spatial gene expression data and the associated histology images, STANDS generates spot embeddings from both data types, which are subsequently fused into multimodal embeddings for each spot using a TF block. A histology image is first segmented into patches centered around each spatial spot, adhering to the methodology outlined by Pang et al.⁴⁰. The image patch for spot $i$, represented as ${x}_{i}^{p}\in {{\mathbb{R}}}^{3\times W\times H}$, is processed through a pre-trained ResNet to yield initial embeddings ${\bar{x}}_{i}^{p}\in {{\mathbb{R}}}^{v}$, where $\mbox{('')} 3\hbox{''}$ indicates the number of channels (RGB), and $W$ and $H$ denote the patch width and height, respectively. Meanwhile, we convert the locations of spatial spots into an undirected neighborhood graph $G=\left(V,E\right)$ with a pre-defined neighbor number $k$, where $V$ and $E$ represent the spot and edge sets, respectively. In our implementation, $k$ is set to be 6. The adjacency matrix ${{\bf{A}}}\in {{\mathbb{R}}}^{{N}_{{spot}}\times {N}_{{spot}}}$ of $G$ is defined as:

$${{{\rm{A}}}}_{i,j}=\left\{\begin{array}{c}1,{{\rm{if}}}j\in {N}_{k}\left(i\right)\\ 0,\, {{\rm{otherwise}}}\end{array}\right.,$$

(15)

$$\widetilde{{{\bf{A}}}}={{\bf{I}}}+{{\bf{A}}},$$

(16)

where ${N}_{k}\left(i\right)$ represents the set of proximity-based $k$-nearest neighbors of spot $i$. Here, $\widetilde{{{\bf{A}}}}$ extends ${{\bf{A}}}$ by adding self-loops and is utilized by the GATv2⁴¹, a two-layer graph attention auto-encoder, in the generation of transcriptomic and image embeddings. For instance, we calculate an attention score ${{{\rm{\alpha }}}}_{i,j}^{\left(l\right)}$ between a given spot $i$ and its neighbor $j$ on the $l$-th encoder layer using the formula:

$${{{\rm{\alpha }}}}_{i,j}^{\left(l\right)}=\frac{\exp \left({{{\bf{W}}}}_{{att}}^{\left(l\right)}{{\rm{LeakyReLU}}}\, \left({{{\bf{W}}}}^{\left(l\right)}\left[{z}_{i}^{\left(l-1\right)}{{\rm{||}}}{z}_{j}^{\left(l-1\right)}\right]\right)\right)}{{\sum }_{{j}^{{\prime} }\in N\left(i\right)}\exp \left({{{\bf{W}}}}_{{att}}^{\left(l\right)}{{\rm{LeakyReLU}}}\,\left({{{\bf{W}}}}^{\left(l\right)}\left[{z}_{i}^{\left(l-1\right)}{{\rm{||}}}{z}_{{j}^{{\prime} }}^{\left(l-1\right)}\right]\right)\right)},$$

(17)

where, ${z}_{i}^{\left(l\right)}\in {{\mathbb{R}}}^{d}$ is spot $i$’s embedding generated by the $l$-th encoder layer. The matrix ${{{\bf{Z}}}}^{\left(l\right)}\in {{\mathbb{R}}}^{{N}_{{spot}}\times d}$, which compiles all ${z}_{i}^{\left(l\right)},\forall i\in \left[1,{N}_{{spot}}\right]$, is formulated as:

$${{{\bf{Z}}}}^{\left(l\right)}={{\rm{\sigma }}}\left({\widetilde{{{\bf{A}}}}}^{T}{{{\bf{W}}}}_{a}^{\left(l\right)}{{{\bf{Z}}}}^{\left(l-1\right)}{\left({{{\bf{W}}}}^{\left(l\right)}\right)}^{T}\right),\, l\in \left\{1,\, 2\right\},$$

(18)

where, ${{{\bf{W}}}}_{a}^{\left(l\right)}$ stores all ${{{\rm{\alpha }}}}_{i,j}^{\left(l\right)}$, and ${{\rm{\sigma }}}$ denotes a nonlinear activation function. Note that spot $i$’s initial embedding for the $0$-th layer, ${z}_{i}^{\left(0\right)}$, depends on the data type: ${x}_{i}^{g}$ for transcriptomic data and ${\bar{x}}_{i}^{p}$ for histology image. The transcriptomic and image embeddings outputted by the GATv2 encoder, denoted as ${{{\bf{Z}}}}^{g}\in {{\mathbb{R}}}^{{N}_{{spot}}\times d}$ and ${{{\bf{Z}}}}^{p}\in {{\mathbb{R}}}^{{N}_{{spot}}\times d}$ respectively, are concatenated into ${{{\bf{Z}}}}^{{concat}}\in {{\mathbb{R}}}^{{N}_{{spot}}\times 2d}$. This concatenated matrix serves as the input to a multi-head transformer block for data fusion. To elaborate, let $m$ denote the number of attention heads such that $2d$ is divisible by $m$. ${{{\bf{Z}}}}^{{concat}}$ is split into $m$ sub-embeddings ${{{\bf{Z}}}}_{1},\cdots,{{{\bf{Z}}}}_{m}\in {{\mathbb{R}}}^{{N}_{{spot}}\times 2d/m}$ followed by their mappings to the corresponding query, key and value matrices (${{{\bf{Q}}}}_{t},{{{\bf{K}}}}_{t},{{{\bf{V}}}}_{t}{{\boldsymbol{\in }}}{{\mathbb{R}}}^{{N}_{{spot}}\times d}$) as:

$${{{\bf{Q}}}}_{t} \,=\,{{{{\bf{Z}}}}_{t}{{\bf{W}}}}_{t}^{Q},\, {{{\bf{K}}}}_{t}=\,{{{{\bf{Z}}}}_{t}{{\bf{W}}}}_{t}^{K},\, {{{\bf{V}}}}_{t} \,=\,{{{{\bf{Z}}}}_{t}{{\bf{W}}}}_{t}^{V},\, \forall t\in \left[1,\,m\right],$$

(19)

where, ${{{\bf{W}}}}_{t}^{Q},\,{{{\bf{W}}}}_{t}^{K},\,{{{\bf{W}}}}_{t}^{V}\in {{\mathbb{R}}}^{\left(2d/m\right)\times d}$ are trainable weight matrices. The output embeddings ${{{\bf{H}}}}_{t}\in {{\mathbb{R}}}^{{N}_{{spot}}\times d}$ from attention head $t$ are calculated as:

$${{{\bf{H}}}}_{t}={{\rm{softmax}}}\left(\frac{{{{\bf{Q}}}}_{t}{{{\bf{K}}}}_{t}^{T}}{\sqrt{d}}\right){{{\bf{V}}}}_{t},$$

(20)

which essentially is an enhanced representation of a subspace of ${{{\bf{Z}}}}^{{concat}}$. Finally, the output embeddings from all attention heads are fused into the final spot embeddings ${{\bf{Z}}}\in {{\mathbb{R}}}^{{N}_{{spot}}\times 2d}$ as:

$${{\bf{Z}}}=\left[{{{\bf{H}}}}_{1}\parallel {{{\bf{H}}}}_{2}\parallel \cdots \parallel {{{\bf{H}}}}_{m}\right]{{{\bf{W}}}}^{o},$$

(21)

where, ${{{\bf{W}}}}^{o}{{\boldsymbol{\in }}}{{\mathbb{R}}}^{{md}\times 2d}$ represents the trainable weight matrix for fusing ${{{\bf{H}}}}_{t},\forall t\in \left[1,\, m\right]$.

Multi-sample ST data alignment

As illustrated in the C2 part of Fig. 1b, this task proceeds in two steps: Initially, each normal spot in target datasets is paired with its most similar spot in the reference dataset, forming a “kin” pair indicative of shared biological contents. Subsequently, based on these “kin” pairs, STANDS learns a “style-divergence” matrix which encodes the batch divergences between the target and reference datasets in its rows. This matrix allows the mapping of target datasets to the reference data space in a “style-transfer” manner.

In the first step, the generator of module II learns to reconstruct the embeddings of target spots using those of reference spots, while the discriminator learns to distinguish between the authentic and generated spots. Specifically, let ${{{\bf{Z}}}}_{T}\in {{\mathbb{R}}}^{{N}_{T}\times d},{{{\bf{Z}}}}_{R}\in {{\mathbb{R}}}^{{N}_{R}\times d}$ denote the transcriptomic embeddings of target and reference spots, respectively. ${N}_{T}$ and ${N}_{R}$ denote the number of target and reference spots, respectively. ${{\bf{M}}}\in {{\mathbb{R}}}^{{N}_{T}\times {N}_{R}}$ denotes a trainable non-negative mapping matrix. The generator reconstructs ${\hat{{{\bf{Z}}}}}_{T}$ as:

$${\hat{{{\bf{Z}}}}}_{T} \,=\, {{\rm{ReLU}}}\left({{\bf{M}}}{\odot}\bar{{{\bf{A}}}}\right){{{\bf{Z}}}}_{R},$$

(22)

$${\bar{{{\bf{A}}}}}_{i,j} \,=\left\{\begin{array}{c}1,\, {{\rm{if}}}j\in {N}_{k}\left(i\right)\\ 0,\, {{\rm{otherwise}}}\end{array}\right.,$$

(23)

where ReLU function imposes a non-negative constraint on ${{\bf{M}}}$. $\bar{{{\bf{A}}}}{{\boldsymbol{\in }}}{{\mathbb{R}}}^{{N}_{T}\times {N}_{R}}$ is a kNN adjacency matrix that integrates spatial neighborhood information into the calculation. ${N}_{k}\left(i\right)$ represents the set of spot $i$’s k-nearest neighboring reference spots. The loss functions of the generator and discriminator in module II are given by:

$${{{\mathcal{L}}}}_{{{\rm{G}}}}=\alpha {\mathbb{E}}{\|{{{\bf{Z}}}}_{T}-{\hat{{{\bf{Z}}}}}_{T}\|}_{1}-\beta {\mathbb{E}}\left[D\left({\hat{{{\bf{Z}}}}}_{T}\right)\right],$$

(24)

$${{{\mathcal{L}}}}_{D}={\mathbb{E}}\left[D\left({\hat{{{\bf{Z}}}}}_{T}\right)\right]{\mathbb{-}}{\mathbb{E}}\left[D\left({{{\bf{Z}}}}_{T}\right)\right]+\lambda {\mathbb{E}}\left[{\left({\|\nabla D\left({\widetilde{{{\bf{Z}}}}}_{T}\right)\|}_{2}-1\right)}^{2}\right],$$

(25)

where $\widetilde{{{\bf{Z}}}}= \epsilon \hat{{{\bf{Z}}}}+\left(1-\epsilon \right){{\bf{Z}}},\epsilon \in \left({\mathrm{0,1}}\right)$, and $\alpha,\beta,\lambda \ge 0$ represent the weights of the loss terms. After training, the column index of the maximum value in the $i$-th row of ${{\bf{M}}}$ points to the reference spot that is “kin” to the $i$-th target spot.

The two spots of a “kin” pair are presumed to share similar biological contents so that the reference spot can be approximated by removing the “style-divergence” (batch variations) from the target spot. Therefore, in the second step, a “style”-transfer GAN (module III) is employed to learn the “style-divergences” between target and reference datasets as a matrix ${{\bf{S}}}\in {{\mathbb{R}}}^{{N}_{{batch}}\times d}$. Specifically, for each target spot $i$, the encoder within the generator of module III maps the gene expression vector ${x}_{i}$ to a latent embedding ${z}_{i}\in {{\mathbb{R}}}^{d}$. This encoder and the one within module I share the same network architectures but are trained independently, with the former initialized using the latter’s trained weights. Here, ${z}_{i}$ approximates the embedding of its “kin” reference spot $j$ as follows:

$${z}_{i} \,=\, {f}_{{GAT}}\left({x}_{i}^{g},{G}_{i},\,{{\bf{W}}}\right),$$

(26)

$${\hat{z}}_{i} \,=\, {z}_{i}-{{{\bf{S}}}}^{T}{b}_{i},$$

(27)

where ${b}_{i}\in {{\mathbb{R}}}^{{N}_{{batch}}}$ denotes spot $i$’s one-hot batch identity vector, and ${G}_{i}$ is the graph representation of the dataset containing spot $i$. The generator’s decoder then reconstructs ${\hat{x}}_{i}^{g}$ from ${\hat{z}}_{i}$, while the discriminator of module III learns to distinguish between ${x}_{i}^{g}$ and ${\hat{x}}_{i}^{g}$. The loss functions for the generator and discriminator are:

$${{{\mathcal{L}}}}_{G} \,=\, \alpha {\mathbb{E}}{\|{x}_{R}^{g}-{\hat{x}}_{R}^{g}\|}_{1}-\beta {\mathbb{E}}\left[D\left({\hat{x}}_{R}^{g}\right)\right],$$

(28)

$${{{\mathcal{L}}}}_{D} \,=\,{\mathbb{E}}\left[D\left({\hat{x}}_{R}^{g}\right)\right]{\mathbb{-}}{\mathbb{E}}\left[D\left({x}_{R}^{g}\right)\right]+\lambda {\mathbb{E}}\left[{\left({\|\nabla D\left({\widetilde{x}}_{R}^{g}\right)\|}_{2}-1\right)}^{2}\right],$$

(29)

where $\alpha,\beta,\lambda$ and ${\widetilde{x}}_{R}$ mirror their counterparts in module II. By passing through the trained generator of module III, spots across multiple target datasets are allowed to be collectively aligned in the common reference data space.

Subtyping anomalous tissue domains across multiple datasets

Initially, identified anomalous spots across multiple target datasets are aligned by module III in the common reference space, effectively reducing the confounding batch variations in anomaly subtyping. Then, as illustrated in the C3 part of Fig. 1b, the embedding and reconstruction residual of each aligned anomalous spot are fused into a comprehensive embedding that is informative on anomaly subtypes. Specifically, for a given anomalous spot $i$, let ${x}_{i}^{g}$ and ${x}_{i}^{p}$ denote its aligned gene expression and image patch vectors, respectively; ${\hat{x}}_{i}^{g}$ and ${\hat{x}}_{i}^{p}$ denote the reconstructed vectors from ${x}_{i}^{g}$ and ${x}_{i}^{p}$, respectively; ${r}_{i}^{g}$ and ${r}_{i}^{p}$ denote the reconstruction residuals of gene expression and image patch vectors, respectively; ${z}_{i}$ and ${{{\rm{\zeta }}}}_{i}$ represent the module I-generated embeddings of ${x}_{i}$ and ${r}_{i}$, respectively. Then, we have:

$${r}_{i} \,=\, \left[{r}_{i}^{g}{{\rm{||}}}{r}_{i}^{p}\right]=\left[\left({x}_{i}^{g}-{\hat{x}}_{i}^{g}\right){{\rm{||}}}\left({x}_{i}^{p}-{\hat{x}}_{i}^{p}\right)\right],$$

(30)

$${z}_{i} \,=\, \left[{f}_{{GAT}}\left({x}_{i}^{g},G,\,{{{\bf{W}}}}_{1}\right){{\rm{||}}}{f}_{{rn}-{GAT}}\left({x}_{i}^{p},G,\,{{{\bf{W}}}}_{2}\right)\right],$$

(31)

$${{{\rm{\zeta }}}}_{i} \,=\, \left[{f}_{{GAT}}\left({r}_{i}^{g},\, G,\,{{{\bf{W}}}}_{1}\right){{\rm{||}}}{f}_{{rn}-{GAT}}\left({r}_{i}^{p},\, G,\,{{{\bf{W}}}}_{2}\right)\right],$$

(32)

$${z}_{i}^{*} \,=\,{TF}\left(\left[{z}_{i}{{\rm{||}}}{{{\rm{\zeta }}}}_{i}\right],\,{{{\bf{W}}}}_{{tf}}\right),$$

(33)

where, ${z}_{i}^{*}$ represents the fused embedding of anomaly $i$. DEC⁴², a discriminatively boosted clustering algorithm, groups anomalies into clusters based on their ${z}^{*}$. It applies a Cauchy kernel to ${z}_{i}^{{\prime} }$ to calculate the soft assignment score (${q}_{i,j}$) of anomaly $i$ to a cluster $j$ as:

$${q}_{i,j} \,=\frac{{\left(1+\frac{{\|{z}_{i}^{*}-{\mu }_{j}\|}^{2}}{v}\right)}^{-1}}{{\sum }_{{j}^{{\prime} }}{\left(1+\frac{{\|{z}_{i}^{*}-{\mu }_{{j}^{{\prime} }}\|}^{2}}{v}\right)}^{-1}},$$

(34)

where, ${\mu }_{j}$ denotes the centroid of cluster $j$, $v$ the degree of freedom of the Cauchy kernel. The clustering loss function ${{\mathcal{L}}}$ is based on the KL-divergence between $q$ and an auxiliary target distribution $p$, defined as:

$${p}_{i,j} \,=\frac{\frac{{q}_{i,j}^{2}}{{\sum }_{i}{q}_{i,j}}}{{\sum }_{j}\left(\frac{{q}_{i,j}^{2}}{{\sum }_{i}{q}_{i,j}}\right)},$$

(35)

$${{\mathcal{L}}}={\sum}_{i}{\sum}_{j}{p}_{i,j}\log \left(\frac{{p}_{i,j}}{{q}_{i,j}}\right).$$

(36)

Essentially, anomalies with high-confident assignment are overweighed in the distribution $p$. In practice, the iterative updating of ${{{\bf{W}}}}_{{tf}}$ and $\mu$, aiming to minimize ${{\mathcal{L}}}$, nudges $q$ toward $p$ and incrementally transforms harder-to-cluster embeddings ${z}^{*}$ into easier ones. This self-paced clustering continues until the changes in anomalies’ hard assignments fall below a threshold or a predetermined number of iterations is reached. The resultant hard cluster assignments of anomalous spots correspond to their subtype labels. The number of clusters is assumed to be known or automatically inferred as described in Supplementary Note 1.5.

Model architecture and training

Anomalous tissue domain detection

Here, GAN module I is first trained on the reference dataset and then applied to the target data, generating reconstruction errors as anomaly scores for each target spot. GAN module I comprises a generator and a discriminator. During the training of GAN module I, we set a mini batch size of 128 and utilize the Adam optimizer with a learning rate of 3e-4.

The generator is further divided into an encoder, a memory bank, and a decoder. When using an ST dataset as reference, the encoder is a two-layer GAT of an architecture of 3000-512-256, with four 128-dimensional attention heads in the first layer and a single 256-dimensional attention head in the second layer. When histology data is available (e.g., for 10x Visium datasets), a pretrained ResNet-34⁴³ is used to extract 256-dimensional visual features from 112 × 112 pixel image patches that are segmented from the histology image and centered around each spatial spot. These visual features are further encoded by another two-layer GAT, with the same architecture as used for encoding ST data, to capture the spatial relationships among neighboring patches. The 256-dimensional image and gene expression embeddings are then fused using a TF block comprising three transformer encoder layers, each with four 128-dimensional attention heads, to output 512-dimensional fused embeddings. Note that the encoder branch for visual features and the TF block is omitted in the absence of histology image. When cross-referencing an scRNA-seq dataset, a two-layer MLP network with an architecture of 3000-512-256 replaces the GAT in the encoder to generate gene expression embeddings at each spot. The batch (128) of embeddings output from the encoder is subsequently enqueued into the memory bank, which has a size of 512×512 for multimodal embeddings and 512 × 256 for single-modal embeddings, while an equal number of the oldest embeddings in the bank are dequeued. After memory bank-mediated embedding reconstruction, 256-dimensional single-modal gene expression embeddings are input to the decoder, while 512-dimensional multimodal embeddings are split into 256-dimensional image and gene expression embeddings before being fed into their respective decoders. The decoder for ST data is a two-layer MLP with an architecture of 256-512-3000, and the decoder for image data is a ResNet-34 decoder symmetric to the ResNet encoder, comprising transposed convolutional layers.

The discriminator, comprising an encoder and a four-layer MLP-based classifier, accepts pairs of original and reconstructed data. The architecture of its encoder mirrors that of the generator’s encoder, and the classifier has an architecture of 512-256(x3)−16 in the presence of image data or 256(x4)−16 otherwise. The discriminator is trained to maximize the L1-norm difference between the 16-dimensional output embeddings of the original and reconstructed data.

Multi-sample ST data alignment

Initially, plausible anomalous spots identified by GAN-module I are excluded from the target datasets to minimize their confounding effects during alignment. GAN module II’s generator processes gene expression embeddings of both reference and target spots generated by GAN module I’s encoder, training a non-negative mapping matrix to reconstruct the target embeddings from the reference embeddings. The discriminator is a four-layer MLP with an architecture of 512-256(x3)−16, aimed at maximizing the L1-norm difference between the original and reconstructed target embeddings. Once trained, the non-negative mapping matrix is utilized to identify kin pairs of reference and target spots, whose raw data are then input into GAN module III.

The encoders and decoders of the generator within GAN module III and module I share the same architectures but are trained independently, with the former initialized with the latter’s trained weights. GAN module III trains a matrix $S \sim {n}_{b}\times 256$ that encompasses ${n}_{b}$ style embeddings representing various batch effects. These style embeddings are subtracted from their corresponding target spots’ embeddings to map the target spots to the common embedded reference space so as to transfer target datasets’ styles to the reference dataset’s. For each target spot, the generator’s decoder use its “style-transferred” embedding to reconstruct its kin reference spot, which is then paired with the original data as inputs to the discriminator whose architecture mirrors that of the discriminator in GAN module I. Lastly, the training of this module adopts a batch size of 128 and the Adam optimizer with a learning rate 3e-4.

Anomaly subtyping

Anomalous spots identified by the GAN module I are aligned in the reference data space using the trained GAN module III. Then, the frozen encoder and decoder from GAN module I are used to generate post-alignment embeddings of identified anomalous spots and their reconstruction errors, respectively. Using a specific encoder that mirrors the encoder of GAN module I’s generator but is trained independently, reconstruction errors are further converted into embeddings with same dimensions as the spot embeddings. The spot and reconstruction error embeddings are fused into 128-dimensional embeddings using a trainable TF block, consisting of three transformer layers, each with multiple 128-dimensional attention heads. Specifically, there are eight attention heads in a transformer layer when using multimodal data or four heads otherwise. The outputs are subsequently subjected to self-paced discriminatively boosted clustering. The training process iterates between clustering and fused embedding generation until the changes in anomalies hard assignments fall below a threshold (0.001) or a number of iterations (2e4) is reached.

Data preprocessing

In this study, we follow the standard pipeline of data preprocessing provided by the Scanpy⁴⁴, SpatialDE⁴⁵ and GeneClust⁴⁶ packages. Specifically, mitochondrial and External RNA Controls Consortium (ERCC) spike-in genes are removed. Genes detected in fewer than 10 spots are excluded. We do not perform filtering on spatial spots to maintain spatial data integrity. Gene expression counts matrix are normalized by library size and then log-transformed. Finally, we select the top 3000 spatially variable genes (SVG) selected using SpatialDE as inputs to the STANDS.

Evaluation metrics

Anomalous tissue domain detection

Spatial grouping discrepancy (SGD)

We propose the SGD, a novel metric to assess both the accuracy of labels and the consistency of spatial structures. Specifically, spatial locations are represented as nodes in an undirected graph. Normal spots are isolated, while anomalous spots are connected to their k-nearest anomalous neighbors. Note that in the anomaly detection results, incorrectly identified spots as anomalies (false positives) become connected, and false negatives become isolated, which leads to a deviation from the local structures of the ground truth graph. Spots are divided into two regions: one includes true positives plus false positives (TP + FP) anomalies, and the other includes true positives plus false negatives (TP + FN) anomalies. We perform a bootstrap sampling of $m$ sets of spots from these two regions, generating a collection $S=\left\{{s}_{i}:\left\{{s}_{i}^{\left(1\right)},{s}_{i}^{\left(2\right)}\right\},\forall i\in \left[1,m\right]\right\}$. Subsequently, both cluster coefficients and degrees for spots within ${s}_{i}$ are calculated as follows:

$${{\rm{c}}}{c}_{i,j}^{\left(r,\, l\right)}=\frac{2{E}_{j}}{{k}_{j}\left({k}_{j}-1\right)},$$

(37)

$${d}_{j}^{\left(r,\, l\right)}={\sum}_{n}^{{N}_{k}\left(j\right)}{\mathbb{I}}\left({e}_{j,n}=1\right),$$

(38)

for all $i\in \left[1,m\right]$, every spot $j$ in ${s}_{i}^{\left(r\right)},$ region $r\in \left\{1:{TP}+{FP},2:{TP}+{FN}\right\}$, and label type $l\in \left\{1:{{\rm{ground}}}{{\rm{truth}}},2:{{\rm{anomaly}}}{{\rm{detection}}}{{\rm{outcomes}}}\right\}$. Here, $c{c}_{i,j}^{\left(r,l\right)}$ and ${d}_{j}^{\left(r,l\right)}$ represent the cluster coefficient and degree of spot $j$ within region $r$ from bootstrap sample $i$, based on either the ground truth ($l=1$) or anomaly detection outcomes ($l=2$). ${k}_{j}$ denotes the number of neighbors connected to spot $j$, ${E}_{j}$ the number of edges among these neighbors, ${e}_{j,n}$ the edge between spots $j$ and $n$, ${N}_{k}\left(j\right)$ the set of k-nearest neighbors of spot $j$. We adopt both degree and cluster coefficient metrics because they reflect the centrality and neighborhood connectivity of spots, respectively.

Next, for any two bootstrap samples $i$ and $j$, we quantify the discrepancy in the distribution of their cluster coefficients or degrees using the Wasserstein distance, $W({p}_{i,t},{p}_{j,t})$, defined as:

$$W\left({p}_{i,t}^{\left(r\right)},\, {p}_{j,t}^{\left(r\right)}\right)={\inf }_{{{\rm{\gamma }}}\in \Pi \left({p}_{i,t}^{\left(r\right)},\, {p}_{j,t}^{\left(r\right)}\right)}{{\mathbb{E}}}_{\left(x,y\right) \sim {{\rm{\gamma }}}}\left[\|x-y\|\right],$$

(39)

$$W\left({p}_{i,t},\, {p}_{j,t}\right)={TPR}\times W\left({p}_{i,t}^{\left(1\right)},\, {p}_{j,t}^{\left(1\right)}\right)+\left(1-{TPR}\right)\times W\left({p}_{i,t}^{\left(2\right)},\, {p}_{j,t}^{\left(2\right)}\right),$$

(40)

$${TPR}=\frac{{TP}}{{TP}+{FN}},$$

(41)

for every $t\in \left\{1:{{\rm{degree}}},\, 2:{{\rm{cluster\; coefficient}}}\right\},$ and $r\in \left[1:{TP}+{FP},2:{TP}+{FN}\right]$. Here, ${p}_{i,t}^{\left(r\right)}$ denotes the distribution of $t$ for region $r$ in the $i$-th bootstrap sample, and $\Pi \left(p,q\right)$ represents the set of all joint distributions with marginals $p$ ${and}$ $q$, respectively. $\gamma$ denotes a valid transport plan between these distributions. To capture high-order moments of distributional discrepancy, we apply a Gaussian-like kernel to the Wasserstein distance:

$${{\mathcal{S}}}\left({p}_{i,t},\, {p}_{j,t}\right) \,=\, \exp \left(-\frac{W\left({p}_{i,t},\, {p}_{j,t}\right)}{\tau }\right),$$

(42)

where $\tau$ is a positive temperature hyperparameter. The Moore-Aronszajin theorem guarantees that this symmetric and positive-definite kernel induces a unique Reproducing Kernel Hilbert Space (RKHS)⁴⁷. Finally, we define SGD metrics as Maximum Mean Discrepancy (MMD) scores for the metric in this RKHS:

$${{\rm{SGD}}}\left({p}_{{true},t}{{\rm{||}}}{p}_{{detect},t}\right)= \, {{\mathbb{E}}}_{{p}_{i,t},{p}_{{i}^{{\prime} },t} \sim {p}_{{true},t}}\left[{{\mathcal{S}}}\left({p}_{i,t},{p}_{{i}^{{\prime} },t}\right)\right]\\ +{{\mathbb{E}}}_{{p}_{j,t},{p}_{{j}^{{\prime} },t} \sim {p}_{{detect},t}}\left[{{\mathcal{S}}}\left({p}_{j,t},{p}_{{j}^{{\prime} },t}\right)\right] \\ -2{{\mathbb{E}}}_{{p}_{i,t} \sim {p}_{{true},t},{p}_{j,t} \sim {p}_{{detect},t}}\left[{{\mathcal{S}}}\left({p}_{i,t},{p}_{j,t}\right)\right],$$

(43)

where ${p}_{{true},t}$ and ${p}_{{detect},t}$ represent the sets of distributions for metric $t$ (degree or cluster coefficient) derived from bootstrap samples in the contexts of ground truth and anomaly detection outcomes, respectively. Based on SGD, we further propose multi-SGD to measure the spatial discrepancy between spatial clustering results and ground truth that involves multiple domain types (see “Multi-type spatial grouping discrepancy” section below).

Multi-sample ST data alignment

The performance of multi-sample alignment is assessed using multiple metrics calculated on reduced t-SNE embeddings of aligned datasets. These metrics include integration local inverse Simpson’s index (iLISI)³³, BatchKL²⁵, and ASW_batch⁴⁸ for evaluating batch mixing effects, and ASW_type⁴⁸ for evaluating cross-batch domain (or spot) type alignment. Additionally, ARI is used to evaluate spatial clustering performed on aligned datasets.

iLISI. This metric measures the effective number of batches present in the local neighborhoods of spots across aligned datasets by calculating a score that represents the degree of local batch mixing. The score value ranges from 1 to ${N}_{{batch}}$, with a higher value indicating more effective batch mixing. To elaborate, a neighboring spot probability matrix is calculated as:

$${p}_{i,j}=\left\{\begin{array}{c}0,{{\rm{if}}}j \, \notin \, {N}_{k}\left(i\right)\hfill \\ \frac{\exp \left(-\beta {\|{x}_{i}-{x}_{j}\|}_{2}^{2}\right)}{{\sum }_{j}\exp \left(-\beta {\|{x}_{i}-{x}_{j}\|}_{2}^{2}\right)},\quad{{\rm{if}}}j\in {N}_{k}\left(i\right)\end{array}\right.$$

(44)

where $i,j\in [1,{N}_{{spot}}]$, and ${x}_{i}$ denotes the reduced t-SNE embeddings of spot $i$, ${N}_{k}\left(i\right)$ the set of proximity-based $k$-nearest neighbors of spot $i$, and ${p}_{i,j}$ the probability that spot $i$ is aligned to spot $j$. Then, the iLISI score is calculated as:

$${{\rm{iLISI}}} \,=\, \frac{1}{{N}_{{spot}}}{\sum }_{i=1}^{{N}_{{spot}}}{\left({p}_{i}^{T}{{\bf{B}}}{{{\bf{B}}}}^{T}{p}_{i}\right)}^{-1},$$

(45)

where, ${{\bf{B}}}={\left({b}_{1},\, {b}_{2},\cdots,\, {b}_{n}\right)}^{T}\in {{\mathbb{R}}}^{{N}_{{spot}}\times {N}_{{batch}}}$ represents a batch-identity matrix and ${b}_{i}$ represent the one-hot batch-identity vector of spot $i$.

BatchKL. This metric assesses the effectiveness of batch correction by calculating mixing Kullback-Leibler (KL) divergences. It reflects the batch diversity across aligned datasets, with a lower value indicating more effective batch mixing. Initially, 100 spots are randomly sampled from all batches, followed by the calculation of the regional mixing KL divergence as:

$${{\rm{BatchKL}}} \,={\sum }_{b=1}^{B}{p}_{b}\log \frac{{p}_{b}}{{q}_{b}},$$

(46)

where, ${q}_{b}$ represents the proportion of spots from batch $b$ in the entire sample, while ${p}_{b}$ represents the average proportion of spots from batch $b$ within the $k$-nearest neighborhood of each sampled spot in the reduced t-SNE space.

ASW_batch &ASW_type. The two metrics represent the average silhouette width of aligned spots based on their batch identities (ASW_batch) and domain types (ASW_type). A higher silhouette coefficient implies that observations within identical groups form compact clusters, while those belonging to different groups are well-separated. Therefore, a lower ASW_batch score indicates more effective batch mixing, while a higher ASW_type score indicates more accurate cross-batch domain (or spot) type alignment.

Adjusted Rand Index (ARI). This metric assesses the spatial clustering results, with a higher value indicating more consistent clustering with the ground truth. Let $n$ represents the total number of spots, ${n}_{{ij}}$ the number of spots of type $i$ within cluster $j$, ${a}_{i}$ the total number of spots of type $i$, ${b}_{j}$ the total number of spots within cluster $j$. Then ARI is calculated as:

$${{\rm{ARI}}} \,=\, \frac{{\sum }_{{ij}}\left(\begin{array}{c}{n}_{{ij}}\\ 2\end{array}\right)\left[{\sum }_{i}\left(\begin{array}{c}{a}_{i}\\ 2\end{array}\right){\sum }_{j}\left(\begin{array}{c}{b}_{i}\\ 2\end{array}\right)\right]/\left(\begin{array}{c}n\\ 2\end{array}\right)}{\frac{1}{2}\left[{\sum }_{i}\left(\begin{array}{c}{a}_{i}\\ 2\end{array}\right)+{\sum }_{j}\left(\begin{array}{c}{b}_{j}\\ 2\end{array}\right)\right]-\left[{\sum }_{i}\left(\begin{array}{c}{a}_{i}\\ 2\end{array}\right){\sum }_{j}\left(\begin{array}{c}{b}_{j}\\ 2\end{array}\right)\right]/\left(\begin{array}{c}n\\ 2\end{array}\right)}.$$

(47)

Anomaly subtyping

Multi-type spatial grouping discrepancy (multi-SGD)

This metric assesses the consistency between the anomaly subtyping outcomes and the ground truth subdomain labels, taking into account the spatial relationships among spots. For $\kappa > 2$ subtypes, we adopt the One-vs-Rest methodology to calculate an SGD score for each subtype, as detailed in the “Anomalous tissue domain detection” section. This involves mapping annotations generated by the subtyping method to the ground truth annotations using the COIN-OR Branch and Coin solver⁴⁹ to solve the following mixed-integer programming problem:

$${J}_{t}\left({y}_{i,j}\right) \,=\,{\min }_{{y}_{i,j}}{\sum }_{i=1}^{\kappa }{\sum }_{j=1}^{\kappa }{y}_{i,j}\frac{{N}_{i}}{{N}_{{total}}}{{\rm{SGD}}}\left({p}_{i,{true} \,,\,t}{{\rm{||}}}{p}_{j,{sub},\, t}\right),$$

(48)

$${{\rm{s}}}.{{\rm{t}}}.{\sum }_{i=1}^{\kappa }{y}_{i,j}=1,{\sum }_{j=1}^{\kappa }{y}_{i,j}=1,\,{y}_{i,j}\in \left[0,1\right],$$

(49)

where $t\in \left\{{{\rm{degree}}},{{\rm{cluster\; coefficient}}}\right\}.$ Here, ${y}_{i,j}=1$ indicates that the $i$-th anomalous subtype is mapped to the $j$-th subtyping annotation. ${N}_{i}$ denotes the number of spots belonging to subtype $i$, ${N}_{{total}}$ the total number of anomalous spots. ${{\rm{SGD}}}\big({p}_{i,{true},t}{||}{p}_{j,{sub},t}\big)$ represents the subtype-specific SGD score for the $i$-th true subtype when mapped to the $j$-th generated annotation (refer to equation 1). Finally, the multi-SGD for metric $t$ is determined as:

$${{{\rm{SGD}}}}_{{multi}}\left({p}_{{true},t}{{\rm{||}}}{p}_{{sub},t}\right)={J}_{t}\left({y}_{i,j}\right).$$

(50)

Normalized Mutual Information (NMI). This nonnegative metric evaluates the consistency between clustering results with the ground truth, with a higher value indicating a more accurate clustering. NMI is defined as:

$${{\rm{NMI}}}=\frac{2{\sum }_{{ij}}\frac{{n}_{{ij}}}{n}\log \left(\frac{n\times {n}_{{ij}}}{{a}_{i}\times {b}_{j}}\right)}{{\sum }_{i}\frac{{a}_{i}}{n}{{\rm{lo}}}{{\rm{g}}}\left(\frac{n}{{a}_{i}}\right)+{\sum }_{j}\frac{{b}_{j}}{n}{{\rm{lo}}}{{\rm{g}}}\left(\frac{n}{{b}_{j}}\right)},$$

(51)

where, ${n}_{{ij}}$ represents the number of true positives of anomaly type $i$ within cluster $j$, ${a}_{i}$ the total number of true positives of anomalous type $i$, ${b}_{j}$ the total number of true positives within cluster $j$.

Benchmark methods

Benchmark overview

Anomalous tissue domain detection

For benchmarks of the ATD detection subtask, we select five supervised methods, including Spatial-ID, scPred, CHETAH, scmap, and CAMLU, as well as two unsupervised methods, including SCEVAN and CopyCAT. All methods, except Spatial-ID, are originally designed for detecting anomalous single cells in scRNA-seq data. Each supervised methods trains a classifier on an annotated reference scRNA-seq dataset, which is then applied to classify target spots into known types. Specifically, Spatial-ID employs a deep neural network (DNN) pretrained on the reference scRNA-seq dataset. During inference, it utilizes a variational graph autoencoder (VGAE) to yield spot embeddings, which are then fed into the DNN classifier to generate spot pseudo-labels. Meanwhile, it trains another self-supervised DNN classifier to predict spot types against their pseudo-labels, identifying spots with a maximum type assignment probability below a threshold as anomalies. scPred trains a support vector machine (SVM) classifier on the annotated reference using the most informative principal features selected via a Wilcoxon signed-rank test. This classifier is used to classify target cells into known types and identify those with maximum assignment probability below a threshold as anomalies. CHETAH builds a hierarchical classification tree from reference data, computing cell-type specific gene expression profiles at each tree node. During inference, target cells are classified by traversing the tree, with the traversal path determined based on their correlations (i.e., confidence scores) with gene expression profiles at intermediate tree nodes. Target cells with a confidence score below a threshold at the root node are deemed anomalous. scmap calculates gene expression profile similarities between target cells to cell type centroids in the reference dataset, assigning them to the type with highest similarity. Target cells with the highest similarity score below a threshold are identified as anomalous. The aforementioned supervised methods identify target cells with low assignment confidence as anomalies, which however increases the false positive risks due to confusing normal cells with uncertain assignment with genuine anomalies. Conversely, CAMLU is a reconstruction-based method that sidesteps the requirement for annotated reference. It trains an autoencoder to reconstruct genes in the reference dataset which is then applied to the target dataset, selecting genes that demonstrate the largest discrepancies between reconstruction errors of the reference and target datasets as discriminative features. These genes are utilized in a hierarchical clustering to categorize target cells as normal or anomalous.

Unlike the supervised benchmarks, the two unsupervised benchmarks, SCEVAN and CopyCAT, are directly applied to the target ST datasets. They both initially identify a set of highly confident normal cells to serve as a gene copy number baseline. Next, the gene copy number profiles of target cells are estimated from the baseline using a joint segmentation algorithm in SCEVAN and a Poisson-Gamma model in CopyCAT. Utilizing these profiles, both methods performs hierarchical clustering to group cells into clusters, identifying those significantly enriched in predefined normal cells as normal and others as anomalous.

Multi-sample ST data alignment

Benchmarks for the ATD alignment subtask include two well-established batch correction methods for scRNA-seq, ComBat and Harmony, alongside two recent methods for ST, GraphST and STAligner. ComBat utilizes a Bayesian framework with empirical priors to estimate and correct for both additive and multiplicative batch effects across samples. Harmony clusters cells in a low-dimensional embedded space, maximizing intra-cluster batch diversity, and then applies linear batch correction using the cluster centroids. GraphST is designed for aligning spatially adjacent ST datasets as it relies on PASTE⁵⁰ to align histological images of adjacent datasets to acquire consensus spatial coordinates. It corrects batch effects by constructing a shared neighborhood graph that connects spatially adjacent spots across samples, reducing cross-sample batch variations through node feature smoothing in a self-supervised contrastive graph learning. STAligner integrates multiple ST datasets, whether spatially adjacent or not, into a single graph to yield spot embeddings using GAT, based on which positive and negative spot pairs are identified. Then batch variations are corrected by iteratively optimizing between spot embeddings and a contrastive learning triplet loss computed on the positive and negative pairs.

Anomaly subtyping

Three celebrated spatial clustering methods, including GraphST, STAGATE, and iStar, are chosen to benchmark ATD subtyping. GraphST models spatial gene expression using a graph, which is further augmented with a locally corrupted graph. Next, it employs a GCN to conduct a self-supervised contrastive learning between the two graphs, yielding spatial spot embeddings for spatial clustering. iStar utilizes a hierarchical vision transformer (HViT) pretrained on public histology images to yield both local and global image features for predicting super-resolution gene expressions using a weakly supervised feed-forward neural network (FFN). A k-means clustering is then performed using gene embeddings encoded by the penultimate layer of the FFN to cluster spots into different subtypes. STAGATE also models spatial gene expression using a graph and generates low-dimensional spot embeddings using a reconstruction-based GAT, which serve as input to an off-the-shelf clustering algorithm for subtyping.

Benchmark implementations

Anomalous tissue domain detection

This series of experiments spans three scenarios: detecting ATDs from a single target dataset, from multiple target datasets, and cross-referencing scRNA-seq data. In the first (Exp ID = 1 and 4), second (Exp ID = 3, 5, 6, 7, and 8) and third (Exp ID = 2) scenarios, five supervised methods, including Spatial-ID, scPred, CHETAH, scmap, and CAMLU, are trained on the reference ST dataset, treating spatial spots as single-cells and domain types as cell types. These methods are then applied to the target datasets— scPred, CHETAH, scmap, and Spatial-ID compute an assignment confidence score to each target spot, labeling those below an implicitly specified threshold as anomalous. CAMLU, on the other hand, reconstructs the target datasets, selecting the top 500 feature genes exhibiting significant discrepancies in reconstruction errors compared to the reference dataset. Using these feature genes, target spots are clustered into normal and anomalous groups via a hierarchical clustering. During inference, all methods except Spatial-ID treat target spots as single-cells. Spatial-ID accounts for spatial relationships among target spots by incorporating their adjacency matrix during spot embedding generation. Additionally, in the third scenario, two unsupervised methods, i.e., SCEVAN and CopyCAT, are directly applied to target ST datasets, treating target spots as single-cells. Both methods perform hierarchical clustering to group spots into clusters based on their estimated gene copy number profiles. Clusters highly enriched in predefined highly confident normal spots in the enrichment analysis (P-value $\le$ 0.05) are identified as normal and others as anomalous. In all experiments, benchmark methods adopt the default hyperparameter values, such as anomaly score thresholds, clustering algorithm parameters, and significance levels for statistical tests.

Multi-sample ST data alignment

Experiments for this task involve three ST datasets, either vertical or non-adjacent. The benchmarks include two methods for aligning scRNA-seq, ComBat and Harmony, and two for aligning ST datasets, GraphST and STAligner. ComBat and Harmony are directly applied to the ST datasets, treating spatial spots as single-cells and disregarding their spatial relationships. In experiments involving vertical datasets, GraphST first utilizes PASTE to obtain consensus spatial coordinates, with which spatial spots across datasets are positioned in a common tissue space for alignment. In experiments involving nonadjacent datasets (Exp ID = 9, 10, 12, and 14), whose spatial coordinates are unalignable, GraphST utilizes consensus original spatial coordinates in the alignment process as a compromise. Conversely, STAligner can handle both vertical and nonadjacent ST datasets through contrastive learning with positive and negative pairs of anchor spots. All benchmarks are evaluated with parameter settings recommended by the original studies.

Anomaly subtyping

Given the performance of ATD subtyping heavily depends on the quality of detected and aligned anomalous spots, we use composite methods comprising of methods specifically designed for each individual task as benchmarks. Experiments for this subtask involves either single or multiple target datasets.

In the first scenario (Exp ID = 18-21), six composite benchmarks are constructed, including CAMLU-GraphST, scPred-GraphST, CHETAH-GraphST, scmap-GraphST, SpatialID-iStar, and SpatialID-STAGATE. The first method in each composite name identifies ATDs, while the second clusters them into subtypes. Since this scenario only involves single target dataset, data alignment method is unnecessary. The subtyping methods vary in implementations: GraphST and STAGATE convert target spots into embeddings and employ their built-in clustering algorithms to group the embedded anomalous spots identified by the first method into subtypes. iStar is excluded in experiments (Exp ID = 17 and 20) involving Slide-seqV2 and Stereo-seq datasets, both of which are devoid of histology images. In other experiments, image patches covering the identified anomalous spots are segmented and converted into visual features using a pretrained visual feature extractor. iStar is trained to predict the gene expression profile at each target spot using a DNN, with the penultimate layer’s outputs serving as input spots’ embeddings. iStar’s k-means clustering algorithm then utilizes embeddings of anomalous spots to group them into subtypes.

The second scenario (Exp ID = 15-17 and 22-24) involves multiple target datasets, which necessitates data alignment methods. To meet this requirement, GraphST is added to the CAMLU-GraphST, scPred-GraphST, CHETAH-GraphST, and scmap-GraphST; ComBat to the SpatialID-iStar; and STAligner to the SpatialID-STAGATE. GraphST and STAligner align target datasets, generating post-alignment spot embeddings. The built-in clustering algorithms of GraphST and STAGATE use post-alignment embeddings of identified anomalous spots to cluster them into subtype groups. Since iStar trains the gene expression predictor using original gene expression data, ComBat, which preserves the original data scale post-alignment, is combined with SpatialID-iStar. This composite method are only used in experiments wherein datasets are associated with histology data (Exp ID = 15,16, 22, and 24). Specifically, Spatial-ID identifies anomalous target spots, iStar extracts visual features from the histology image associated with each target dataset, and ComBat aligns target datasets. Subsequently, iStar is trained on the target spots to predict their post-alignment gene expressions using the extracted visual features, in the meanwhile generating spot embeddings from the penultimate layer of the predictor. Finally, iStar’s built-in clustering algorithm cluster identified anomalous spots into subtypes using their spot embeddings. All component methods in the benchmarks adopt their default hyperparameter settings, and the true number of clusters is assumed to be known.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All experimental datasets involved in this paper are available from their respective original sources: The 10x-Visium datasets of healthy human breast tissues (10x-hNB datasets) are available at GSE195665; The 10x-Visium datasets of human breast cancer tissues (10x-hBC datasets) are available at GitHub [https://github.com/almaan/her2st/tree/master]; The 10x-Visium datasets of human primary sclerosing cholangitis tissue (10x-hPSC datasets) and human liver caudate lobe tissue are available at GSE243977; The 10x-Visium datasets of human renal cell cancer tissue (10x-hRCC datasets) are available at GSE175540; The scRNA-seq dataset of human pancreatic ductal (sc-hPD) and 10x-Visium datasets of the human pancreatic ductal adenocarcinomas (10x-hPDAC) are available at GSE111672; The slide-seqV2 datasets of mouse embryo tissues (ssq-mEmb datasets) are available at GSE197353; The Stereo-seq datasets of mouse embryo tissue (Stereo-mEmb datasets) are available at CNP0001543. Source data are provided in this paper. Source data are provided with this paper.

Code availability

An open-source Python implementation of the STANDS package along with the documentation and tutorial for running it are available at GitHub [https://github.com/Catchxu/STANDS]. We have obtained a DOI for our Github repository by linking it to Zenodo⁵¹. The Zenodo DOI is: https://doi.org/10.5281/zenodo.13375297.

References

Maus, R. L. et al. Resolving the heterogeneous tumor-centric cellular neighborhood through multiplexed, spatial paracrine interactions in the setting of immune checkpoint blockade. Cancer Res. Commun. 2, 78–89 (2022).
Article CAS PubMed PubMed Central Google Scholar
Arora, R. et al. Spatial transcriptomics reveals distinct and conserved tumor core and edge architectures that predict survival and targeted therapy response. Nat. Commun. 14, 5029 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Shinagare, A. B., Krajewski, K. M., Braschi-Amirfarzan, M. & Ramaiya, N. H. Advanced renal cell carcinoma: role of the radiologist in the era of precision medicine. Radiology 284, 333–351 (2017).
Article PubMed Google Scholar
Yang, J., Chen, H., Yan, J., Chen, X. & Yao, J. in International Conference on Learning Representations (2021).
Stepec, D. & Skocaj, D. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 3785-3792 (2021).
Hu, J. et al. SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nat. Methods 18, 1342–1351 (2021).
Article PubMed Google Scholar
Dries, R. et al. Giotto: a toolbox for integrative analysis and visualization of spatial expression data. Genome Biol. 22, 78 (2021).
Article CAS PubMed PubMed Central Google Scholar
Zhao, E. et al. Spatial transcriptomics at subspot resolution with BayesSpace. Nat. Biotechnol. 39, 1375–1384 (2021).
Article CAS PubMed PubMed Central Google Scholar
Zhang, D. et al. Inferring super-resolution tissue architecture by integrating spatial transcriptomics with histology. Nat Biotechnol (2024).
Dong, K. & Zhang, S. Deciphering spatial domains from spatially resolved transcriptomics with an adaptive graph attention auto-encoder. Nat. Commun. 13, 1739 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Long, Y. et al. Spatially informed clustering, integration, and deconvolution of spatial transcriptomics with GraphST. Nat. Commun. 14, 1155 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Hu, J. et al. Deciphering tumor ecosystems at super resolution from spatial transcriptomics with TESLA. Cell Syst. 14, 404–417. e404 (2023).
Article CAS PubMed PubMed Central Google Scholar
Kleshchevnikov, V. et al. Cell2location maps fine-grained cell types in spatial transcriptomics. Nat. Biotechnol. 40, 661–671 (2022).
Article CAS PubMed Google Scholar
Shen, R. et al. Spatial-ID: a cell typing method for spatially resolved transcriptomics via transfer learning and spatial embedding. Nat. Commun. 13, 7640 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Li, Z., Wang, Y., Ganan-Gomez, I., Colla, S. & Do, K. A. A machine learning-based method for automatically identifying novel cells in annotating single-cell RNA-seq data. Bioinformatics 38, 4885–4892 (2022).
Article CAS PubMed PubMed Central Google Scholar
Sprang, M., Andrade-Navarro, M. A. & Fontaine, J.-F. Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality. BMC Bioinforma. 23, 1–15 (2022).
Article Google Scholar
Xu, K. et al. Domain adaptive and fine-grained anomaly detection for single-cell sequencing data and beyond. In Proceedings of the International Joint Conference on Artificial Intelligence. 6125–6133 (2024).
Yang, Y. et al. SMNN: batch effect correction for single-cell RNA-seq data via supervised mutual nearest neighbor detection. Brief. Bioinforma. 22, bbaa097 (2021).
Article Google Scholar
Zhou, X., Dong, K. & Zhang, S. Integrating spatial transcriptomics data across different conditions, technologies and developmental stages. Nat. Comput. Sci. 1–13 (2023).
Consortium, Hu. B. M. A. P. The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature 574, 187–192 (2019).
Article ADS Google Scholar
Xu, Z. et al. STOmicsDB: a comprehensive database for spatial transcriptomics data sharing, analysis and visualization. Nucleic Acids Res 52, D1053–D1061 (2024).
Article PubMed Google Scholar
Bachmann, R., Mizrahi, D., Atanov, A. & Zamir, A. in European Conference on Computer Vision 348-367 (Springer, 2022).
Zenati, H., Foo, C. S., Lecouat, B., Manek, G. & Chandrasekhar, V. R. Efficient gan-based anomaly detection. arXiv preprint arXiv:1802.06222 (2018).
Zhu, J.-Y., Park, T., Isola, P. & Efros, A. A. in Proceedings of the IEEE International Conference on Computer Vision 2223-2232 (2017).
Li, X. et al. Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis. Nat. Commun. 11, 2338 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Alquicira-Hernandez, J., Sathe, A., Ji, H. P., Nguyen, Q. & Powell, J. E. scPred: accurate supervised method for cell-type classification from single-cell RNA-seq data. Genome Biol. 20, 1–17 (2019).
Article Google Scholar
De Kanter, J. K., Lijnzaad, P., Candelli, T., Margaritis, T. & Holstege, F. C. CHETAH: a selective, hierarchical cell type identification method for single-cell RNA sequencing. Nucleic Acids Res. 47, e95–e95 (2019).
Article PubMed PubMed Central Google Scholar
Kiselev, V. Y., Yiu, A. & Hemberg, M. scmap: projection of single-cell RNA-seq data across data sets. Nat. Methods 15, 359–362 (2018).
Article CAS PubMed Google Scholar
Majidzadeh-A, K., Esmaeili, R. & Abdoli, N. TFRC and ACTB as the best reference genes to quantify Urokinase Plasminogen Activator in breast cancer. BMC Res. Notes 4, 1–7 (2011).
Article Google Scholar
Zhang, X. et al. Thymosin beta 10 is a key regulator of tumorigenesis and metastasis and a novel serum marker in breast cancer. Breast Cancer Res. 19, 1–15 (2017).
Article ADS Google Scholar
De Falco, A., Caruso, F., Su, X.-D., Iavarone, A. & Ceccarelli, M. A variational algorithm to detect the clonal copy number substructure of tumors from scRNA-seq data. Nat. Commun. 14, 1074 (2023).
Article ADS PubMed PubMed Central Google Scholar
Gao, R. et al. Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes. Nat. Biotechnol. 39, 599–608 (2021).
Article CAS PubMed PubMed Central Google Scholar
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
Article CAS PubMed PubMed Central Google Scholar
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
Article PubMed Google Scholar
STAligner enables the integration and alignment of multiple spatial transcriptomics datasets. Nat. Comput. Sci. 3, 831–832 (2023).
Eguchi, J. et al. Interferon regulatory factors are transcriptional regulators of adipogenesis. Cell Metab. 7, 86–94 (2008).
Article CAS PubMed PubMed Central Google Scholar
Cronin, S. J., Woolf, C. J., Weiss, G. & Penninger, J. M. The role of iron regulation in immunometabolism and immune-related disease. Front. Mol. Biosci. 6, 116 (2019).
Article CAS PubMed PubMed Central Google Scholar
Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
Article CAS PubMed PubMed Central Google Scholar
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, A. C. Improved training of wasserstein gans. Advances in Neural Information Processing Systems. 30, 5769–5779 (2017).
Pang, M., Su, K. & Li, M. Leveraging information in spatial transcriptomics to predict super-resolution gene expression from histology images in tumors. bioRxiv, 2021.2011. 2028.470212 (2021).
Brody, S., Alon, U. & Yahav, E. in International Conference on Learning Representations (2021).
Xie, J., Girshick, R. & Farhadi, A. in International Conference on Machine Learning 478-487 (PMLR, 2016).
He, K., Zhang, X., Ren, S. & Sun, J. in Proceedings of the IEEE conference on computer vision and pattern recognition 770–778 (2016).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 1–5 (2018).
Article Google Scholar
Svensson, V., Teichmann, S. A. & Stegle, O. SpatialDE: identification of spatially variable genes. Nat. Methods 15, 343–346 (2018).
Article CAS PubMed PubMed Central Google Scholar
Deng, T. et al. A cofunctional grouping-based approach for non-redundant feature gene selection in unannotated single-cell RNA-seq analysis. Brief. Bioinforma. 24, bbad042 (2023).
Article Google Scholar
You, J., Ying, R., Ren, X., Hamilton, W. & Leskovec, J. in International Conference on Machine Learning 5708-5717 (PMLR, 2018).
Yu, X., Xu, X., Zhang, J. & Li, X. Batch alignment of single-cell transcriptomics data using deep metric learning. Nat. Commun. 14, 960 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Forrest, J. & Lougee-Heimer, R. in Emerging theory, methods, and applications 257-277 (INFORMS, 2005).
Zeira, R., Land, M., Strzalkowski, A. & Raphael, B. J. Alignment and integration of spatial transcriptomics data. Nat. Methods 19, 567–575 (2022).
Article CAS PubMed PubMed Central Google Scholar
Kaichen Xu, Y. L., et al. Xiaobo Sun Detecting anomalous anatomic regions in spatial transcriptomics with STANDS. STANDS. Zenodo, https://doi.org/10.5281/zenodo.13375297 (2024).

Download references

Acknowledgements

The project is funded by Strategic Priority Research Program of Chinese Academy of Sciences (Grant No. XDB38050100) to H.W. X.S. was supported by the startup grant from Zhongnan University of Economics and Law.

Author information

These authors contributed equally: Kaichen Xu, Xiaobo Sun.

Authors and Affiliations

School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan, 430073, China
Kaichen Xu, Yan Lu, Yihang Du, Mengqian Huang & Xiaobo Sun
School of Information Engineering, Zhongnan University of Economics and Law, Wuhan, 430073, China
Suyang Hou
Information Hub, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, 510000, China
Kainan Liu
Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, OH, 44106, USA
Hao Feng
Faculty of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Shenzhen, 518055, China
Hao Wu
Key Laboratory of Biomedical Imaging Science and System, Chinese Academy of Science, Shenzhen, 518055, China
Hao Wu

Authors

Kaichen Xu
View author publications
Search author on:PubMed Google Scholar
Yan Lu
View author publications
Search author on:PubMed Google Scholar
Suyang Hou
View author publications
Search author on:PubMed Google Scholar
Kainan Liu
View author publications
Search author on:PubMed Google Scholar
Yihang Du
View author publications
Search author on:PubMed Google Scholar
Mengqian Huang
View author publications
Search author on:PubMed Google Scholar
Hao Feng
View author publications
Search author on:PubMed Google Scholar
Hao Wu
View author publications
Search author on:PubMed Google Scholar
Xiaobo Sun
View author publications
Search author on:PubMed Google Scholar

Contributions

X.S. conceived the idea and supervised the study. X.S. and K.X. designed the methodology and wrote the manuscript. X.S., K.X., Y.L., K.L., S.H., and Y.D. implemented the methods and conducted the experiments. K.X., X.S., and Y.L. summarized the results. K.X., M.H., and Y.L. collected the datasets required for the experiment. H.W., H.F. helped revise the manuscript. All authors have read and approved the final manuscript. Additionally, X.S. and K.X. contributed equally; Y.L. and S.H. contributed equally.

Corresponding author

Correspondence to Xiaobo Sun.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Juexin Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Xu, K., Lu, Y., Hou, S. et al. Detecting anomalous anatomic regions in spatial transcriptomics with STANDS. Nat Commun 15, 8223 (2024). https://doi.org/10.1038/s41467-024-52445-9

Download citation

Received: 04 January 2024
Accepted: 06 September 2024
Published: 19 September 2024
DOI: https://doi.org/10.1038/s41467-024-52445-9

Subjects

Abstract

Similar content being viewed by others

Learning generalizable AI models for multi-center histopathology image classification

Spatial and single-nucleus transcriptomic analysis of genetic and sporadic forms of Alzheimer’s disease

A graph self-supervised residual learning framework for domain identification and data integration of spatial transcriptomics

Introduction

Results

Overview of STANDS

STANDS identifies cancerous domains in single human breast cancer ST dataset

STANDS concurrently identifies anomalous tissue domains across multiple ST datasets

STANDS identifies pancreatic cancerous domains via cross-referencing pancreatic scRNA-seq data

STANDS effectively aligns multiple ST datasets sharing identical domain types

STANDS effectively aligns multiple ST datasets with non-overlapping domain types

STANDS discerns biologically distinct anomalous tissue subdomains in single ST datasets

STANDS discerns biologically distinct anomalous tissue subdomains across multiple ST datasets

The impacts of reference diversity and size on STANDS’ performance

Ablation study

Discussion

Methods

Anomalous tissue domain detection

Multimodal learning of spatial gene expression data and histology image

Multi-sample ST data alignment

Subtyping anomalous tissue domains across multiple datasets

Model architecture and training

Anomalous tissue domain detection

Multi-sample ST data alignment

Anomaly subtyping

Data preprocessing

Evaluation metrics

Anomalous tissue domain detection

Spatial grouping discrepancy (SGD)

Multi-sample ST data alignment

Anomaly subtyping

Multi-type spatial grouping discrepancy (multi-SGD)

Benchmark methods

Benchmark overview

Anomalous tissue domain detection

Multi-sample ST data alignment

Anomaly subtyping

Benchmark implementations

Anomalous tissue domain detection

Multi-sample ST data alignment

Anomaly subtyping

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Source data

Source Data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links