Introduction

In recent years, there has been a significant advancement in single-cell spatial transcriptomics technologies, which are powerful tools for studying transcript localization and cellular processes at a high resolution and scale1,2. Despite the vast potential of single-cell spatial transcriptomics technologies, analyzing the resulting data presents several complex computational challenges. One of the most significant yet unsolved obstacles is cell segmentation, which is critical for accurately delineating individual cells within a tissue sample in spatial data. It remains a challenging task for several reasons. One of the primary difficulties is that some tissues have densely packed cells with unclear boundaries, making it difficult to perform accurate segmentation. For instance, some cells in tumor tissues and ileum3 have almost no gaps between them, presenting a very different scenario than the cortex, where tissues have a relatively sparse distribution of cells. Traditional segmentation approaches, such as Watershed4, have struggled to detect the precise cell boundaries due to the complexity of tissue microenvironment and spatial transcriptomics data. Pioneering deep learning approaches, such as Cellpose5 and JSTA6, have utilized the power of advanced statistical strategies and deep learning skills, proven effective for cell segmentation tasks using nuclei staining. However, cell staining imaging, including DAPI, poly-A, and membrane staining, reveals imbalanced image signals among different cell types7. Additionally, this strategy is unable to capture transcript spatial patterns or their colocalization, missing out valuable insights into subcellular cell compartments and structures8. In light of this, some methods have sought to leverage spatial distributions of transcripts for cell segmentation in spatial transcriptomics data, such as ClusterMap9 and Baysor3. However, it remains challenging for these statistical methods to efficiently learn the latent representation of transcript colocalization relationships within such high-dimensional spatial data. An innovative approach taken by SCS involves the utilization of transformer-based deep learning models on integrated imaging and transcript data to enhance cell segmentation accuracy. Nevertheless, one of the core steps of SCS relies on the identification of cell nuclei based on nuclei staining, which can sometimes result in incomplete coverage of cells in certain tissues10. Consequently, this strategy may lead to relatively under-sampled segmentation outcomes.

Graphical models, such as graph neural networks (GNNs) and Markov random fields, have been applied in spatial transcriptomics data for various tasks, such as deconvolution and cell-cell interaction inference3,11,12,13. These models excel at capturing the local neighborhood relationships between cells, preserving both spatial and structural context, alongside gene expression information within tissues11. Spatial transcriptomics data is inherently multi-modal, encompassing gene expression, spatial localization, and morphological information. Consequently, approaches that leverage joint learning of shared embeddings across modalities, or employ transfer learning, have gained significant traction in offering a more holistic understanding of this complex data. These techniques have proven valuable in a variety of applications, such as cell type deconvolution and the identification of disease biomarkers11,14.

Given these challenges and recent advances in computational methods, we embarked on a comprehensive exploration of the multi-modal data in single-cell spatial transcriptomics. We found significant loss of transcripts during segmentation in strategies that solely rely on staining images, and highlighted gene patterns that are indicative of cell types and boundaries from transcript colocalization data. To tackle challenges in segmentation and build upon discoveries of transcript colocalization, we introduce a computational approach, named Bering, that utilizes a GNN to harness transcript colocalization relationships for cell-type annotation. Notably, the learned transcript representations are transferred to the segmentation task as a component of multi-modal learning input, circumventing the limitations of single-modal learning. Innovatively, we approached the segmentation task as the edge prediction task to fully leverage transcript colocalization relationships and achieve a finer level of segmentation compared to conventional pixel-level segmentation methods, such as Watershed4 and Cellpose5. We have successfully applied this method to various tissue types and technologies, whether image-free or image-dependent, and have showcased its superior performance in accurately identifying precise cells in 2D and 3D thick tissues. Additionally, we demonstrate the potential for broader application of our approach by successfully transferring the pre-trained model to a new dataset, achieving accurate cell segmentation results through self-distillation. Taken together, Bering is highly modular and versatile for joint cell segmentation and annotation for spatial transcriptomics across tissues and platforms.

Results

Spatial transcriptomics data encode spatial and subcellular distribution of transcripts for segmentation analysis

Multiple types of staining images, such as DAPI, poly-A, and membrane staining, have been generated across spatial datasets and technologies for cellular morphological detection and cell segmentation (Fig. 1a and Supplementary Fig. 1). Details of datasets used in this paper are provided in Supplementary Data 1. Among them, DAPI is the most widely used staining image for cell segmentation. However, spatially detected transcripts do not always perfectly overlap with DAPI signals, with coverage varying between 30% and 70% across different samples and datasets (Supplementary Fig. 2), which can potentially result in major information loss during segmentation. While membrane staining can provide rich information for segmentation3, its inadequate and imbalanced imaging signals across different cell types could cause biased segmentation and loss of information (Supplementary Fig. 3).

Fig. 1: Overview of Bering model for cell segmentation.
figure 1

a An animation illustrating image-based spatial transcriptomics approaches. Multiple slices in the z-axis are generated and spots from nuclei and cytoplasm are detected in each slice. Additionally, staining images, including the nuclei image, are captured. The microscope image was generated by BioRender. b The concept of Neighborhood Gene Components (NGCs) is introduced for transcripts in image-based spatial transcriptomics data. NGCs are defined as count matrices, where each value in the matrix corresponds to the number of detected transcripts for each gene in the spatial neighborhood of the query transcript. c Overview of the Bering model. The workflow of the Bering model progresses through input, feature extraction, representation learning, training, and output. The input consists of spatial coordinates and gene identities of transcript spots, along with coarse cell-type labels and cell assignments; aligned stain images can optionally enhance the input. During feature extraction, spot features, including neighbor gene components (NGC), and edge (meaning spot pair) features, including pairwise distance matrices and cropped images of spot pairs (edges), are derived. In the representation learning stage, graph convolutional neural networks (GCNs) process NGC features to generate spot representations, while distance matrices and cropped edge images are used to create edge representations via radial basis function (RBF) kernels and convolutional neural networks (CNNs), respectively. For each spot pair, spot representations from GCNs are concatenated (via tensor concatenation) and combined with distance and image-based edge representations (via tensor concatenation) to form unified edge representations. In training, spot classifiers use fully connected layers with cross-entropy loss to predict cell classes for individual spots, while edge classifiers distinguish intracellular (positive) from intercellular edges using binary cross-entropy (BCE) loss. Once trained, the model’s weights are fixed, enabling predictions on new data. The output integrates edge labels via Leiden clustering to infer cell assignments, while node classifier predictions provide cell class labels, assigning cell classes and IDs to individual spots in an ensemble approach (“Methods”). The abbreviation “Repr.” means “Representations”. Illustration in (a) created in BioRender (Jin, K. (2025) https://BioRender.com/yd4shj8).

To gain a holistic understanding of the single-cell spatial omics data, we delved into transcript profiles of non-small cell lung cancer (NSCLC) CosMx data15 and revealed their patterns of cell compartments and subcellular structures16 (Supplementary Figs. 4 and 5). We utilized a factor analysis model17 in NSCLC and identified three distinct subcellular gene patterns in tumor cells, including nuclear genes (factor 2) and peripheral genes (factor 3) (Supplementary Fig. 4a–c). Nucleus-specific genes, such as MALAT1 and NEAT1, exhibit a high enrichment within the nuclear region. In contrast, genes involved in kinase phosphatase activity, such as DUSP5, exhibit a notable enrichment within the cytoplasm of cells (Supplementary Fig. 4d, e), providing compelling evidence of a subcellular pattern as indicated by the spatial distribution of transcripts.

To gain deeper insights into transcript physical colocalization, we constructed neighborhood gene components (NGCs) by grouping nearby transcripts (Fig. 1b) in subcellular-resolution NSCLC spatial transcriptomics data. Analyzing the latent representation of NGCs revealed different distributions among various cell types on Uniform Manifold Approximation and Projection (UMAP) (Supplementary Fig. 5a). In addition, NGCs from various subcellular compartments, such as the nucleus, cytosol, and membrane, within the same cell type, also exhibited different distributions on UMAP (Supplementary Fig. 5b).

Bering overview

To effectively tackle the challenges mentioned above and fully capitalize on the information embedded in transcript distributions within spatial transcriptomics data, we have developed Bering, an approach that combines GNN and transfer learning for joint cell segmentation and annotation (Fig. 1c and Supplementary Fig. 6).

Firstly, in the cell classification task, we constructed colocalization graphs using NGCs as the input features for spots (transcripts) (Fig. 1c and Supplementary Fig. 6b). A Graph Convolutional Network (GCN) was then applied to learn latent spot representations, followed by fully connected layers (FCN) to predict the cell types of individual transcripts (Fig. 1c and Supplementary Fig. 6b–d, “Methods”).

For the cell segmentation task, we proposed a strategy to predict two types of edges (meaning spot pairs) between physically close transcripts: intracellular and intercellular edges (Fig. 1c and Supplementary Fig. 6j, k). Using Leiden clustering, we distinguished individual cells, with intracellular edges linking transcripts within the same cell and intercellular edges delineating transcripts across different cells (Fig. 1c). To fully capture the edge features from spatial transcriptomics data, we incorporated multiple inputs for edge representation learning, including stain images, spot-pair distances, and concatenated spot representations (Fig. 1c and Supplementary Fig. 6e–g). These edge representations have the potential to capture essential information, including various cell boundaries and sizes (Supplementary Fig. 7), spatial distances between transcripts, and transcript composition across different cell types and subcellular compartments (“Methods”).

For the imaging-based latent feature representation learning, we cropped the rectangular region along each edge and used it as input to a convolutional neural network (CNN) to learn image-based patterns of the edges. To handle varying image shapes, we applied spatial pyramid pooling (SPP), ensuring the embeddings were transformed into a consistent format (Fig. 1c and Supplementary Fig. 6g, i). In addition, we calculated the Euclidean distances between transcripts and utilized learnable Radial Basis Function (RBF) kernels to model the distribution patterns of these distances (Fig. 1c and Supplementary Fig. 6f, h). Furthermore, the latent spot representations learned from neighboring gene components captured valuable information of subcellular colocalization patterns in various cell types. To enhance edge representation, we concatenated the representations of the two connected spots from the cell classification model (Fig. 1c and Supplementary Fig. 6e).

Finally, we combined three types of edge representations—image patterns, distance distributions, and node embeddings—through tensor concatenation and then trained FCN to predict binary edge classes (Fig. 1c and Supplementary Fig. 6j, k). For cell segmentation, we used Leiden clustering on the graph constructed by intracellular edges, which produced the segmentation output. We then integrated the cell type classification results with the segmentation, delivering annotated single cells for downstream analysis (Fig. 1c and Supplementary Fig. 6l, “Methods”).

Ablation study of the Bering model

The Bering model for cell classification and segmentation incorporates various components, including graph models, RBF kernels, and image embeddings learned from CNNs. To assess the contribution of each module and understand the model’s capabilities, we conducted ablation studies (Supplementary Fig. 8a) and evaluated the performance of cell classification and segmentation using quantitative metrics, including accuracy, macro F1 score, macro precision and recall for the classification task, and adjusted mutual information (AMI) for the segmentation task. The results revealed that GCN achieved better cell classification performance compared with fully connected layers (FCN), including significant increase of accuracy, F1 score, and precision for approximately 10%, 5%, and 5% (Supplementary Fig. 8b). Importantly, these enhancements were achieved without compromising cell segmentation performance, which remained comparable between the two models (Supplementary Fig. 8c). Additionally, the inclusion of RBF distance kernels led to significant improvements in segmentation performance (Supplementary Fig. 8e), and additional imaging representation further improved segmentation performance. Similarly, these enhancements in segmentation did not compromise cell classification performance (Supplementary Fig. 8d). Consequently, in practice, we implemented Bering with graph models and RBF kernels, and incorporated image embeddings if cell staining images were available.

Validating Bering performance of background noise and cell type prediction

Background noises pose a substantial challenge in some spatial technologies as they lack distinct boundaries from real signals, as exemplified in MERFISH and STARmap18,19. Bering addresses this issue by leveraging its GNN model to predict both background noises and real signals with cell-type annotations. While defining the ground truth for background signals is challenging, previous studies have shown that background noise tends to be farther from its neighboring transcripts compared to true foreground signals3. Following the approach of ref. 3, we measured the distances to neighboring transcripts for both background noise and true signals using Bering (Fig. 2a–c and Supplementary Fig. 9, “Methods”). Notably, Bering achieved more distinct distance distributions (p = 0.018, one-sided t-test) between real signals and background noises compared to the original paper (Fig. 2b, c and Supplementary Fig. 9).

Fig. 2: Performance of noise and cell type predictions on transcripts.
figure 2

a Background noise prediction in the MERFISH cortex data using Bering for a specific field of view (FOV). The background noise annotated in the original paper of the data was shown on the left. b Distance distributions of molecules to their 16th nearest neighbor (x-axis) for the spots in (a) are shown. Fitted lines represent these distance distributions for spots predicted as background and foreground. c Jensen–Shannon divergence scores were computed to compare the distance distributions of background and foreground regions shown in (b) for individual FOVs (n = 15 biological replicates), as presented in the original paper and in the Bering prediction results. A one-sided t-test was performed, and the significance level is indicated at the top. The interpretation of box plots follows the same convention as in this figure (e). d Cell type prediction in the 10× Xenium data of Ductal Carcinoma In Situ (DCIS) using different transcript-level annotation methods, including TACCO and Bering with and without graph models (top). The zoomed-in visualization of a particular section of the tissue is presented below. e We evaluated the performance of TACCO and Bering quantitatively on cell type classification across FOVs (n = 15 biological replicates) using four key metrics: accuracy, macro F1 score, macro precision, and macro recall. Statistical significance was determined using one-sided Wilcoxon rank-sum tests, with p-values corrected using the False Discovery Rate (FDR) method (Benjamini/Hochberg). Boxplots represent the distribution of each metric, with the box spanning from the first to the third quartile and the median indicated by a horizontal line. Whiskers extend to the most extreme values within 1.5 times the interquartile range (IQR) from the quartiles. Statistical significance between models is indicated above each comparison: p < 0.05 (*), p < 0.01 (**), p < 0.001 (***), and p < 0.0001 (****). Corrected p-values are shown on the top if p < 0.05. Source data are provided as a Source Data file.

Furthermore, we conducted a benchmark comparison of transcript-level cell type annotation using the state-of-the-art approach TACCO20. In the case of ductal carcinoma in situ (DCIS) Xenium data21, Bering’s predictions accurately identified cell labels and preserved detailed cell components in the microenvironment (Fig. 2d, Supplementary Fig. 10). Specifically, Bering successfully distinguished proliferative invasive tumor cells from other tumor cells in the niche, whereas TACCO failed to differentiate between these two types of tumor cells (Fig. 2d). Additionally, Bering captured more comprehensive immune cell distributions within the tumor microenvironment in NSCLC (Supplementary Fig. 10c). Importantly, Bering with graph models demonstrated fewer sporadic predictions and more consistent cell predictions compared to Bering without graph models, highlighting the advantages of information sharing in the neighborhood facilitated by graph models, which aligns with our initial hypothesis during model construction (Fig. 2d and Supplementary Fig. 10a). Additionally, we quantitatively compared the performance of TACCO and Bering, observing a significant accuracy improvement of 30–40% on average in the Cortex and DCIS datasets. Prediction precision increased by 10–25% on average in the NSCLC and Cortex datasets, while recall showed substantial improvement, rising by 15–35% on average in the NSCLC, ileum, and DCIS datasets (Fig. 2e).

Validating Bering performance on cell segmentation

Prior to conducting comprehensive benchmark studies, we performed a hyperparameter search for both Bering and the benchmark methods (Supplementary Figs. 1114). In Bering, we thoroughly compared hyperparameters such as the number of GNN layers, number of training cells, and structures of RBF distance kernels to determine the optimal settings (Supplementary Fig. 11). Details can be found in the Method section. In the cell segmentation process of Bering, unsupervised clustering is involved, and the hyperparameter of clustering resolution can be set manually. It was observed that stable cell segmentation results were achieved when the edge prediction accuracy was high (Supplementary Fig. 12). This implies that stable segmentation can be obtained by focusing on improving the accuracy of edge prediction, rather than purely adjusting the clustering resolution hyperparameter. Additionally, we searched hyperparameters for the benchmark methods to achieve the best segmentation performance for benchmark studies (Supplementary Figs. 13 and 14).

We then implemented the benchmark methods on the NSCLC CosMx data and observed that Bering preserved reasonable cell boundaries and sizes (Fig. 3a). In contrast, Watershed and Cellpose exhibited a relatively conservative segmentation approach, while ClusterMap and Baysor predicted a certain number of cells with abnormal sizes (Fig. 3a). Similar observations were made in other tissues, including ileum, cortex, and DCIS (Supplementary Fig. 15).

Fig. 3: Performance of cell segmentation.
figure 3

a Zoomed-in sections of CosMx NSCLC data illustrate the cell segmentation results obtained using various segmentation approaches. Different cells are depicted in distinct colors, while background noises are visualized as gray dots. The segmentation result from the original paper is displayed in the top-left corner. CM: ClusterMap. CM (img): ClusterMap with DAPI image input. Baysor (prior): Baysor with another segmentation mask input. b Adjusted Mutual Information (AMI) was used to quantify cell segmentation performance. Image-dependent methods, such as Watershed and Cellpose, were excluded from analysis when aligned nuclei staining images were not publicly available. Statistical significance was assessed using one-sided Wilcoxon rank-sum tests (n = 15 biological replicates), comparing Bering to other benchmark methods within each dataset. p-values were corrected for multiple comparisons using the False Discovery Rate (FDR) method (Benjamini/Hochberg). The interpretation of box plots and significance levels (denoted by asterisks) follows the same convention as in Fig. 2e. Corrected p values are shown on the top if p < 0.05. c Expression levels of tumor and non-tumor genes across cell types from Bering segmentation results. More details can be found in Supplementary Fig. 17. Source data are provided as a Source Data file.

Then, we quantitatively measured segmentation performance using AMI across six datasets from various tissues and technologies (Fig. 3b). Bering consistently outperformed Watershed and Cellpose in all datasets with imaging input, achieving an average improvement of 0.1–0.5 in AMI scores. While Baysor and ClusterMap performed comparably to Bering in some datasets, including NSCLC CosMx, hippocampus pciSeq, cortex MERFISH, and ileum MERFISH data, Bering achieved significantly higher AMI scores in embryo seqFISH data and DCIS Xenium data, with improvements of 0.05 to 0.5 on average (Fig. 3b). Additionally, we also evaluated segmentation performance using the Intersection over Union (IoU) from the Cellpose paper5 (“Methods”) and observed that Bering outperformed other approaches in three out of four test datasets (Supplementary Fig. 16a–d).

Bering segmented cells with increased transcript numbers and clear gene patterns

We observed that Bering tended to detect a higher number (about 20% to 70% increase on average) of transcripts per cell with larger areas than reported in the original paper (Supplementary Fig. 16e, f). This increase, which was within 100% of the original sizes, suggests that Bering can potentially capture more transcripts while preserving normal cell morphology. Other methods, by contrast, sometimes predicted cell sizes more than double or less than half of the originals, indicating possible segmentation errors. Such errors included excessively large segmentation masks, as observed with Baysor in ileum MERFISH data and DSIC in Xenium data (Supplementary Fig. 15), and excessively small segmentation masks, as seen with ClusterMap in NSCLC CosMx data and ileum MERFISH data (Fig. 3a and Supplementary Fig. 15).

To gain insights into the quality of single cells derived from different segmentation methods, we conducted benchmark comparisons at the single-cell level using the NSCLC CosMx data, where cell labels obtained from model predictions or label transfers were displayed within reduced dimensions (Supplementary Fig. 17a, “Methods”). We measured the correlations between cell types. Remarkably, we observed that Bering exhibited low correlations between tumor and non-tumor cells and closely mirrored the correlation patterns observed in the original paper (Supplementary Fig. 17b). In contrast, other methods demonstrated strong cross-correlations between tumor and non-tumor cells, alongside diminished correlation within non-tumor cells (Supplementary Fig. 17b). Furthermore, the expression of marker genes clearly indicates the separation of tumor and non-tumor cells (Fig. 3c). These findings suggest that Bering produced more distinct cell type identifications, facilitating easier and more accurate downstream analysis.

Bering is applicable across spatial technologies and 3D thick tissues

A diverse range of technologies now exists for generating single-cell spatial transcriptomics data, offering distinct data qualities and gene throughput capabilities. Apart from quantitative measurements mentioned above, we also visualize Bering’s performances of classification and segmentation across various technologies and gene numbers, such as osmFISH with 35 genes22 and STARMap with over 1000 genes18 (Fig. 4a–h and Supplementary Data 1). Additionally, a holy grail of spatial transcriptomics is to generate spatially resolved gene expression in 3D tissues and organs, thus, we applied Bering to the latest 100-µm thick-tissue MERFISH cortex tissue23, and successfully predicted cell types and conducted 3D segmentation (Fig. 4i). We segmented 562 cells from a subset of the dataset, achieving strong concordance with the cell assignments reported in the original study (AMI score = 0.795). In summary, Bering successfully segments cell boundaries and accurately predicts their corresponding cell types, thereby demonstrating its efficacy in simultaneous cell segmentation and annotation across diverse datasets and technologies, in both 2D and 3D settings.

Fig. 4: Bering applications across technologies and tissues.
figure 4

ah Bering was applied to various single-cell image-based spatial datasets, with predicted cell types and boundaries depicted in different colors and hulls in zoomed-in regions. Predicted background noises were colored in light gray. i Bering was applied to thick-tissue MERFISH mouse cortex dataset, resulting in the prediction of diverse cell types (left) and the successful segmentation of individual cells. A cross-section at 16 µm depth was magnified and presented on the right, highlighting the segmentation outcome for a specific plane.

Validating pre-trained Bering in new samples using transfer learning and self-distillation

Bering, as a deep learning approach, provides the distinct advantage of cross-dataset portability. For example, we successfully applied a pre-trained model developed from one slice of the mouse cortex to another slice, yielding highly satisfactory segmentation performance. This led to unambiguous cell type annotations (Supplementary Fig. 18a) while achieving comparable performance to the fine-tuned model (Supplementary Fig. 18b).

However, the portability of the Bering pre-trained model can be hindered by batch effects across datasets, where the throughput of genes, which serve as features in the node classification task, can vary dramatically. This presents substantial challenges when applying pre-trained models to new data. To overcome this obstacle, we employed transfer learning techniques and employed the self-distillation method to enhance prediction results on the new data (see “Methods”). In the specific case of the cortex MERFISH data, we acquired a pre-trained Bering model from ref. 19 and utilized it to predict cells and annotations in the new cortex VISp data from ref. 24. Less than 20% of molecules were assigned cell labels initially by the pre-trained model (Fig. 5a–c). To better capture the latent representation of the new data, we improved the pre-trained model through two rounds of self-distillation, leveraging the coarse prediction labels in the new data (see “Methods”). As a result, a larger number of transcripts were successfully labeled and segmented, with over 80% of transcripts assigned labels and more than 2000 cells segmented (Fig. 5c). Notably, the different layers of neurons (L2–L6) were accurately predicted, with intermittent distributions of interneurons and supporting cells (Fig. 5a, b). Furthermore, the predicted single cells from various cell types exhibited distinct distributions on the UMAP, highlighting a more pronounced separation between cell types compared to the predictions prior to self-distillation.

Fig. 5: Generalizability of the Bering model using transfer learning and self-distillation.
figure 5

ad Transfer learning of the Bering pre-trained model on a new mouse cortex MERFISH dataset. a Application of a pre-trained model from mouse cortex MERFISH data (Zhang et al.) to new mouse cortex data (Biancalani et al.), with (w.) and without (wo.) fine-tuning. Fine-tuning labels were derived from the prediction results of the pre-trained model, shown in the leftmost figure. Two rounds of fine-tuning were conducted through distillation of the prediction results (see “Methods”). A specific region is highlighted for further investigation. b Enlarged view of the highlighted region in (a). Cell boundaries are depicted by hulls. c Quantitative metrics depicting the increasing percentages of assigned molecules (left) and the number of segmented cells (right). d UMAPs displaying the distributions of cells from prediction results, with and without fine-tuning. eg Similarly, transfer learning of Bering on tumor spatial transcriptomics data. e Application of a pre-trained model from CosMx NSCLC to Xenium DCIS data, resulting in prediction results without and with fine-tuning, respectively. Two specific regions are highlighted for further investigation. f Enlarged views of the highlighted regions in (e) showing predicted cell types and cell boundaries in the results with and without fine-tuning. g The heatmap shows the expression levels of marker genes in the segmented cells from the tumor microenvironment. Source data are provided as a Source Data file.

We further applied this strategy to tumor datasets, where the pre-trained model was derived from NSCLC CoxMx data comprising 980 genes, and the validation data was obtained from the DCIS dataset, consisting of 313 genes. Without any fine-tuning, approximately 40% of transcripts in DCIS were successfully segmented and assigned cell labels (Supplementary Fig. 18c). However, the resulting transcript distribution landscape on the tumor slice lacked sufficient details for individual cells (Fig. 5e, f). Through the process of self-distillation, we achieved significant improvements. More than 80% of transcripts were labeled, and over 3000 cells were successfully segmented, which is approximately three times more compared to the results before self-distillation (Supplementary Fig. 18c). This enhancement allowed us to reveal finer details of tumor niches, such as the colocalization of immune cells and tumor cells, as well as the precise boundaries of tumor regions (Fig. 5e, f). Notably, the expression patterns of marker genes demonstrated specific expression across different cell types, confirming the accuracy of our cell segmentation and annotation (Fig. 5g).

Additionally, we evaluated the generalizability and limitations of the transfer learning strategy using the two pre-trained models mentioned above across a broader range of scenarios, including diverse tissues, technologies, and gene throughputs (Supplementary Figs. 19 and 20). When using the pre-trained model from NSCLC CosMx data (Supplementary Fig. 19a), we found that Bering could be successfully transferred to another sample from the same dataset or to other tissues, such as a tumor slice from DCIS, with reasonably high AMI scores (0.7–0.8). This transfer was effective even when using different spatial transcriptomics technologies (Supplementary Fig. 19b) and as few as ~100 overlapping features between DCIS and NSCLC data (Supplementary Fig. 19a). Although cell classification performance in DCIS was not as good as in transfer learning within the same tissue (e.g., NSCLC Sample 2), as indicated by lower accuracy and F1 scores (Supplementary Fig. 19b), it was demonstrated that self-distillation can effectively enhance cell classification in such cases (Fig. 5e, f). Similar findings were observed in another example (Supplementary Fig. 20b, c, f, g). However, the transfer learning strategy did not generalize well to tissues with vastly different cell compositions or gene features (Supplementary Figs. 19e, h and 20d, e, h, i). In these cases, significantly poorer cell classification and segmentation performance was observed compared to successful transfer learning examples (e.g., Supplementary Fig. 19b), likely due to overfitting of the pre-trained model to the original dataset.

Computational performance of Bering

We assessed the runtime and memory usage of Bering across various spatial transcriptomics datasets with differing throughput and sizes (Supplementary Data 2). Model training was carried out on an NVIDIA A100 GPU, while predictions were executed on 8 nodes of Intel Xeon Sapphire Rapids CPUs. In most cases, both training and inference required less than 2 GB of memory. For datasets without image inputs, training for 200 epochs took less than 20 min. However, for datasets that included image inputs, training times ranged from 1 to 5 h.

Discussion

The field of subcellular transcriptomics has gained popularity because of the rapid advancement of spatial omics technologies8,16,17,25,26. These cutting-edge technologies have enabled researchers to obtain more detailed information about cellular processes than ever before. Additionally, the size of features in image-based spatial transcriptomics technologies has increased from 30 to 10,00027, making it increasingly feasible to use deep learning models. However, the primary obstacle to implementing such models remains the shortage of labels for specific tasks. Consequently, previous methods for analyzing subcellular data, such as Bento and FISHFactor, have primarily relied on statistical models for unsupervised learning17,25. However, cell segmentation, which is relatively easier to obtain a large number of labels using traditional methods, presents the possibility of exploring the application of deep learning models in subcellular resolutions. Benchmark results in our study demonstrate the superior performance of deep learning models compared to traditional methods by inferring the complicated underlying relationships of gene colocalizations.

In our paper, we explore the possibility of detecting cell borders using subcellular transcript distributions and demonstrate the successful application of the graph model. Bering demonstrated better performance in cell classification, achieving an average accuracy improvement of 15–35% compared to TACCO. Additionally, it outperformed in segmentation across datasets, with AMI scores ranging from 0.1 to 0.5 higher, while achieving a reasonable increase in transcript coverage. Although Bering gained good performance in cell annotation and segmentation, we expect Bering can be further improved in future studies. First, graphs in the model were built using k-nearest neighbors, which mainly consider the relationships of transcript location within a local region. However, cells are usually spherical shapes in their 3D tissue environments and the relationships of transcripts on the membrane of two sides of the cell may not be efficiently captured by the graph. Second, the model requires labels for training purposes. If the labels come from Watershed, which may over-rely on the nuclei position and give conservative cell masks, it may also lead to conservative prediction in the Bering model. Third, we observed degraded annotation and segmentation performance by Bering in tissues where individual cells contained a low number of detected molecules. A more effective strategy is needed to handle datasets of relatively low quality. Lastly, the speed of training for the image-based encoder is relatively slow compared to image-free segmentation, where we only use the gene colocalization information as the input. Technical performance can still be improved for better and faster training and prediction. For instance, image representations from pre-trained models on cell staining data can effectively capture underlying image patterns, offering a potential efficiency boost28.

Obtaining pre-trained labels can be a difficult and labor-intensive task. Consequently, it would be advantageous if we could obtain pre-trained models for specific tissues based on large-scale data. The cortex18, with its accessible subcellular transcriptomics data, is an example of such a tissue. We have demonstrated that a pre-trained model can be readily applied to a new dataset or a new technology of the same tissue and enhanced through fine-tuning the data itself. However, the main hurdle in obtaining practical pre-trained models for various technologies is the variability in the features being measured and the lack of a large amount of data. For instance, the overlap of measured genes between the MERFISH cortex data and the osmFISH cortex data is limited to less than 15. This presents a significant challenge for transfer learning in such cases. The selection of genes to be probed and measured is heavily influenced by authors’ interests, which can result in significant heterogeneity across datasets. The varying choices and depth of features create challenges for portable models. Nevertheless, as more genes are probed in new versions of these technologies (5000 to 10,000 genes) and more datasets from various tissues are generated, this issue will become less problematic in the future. Transformer models have proven effective in learning spatial patterns from such large-scale data29,30, making them a promising avenue for future enhancements to the Bering model.

In summary, we tackled one of the often ignored yet unresolved challenges, namely cell segmentation for spatial data, using a graph model and transfer learning, incorporating multi-modality information during training. We demonstrated the model’s portability across various datasets and technologies. Bering lays a path for the future development of a foundation model trained on a massive amount of spatial transcriptomics data.

Methods

Bering model framework

The Bering model, illustrated in Fig. 1 and Supplementary Fig. 6, consists of two main components. The first component involves node classification to distinguish between noise and real signals, with predictions made for cell types. In the second component, cell segmentations are performed through an edge prediction model and the Leiden clustering algorithm. Initially, we construct a gene colocalization graph, which serves as the input for a GNN used in the node classification task. The node representations, along with edge embeddings derived from image staining and distance kernels, are employed as edge embeddings. These combined edge embeddings (via tensor concatenation) are utilized to predict edge labels, wherein intercellular and intracellular edges are binarized as negative and positive labels, respectively. The predicted intracellular edges are then utilized to build a molecular connection graph, where Leiden Clustering is applied to achieve cell segmentation. Finally, the outcomes of both node classification and cell segmentation tasks are merged to obtain annotated single-cell data.

Gene colocalization graph

In this model, transcripts are depicted as nodes on the slice, and we utilize their 2-dimension (2D) or 3-dimension (3D) spatial coordinates to construct k-nearest neighbor graphs to capture gene colocalization information. By default, we consider the 20 nearest neighbors. Edges of the graph depict equal-weight neighborhood relationships between nodes. The graph is described as below:

$${G}_{{gc}}=\left(V,{E}_{{gc}}\right),\,\,{E}_{{gc}}=\{\left(u,v\right){|v}\in {{{\rm{ad}}}}{{{{\rm{j}}}}}_{k}(u)\}$$
(1)

where \(V\) represents the node set, \({E}_{{gc}}\) represents the edge set and \({{adj}}_{k}(u)\) represents the k-nearest neighboring nodes of node \(u\) in Euclidean space.

Node features

To capture gene colocalization information in 2D or 3D spatial coordinates more effectively, we utilize NGCs as node features. NGCs consist of gene components within the k-nearest neighborhoods, resulting in sparse count matrices where genes serve as features. This enables us to incorporate spatial relationships and uncover insights about gene colocalization patterns. Below is the definition of NGC:

$${{NGC}}_{i,q}=|\{{u|u}\in {{{{\rm{adj}}}}}_{k}\left(i\right),{{gene}}_{u}=q\}|$$
(2)

where \(i\) represents molecule \(i\) in the node set \(V\) and \(q\) denotes gene \(q\) in the gene set. The matrix value \({{NGC}}_{i,{q}}\) indicates the count of detected genes \(q\) in the neighborhood of node \(i\). The total number of genes in the dataset is defined as \({N}_{{genes}}\), and each gene \(u\) corresponds to a column in the NGC matrix. The matrix values indicate the count of detected genes \(u\) in the NGC. \({N}_{{genes}}\) can vary across technologies and datasets. For instance, osmFISH has been reported to detect 35 genes22, whereas SeqFISH+ has the capacity to detect up to 10,000 genes in a single experiment27.

Graph convolutional networks

Assuming that spatially proximal nodes exhibit similar node embeddings, our hypothesis aligns with GNNs. These networks propose that the node representation in a graph should consider not only its own features but also the characteristics of its neighboring nodes. In our model, we employed the GCN to analyze the gene colocalization graph. The GCN layer is defined as follows:

$${H}_{1}=\sigma \left(\widehat{{D}^{-\frac{1}{2}}}\widehat{A}\widehat{{D}^{\frac{1}{2}}}X{\varTheta }_{1}\right)$$
(3)
$${H}_{i+1}=\sigma \left(\widehat{{D}^{-\frac{1}{2}}}\widehat{A}\widehat{{D}^{\frac{1}{2}}}{H}_{i{\varTheta }_{i}}\right)$$
(4)

where \(H\) represents the matrix of node representations \({h}_{u}\), and \(X\) represents the NGC of node features \({x}_{u}\). \(\sigma (\cdot )\) denotes the activation function (ReLU in our case). \(\hat{A}\) is the graph adjacency matrix augmented with self-loops. \(\hat{D}\) is the graph degree matrix and \(\varTheta\) is a matrix of trainable parameters.

Fully connected neural networks

After obtaining the node representation from GCNs, Fully Connected Networks (FCNs) are employed. The FCN is defined as

$${FC}({h}_{j})=\sigma (W{h}_{j}+b)$$
(5)

Here \({h}_{j}\) denotes the representation of node \(j\). The weight matrix \(W\) and bias term \(b\) are learned for each layer. \(\sigma (\cdot )\) denotes the activation function (ReLU in our case). Three FCN layers are stacked to get the final prediction. Similar networks are employed for edge predictions in the segmentation task as well, which will be mentioned below.

Node classification

Node label prediction is accomplished by leveraging the node representations acquired through GCNs and FCNs. In the training phase, the objective function is determined by computing the cross-entropy loss between the ground truth labels and the predicted labels.

$${L}_{{nodeclf}}=-\frac{1}{N}{\sum }_{n=1}^{N}\left({\sum }_{c=1}^{C}{\omega }_{c}{{{\rm{softmax}}}}\left({\hat{y}}_{n,c}\right)\cdot {y}_{n,c}+{\omega }_{b}{{{\rm{softmax}}}}\left({\hat{y}}_{n,b}\right)\cdot {y}_{n,b}\right)$$
(6)
$${{{\rm{softmax}}}}({x}_{n,c})=\log \frac{\exp ({x}_{n,c})}{{\sum }_{i\,=\,1}^{C}\exp ({x}_{n,i})}$$
(7)

where the loss of node classification is denoted as \({L}_{{nodeclf}}\). \(\hat{y}\in {R}^{\left|V\right|\times (C+1)}\) is the output from FCNs and \(y\) is the ground truth labels. \(C\) is the number of classes and \(\omega\) is the weight. \({\omega }_{c}\) and \({\omega }_{b}\) represent the weight of real signal nodes with various cell types and background noises, respectively. These weights can be adjusted by users according to the background noise prevalence to effectively identify real signals and noises.

Transfer node representation for segmentation

NGCs are employed to learn the node representations of molecules, primarily for the node classification task. To enrich the edge representations with subcellular colocalization patterns of transcripts, we transfer the learned node representations from the GCN and concatenate the representation matrices of two nodes connected by an edge via simple tensor concatenation (Fig. 1c and Supplementary Fig. 6e). This forms a new matrix, which becomes a part of the edge representation. This process is performed after the node classification model is trained, during which the parameters of GCNs and FCNs learned for the node classification task remain frozen.

$${{{\rm{E}}}}({{{\rm{e}}}})_{N}=\left\{{h}_{i},{h}_{j}\right\},{{{\rm{where}}}}\, e=(i,j)$$
(8)

where \({{{\rm{E}}}}({{{\rm{e}}}})_{N}\) represents the edge embedding of edge \(e\) by concatenating node representation \({h}_{i}\) and \({h}_{j}\) of node \(i\) and \(j\) from the layer \(l\) of the node classification model. By default, we select the output of the first fully connected layer after GCNs as the node representation.

Distance kernels

The RBF distance kernel is a kernel function that measures the similarity between two vectors based on the distance. Since the spatial distance between two nodes is highly correlated with their intercellular relationship, we utilize distance kernels as a part of the edge embeddings to effectively learn the appropriate cell sizes.

$${E(e)}_{D}={\left\{{K}_{d}({p}_{i},{p}_{j})\,\right\}}_{d=1}^{D}$$
(9)
$${K}_{d}({p}_{i},{p}_{j})=\exp \left(-\frac{{{{{\rm{||}}}}{p}_{i}-{p}_{j}{{{\rm{||}}}}}^{2}\,-\,{\mu }_{d}\,}{2{{\sigma }_{d}}^{2}}\right)$$
(10)

where E(e)D represents the edge embedding derived from distance kernels Kd. Total number of kernels is denoted as D. pi and pi represents coordinates of nodes i and j. μd and σd are the mean and standard deviation of each kernel, which can be learnable.

Image representation

Cell boundaries are highly indistinct without image staining in densely packed tissue, such as tumor and ileum. Latent image representations prove useful in these cases. Consequently, the image representations serve as an additional edge representation in such scenarios. In this model, CNNs followed by SPP are applied to learn the embeddings of input images of different sizes. FCNs are further utilized to learn the edge representation using the output of SPP.

$$E{\left(e\right)}_{{Img}}={{{\rm{FCN}}}}\left({{{\rm{SPP}}}}\left({{{\rm{CNN}}}}\left({im}{g}_{e}\right)\right)\right)$$
(11)

where \({{{\rm{E}}}}({{{\rm{e}}}})_{{img}}\) represents the image representation in the edge embedding, and \({{img}}_{e}\) is the image of edge \(e\). The image of an edge is defined as the rectangular region it covers, with the edge itself forming the diagonal line. While the size of these images falls within a specific range, the exact dimensions can vary significantly (such as 7 × 12 or 14 × 19 pixels), depending on the length of the edge. To optimize computational resources, we categorize edge sizes into distinct groups (such as 5 × 10 or 15 × 20 pixels). The actual images are then adjusted by expanding or cropping to fit within the designated bin size, which serves as the training input.

Edge representation

The edge representation serves as the input for the edge classification and the segmentation task, comprising three aforementioned key components, including node representation, distance kernels and image representation. They are learned by different models—GCN for node representation, learnable RBF kernels for distance distributions, and CNN for image representation. We combine these three distinct types of edge representations vis simple tensor concatenation (Fig. 1c and Supplementary Fig. 6j), and form a new matrix as the input for edge classification model:

$$E(e)=\{{E(e)}_{N},{E(e)}_{D},{E(e)}_{{Img}}\}\,$$
(12)

These three components play a crucial role in efficiently learning the underlying knowledge of gene colocalization relationships, cell sizes, and image-informed cell boundaries, respectively. Moreover, their combination holds the potential to learn cell type-specific cell sizes. Node representations and distance kernels prove to be adequate for sparsely populated tissues with clear cell boundaries, such as the cortex. However, in densely packed tissues like tumors, ileum, and liver, where cell boundaries are often challenging to discern, image representation becomes a vital component. The incorporation of image channels, such as DAPI staining, provides valuable information about cell boundaries in these densely packed tissues.

Edge classification

The edge classification task serves as input for the cell segmentation community detection algorithm. It is formulated as a binary classification problem, training the fully connected neural network to discern intracellular and intercellular molecular colocalization. Predicted edge labels are obtained using the sigmoid function applied to the neural network output. Binary cross-entropy is employed as the objective function.

$${L}_{{Edgeclf}}=-\omega [y\cdot \log \hat{\gamma }+(1-y)\cdot \log (1-\,\hat{\gamma })]$$
(13)

where \({L}_{{Edgeclf}}\) represents the loss of the edge classification model. \(\omega\) is the weight; \(x\) and \(y\) are the probability of predicted binary labels and ground truth labels, respectively.

In the benchmark studies, edge labels were derived from the segmented cells in the original paper. In practice, however, edge labels were generated from coarsely segmented cells using methods like Watershed or others. For each node, edges were generated with its nearest neighbors and labeled as positive or negative, depending on whether they represented intracellular or intercellular connections within the coarsely segmented cells. These labels served as input for training the edge classification model. Once trained, the model can predict new intracellular edges, allowing for the segmentation of cells that may not have been identified in the original dataset or the coarsely segmented results.

Molecular connectivity graph

After the edge prediction, we create the molecular connectivity graph using predicted edge labels \({e}_{i,j}\). Positive labels (\({e}_{i,j}=1\), intracellular edges) indicate a connection between two nodes, where they belong to the same cell according to the model. Negative labels (\({e}_{i,j}=0\), intercellular edges) imply no connection between nodes, indicating they belong to separate cells.

$${G}_{{mc}}=(V,{E}_{{mc}})$$
(14)
$$(i,j)\in {E}_{{mc}}\,{{{\rm{if}}}}\,{e}_{i,j}=1$$
(15)

Due to the large number of nodes and edges involved (potentially in the millions and billions, respectively), it is infeasible to predict labels for all pairwise edges. Currently, technology typically covers fewer than 300 nodes in a single cell, suggesting that positive edges for a node primarily come from its 300 closest neighbors. Consequently, we only predict edge labels for nearest neighbors.

Cell segmentation with community detection algorithms

Once the molecular connectivity graph is constructed, we apply the community detection algorithms, such as Louvain and Leiden, to identify clusters from the graph. These clusters correspond to individual cells in the tissue.

Single-cell annotation

After obtaining node classes and cell boundaries, we utilize both outcomes to generate single-cell data with annotations. We select cells that meet specific criteria, including (1) a minimum number of total transcripts per cell, (2) a minimum number of transcripts for the dominant cell type per cell, and (3) a minimum ratio of transcripts of the dominant cell type per cell. We introduce a threshold for the transcripts of the dominant cell type as we posit that transcripts within a segmented cell should belong to the same cell type. If transcripts within a segmented cell are annotated as many different cell types, we lack confidence in the accuracy of the cell annotation. These values are determined based on the user’s interpretation of the data. Setting a higher threshold for the minimum number of total transcripts will exclude segmented cells with fewer transcripts. Segmented cells that satisfy the criteria are annotated with the dominant cell class of transcripts within each cell.

Model implementation

The GCNs are implemented using PyTorch Geometric31, with batch normalization and dropout layers (dropout rate = 0.2) applied to each graph convolutional layer during training. The FCNs and CNNs are implemented using PyTorch, with batch normalization incorporated into each FCN layer. To handle the heterogeneous shapes of input images, SPP is employed. This is achieved through the PyTorch AdaptiveMaxPool2d function, utilizing three pooling sizes (4, 2, and 1) to ensure consistent sizes of output embeddings. The resulting embeddings from the SPP layers are passed through two layers of FCNs to generate the image representation, which becomes a part of the edge representation. The RBF distance kernel, referred to as the GaussianSmearing function, is implemented using the TorchDrug package32. The parameters μ and σ in the distance kernels are learnable during the training phase.

In both the node classification and edge prediction tasks, a learning rate of 1 × 10−3 and weight decay of 5 × 10−4 are set. Early stopping can be triggered when the loss fails to decrease.

Generalized model using transfer learning

In our transfer learning experiments, two major pre-trained models were developed. The first model was based on tissue slice 21 of the mouse cortex MERFISH dataset (Zhang et al.). Three derivative models were then created with the same architecture but varying feature spaces: model I with 255 features from the pre-trained dataset itself; model II with 288 features combining data from two mouse cortex datasets (Zhang et al., Codeluppi et al.); and model III with 485 features from a combination of mouse cortex and mouse ileum data (Zhang et al., Petukhov et al.).

The second model was built using NSCLC CosMx data (He et al.). Similarly, three derivatives were created: model I with 980 features from the pre-trained dataset; model II with 1395 features from a combination of NSCLC CosMx and DCIS Xenium data (Janesick et al.); and model III with 1154 features combining NSCLC CosMx data and mouse ileum data (Petukhov et al.).

Once the models were pre-trained, we transferred both the node classification and edge prediction models with the weights frozen. In the instances where fine-tuning was applied (Fig. 5a, e), the newly predicted cell types and assignments were used as inputs, and the pre-trained model’s weights were unfrozen, while the model structure remained unchanged. The fine-tuned model was then used to predict cell labels again, resulting in more predicted cells, and this process could be repeated. We performed fine-tuning for up to two rounds (Fig. 5). Throughout fine-tuning, the model architecture remained consistent, while the weights were updated with new predictions from each round.

Analysis of Neighborhood Gene Component

To unravel the underlying information regarding gene colocalization, we constructed atlases of NGCs using the CoxMx NSCLC dataset. In each cell, a random selection of two molecules was made, and their k-nearest neighbors (k = 30 in our case) were identified, resulting in the formation of two NGCs. These NGCs were derived from various cell types and compartments, including nuclei, cytoplasm, and membrane. The information about cell type and compartment was derived from the original papers. The NGCs from different cell types and compartments were then concatenated to form a matrix, similar to single-cell matrices where genes serve as features. The distinction lies in the fact that NGCs utilize neighborhoods as objects within the matrix, rather than individual cells. These NGC matrices were stored as Scanpy objects33, and an analysis pipeline resembling single-cell analysis was applied to the data. By employing UMAPs and Leiden clusters, we obtained reduced dimensions and clustering patterns from these matrices. For further details, please refer to the single-cell analysis section.

Generation of validation data

For each benchmark dataset, we divided the entire dataset evenly into individual fields of view (FOVs), with each FOV covering an average of 20 cells based on segmentation results from the original studies. In the ablation studies, we evaluated classification and segmentation performance on 50 FOVs from the NSCLC CosMx dataset34. For other benchmark studies involving different segmentation algorithms, we used 15 FOVs and applied the same metrics. The scripts used to generate the validation data are available in the public GitHub repository, as detailed in the Code Availability section.

Ablation study

We performed ablation studies on three key components of the Bering model: variations in node representation learning models, the usage of trainable RBF kernels, and the inclusion of image representations (Supplementary Fig. 8a). For the node representation learning, we compared two neural network architectures with the same number of layers: Graph Convolutional Networks (GCNs) and FCNs. We also assessed the impact of using trainable RBF kernels for distance representation learning and the effect of including imaging representations learned through CNNs and SPP layers.

Metrics for both cell classification and cell segmentation were evaluated across 50 Fields of View (FOVs) from NSCLC CosMx data to compare different model structures (Supplementary Fig. 8b–e). Statistical significance was determined using one-sided Wilcoxon rank-sum tests, with p-values adjusted using the False Discovery Rate (FDR) method (Benjamini/Hochberg).

Benchmark metrics

Node classification task

In this project, we employed multiple metrics to evaluate the performance of the model from various angles. For the node classification task, we focused on two aspects for comparison: background noise prediction and cell type prediction. In in-situ hybridization methods like MERFISH, significant noise is observed due to RNA diffusion during the staining rounds. To assess the effectiveness of background noise classification, we utilized the distance distribution as the evaluation metric. We selected spots from either the original paper’s annotations or Bering’s prediction results and calculated the distance to the 16th nearest neighbor for each spot. Using Gaussian kernels, we fitted curves to represent the distances for background noise and real signals. Finally, we calculated the Jensen–Shannon divergence between these curves for both the original paper and Bering’s predictions to compare their distributions.

Apart from background accuracy, we also evaluated the performance of cell type classification. The accuracy of cell type predictions is used for the comparison

$${Ac}{c}_{{cell}-{type}}=\frac{{\sum }_{i=1}^{N}(\hat{{y}_{i}}={y}_{i})}{N},\,{{{\rm{where}}}}\, \, \widehat{{y}_{i}},{y}_{i}\in [{\scriptstyle{{\prime}}\atop} \! {type}{1}^{\prime},\,{...}^{\prime} {type}\; {C}^{\prime} ]$$
(16)

where \({\hat{y}}_{l}\) and yi are predicted labels and ground truth of cell types for the molecule i.

Additionally, we calculated the Macro-F1 score, Macro Precision, and Macro Recall for the cell type classification task using functions from scikit-learn.metrics. We selected the Macro-F1 score, Macro Precision, and Macro Recall over their micro counterparts because the distribution of cell types is usually imbalanced. By using macro metrics, we aim to equally evaluate the classification performance for both abundant and rare cell types.

Cell segmentation task

The cell segmentation output can be seen as clustering results, where the similarity between two clustering results—our predicted cells and the cells in the ground truth—needs to be assessed. To measure this similarity, we employed a widely used metric known as AMI. AMI served as the quantification metric in evaluating the performance of our cell segmentation method.

$$ {{{\rm{AMI}}}}({{cells}}_{{true}},{{cells}}_{{pred}})=\\ \frac{[{{{\rm{MI}}}}({{cells}}_{{true}},\,{{cells}}_{{pred}})\,-\,E\left(\right.{{{\rm{MI}}}}({{cells}}_{{true}},\,{{cells}}_{{pred}})]}{\left[\right.{{{\rm{avg}}}}(H({{cells}}_{{true}}),\,H({{cells}}_{{pred}}))\,-\,E\left(\right.{{{\rm{MI}}}}({{cells}}_{{true}},\,{{cells}}_{{pred}})}$$
(17)

where \({{cells}}_{{true}},\,{{cells}}_{{pred}}\) represent cell ids from ground truth and predictions. \(H(\cdot )\) represents the entropy of a vector, and \({MI}(\cdot )\) represents mutual information. AMI considers the fact that mutual information is generally higher for two clusterings with a large number of clusters. In our case, the number of cells in a dataset is usually large and AMI could be a more appropriate option.

Additionally, we measured the average precision curve across a range of IoU thresholds, also referred to as the Jaccard index, which is a benchmark metric used in the Cellpose paper5. In the image-based object detection, IoU is defined as \(J(A,B)=\frac{|A\cap B|}{|A\cup B|}\), where |A B| represents the overlapping area between objects A and B, and \(|A\cup B|\) represents the area covered by either A or B.

In this paper, we customized IoU as follows: \(|A\cap B|\) refers to the number of overlapping transcripts assigned to both cell A and cell B, while \(|A\cup B|\) represents the total transcripts assigned to either cell A or cell B in the prediction. This approach is more tailored than using pixel-based areas, since some datasets may contain multiple transcripts within a single pixel. As a simple example, if both cell 1 and cell 2 have 100 transcripts each, with 50 overlapping transcripts, the IoU would be calculated as \(J\left({cell}1,{cell}2\right)=50\,/\,\left(100+100-50\right)=\,0.33\).

We measured the number of matched cells between the prediction results and the original study under various IoU thresholds. Higher IoU values indicate that predicted cells contain more similar transcripts compared to the original, thereby filtering out more cells. Following the Cellpose paper5, we also measured the same metric—average precision (AP) curve—along the IoU thresholds (Supplementary Fig. 16a–d). AP is defined as:

$${{{\rm{AP}}}}=\frac{{{{\rm{TP}}}}}{{{{\rm{TP}}}}+{{{\rm{FP}}}}+{{{\rm{FN}}}}}$$
(18)

Fraction of assigned molecules

The fraction of assigned molecules is utilized as a metric to compare the effectiveness of segmentation. We hypothesize that more conservative methods that solely rely on nuclei staining images may yield lower performance in this metric. It is defined as below.

$${{{\rm{Frac}}}}=\frac{{N}_{{{{\rm{molecules}}}}\,\notin \,{{{\rm{background}}}}}}{{N}_{{total}\,{molecules}}}$$
(19)

Number of cells and cell areas

The number of cells is utilized as an indicator to assess the capability of identifying individual cells. Nevertheless, it is important to note that a higher number of cells does not necessarily indicate superior segmentation performance. Certain methods may employ an aggressive approach that results in numerous cells with small areas. Therefore, we also evaluated the sizes of the segmented cells to determine whether they exhibit appropriate shapes. The median cell area for each field of view was calculated as a reference, and the relative sizes of the predicted cells were used to assess cell area measurements.

Number of transcripts per cell

In the benchmark experiments, we calculated the median number of transcripts for each segmented cell after excluding unsegmented transcripts. In some benchmark experiments, this value was determined for each field of view as a reference, and the relative sizes of the predicted cells were used to report the number of transcripts per cell.

Correlation of single cell expressions

To assess the correlation between clusters, we initially identified representative cell markers by conducting differential expression analysis within the single-cell clusters from the original paper34. Subsequently, we calculated the Spearman correlation by averaging the gene expression values across clusters. This allowed us to evaluate the degree of correlation between the clusters.

Significance test

In the benchmark and ablation experiments, we conducted One-sided Wilcoxon rank-sum tests for comparisons between methods, with p values corrected by FDR (Benjamini/Hochberg). In the distance distribution experiment in Fig. 2c, we conducted the one-sided t-test for comparison.

Benchmark methods

TACCO classification20

TACCO, an optimal transport method, enables the transfer of annotations from single-cell data to spatial data. In our benchmark studies, we utilized the benchmark datasets with annotations obtained from the original papers to create the reference single-cell data. The cell type annotations for molecules were generated by projecting the single-cell labels using the “annotate_single_molecules” function.

Watershed segmentation4

The watershed method, implemented using the spateo package35, was utilized for segmentation based on nuclei staining images with some modifications. Firstly, the masks of nuclei were identified using both global and local adaptive thresholding techniques. Subsequently, peak detections were performed based on the results obtained from the distance transform algorithm. The connected peaks were merged to form individual markers. Finally, the masks and markers were used as input for the watershed algorithm. Molecules were assigned to the nearest pixels, and cell IDs were determined based on the corresponding pixels.

Cellpose segmentation5

Cellpose, a U-Net based deep CNN approach, was used as a benchmark method in the paper. We used the pretrained model ‘cyto’ and nuclei staining for segmentation.

ClusterMap segmentation9

ClusterMap is a cell identification method that utilizes density peak clustering of spots in spatial data, with the option to incorporate staining images as an auxiliary input. In our benchmarking, we evaluated both modes of ClusterMap, namely with and without aligned nuclei images. Specifically, we employed the 2D segmentation mode for this particular task. When available publicly, DAPI staining imaging was employed during the preprocessing stage. In cases of noisy data, such as MERFISH19 and pciSeq7, the noise ratio was estimated by referencing the percentage of unsegmented transcripts as indicated in the original paper; this parameter, termed “pct_filter,” was then configured during preprocessing. Furthermore, for these noisy datasets, the local noise rejection mechanism (“LOF”) was activated during the preprocessing phase.

Baysor segmentation3

Baysor is a Bayesian model-based method designed for segmenting spots in spatial data. It has the capability to incorporate prior segmentation masks, such as those obtained from Watershed or other segmentation methods. In our benchmark study, we evaluated the performance of Baysor both with and without prior segmentation. For the tests with prior information, we performed Watershed segmentation on the nuclei staining images and utilized the resulting masks as input for Baysor.

Hyperparameter search

To ensure a fair benchmark comparison, we carefully fine-tuned the hyperparameters of Bering and other benchmark methods, selecting the best parameter combinations for the final evaluation. In the case of the Watershed algorithm, we considered three important parameters: the minimal distance for peak detection, the kernel size for morphological open and close operations, and the block size for adaptive thresholding. For Cellpose, we conducted benchmarking experiments to determine optimal diameters and flow thresholds. In ClusterMap, we separately benchmarked two modes: one with auxiliary images and one without. In both modes, we evaluated three hyperparameters, including thresholds for cell numbers, estimated radius in the x-y plane, and sample interval in the DAPI image. Similarly, in Baysor, we evaluated its performance in two modes: one with a prior segmentation mask and one without. For Baysor, we first benchmarked hyperparameter selections, including scale, the standard deviation of scale, and the minimal number of molecules.

We first performed a hyperparameter search for Bering and evaluated cell segmentation performance with AMI (Supplementary Fig. 11). While some variation was observed between hyperparameters in each experiment (Supplementary Fig. 11a–f), none of the differences was significant based on corrected p-values, which were calculated using a two-sided Wilcoxon rank-sum test with FDR correction (Benjamini/Hochberg). It indicates the segmentation performance is robust and not sensitive to the hyperparameter selection in the Bering model. Based on these observations, we selected the hyperparameters that yielded the highest average AMI. In cases where multiple hyperparameters produced similar average AMI, we opted for the model with a simpler architecture. For instance, node classification models with 3 and 5 FCN layers demonstrated comparable AMI (Supplementary Fig. 11d), so we selected the less complex model with 3 FCN layers. The final chosen hyperparameters are listed below:

Number of nearest neighbors in the graph: 20

Number of GCN layers in the node classification: 3

Number of FCN layers in the node classification: 3

Number of FCN layers in edge classification: 3

Number of RBF kernels: 64

Maximal μ in RBF kernel (relative to cell diameter): 0.5

In our hyperparameter search experiments for other methods, we observed that certain parameters, such as the scale-std of Baysor and the sample interval of DAPI images in image-free mode, did not significantly influence the results within the tested range. As a result, these parameters were excluded from further optimization. We identified the best hyperparameters for each method using the experimental dataset. The optimal hyperparameters are listed below:

Watershed: the minimal distance for peak detection = 5; the kernel size for morphological open and close operations = 11; the block size for adaptive thresholding = 33% (relative to cell diameter)

Cellpose: flow_threshold = 0.8

ClusterMap (without DAPI input): cell_num_threshold = 0.0001

ClusterMap (with DAPI input): DAPI grid interval = 3

Baysor (without prior segmentation): min_molecule = 10; min_molecule = 3 if there are not enough cells pass min_molecule = 10 threshold

Baysor (with prior segmentation): prior_segmentation_confidence = 0.8.

Thick tissue analysis

The thick tissue MERFISH cortex dataset23 was harnessed to assess the 3D segmentation performance of Bering. The original cell boundaries and labels served as the pre-trained references during the model’s training phase. By employing Euclidean distances within the 3D context, we computed k-nearest neighbors and subsequently established colocalization graphs. The trained model’s prowess was then put to the test in a designated 200 × 200 × 100 µm region for comprehensive evaluation.

Subcellular pattern identification

Utilizing FISHFactor, we discerned distinct subcellular gene patterns within tumor cells from the NSCLC CosMx dataset. A cohort of 200 tumor cells was randomly selected, and three distinct factors were computed for all genes. The gene weights within these factors, along with spatial factor scores, were subsequently plotted to unveil subcellular gene patterns. Factors 2 and 3 were denoted as “nucleus-pattern” and “peripheral-pattern,” respectively. The genes with the highest weights within these two factors were identified, and their subcellular distributions were visualized. To portray these distributions, boundaries for cell nuclei and cell bodies were established through Cellpose segmentation of DAPI imaging and the convex hulls of all transcripts, respectively. Furthermore, we computed the normalized distance to the cell nuclei centroid. This calculation entailed determining the distance between the query transcript and the nuclei centroid, which was then divided by the maximal distance between all transcripts and the nuclei centroid. Finally, we employed kernel density estimation plots to visually represent the smoothed distributions of these distances.

Single cell analysis

Single cell analysis was performed using Scanpy33. We started by extracting count matrices from the segmented cells, which were then normalized to a total count of 1000 per cell. Cells with insufficient counts (minimum 10 counts per cell) were removed due to low coverage. Then, log transformation and scaling were applied to the normalized counts. Principal Component Analysis (PCA) was subsequently employed to reduce the dimensionality of the data. Using the resulting PCA components, a neighbor graph was constructed by considering the 10 nearest neighbors for each cell. The UMAP algorithm was applied to obtain a reduced-dimensional representation of the data, which facilitated visualization and exploration. To understand cell identities in the results, we used the “ingest” function in Scanpy to map labels from the reference data (annotations in the original paper in this case) to single cell data generated from benchmark methods, including Watershed, Cellpose, ClusterMap, and Baysor. Predicted labels from Bering results were used directly for the comparison with other benchmark methods.

Alignment of image signals and spot information

DAPI staining images are usually available for image-based spatial technologies. We get the coordinates of transcript spots and project them onto DAPI images based on the closest pixels. The DAPI image intensity of the corresponding pixels was used as the DAPI staining strength of the spots. DAPI image intensity ranges from 0 to 255 and 25 is used as the threshold for low signal pixels.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.