Introduction

Recent advances in single-cell technologies, which enable the measurement of transcriptomic1, epigenomic2,3 and proteomic4,5,6 profiles at single-cell resolution, have greatly enhanced our ability to comprehensively characterize cellular states. Data resources generated by these technologies have provided significant insights into the functions of various cell types7,8 and deeper understanding of pathology9,10 from multiple omics perspectives. As high-throughput single-cell technologies continue to develop rapidly and data resources accumulate, there is an increasing need for computational methods that can integrate information from different modalities to perform joint analysis of single-cell multi-omics data and gain a more comprehensive understanding of cellular states and functions.

However, integrating single-cell omics datasets presents unique challenges. First, cross-modality integration, also known as “diagonal integration”11, aims to align different single-cell modalities with distinct features. For the integration of single-cell RNA-sequencing (scRNA-seq) and single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) datasets, the cross-modality features exhibit strong connections as gene expression levels can usually be accurately imputed using single-cell chromatin accessibility12,13. Nevertheless, the features across some modalities, such as surface protein abundance in proteomic assays and its coding gene expression in scRNA-seq data, show weaker relationships which are often not robust enough to reliably guide integration, as mRNA levels do not always correlate with protein abundance due to post-transcriptional regulation, degradation, and protein modifications14,15,16,17. Furthermore, many cross-omics features are involved in regulatory circuits that are not well understood, making it difficult to achieve integration when known information about feature relationships is limited. Second, compared to scRNA-seq which provides whole-transcriptome profiling for tens of thousands of genes, some technologies detect only a limited number of features, such as dozens to hundreds of protein targets in antibody-based single-cell proteomics5,18 and 100 to 1000 genes in imaging-based spatial transcriptomics19. This limitation further constrains the signal available for high-quality integration, making cross-modality integration more challenging.

Many computational methods have been developed for the integration of single-cell datasets20,21,22,23,24,25,26,27. However, most of these methods were developed primarily for correcting batch effects in scRNA-seq datasets, or integrating omics layers with strong connections such as scRNA-seq and scATAC-seq data. These methods, however, often fail to address the aforementioned challenges. Among the existing methods, bindSC26 and MaxFuse27 were recently developed for single-cell multi-modal integration, demonstrating particular efficacy in integrating modalities with weak relationships, such as protein abundances and gene expression levels. Both methods utilize canonical correlation analysis (CCA) to learn linear projections that map features from each modality to a common space, ensuring that the projected vectors are maximally correlated. However, the inherent structure of unwanted variation across single-cell datasets is often complex and nonlinear28,29,30. Meanwhile, the relationships between cross-modality features can be intricate and cell type-specific, regulated by multiple biological factors15,16. Thus, linear projections may lack the flexibility needed to adequately correct unwanted variation and accurately model feature correspondence.

Here, we present scMODAL, a general deep learning framework for single-cell multi-omics data alignment with feature links. scMODAL is designed to integrate unpaired datasets with limited numbers of known positively correlated features, which are also referred as “linked” features in the literature27. To capture complex relationship between different modalities, we build neural networks to project different single-cell datasets into a common low-dimensional latent space and apply generative adversarial networks (GANs)31 to align cell embeddings. To accurately find cell population correspondence across datasets, scMODAL utilizes prior information from known linked features to identify anchor cell pairs that can guide integration, while preserving topology structure of all input features. Through comprehensive real data experiments, we demonstrate scMODAL’s performance in preserving biological variation across modalities and finding correct correspondences among them, using scRNA-seq, single-cell proteomics and scATAC-seq datasets. Especially, scMODAL shows state-of-the-art performance in both unwanted variation removal and biological information preservation even when there are very few linked features. With the integration results, scMODAL can identify cell subpopulations that were not distinguishable with the original modality features. We further showcase scMODAL’s capabilities in downstream tasks, such as imputation of cross-modality features and inference of feature relationships. We have made scMODAL publicly available as a Python package at https://github.com/gefeiwang/scMODAL.

Results

Method overview

scMODAL is a deep generative framework that learns integrated cell representations from single-cell multi-omics features. The input to scMODAL comprises cell-by-feature data matrices. For simplicity, we consider the scenario involving two datasets with different numbers of cells and features, denoted by \({{{{\bf{X}}}}}_{1}\in {{\mathbb{R}}}^{{n}_{1}\times {p}_{1}}\) and \({{{{\bf{X}}}}}_{2}\in {{\mathbb{R}}}^{{n}_{2}\times {p}_{2}}\). Using prior knowledge about the cross-modality feature relationships, we compile linked features from these datasets into another pair of matrices \({\widetilde{{{{\bf{X}}}}}}_{1}\in {{\mathbb{R}}}^{{n}_{1}\times s}\) and \({\widetilde{{{{\bf{X}}}}}}_{2}\in {{\mathbb{R}}}^{{n}_{2}\times s}\), where s represents the number of feature pairs (Fig. 1a). The columns of these matrices pair cell features likely to be positively related, such as gene expression levels from scRNA-seq data and gene activity scores computed based on scATAC-seq data, or protein abundance levels paired with their corresponding protein-coding gene expression levels.

Fig. 1: Overview of scMODAL.
figure 1

a scMODAL takes single-cell feature matrices from different modalities, together with feature links as input. b scMODAL utilizes generative adversarial learning to mix the distributions of cell embeddings from different datasets. To find correct correspondence between modalities as well as preserve biological variation within each modality, regularizations to narrow the distance between anchors based on prior information and preserve geometric representation of cells are applied in the training process of scMODAL. c scMODAL outputs integrated cell representations for further analyses, and the composition of trained networks enables imputation of features and inference of feature relationship across single-cell modalities. The results can also be used for multiple downstream analyses, including label transfer for revealing cell identities and cell-cell communication inference using imputed features.

To address complex unwanted variations between modalities, we use nonlinear neural networks as encoders, denoted as E1 and E2, to map cells to a shared latent space Z (Fig. 1b). Unlike most integration methods that rely solely on shared features, our approach inputs the full feature matrices X1 and X2 into the encoders to preserve biological information. Decoders G1 and G2 are employed to generate cell features from the latent embeddings and trained together with the encoders for autoencoding consistency. Once the cells are encoded in Z, we apply the generative adversarial learning mechanism in GANs to minimize the Jensen-Shannon divergence between the latent distributions of the datasets using an auxiliary discriminator network.

However, using generative adversarial learning to align distributions without guidance can result in incorrect integration by mismatching distinct cell populations. In practice, there are often no cells measured with both modalities available to serve as integration anchors. Therefore, we use cell similarity information in positively related features \({\widetilde{{{{\bf{X}}}}}}_{1}\) and \({\widetilde{{{{\bf{X}}}}}}_{2}\) to establish connections between datasets. Specifically, during training, we calculate mutual nearest neighborhood (MNN) pairs between minibatches of samples as anchors to guide integration. After identifying these MNN pairs, we regularize the neural network optimization by keeping the embeddings of MNN pairs close to each other using an L2 penalty on the Euclidean distance. While using MNN pairs for batch-effect correction in scRNA-seq datasets has yielded promising results28, simply minimizing the distances between MNN pairs may not effectively align all cell populations in a multi-omics setting, as the shared information between cross-modality features could be limited. Nevertheless, these MNN pairs can serve as valuable prior information, enhancing the accuracy of integration when combined with the generative adversarial learning mechanism. Additionally, to prevent the networks from becoming too flexible, which could result in loss of information and destruction of dataset-unique structures, we preserve the geometric structure of each dataset by regularizing the geometric representations of cells. Specifically, for each cell, we calculate its Gaussian kernel distance from other cells in the sampled minibatch as a B-dimensional geometric representation, where B is the batch size. During training, the encoders are encouraged to preserve the geometric representations, maintaining relative similarities and distinctions among cell populations.

After training the neural networks, aligned cell representations can facilitate cross-modality integrative analysis (Fig. 1c). The network compositions E1(G2( )) and E2(G1( )) can be used to map cells from one modality to another, serving as a bridge for cross-modality feature imputation. Using imputed features, we can also infer correlation networks among different modalities to reveal potential regulatory relationships. More details are provided in the Methods section.

Benchmarking on integration of gene expression and protein abundance with multimodal datasets

We first evaluated scMODAL’s performance using a human cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq) peripheral blood mononuclear cells (PBMCs) dataset32, which simultaneously quantified transcriptome-wide gene expressions and 228 surface protein markers using antibody-derived tags (ADTs) in the same cells. We applied scMODAL and other recently developed integration methods, including MaxFuse27, bindSC26, GLUE24, Monae25, Portal23 and Seurat20, to integrate the RNA and ADT modalities, treating these cells as unmatched during the integration process. The matched RNA and ADT profiles in this dataset serve as the ground truth for a systematic comparison.

Before integration, we investigated the cell population structures in unintegrated datasets. As shown in the UMAP33 plot and correlation heatmap based on the ADT data, CD4 T cells and CD8 T cells exhibit distinct protein abundance levels, indicated by their separate clusters and distinct correlation blocks (Fig. 2a, c). However, they show higher similarity when comparing the expression levels of highly variable genes (Fig. 2d).

Fig. 2: Benchmarking on integrating transcriptome and protein data produced by CITE-seq.
figure 2

a UMAP plots of unintegrated datasets, colored by cell types. b UMAP plots of integrated embeddings produced by scMODAL, colored by modalities (left) and cell types (right). Correlation heatmaps of ADT (c) and RNA features (d), computed using all proteins and 3000 highly variable genes for each pair of cells. Cells are grouped based on cell types. e Confusion matrix of scMODAL's label transfer result. Each entry represents the proportion of cells of the cell type on the y-axis that are assigned to the cell type on the x-axis. Quantitative evaluations in terms of unwanted variation removal, biological information preservation and cell-state matching accuracy, using full protein panel (f) and 30 proteins (g). h UMAP plots of integrated embeddings produced by MaxFuse, which ranked the second in preserving cell population structures, colored by modalities (left) and cell types (right). i Comparison of scMODAL, MaxFuse and bindSC's imputation results and coding genes of proteins CD102 and CD244. j Comparison of scMODAL, MaxFuse and bindSC's protein imputation results using Pearson correlation coefficient. k Correlation box plots for scMODAL, MaxFuse and bindSC's imputation results and protein coding genes, with the central lines marking the median values, the boxes showing the quartiles while the whiskers showing the rest of the distribution. 206 overlapping proteins filtered by all methods were used for comparison. l Gene-protein correlation network inferred using scMODAL's imputation result, centered at five monocyte-enriched proteins. Red and green lines indicate strong correlations, with red representing positive correlations ( > 0.8) and green representing negative correlations ( < − 0.8). Source data are provided as a Source Data file.

Among all compared methods, scMODAL, MaxFuse, bindSC, GLUE and Monae were specifically developed for integrating different single-cell modalities and can utilize dataset-unique features for integration, while Seurat and Portal only use linked features between modalities. We assessed the integration performance of these methods from three main aspects. First, a good integration method should mix the cell distributions well in its output. We used the mixing metric20 and k-nearest-neighbor batch-effect test (kBET)34 scores to indicate how the datasets are mixed after integration. Second, distinct cell types should be kept separated after integration. Using two levels of cell-type annotations, we quantified how different cell states are prevented from being incorrectly mixed together with the average silhouette width (ASW). Third, we measured how accurately corresponding cell states are matched between modalities using label transfer accuracy with labels transferred from RNA cells to ADT cells, relative distance between ground truth paired cells (pair distance), and fraction of samples closer than true match (FOSCTTM)35. More details about the metrics are provided in the Methods section.

We first inspected the ability of integration to mix cell distributions. As shown by the results (Fig. 2f and Supplementary Fig. 1), scMODAL achieved comparable alignment performance with cross-modality integration methods including MaxFuse, bindSC, GLUE and Monae, indicating its ability of removing strong cross-modality unwanted variation. Among compared methods, scMODAL has the best performance in integration accuracy. Notably, scMODAL had the highest label transfer accuracy scores among all methods, approximately 98% for level 1 annotation and 86% for level 2 annotation (Fig. 2e, f). Higher label transfer accuracy scores indicate that scMODAL is better at finding correct correspondence between cell states across RNA and ADT modalities. Meanwhile, scMODAL’s lower pair distance and FOSCTTM scores indicate that ground truth cell pairs have closer relative distances in its integrated embeddings compared to other methods. More importantly, scMODAL achieved significantly improved ASW scores compared to other methods, indicating its capability to preserve fine-grained cell populations. This result is consistent with our observation in the UMAP plots. As shown in the UMAP plots, only scMODAL successfully maintained natural killer (NK) cells, CD4 T cells and CD8 T cells as clearly separated clusters, while in the results of other methods, NK cells were often mixed with effector memory CD8 T (CD8 TEM) cells due to their similarity in RNA modality (Supplementary Fig. 2). Among all compared methods, MaxFuse ranked second in preserving level 1 cell population structures but failed to preserve the difference between NK cells and CD8 TEM cells in protein abundance levels (Fig. 2h). The other methods also produced less satisfactory integration results. For example, bindSC did not preserve the distinction between NK cells and CD8 TEM cells, Monae mixed a cluster of monocytes with T cells, and GLUE inaccurately matched these NK cells, CD4 T cells, and CD8 T cells. Portal and Seurat did not integrated CD4 T cells well across modalities (Supplementary Fig. 1).

We also evaluated all methods using a reduced protein panel consisting of the 30 most informative proteins, a typical scenario in single-cell proteomic datasets. Even with this reduced feature set, scMODAL consistently demonstrated superior performance compared to other methods, highlighting its effectiveness in leveraging a limited number of linked features for precise cross-modality integration (Fig. 2g).

Using this dataset, we further assessed scMODAL’s capability to predict protein abundance levels for individual cells based on gene expressions. We included MaxFuse and bindSC in this comparison, as they also support cross-modality imputation following integration of RNA and ADT modalities. Comparing predicted protein abundance levels with ground truth data, scMODAL consistently outperformed these two methods, showing higher correlations (Fig. 2j). On average, scMODAL achieved a correlation of 0.53, whereas MaxFuse and bindSC achieved 0.42 and 0.40, representing a relative improvement of 29% and 34% with scMODAL, respectively. In contrast, predictions based solely on protein-coding genes yielded an average correlation of 0.24 with ground truth protein abundances (Fig. 2k). Notably, bindSC’s predictions of relative protein abundances were not consistent at the same scale as ground truth measurements, whereas scMODAL reliably recovered protein abundances at the scale close to ground truth with improved correlations, as evidenced by comparisons with bindSC predictions and corresponding protein-coding genes (Fig. 2i). These results underscore scMODAL’s proficiency in cross-modality feature imputation even in the absence of paired cells.

Using scMODAL’s predicted features, we were able to computationally generate cells simultaneously measured with features from different modalities, facilitating the identification of feature correlation networks. For instance, we examined the gene-protein correlation network inferred using cells from the RNA modality with imputed proteins. We used several monocyte-enriched proteins as an illustrative example to demonstrate this utility of scMODAL’s imputation. Several genes showed strong correlations with monocyte-enriched proteins, with correlation coefficients exceeding 0.8 or falling below -0.8. These genes are also found to be monocyte-enriched or -depleted genes (Supplementary Fig. 3). As an example, CD64 only exhibited a 0.54 correlation with its coding gene FCGR1A, but displayed strong correlations with many other genes (Fig. 2l). These findings illustrate scMODAL’s capability to suggest potential gene-protein interactions, thereby elucidating the intricate molecular dynamics within cells through the integration of scRNA-seq and proteomics datasets.

In addition to the CITE-seq PBMC dataset, we utilized a human bone marrow dataset containing transcriptome-wide gene expression profiles and 97 surface protein markers measured via Ab-seq36 for benchmarking the integration performance of compared methods (Supplementary Fig. 4). Among the evaluated approaches, scMODAL achieved the highest performance metrics (Supplementary Fig. 5), highlighting its capability to integrate modalities with weakly connected features, such as transcriptomics and proteomics. For cell type matching accuracy, evaluated using label transfer accuracy, pair distance, and FOSCTTM, MaxFuse and bindSC demonstrated the second and third best performance, respectively. Although GLUE and Monae effectively mixed RNA and ADT cells, they misaligned cell types, indicating suboptimal performance in integrating RNA and ADT assays.

Benchmarking integration of proteomics datasets with limited shared features and tri-modality integration

To further demonstrate scMODAL’s effectiveness, we benchmarked scMODAL against other integration methods in two additional challenging scenarios: one where there are very few shared features, and another where datasets from multiple modalities with varying degrees of shared information are integrated.

In the first scenario, we benchmarked all methods using two human bone marrow single-cell proteomic datasets produced by two different technologies: a sequencing-based CITE-seq dataset20 and a mass cytometry-based cytometry by time of flight (CyTOF) dataset37. In addition to the technical variations between technologies, integrating proteomics datasets from different studies are further complicated by different antibody panels used with only several overlapping markers, providing limited shared information. For instance, for the two datasets we used for benchmarking, the CITE-seq dataset includes 29 protein markers, while the CyTOF dataset includes 32 protein markers, with only 12 markers shared between them.

After applying scMODAL, the datasets were well-mixed in the integrated latent embeddings, as indicated by the UMAP plot and scMODAL’s strong scores in mixing performance (Fig. 3a, b and Supplementary Fig. 6). More importantly, the highest label transfer accuracy demonstrated scMODAL’s accuracy in finding correct cell state correspondences across datasets even with limited shared features. Bar plots of cell-type silhouette coefficients revealed that scMODAL produced the best grouping of cell types among all methods, showing its superior performance in preserving biological variations (Fig. 3d). We also conducted an ablation study using these datasets to investigate the functionality of each component in scMODAL. This study demonstrated that the adversarial learning objective significantly improves dataset mixing, MNN anchor regularization greatly aids in finding cell state correspondences, and dataset geometric structure regularization helps preserve biological variations by preventing over-correction of cell clusters. More details can be found in the Methods section and Supplementary Fig. 7.

Fig. 3: Benchmarking on integration using limited shared features using CITE-seq and CyTOF data and tri-modality integration using TEA-seq data.
figure 3

a UMAP plots of integrated embeddings produced by scMODAL, colored by modalities (left) and cell types (right), for the integration of CITE-seq and CyTOF data. b Quantitative evaluations for the integration of CITE-seq and CyTOF data. c Quantitative evaluations for the tri-modality integration of TEA-seq data. d Bar plots of cell-type silhouette coefficients for individual cells, colored by cell type, in the integration of CITE-seq and CyTOF data. Higher values indicate better separation of different cell types. e UMAP plots of integrated embeddings produced by scMODAL, colored by modalities (left) and cell types (right), for the tri-modality integration of TEA-seq data. Source data are provided as a Source Data file.

In the second scenario, we used a human PBMCs dataset profiled by transcription, epitopes, and accessibility sequencing (TEA-seq)38, which includes 46 protein markers, to evaluate all methods. Specifically, TEA-seq simultaneously measures transcriptomics, epitopes, and chromatin accessibility from cells. This allows us to assess whether an integration method can achieve high-quality tri-modality integration. The challenge in this scenario lies in the higher degree of information sharing between RNA and ATAC modalities compared to RNA and protein39, requiring integration methods to be flexible and adaptive to handle the heterogeneity in cross-modality gaps.

As shown in the UMAP plots, scMODAL effectively integrated these modalities (Fig. 3e and Supplementary Fig. 8), successfully preserving distinct clusters for B cells, T cells, monocytes, and NK cells. scMODAL outperformed or matched the best evaluation metrics among all compared methods, indicating its superior overall integration performance (Fig. 3c). Notably, it achieved an RNA-to-ADT label transfer accuracy of 87% and an RNA-to-ATAC label transfer accuracy of 83%, making it the only method to achieve both accuracy scores higher than 70% among all methods (Supplementary Fig. 9). However, not all methods can produce satisfactory results in this tri-modality integration task. Other integration methods had various shortcomings: MaxFuse failed to align different T cell subtypes well, leading to poor scores in mixing performance and cell-state matching accuracy. bindSC, Monae, Portal, and Seurat did not adequately maintain separation between cell types, resulting in a loss of biological information and low matching accuracy. Although GLUE showed good alignment of RNA and ATAC modalities, it struggled with ADT modality integration, improperly aligning B cells with monocytes and mismatching different T cell subtypes. The above result highlighted scMODAL’s reliability in handling cross-modality integration tasks with varying degrees of shared variation across datasets.

Accurate integration of mouse brain scRNA-seq and scATAC-seq datasets enabling peak-gene regularity inference

As a complex organ, the brain contains diverse cell types, including glial cells, endothelial cells, and numerous neuron subtypes. Integrating different single-cell modalities that measure brain cells is crucial for revealing intricate cell type-specific regulatory networks and pathways, as well as studying disease mechanisms such as the Alzheimer’s disease in the brain10. However, integrating single-cell brain datasets is challenging because it is difficult to preserve the nuanced difference between neuron subclusters. After validating scMODAL’s effectiveness in cross-modality integration through benchmarking studies, we applied scMODAL to integrate a scRNA-seq dataset40 and a scATAC-seq obtained from the cortex of mouse brains, demonstrating its ability to achieve accurate integration and facilitate multimodal single-cell analysis in complex organs.

Before integration, different brain cell types in the datasets were annotated according to marker genes (Fig. 4a). After scMODAL’s integration, cells of the same cell type were correctly aligned in the latent space (Fig. 4b). To validate the integration accuracy in detail, we used the Louvain method41 to find fine-grained cell type clusters in the integrated cell embedding space (Fig. 4d). We identified 15 clusters in total, nine of which corresponded to different neuron subtypes. The clusters were then relabeled according to cell types. For each cluster, there were both RNA modality cells and ATAC modality cells. By comparing the similarity of these clusters using gene expression levels for RNA cells and gene activity scores for ATAC cells, we found that clusters in different modalities assigned the same Louvain cluster label tended to have a higher similarity, as shown by the 2 × 2 blocks on the diagonal of the correlation matrix (Fig. 4c). This indicates that scMODAL correctly matched corresponding neuron subtypes after integration. We closely examined a major excitatory neuron population formed by excitatory neuron clusters 1, 3, and 7. As shown in Fig. 4e, f, many marker genes of mouse brain cortical layers, such as layer 2/3 enriched genes Lamp5, Otof and Stard8, layer 4 enriched genes Rorb, Rspo1 and Scnn1a, and layer 5/6 enriched genes Pcp4, Rxfp1 and Grik342,43,44,45,46,47,48, exhibit consistent differentially expressed patterns in these three clusters in scRNA-seq and scATAC-seq data. Specifically, excitatory neuron clusters 1, 3 and 7 exhibited high layer 4, layer 2/3 and layer 5/6 marker gene expressions, respectively. This result demonstrates that scMODAL correctly aligned detailed cortical neuron cell cluster structures across different modalities. Additionally, by mapping the Louvain cluster labels back to the original space, we found that scMODAL successfully preserved neuron subtype cluster structures contained in the original datasets. For example, excitatory neuron clusters 2, 4, 5 and 6 form isolated clusters in the unintegrated datasets, unconnected from the major excitatory neuron population formed by clusters 1, 3 and 7 (Supplementary Fig. 10). This pattern is well-preserved after scMODAL integration, demonstrating its ability to maintain the subtle similarities and differences among neuronal subtypes.

Fig. 4: Integration of mouse brain scRNA-seq and scATAC-seq datasets.
figure 4

a UMAP plots of unintegrated scRNA-seq and scATAC-seq datasets, colored by cell types. b UMAP plots of integrated embeddings produced by scMODAL, colored by modalities (left) and cell types (right). c Correlation heatmap of Louvain clusters computed based on the mean gene expression or gene activity profiles of the clusters. The dendrogram was computed with Scanpy80 to show the hierachical clustering result of the Louvain clusters. d UMAP plots of integrated embeddings produced by scMODAL, colored by Louvain cluster labels. Mouse brain cortical layer marker gene expression levels in the RNA modality (e) and gene activity scores in the ATAC modality (f) in Louvain clusters 1, 3 and 7. Fam107a expression levels in the RNA modality, activity scores in the ATAC modality and imputed expression levels in the ATAC modality shown in UMAP (g) and heatmap (h). i Gene-peak links inferred using scMODAL's imputation result. Source data are provided as a Source Data file.

For comparison, we also applied MaxFuse, which ranked second in integration accuracy in our benchmarking studies, and GLUE, a state-of-the-art method for integrating scRNA-seq and scATAC-seq data, to integrate these two datasets. Comparing to scMODAL, these two methods produced less accurate integration results in terms of finding cell-state matching (Supplementary Fig. 11). For example in MaxFuse’s integration, a cluster of excitatory neurons and a cluster of inhibitory neurons were incorrectly aligned with each other.

Using scMODAL’s integration result, we performed gene expression imputation for the scATAC-seq data to generate virtual cells with simultaneous measurement of gene expression and chromatin accessibility. Interestingly, we found the imputation results for some genes align with the gene expression patterns observed in scRNA-seq data but differ from the gene activity scores in scATAC-seq data. This discrepancy arises because gene score prediction methods using scATAC-seq data often assume that chromatin accessibility within the gene locus or nearby regions consistently contributes to improving the gene expression level, which may not reflect the true regulatory mechanisms. For example, consider the astrocyte marker gene Fam107a49. In the scRNA-seq dataset, Fam107a shows high expression exclusively in astrocytes, but it is depleted in other cell types (Fig. 4g, h). However, the gene activity scores produced by Signac12 infer Fam107a expression in oligodendrocytes, oligodendrocyte progenitor cells (OPC) and endothelial cells, likely due to chromatin accessibility peaks detected near the Fam107a gene (Supplementary Fig. 12). In contrast, scMODAL’s gene imputation results show Fam107a expression patterns that more closely resemble the scRNA-seq data, with clear enrichment only in astrocytes. We further investigated potential cis-regulatory interactions by calculating the correlation coefficients between the imputed gene expression and the accessibility of each peak within a 10kb distance from Fam107a50. As shown in Fig. 4i, based on scMODAL’s imputation, only peaks highly accessible in astrocytes were inferred to be associated with Fam107a, providing a reduced candidate peak set that could potentially regulate the gene expression level of Fam107a in the brain. The above analysis demonstrated scMODAL’s ability to provide insights to regulatory signatures using unpaired multi-omics single-cell datasets.

Integration of CODEX, scRNA-seq and scATAC-seq datasets facilitating spatial structure identification of B cell follicles in tonsil

As a recently developed technology, co-detection by indexing (CODEX) enables highly multiplexed and spatially resolved profiling of proteins within tissue sections at single-cell resolution6. This technique has been widely applied to study diverse immune microenvironments, such as those in lymph nodes51 and tumors52. However, to accurately characterize a specific immune microenvironment using single-cell spatial proteomics, it is essential to have a well-designed protein panel targeting specific cell types of interest. In this section, we show how integrating single-cell spatial proteomics data with other single-cell modalities using scMODAL can improve the spatial characterization of the microenvironment, even when the protein panel is not fully comprehensive. This integration is illustrated using a human tonsil CODEX dataset including 44 protein markers53, a tonsil scRNA-seq dataset54 and a tonsil scATAC-seq dataset55.

Using the CODEX tonsil section with the original cell-type annotation, we identified B cell follicle structures organized around interfollicular regions rich in T cells (Fig. 5c). In these interfollicular regions, T cells interact with B cells, facilitating the formation of germinal centers within the B cell follicles56. Within these germinal centers, mature B cells undergo activation, proliferation, differentiation, and diversify their antibody genes through somatic hypermutation57,58. However, the CODEX data protein panel did not include proliferation-associated markers such as Ki67, which is crucial for identifying proliferating germinal center B cells59, or marginal zone B cell markers like CD22 and CD4060. This limitation makes it challenging to identify B cell subtypes and fully characterize the structure of B cell follicles in the tonsil section (Supplementary Fig. 13).

Fig. 5: Integration of human tonsil CODEX, scRNA-seq and scATAC-seq datasets.
figure 5

a UMAP plots of unintegrated CODEX, scRNA-seq and scATAC-seq datasets, colored by cell types. b UMAP plots of integrated embeddings produced by scMODAL, colored by modalities (left) and cell types (right). c The CODEX section with the original cell-type annotation. d The CODEX section with transferred cell-type annotation. Dashed circles indicate regions showing clear B cell follicle structures. Spatial distributions of B-CD22-CD40 (e) and B-Ki67 cells (f). g Measured CCL4, SLC7A1, DHCR24 and RORA expression levels in the scRNA-seq dataset. Identified spatial distributions of B-CD22-CD40 cells (h) and B-Ki67 cells (i) in the 10x Visium tonsil sample using STitch3D cell-type deconvolution. j Distributions of spatial distances between cells of different cell types and the centers of gravity of germinal centers. k Predicted MKI67 spatial expression pattern in the CODEX dataset. l Measured MKI67 spatial expression pattern in the CODEX dataset. The spatial signaling directions of the CCL4-SLC7A1 and DHCR24-RORA pathways inferred by COMMOT in the CODEX sample (m, n) and the Visium sample (o, p). Dashed circles in (o, p) highlight B cell follicles exhibiting signaling directions consistent with those in the CODEX sample. Source data are provided as a Source Data file.

Unlike the CODEX dataset, the scRNA-seq dataset clearly distinguishes between germinal center B cells (B-Ki67) and marginal zone B cells (B-CD22-CD40), as well as between CD4 and CD8 T cells (Fig. 5b). By integrating the CODEX, scRNA-seq and scATAC-seq tonsil datasets using scMODAL (Fig. 5a), we successfully transferred cell-type labels from the scRNA-seq data to the CODEX data and the scATAC-seq data. After this label transfer, we validated the results using the available protein panel and gene activity scores (Supplementary Fig. 14). The protein abundance confirmed that CD4 and CD8 T cells identified by scMODAL in the CODEX data were specifically enriched for CD4 and CD8, respectively, while all expressed the T cell marker CD3. Additionally, following the label transfer, B cells, dendritic cells (DCs), and plasma cells were enriched for their corresponding markers—CD20, CD11c, and CD138, respectively. Similarly, the gene activity scores of the corresponding genes in the scATAC-seq data exhibited consistent patterns across the transferred cell types. Importantly, the cell types identified with the transferred annotation displayed distinct spatial distribution patterns (Fig. 5d). For the B-CD22-CD40 subtype, we observed that these cells formed several hollow circles in the outer regions of B cell follicles, indicating the presence of marginal zones (Fig. 5e). Additionally, B-Ki67 cells were concentrated within the circles formed by marginal zone B cells, marking the spatial locations of germinal centers (Fig. 5f). Together, these two B cell subtypes, identified through transferred cell-type labels, revealed the spatial organization of B cell follicle structures. Using DBSCAN61, we identified six germinal centers using the spatial distribution of B-Ki67 cells (Supplementary Fig. 15) and calculated the minimum distance between each cell and the center of gravity of any germinal center. The distance distributions confirmed expected spatial patterns, with B-Ki67, B-CD22-CD40, and other cells showing increasing distances from the germinal centers (Fig. 5j). The identified B cell follicle structures align with the cell-type deconvolution results of a 10x Visium tonsil section55, which used the same scRNA-seq reference data and was analyzed with STitch3D62 (Fig. 5h, i, and Supplementary Fig. 16). This further supports the accuracy of scMODAL’s label transfer.

The gene expression levels predicted by scMODAL further enhance the spatial characterization of the tonsil section. For example, we imputed the expression level of MKI67, the gene encoding Ki67 (Fig. 5k). Although Ki67 abundance was not measured in the original CODEX dataset, the imputed MKI67 expression accurately captured B cell dynamics. Specifically, imputed MKI67 showed high expression in germinal centers with a decreasing gradient from the inner to outer B cell follicles, reflecting the spatial specificity of B cell proliferation. The imputed spatial gene expression pattern is consistent with the measured MKI67 expression levels in the Visium sample, where MKI67 is concentrated in B-Ki67-enriched regions (Fig. 5l).

Leveraging scMODAL’s imputed gene expression levels in the CODEX tonsil section, we further applied COMMOT63 to analyze cell-cell communication within the tonsil immune microenvironment, using ligand-receptor information from the CellPhoneDB database64. For instance, the CCL4-SLC7A1 interaction, which has been used to study immune cell communication pathways65, was explored. In the tonsil scRNA-seq dataset, CCL4 was enriched in CD8 T cells, while SLC7A1 was enriched in B-Ki67 cells (Fig. 5g). This interaction was identified between germinal center B cells and interfollicular T cells, suggesting a potential B cell-T cell communication pathway in the immune response (Fig. 5m). Additionally, we identified other spatial cell-cell communication pathways, such as DHCR24-RORA signaling between B cells and T cells (Fig. 5n). We further validated the inferred signaling directions using the Visium sample. Due to the high sparsity of gene expression levels in this dataset, we applied STitch3D to denoise the data, leveraging information from the scRNA-seq reference to infer spatial cell-cell communication (Supplementary Fig. 17). As shown in Fig. 5o, p, near multiple B cell follicles (highlighted by circles), the inferred CCL4-SLC7A1 and DHCR24-RORA signaling pathways exhibited directional consistency with our findings from the CODEX section. These findings demonstrate scMODAL’s capability in facilitating spatial multi-omics analysis.

Discussion

In this study, we introduced scMODAL, a novel deep learning framework designed for the integration of single-cell multi-omics data, specifically addressing the challenges associated with datasets that have limited numbers of known correlated features. Our results demonstrate that scMODAL effectively aligns cell embeddings across different modalities, preserves the biological variation, and accurately identifies cell subpopulations. Moreover, scMODAL excels in tasks such as cross-modality feature imputation and inferring feature relationships, which are critical for understanding the underlying cellular processes.

Compared to existing integration methods, scMODAL offers distinct advantages. While methods like MaxFuse and bindSC have shown efficacy in integrating modalities with weak relationships, they rely on linear projections that may not fully capture the complex, nonlinear nature of unwanted variations present in multi-omics datasets. scMODAL addresses this limitation by employing nonlinear neural networks and GANs to align cell embeddings, ensuring that the integration process retains the intrinsic biological structure of the data. scMODAL also has advantanges over current deep learning-based integration tools. Many deep learning-based methods integrate datasets based only on shared features, such as scVI22, scANVI66, VIPCCA67, SCALEX68, iMAP69 and Portal, disregarding valuable unshared features. Recent efforts have incorporated unshared features into single-cell multi-omics integration. Methods such as totalVI70, MultiVI71, CLUE72, MIDAS73, scButterfly74, and SpatialGLUE75 attempt to integrate multi-omics data by leveraging autoencoders on joint-profiled datasets but require (partially) paired cells across modalities. To address cross-modality integration using unpaired single-cell data, alternative approaches leveraging autoencoders and GANs have been proposed. For instance, scMMGAN76 employs GANs to learn multi-modal mappings. However, it integrates datasets in a paired manner and does not provide a shared latent space that captures a unified view across all modalities. Other methods, including GLUE, CoVEL77, and Monae, incorporate prior knowledge of cross-modality interactions—mainly between genes and epigenomic profiles such as ATAC peaks—by linking them in a knowledge-based graph. However, our benchmarking experiments indicate that the graph-variational autoencoder-based approach for incorporating prior knowledge of cross-modality interactions in these methods perform less effectively when leveraging prior interaction information between weakly linked modalities, such as RNA and protein.

Unlike these methods, scMODAL has unique designs for robustly integrating single-cell datasets across modalities with varying feature connections. Instead of relying on knowledge graph-based approaches like those used in GLUE, CoVEL, and Monae to incorporate prior information on feature relationships, scMODAL leverages linked features from prior knowledge to construct MNN pairs, which serve as potential anchors for GAN alignment. By minimizing the latent distance between these MNN pairs as a regularizer during GAN training, scMODAL enables a more flexible utilization of prior feature link information. This approach reduces dependence on the strength of cross-modality feature links while ensuring effective latent space mixing, enabling scMODAL to not only integrate strongly linked modalies such as RNA and ATAC, but also integrate weakly linked modalities such as RNA and protein. Additionally, scMODAL incorporates an extra regularizer to preserve the structures of the original datasets in the joint latent space, maintaining relative distances between cells. The benchmarking results on a collection of datasets in different scenarios highlight scMODAL’s superiority in mixing cell distributions, maintaining cell-type separations, and accurately matching corresponding cell states across modalities.

The ability of scMODAL to preserve biological variation while integrating multi-omics data has significant implications for the study of complex cellular processes. For instance, its capacity to accurately identify cell subpopulations that were not distinguishable with individual modalities suggests that scMODAL could be instrumental in uncovering new cell types or states. Additionally, the feature imputation capabilities of scMODAL could facilitate the discovery of novel gene regulatory networks and pathways that are otherwise obscured in single-modality analyses. The gene-protein and gene-peak link inference, along with the discovery of spatial cell-cell communication patterns using scMODAL’s imputation results, exemplify the practical utility of this functionality.

Despite its advantages, scMODAL has certain limitations. For example, the reliance on known linked features for integration, although effective, may limit its applicability to scenarios where such features are not well-characterized or absent. Future work could explore the incorporation of unsupervised learning techniques to identify potential links between modalities, thereby broadening the applicability of scMODAL.

In conclusion, scMODAL represents a significant advancement in the field of single-cell multi-omics data integration. By leveraging deep learning techniques, it addresses the critical challenges of cross-modality integration, offering a robust tool for researchers to explore the complex interplay between different cellular components. As single-cell technologies continue to evolve, frameworks like scMODAL will be indispensable in translating multi-omics data into actionable biological insights.

Methods

The model of scMODAL

Let \({{{{\bf{X}}}}}_{1}\in {{\mathbb{R}}}^{{n}_{1}\times {p}_{1}}\) and \({{{{\bf{X}}}}}_{2}\in {{\mathbb{R}}}^{{n}_{2}\times {p}_{2}}\) be the matrices representing single-cell features from two different modalities. As features in the two modalities are usually not shared, prior knowledge about the cross-modality feature relationships is required for finding correspondence between modalities. We construct a new pair of matrices \({\widetilde{{{{\bf{X}}}}}}_{1}\in {{\mathbb{R}}}^{{n}_{1}\times s}\) and \({\widetilde{{{{\bf{X}}}}}}_{2}\in {{\mathbb{R}}}^{{n}_{2}\times s}\), with s pairs of features likely to positively correlate with each other based on prior information. For the integration of proteomic and scRNA-seq data, we let each pair be the abundance of a protein and the expression level of its coding gene. When integrating scRNA-seq and scATAC-seq data, we used gene expression levels and gene activity scores of shared highly variable genes as feature pairs.

Aligning different modalities using generative adversarial learning

To integrate datasets while preserving biological information contained in all highly variable features, we introduce a shared latent space Z and encode information into Z with neural networks. Denote the encoders as E1( ) and E2( ), our goal is to integrate the cell embedding distributions E1(x1) and E2(x2) in Z, where x1 and x2 represent cells from X1 and X2, respectively. We apply generative adversarial learning to align the empirical distributions of E1(x1) and E2(x2) in Z, borrowing the idea from Generative Adversarial Networks (GANs)31. Specifically, we use an auxiliary network D( ): Z → (0, 1) as the discriminator to distinguish cell embeddings from two datasets by maximizing the following objective:

$${{{{\mathcal{L}}}}}_{{{{\rm{GAN}}}}}={\mathbb{E}}\left[\log D({E}_{1}({{{{\bf{x}}}}}_{1}))+\log (1-D({E}_{2}({{{{\bf{x}}}}}_{2})))\right].$$
(1)

The encoders are trained against the discriminator by minimizing \({{{{\mathcal{L}}}}}_{{{{\rm{GAN}}}}}\), which is equivalent to minimizing the Jensen-Shannon (JS) divergence between the distributions of E1(x1) and E2(x2)31. This process is represented by the minimax optimization formula \({\min }_{{E}_{1},{E}_{2}}{\max }_{D}{{{{\mathcal{L}}}}}_{{{{\rm{GAN}}}}}\).

Regularization for within- and cross-domain autoencoding consistency

Two decoders, denoted as G1( ) and G2( ), are introduced and trained together to ensure within- and cross-domain autoencoding consistency by minimizing the autoencoder loss:

$${{{{\mathcal{L}}}}}_{{{{\rm{AE}}}}}={\sum}_{s,t=1,2,s\ne t}{\mathbb{E}}\left[\frac{1}{{p}_{s}}\parallel {{{{\bf{x}}}}}_{s}-{G}_{s}({E}_{s}({{{{\bf{x}}}}}_{s})){\parallel }^{2}+\frac{1}{q}\parallel {E}_{s}({{{{\bf{x}}}}}_{s})-{E}_{t}({G}_{t}({E}_{s}({{{{\bf{x}}}}}_{s}))){\parallel }^{2}\right],$$
(2)

where q is the dimensionality of Z.

Regularization for aligning anchors with prior feature linkage information

Using generative adversarial learning to align distributions without constraints can result in incorrect matching of cell populations. To learn accurate integration results, we utilize similarity information in linked features to guide integration. Specifically, during minibatch training with two minibatches from two modalities, for each cell in one minibatch, we find the k-nearest neighborhoods in cells in another minibatch by comparing the angle distance between corresponding linked features in \({\widetilde{{{{\bf{X}}}}}}_{1}\) and \({\widetilde{{{{\bf{X}}}}}}_{2}\), and vice versa. This procedure gives us mutual nearest neighborhood pairs \({\{({{{{\bf{x}}}}}_{1}^{(m)},{{{{\bf{x}}}}}_{2}^{(m)})\}}_{m=1}^{M}\), serving as anchors for integration. For these pairs, we let their embeddings to be close to each other by minimizing the objective:

$${{{{\mathcal{L}}}}}_{{{{\rm{Anchor}}}}}=\frac{1}{q}{\sum}_{m=1}^{M}{\left\Vert {E}_{1}\left({{{{\bf{x}}}}}_{1}^{(m)}\right)-{E}_{2}\left({{{{\bf{x}}}}}_{2}^{(m)}\right)\right\Vert }^{2}.$$
(3)

Regularization for data structure preservation

To avoid loss of information contained in dataset-unique features, we propose to preserve the geometric structure of each dataset by regularizing the geometric representations of cells. To be specific, for a cell \({{{{\bf{x}}}}}_{1}^{b}\) from X1 in a minibatch, we calculate the Gaussian kernel distance with all cells in the batch as its geometric representation:

$${{{{\bf{k}}}}}_{1}^{b}=\left[\exp \left(-{\left\Vert {{{{\bf{x}}}}}_{1}^{1}-{{{{\bf{x}}}}}_{1}^{b}\right\Vert }^{2}/2{p}_{1}\right),\cdots \,,\exp \left(-{\left\Vert {{{{\bf{x}}}}}_{1}^{B}-{{{{\bf{x}}}}}_{1}^{b}\right\Vert }^{2}/2{p}_{1}\right)\right]\in {{\mathbb{R}}}^{B}.$$
(4)

The geometric representation is also calculated for the cell representation of \({{{{\bf{x}}}}}_{1}^{b}\) in Z as

$${\hat{{{{\bf{k}}}}}}_{1}^{b}=\left[\exp \left(-{\left\Vert {E}_{1}\left({{{{\bf{x}}}}}_{1}^{1}\right)-{E}_{1}\left({{{{\bf{x}}}}}_{1}^{b}\right)\right\Vert }^{2}/2q\right),\cdots \,,\exp \left(-{\left\Vert {E}_{1}\left({{{{\bf{x}}}}}_{1}^{B}\right)-{E}_{1}\left({{{{\bf{x}}}}}_{1}^{b}\right)\right\Vert }^{2}/2q\right)\right]\in {{\mathbb{R}}}^{B}.$$
(5)

Similarly we define the geometric representations of cells from X2 in the other minibatch. The geometric representation of a cell indicates its relative distance from other cells computed with all variable features. We use a geometric structure regularization to preserve this information by minimizing the objective.

$${{{{\mathcal{L}}}}}_{{{{\rm{Geo}}}}}=-\frac{1}{B}\left[{\sum }_{b=1}^{B}\min \left(Cosine\left({{{{{\bf{k}}}}_{1}^{b}}},{\hat{{{{\bf{k}}}}}}_{1}^{b}\right),{k}_{{{{\rm{th}}}}}\right)+\min \left(Cosine\left({{{{{\bf{k}}}}_{2}^{b}}},{{\hat{{{{\bf{k}}}}_{2}^{b}}}}\right),{k}_{{{{\rm{th}}}}}\right)\right],$$
(6)

where kth = 0.975 is a fixed threshold.

Training procedure

To integrate cross-modality datasets with correctly matched cell states while preserving important biological variation, we train the networks by considering the generative adversarial learning objective and other regularizers jointly in the following mini-max optimization formula

$${\min }_{{E}_{k},{G}_{k}}{\max }_{D}{{{{\mathcal{L}}}}}_{{{{\rm{GAN}}}}}+{\lambda }_{{{{\rm{AE}}}}}{{{{\mathcal{L}}}}}_{{{{\rm{AE}}}}}+{\lambda }_{{{{\rm{Anchor}}}}}{{{{\mathcal{L}}}}}_{{{{\rm{Anchor}}}}}+{\lambda }_{{{{\rm{Geo}}}}}{{{{\mathcal{L}}}}}_{{{{\rm{Geo}}}}},$$
(7)

where λAE, λAnchor and λGeo are coefficients for the regularizers. During training, the neural networks in scMODAL are updated iteratively to solve the mini-max problem. Once the training is finished, cell embeddings in Z serve as integrated representations for further downstream tasks. Besides, G2(E1( )) and G1(E2( )) can be used to predict unmeasured features across modalities.

Analysis details

Integration of multiple datasets

Benefiting from scalable neural network training, scMODAL can also be used for integrating multiple multi-omics datasets. When there are more than two datasets to be integrated (denoted as \({{{{\bf{X}}}}}_{l}\in {{\mathbb{R}}}^{{n}_{l}\times {p}_{l}},l=1,2,\cdots \,,L\)), scMODAL handels the integration task by introducing L − 1 discriminators to align dataset pairs (XlXl+1), l = 1, 2,  , L − 1 in the latent space Z. The regularizers for cross-domain autoencoding consistency and aligning anchors with prior feature linkage information are also extended accordingly for dataset pairs (XlXl+1).

For integrating multiple modalities, we found that scMODAL is robust to the order in which modalities are processed. For instance, in the tri-modality integration task using the TEA-seq dataset, we explored the relationships between ADT and RNA, as well as between RNA and ATAC, setting the integration order as (X1X2X3) = (XADTXRNAXATAC). We also evaluated scMODAL’s performance with orders (X1X2X3) = (XADTXATACXRNA) and (X1X2X3) = (XRNAXADTXATAC), with different modalities serving as the bridge between the other two. Overall, scMODAL demonstrated consistently strong integration performance regardless of the modality order (Supplementary Fig. 18).

Model training details

scMODAL employs the Adam optimizer78 for stochastic optimization during model training. By default, we trained scMODAL for a fixed 10,000 steps. With a batch size of B = 500 per dataset, this training schedule allows sampling of five million cells from each dataset, ensuring comprehensive coverage of the data distribution. This approach is comparable to the heuristic used by scVI22, in which the author noted that “bigger datasets require fewer epochs”. In scVI, the maximum number of training epochs is set as \(\min [{{{\rm{round}}}}(20000/{n}_{{{{\rm{cells}}}}}\times 400),400]\) where ncells is the total number of cells, resulting in a nearly fixed number of training iterations for datasets with more than 20,000 cells. We train scMODAL with a learning rate of lr = 0.001, coefficients for running averages (β1β2) = (0.9, 0.999), and a weight decay parameter of λ = 0.001 across all networks. The latent space dimensionality is set to q = 20 and the neighborhood size is set to k = 30 for identifying MNNs. The regularization parameters are λAE = 10.0, λAnchor = 1.0, and λGeo = 1.0.

Computational time and memory usage

We evaluated the computational time and memory usage of all methods using the CITE-seq PBMC dataset32 with different sample sizes. For the benchmarking of computational time and memory usage, we applied all methods on the same Linux server with Intel Xeon Gold 5222 CPUs. For methods that require GPUs including scMODAL, GLUE, Monae and Portal, a single NVIDIA RTX 5000 GPU was used in all the experiments. To only focus on the integration algorithms, we only recorded the running time and memory usage after standard data preprocessing such as normalization, scaling and dimension reduction. As illustrated in Supplementary Fig. 19, Monae, bindSC and Seurat were unable to complete the integration of datasets with 100,000, 200,000 and 300,000 cells, respectively, due to their peak memory usage exceeding the 160 GB limit. Unlike these two methods, scMODAL demonstrates efficient memory usage, allowing it to handle large datasets without exceeding memory limits. Moreover, as dataset size increases, scMODAL demonstrates faster running times compared to MaxFuse and GLUE, highlighting its training efficiency. As shown in Supplementary Fig. 20, scMODAL demonstrates effective training and achieves strong integration of the RNA and ADT modalities across a total of 322,318 cells with a short computational time.

Ablation study

We investigated the functionalities of different components in scMODAL’s model using the CITE-seq and the CyTOF human bone marrow datasets. As shown in Supplementary Fig. 7, we observed that removing the GAN objective from the loss function led to less well-mixed cell distributions, as evidenced by a higher mixing metric and a lower kBET metric. Additionally, removing the regularization for autoencoding consistency resulted in less accurate cell-state matching, reflected in a decrease in label transfer accuracy. This also led to poorer preservation of biological variation, as indicated by a lower ASW score. When the regularization for aligning MNN anchors was removed, nearly all cell states were incorrectly matched, with label transfer accuracy approaching 0, indicating that cells were aligned with others of different cell type labels. Furthermore, removing the regularization for data structure preservation caused a decrease in the ASW score, suggesting a decline in the preservation of cell-type cluster information.

Evaluation metrics

We used the mixing metric20 and k-nearest-neighbor batch-effect test (kBET)34 to assess the ability of unwanted variation removal. Besides, we used the average silhouette width (ASW) to evaluate the preservation of biological variation, and label transfer accuracy, pair distance, and fraction of samples closer than true match (FOSCTTM)35 to measure the correctness of cell-state matching in the integration results.

Mixing metric

For each cell, the rank in its k = 300 neighborhood corresponding to the fifth neighbor in each dataset is calculated. The mixing metric is then obtained by taking the median of the ranks over all datasets and then taking the average over all cells. A lower mixing metric indicates better mixing of the datasets.

kBET

kBET uses a Pearson’s χ2-based test to evaluate whether the distribution of batch labels in the neighborhood of a cell matches the overall distribution of batch labels. In our experiments, we ran 100 replicates, each with 1000 randomly selected samples, and we used the median of the average acceptance rates as the final output. A higher kBET indicates better mixing of the datasets.

ASW

For each cell, its silhouette width is defined as \((b-a)/\max (a,b)\), where a represents the mean distance between the cell and other cells within the same cluster, and b represents the mean distance between the cell and other cells from the nearest cluster that the cell does not belong to. The ASW score is then the average of silhouette widths over all cells. A higher ASW indicates a better preservation of clustering structures in the integration results.

Label transfer accuracy

In the integrated cell embedding space, we transfer labels from one dataset to another dataset based on the nearest neighbor using the euclidean distance. Then we evaluate the ratio of correct transferred labels as the label transfer accuracy. A higher label transfer accuracy indicates a more accurate matching of corresponding cell states.

Pair distance

This metric is evaluated with single-cell multi-omics datasets with simultaneously measured features from different modalities. Given the ground truth pairs of embeddings from different modalities, say \({{{{\bf{z}}}}}_{1}^{i}={E}_{1}({{{{\bf{x}}}}}_{1}^{i})\) and \({{{{\bf{z}}}}}_{2}^{i}={E}_{2}({{{{\bf{x}}}}}_{2}^{i})\), the pair distance is a relative distance defined as \(\frac{n}{2}(\parallel {{{{\bf{z}}}}}_{1}^{i}-{{{{\bf{z}}}}}_{2}^{i}\parallel /\mathop{\sum }_{j=1}^{n}\parallel {{{{\bf{z}}}}}_{1}^{i}-{{{{\bf{z}}}}}_{2}^{\,j}\parallel+\parallel {{{{\bf{z}}}}}_{1}^{i}-{{{{\bf{z}}}}}_{2}^{i}\parallel /\mathop{\sum }_{j=1}^{n}\parallel {{{{\bf{z}}}}}_{1}^{\,j}-{{{{\bf{z}}}}}_{2}^{i}\parallel )\), where n is the total number of cells. The final score is then the average of pair distances over all cells. A lower pair distance indicates the ground truth pairs from different modalities are better matched after integration.

FOSCTTM

Given the ground truth pairs of embeddings \({{{{\bf{z}}}}}_{1}^{i}\) and \({{{{\bf{z}}}}}_{2}^{i}\), FOSCTTM is computed as \(*\mathop{\sum }_{i=1}^{n}\mathop{\sum }_{j=1}^{n}[{\mathbb{1}}(\parallel {{{{\bf{z}}}}}_{1}^{i}-{{{{\bf{z}}}}}_{2}^{\,j}\parallel < \parallel {{{{\bf{z}}}}}_{1}^{i}-{{{{\bf{z}}}}}_{2}^{i}\parallel )+{\mathbb{1}}(\parallel {{{{\bf{z}}}}}_{1}^{\,j}-{{{{\bf{z}}}}}_{2}^{i}\parallel < \parallel {{{{\bf{z}}}}}_{1}^{i}-{{{{\bf{z}}}}}_{2}^{i}\parallel )]/2{n}^{2}\). A lower FOSCTTM means that the ground truth pairs have closer distances, indicating a better integration result.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.