Delineating the effective use of self-supervised learning in single-cell genomics

Richter, Till; Bahrami, Mojtaba; Xia, Yufan; Fischer, David S.; Theis, Fabian J.

doi:10.1038/s42256-024-00934-3

Download PDF

Article
Open access
Published: 27 December 2024

Delineating the effective use of self-supervised learning in single-cell genomics

Till Richter^1,2,
Mojtaba Bahrami^1,3,
Yufan Xia²,
David S. Fischer^1,4 &
…
Fabian J. Theis ORCID: orcid.org/0000-0002-2419-1943^1,2,3

Nature Machine Intelligence volume 7, pages 68–78 (2025)Cite this article

24k Accesses
7 Citations
155 Altmetric
Metrics details

Subjects

Abstract

Self-supervised learning (SSL) has emerged as a powerful method for extracting meaningful representations from vast, unlabelled datasets, transforming computer vision and natural language processing. In single-cell genomics (SCG), representation learning offers insights into the complex biological data, especially with emerging foundation models. However, identifying scenarios in SCG where SSL outperforms traditional learning methods remains a nuanced challenge. Furthermore, selecting the most effective pretext tasks within the SSL framework for SCG is a critical yet unresolved question. Here we address this gap by adapting and benchmarking SSL methods in SCG, including masked autoencoders with multiple masking strategies and contrastive learning methods. Models trained on over 20 million cells were examined across multiple downstream tasks, including cell-type prediction, gene-expression reconstruction, cross-modality prediction and data integration. Our empirical analyses underscore the nuanced role of SSL, namely, in transfer learning scenarios leveraging auxiliary data or analysing unseen datasets. Masked autoencoders excel over contrastive methods in SCG, diverging from computer vision trends. Moreover, our findings reveal the notable capabilities of SSL in zero-shot settings and its potential in cross-modality prediction and data integration. In summary, we study SSL methods in SCG on fully connected networks and benchmark their utility across key representation learning scenarios.

Reusability report: Exploring the transferability of self-supervised learning models from single-cell to spatial transcriptomics

Article 21 August 2025

Predicting cell types with supervised contrastive learning on cells and their types

Article Open access 03 January 2024

Topological identification and interpretation for single-cell epigenetic regulation elucidation in multi-tasks using scAGDE

Article Open access 16 February 2025

Main

Single-cell genomics (SCG) has rapidly expanded into a big-data domain, primarily driven by advancements in single-cell RNA-sequencing technologies¹. This expansion has shifted the focus from analysing data in isolated studies to using machine learning models for interpreting data within the context of existing datasets². Recent efforts towards comprehensive atlases, such as the Human Cell Atlas³, underscore this development. However, larger datasets introduce additional methodological challenges, such as technical batch effects across studies and the variability in labelling quality^4,5. Large-scale models have gained interest and emerged for their potential to address these issues⁶. Yet a gap remains in understanding their use cases and how to effectively leverage the emerging datasets comprising millions of cells⁷. The SCG field now not only requires computational power but also strategic use of methods that handle the complexities of big data. In this context, self-supervised learning (SSL) is a promising approach. SSL leverages pairwise relationships within data X for training, setting it them apart from supervised learning, which relies on data X with labels Y to guide the loss, and unsupervised learning, which depends solely on data X (refs. ^8,9,10). It has proven powerful in other data-intensive domains, such as computer vision^11,12 and natural language processing^13,14, leveraging large unlabelled datasets. It is thus often the basis for foundation models¹⁵.

SSL has already begun to impact SCG on small and large scales. On small scales, specialized SSL methods have deployed contrastive losses, tailored with techniques such as multimodal learning¹⁶, graph-based strategies¹⁷ and clustering-based approaches^18,19,20 to embed cells. The contrastive methods address unique data challenges in SCG, including batch effects and data sparsity^{18,19,21,22,23,24,25,26,27}. Other specialized SSL methods predict blood cell traits²⁸, identify subpopulations of T cells²⁹, boost active learning³⁰ and classify cell types on the whole mouse brain³¹, indicating the method’s versatility. However, a common limitation among these approaches is their application to relatively small datasets or specific problems, resulting in limited generalizability across downstream tasks. On large scales, foundation models are trained on large datasets and applied to a broad range of tasks. In SCG, they often deploy transformers trained in a supervised^32,33 and self-supervised^34,35,36,37 fashion. While foundation models have demonstrated improvements through self-supervised pre-training^34,35, disentangling the contributions of SSL, scaling laws or the transformer architecture remains difficult. This ongoing debate underscores the relevance of investigating SSL in non-transformer contexts, which are prevalent in SCG^38,39. Recent studies in computer vision^40,41 also suggest a nuanced perspective on the dominance of transformer architectures, indicating the value of exploring diverse architectural approaches for model development. The ambiguity mainly arises when comparing the performance of models with and without self-supervised pre-training^14,42, suggesting a need for a more in-depth exploration of the role of SSL in SCG. Similar to SSL, semi-supervised learning combines unsupervised pre-training with supervised fine-tuning³⁰, as opposed to self-supervised pre-training with optional fine-tuning in SSL. Both learning techniques are useful in transfer learning settings that are popular in SCG as reference mapping methods for single-cell datasets^2,43.

To guide the effective usage of SSL in SCG, we need to address these ambiguities through systematic empirical validation. Such a study helps to determine the scenarios in which SSL can effectively contribute to SCG. First, this requires developing SSL methods based on first principles and tailoring them for single-cell applications. These SSL methods learn representations from data and differing pairwise relationships. To assess their impact on downstream performance, we benchmark our SSL methods and compare their performance with their supervised and unsupervised counterparts. Second, this study requires validation across downstream applications, addressing the method’s objective to learn data representations that are helpful across multiple tasks.

Our study aims to identify specific scenarios in SCG where SSL is helpful and to thoroughly analyse and evaluate SSL approaches in SCG. Utilizing the CELLxGENE⁴⁴ census of scTab⁵ (scTab dataset), which comprises over 20 million cells, our study assesses the effectiveness of SSL across multiple downstream tasks. On the basis of well-defined benchmark metrics for SSL in SCG, our empirical analysis primarily focuses on the cell-type prediction application, with validation in gene-expression reconstruction, cross-modality prediction and data integration. We find that SSL improves downstream performance in transfer learning settings, that is, when analysing smaller datasets informed by insights from a larger auxiliary dataset and in scenarios involving unseen datasets. This improvement is especially notable in class-imbalance-sensitive metrics, indicating robustness improvements. However, our findings also reveal that self-supervised pre-training on the same dataset as the fine-tuning does not yield improvement compared with only supervised or unsupervised training. In summary, our study clarifies the roles and benefits of SSL in SCG, demonstrating its strengths in specific contexts while identifying its applicability limits. This research contributes to a more informed and strategic use of SSL in SCG, particularly in advancing our understanding of complex biological datasets.

Results

SSL framework for SCG

We present an SSL framework to develop self-supervision methods and study different use cases in SCG. Central to our framework is the use of fully connected autoencoder architectures, selected for their ubiquitous application in SCG tasks^38,39 and for minimizing architectural influences on our study, yet still large enough to capture underlying biological variations. In this framework, we integrate key SSL pretext tasks based on masked autoencoders⁴⁵ and contrastive learning^46,47 to benchmark their performance. The framework operates in two stages. The first stage is pre-training, also called pretext task, where the model learns from unlabelled data. We call the resulting model ‘zero-shot SSL’ for its zero-shot evaluation. The second stage is the optional fine-tuning. We call the resulting model the ‘SSL’ model, which is further trained to specific downstream tasks such as cell-type annotation (Fig. 1a). The pretext task builds a rich data representation based on a comprehensive dataset. We chose the scTab dataset⁵ because of its extent and diversity. We used all 19,331 human protein-encoding genes from scTab to maximize generalizability, ensuring gene coverage for analyses of unseen datasets, regardless of their feature selections. Our SSL framework leverages masked autoencoder with random masking and gene programme (GP) masking strategies, along with our isolated masked autoencoder approaches gene programme to gene programme (GP to GP) and gene programme to transcription factor (GP to TF) masking, considering isolated sets of genes (Fig. 1b). The strategies entail leveraging different degrees of biological insight, from random masking with a minimal inductive bias to isolated masking that intensively utilizes known gene functions, emphasizing targeted biological relationships. For contrastive learning, we incorporate the negative-pair-free methods bootstrap your own latent (BYOL)⁴⁶ and Barlow twins⁴⁷, known for their effectiveness in computer vision (Fig. 1c), with negative binomial noise and masking as data augmentations. We benchmarked these strategies for their efficacy in improving downstream performance. Our SSL framework, including these strategies, is depicted in Fig. 1a, outlining its architecture and pivotal components. Detailed descriptions of the specific implementations and adaptations of these SSL methods for SCG are further elaborated in Methods.

**Fig. 1: SSL on auxiliary data in SCG improves downstream performance.**

Pre-training on auxiliary data boosts cell-type prediction

As a first use case for self-supervision in SCG, we asked whether analyses on cell atlases or smaller datasets can benefit from self-supervised pre-training on auxiliary data. We answered this using three datasets: the Human Lung Cell Atlas (HLCA)⁴ (2,282,447 cells, 51 cell types), peripheral blood mononuclear cells (PBMCs) after severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection⁴⁸ (422,220 cells, 30 cell types), and the Tabula Sapiens Atlas (483,152 cells, 161 cell types)⁴⁹. These datasets vary in size, biological context and complexity, providing a robust test bed for our models. We evaluated cell-type prediction with the macro F1 score, supplemented by the micro F1 score, to compare robustness against class imbalances. We evaluated gene-expression reconstruction with the weighted explained variance. For the PBMC and Tabula Sapiens datasets, the self-supervised pre-training on additional scTab data significantly improved cell-type prediction and gene-expression reconstruction (Fig. 1d and Supplementary Fig. 2): from [0.7013 ± 0.0077] to [0.7466 ± 0.0057] macro F1 in the PBMC dataset and from [0.2722 ± 0.0123] to [0.3085 ± 0.0040] macro F1 in the Tabula Sapiens dataset. In the Tabula Sapiens dataset, this improvement is driven by strongly enhancing the classification of specific cell types, correctly classifying 6,881 of 7,717 type II pneumocytes instead of 2,441 (Fig. 1e; for other datasets, see Supplementary Fig. 1). For the PBMC dataset, this improvement is pronounced for underrepresented cell types (Fig. 1f), also indicated by the stronger macro F1 improvement versus micro F1 improvement. In contrast, the HLCA dataset presented a marginal performance improvement through self-supervised pre-training. Notably, SSL outperforms supervised learning if pre-trained on a large number of donors, highlighting the necessity of a rich pre-training dataset (Fig. 1g and Supplementary Fig. 6).

Tailored pre-training yields strong zero-shot performance

The scenario in which SSL is typically evaluated in computer vision is the zero-shot setting, where the model’s ability to represent and distinguish unobserved classes is assessed using data representations obtained solely through self-supervised pre-training. The labels are predicted, for example, with k-nearest-neighbours (kNN) classification or by training a prediction head while freezing the encoder weights. This perspective is noteworthy in SCG, where datasets’ increasing volume and complexity often come with challenges in obtaining accurate and comprehensive labels⁴. The ability of zero-shot SSL to achieve up to a 0.6725 macro F1 score on the scTab test set stands out as a strong performance (Fig. 2a). Likewise, in the test cases of HLCA, PBMC and Tabula Sapiens, zero-shot SSL comes close to their fine-tuned counterparts (Fig. 1d and Supplementary Fig. 1). The embedding from the zero-shot model illustrates this implicitly learned distinction of cell types (Fig. 2b). These findings highlight SSL’s potential in SCG to reduce the reliance on curated labels⁵⁰ and propose adding self-supervised pre-trained model embeddings to biological analyses alongside principal component analysis (PCA), a practice exemplified by platforms such as CELLxGENE⁴⁴. However, our benchmarking of SSL methods revealed the sensitivity to the choice of pre-training strategy. Contrastive learning has proven effective in domains such as language or vision modelling^10,46,47. It has further proved effective on smaller scales^{18,19,21,22,23,24,25,26,27} in SCG and worked in principle on large scales, as shown in ref. ⁵¹ and this benchmark. Still, our study finds that masking outperforms contrastive learning in large SCG tasks. This result highlights the challenges of applying these methods as generalizable pretext tasks for single-cell data. Conversely, masked autoencoders performed better: the random masking strategy consistently ranked among the top performers across different tasks (Fig. 2a). Notably, in the specific context of gene-expression reconstruction, the GP to TF isolated masking showed superior performance compared with other methods (Supplementary Figs. 1 and 2). This finding highlights the potential of tailored masking strategies in capturing the nuanced biological variations inherent in SCG data.

**Fig. 2: SSL enables high zero-shot performance and higher accuracy on unseen datasets.**

The efficacy of SSL depends on its context

While the previous evaluations focused on carefully curated and widely used benchmarks, we also set out to investigate SSL’s nuanced behaviour in analysing in-distribution versus unseen data settings. If the supervised and SSL model are provided access to the same data, their performance is remarkably similar (Fig. 2a). This finding is notable across cell-type annotation and gene-expression reconstruction. Extending to unseen datasets, we evaluated the supervised and SSL models on five datasets^52,53,54,55 published after the CELLxGENE⁴⁴ census of scTab (Methods). In this setting, self-supervised pre-training improves performance (Fig. 2c and Supplementary Fig. 2), for example, from [0.0877 ± 0.0215] to [0.1797 ± 0.0450] macro F1 for cell-type prediction in the great apes study⁵⁵. So, while inside the distribution (Fig. 2a), supervised and self-supervised learning perform similarly, this finding offers another dimension of SSL’s utility: when analysing unseen datasets, where generalization is crucial, SSL shows its advantages.

Help cross-modality prediction with SSL

Having benchmarked the utility of SSL on transcriptomics, we extended our study to multiomics⁵⁶, asking whether SSL can leverage auxiliary data from one modality to enhance multimodal downstream tasks, here focusing on cross-modality prediction (Fig. 3a). The NeurIPS multiomics dataset⁵⁷, a rich multi-donor, multi-site and multimodal bone marrow dataset containing coupled gene expression and proteomics counts from CITE-seq⁵⁸ experiments, provided a suitable test bed. The models obtain RNA-sequencing counts as input and predict protein counts. The SSL models are additionally pre-trained on RNA-sequencing data from the auxiliary scTab and the NeurIPS multiome dataset. When pre-trained on scTab, SSL significantly outperforms its supervised counterpart (P < 0.01) and the baseline method totalVI⁵⁹ (P < 0.01; Fig. 3b,d). The Pearson correlation between predicted and true protein counts improved from [0.8809 ± 0.0013] for the unsupervised model to [0.8943 ± 0.011] for the self-supervised model. Notably, the improvement is smaller if pre-trained on the same data, to a Pearson correlation of [0.8824 ± 0.0037]. This finding highlights the advantage of self-supervision in cases where one modality is more abundant. This effect is reproducible on other modalities, as verified by predicting the assay for transposase-accessible chromatin with sequencing (ATAC-Seq) from RNA counts in the NeurIPS multiome dataset⁵⁷ (Fig. 3c), proving the robust advantage of self-supervision on auxiliary data.

**Fig. 3: Self-supervised pre-training on auxiliary data improves cross-modality prediction.**

Self-supervised pre-training enhances data integration

Integrating single-cell datasets for joint analysis is difficult due to batch effects, for example, experimental conditions or confounding factors, posing unique challenges to atlasing efforts⁴. Large-scale models in SCG have already been deployed to address this challenge^32,34. To clarify the role of SSL in these efforts, we set out to integrate three datasets included in scTab: the molecular cell atlas of the human lung (65,662 cells, 45 cell types)⁶⁰, the molecular atlas of lung development of LungMap (46,500 cells, 28 cell types)⁶¹ and the molecular single-cell lung atlas of lethal coronavirus disease 2019 (COVID-19; 116,313 cells, 30 cell types)⁶². The datasets vary in cell-type composition and donor health or disease states, providing a challenging environment for this task. The single-cell integration benchmarking (scIB) metrics⁶³ evaluate the data integration performance, indicating how well batch effects are corrected while conserving biological variability (Fig. 4a). The score aggregates five batch correction metrics (PCR batch, batch ASW, graph iLISI, graph connectivity and kBET) and nine biological conservation metrics (NMI cluster/label, ARI cluster/label, cell-type ASW, isolated label F1, isolated label silhouette, graph cLISI, cell cycle conservation, HVG conservation and trajectory conservation) that cover cell identity labels and variance beyond that. We fine-tuned the unsupervised and SSL models using gene-expression reconstruction on these three datasets. To improve data integration performance and model comparison, we added batch covariates to all models⁵⁶. This led to the SSL-shallow model, which fine-tunes the last encoder layer of the zero-shot SSL model with batch covariates. PCA and scVI³⁸ embeddings serve as baselines for data integration. The scIB metrics indicate that self-supervised pre-training improves the data integration performance (Fig. 4b) with a total scIB score of [0.5638 ± 0.0089] (SSL shallow) and [0.5571 ± 0.0080] (SSL) compared with [0.5354 ± 0.0110] (unsupervised). The SSL-shallow model performed best, hinting at a meaningful data representation learned through the self-supervision algorithms, underscored by the comparable performance of the specialized data integration method scVI³⁸. This finding supports the advantage of leveraging auxiliary data through SSL and showcases the effectiveness of minimal fine-tuning compared with unsupervised learning.

**Fig. 4: SSL benefits data integration.**

Discussion

We analysed the application of SSL in SGC to guide its effective usage, leading us to adapt and benchmark several SSL techniques tailored for SCG. Our empirical study illuminates the context in which SSL can excel, especially when leveraging insights from vast, auxiliary datasets for smaller dataset tasks and in unseen dataset scenarios. We also demonstrated that SSL shows parity with supervised methods where both access the same data and that the zero-shot SSL model comes close to that performance. Our insights contribute to a more nuanced understanding of SSL’s applications in SCG. By rigorously testing these methods on an expansive dataset encompassing over 20 million cells, we offer a robust, empirically grounded perspective on SSL in SCG, paving the way for more informed, data-driven approaches to studying complex biological systems. In the context of large-scale and foundation models^34,35,36, this understanding could help design pre-training and select pretext tasks. For broad applicability within SCG, we address diverse, meaningful tasks, including cell-type prediction, gene-expression reconstruction, cross-modality prediction and data integration. By demonstrating that SSL’s advantages emerge predominantly in scenarios involving transfer learning tasks through auxiliary data or distributional shifts, we offer a pragmatic lens through which the SCG community can view SSL—not as a universal solution but as a strategic tool tailored for specific challenges. This insight is particularly relevant as SCG moves towards larger data analyses, analysing cell atlases and leveraging the consortia of millions of cells in foundation models. The adaptability and robustness of SSL, as evidenced in our empirical analysis, are crucial in this context to leverage large datasets. Our approach thus shows an example in SCG for the contextual application of SSL, guiding researchers to leverage this methodology where it most effectively addresses the field’s unique data challenges.

The benchmark of SSL methods provides a clear recommendation for practitioners regarding which approach is advantageous in the aforementioned settings. As a primary approach, we recommend masked pre-training with a random masking strategy due to its robustness and versatility across various tasks, which is central to foundation models. However, when focusing on specific problems, more tailored techniques might be beneficial. For instance, cell-type-specific tasks such as zero-shot cell-type prediction may benefit from masking gene programmes associated with cell types. Tasks prioritizing cell–cell interactions over subcellular resolutions may prefer contrastive methods, such as Barlow twins, which also showed strong zero-shot performance. These recommendations provide a strategic framework for applying SSL methods in SCG, ensuring researchers can select the most appropriate SSL method.

Future work on SSL in SCG may follow up on our findings. First, we identified several scenarios where SSL can improve performance across downstream tasks. These scenarios can serve as a baseline for future work, such as adding further downstream applications or developing SSL methods. Second, the remarkable performance improvement through SSL pre-training on auxiliary data promises further applications in which the data or data modality are scarce. Our solution can potentially improve analysis performance in applications such as dynamics modelling, where datasets with temporal resolutions are limited in size and availability, or in smaller applications with very small datasets. Third, the findings of this work are in the context of fully connected neural networks. Some conclusions may not generalize to other architectures, such as transformers. Extending the investigation to another base architecture is an interesting direction. Still, practitioners might consider pre-training their chosen model with SSL on auxiliary data, in particular masked autoencoders, as our benchmark suggests. Models natively equipped with self-supervised pre-training on auxiliary datasets, such as scGPT³⁵ or Nicheformer³⁷, can be a good starting point for practical purposes.

Finally, our study clarifies the scenarios in which SSL pre-training can improve performance in SCG. Namely, SSL excels in transfer learning tasks by leveraging auxiliary data and distributional shift scenarios. In the context of foundation models, we illuminate methodological innovations stemming from the SSL pre-training. For the broader computational biology community, we have shown that self-supervised pre-training on atlas-level data can help to improve performance on smaller datasets of biological or medical relevance that are commonly more difficult to scale.

Methods

Data curation

Preprocessing

All datasets used in this study underwent a commonly used preprocessing pipeline in SCG. This involved normalization to 10,000 counts per cell and log1p transformation to mitigate technical variations and facilitate more meaningful biological comparisons. This uniform preprocessing approach ensured that our models were trained and evaluated on data closely reflecting the underlying biological realities while minimizing technical noise.

scTab dataset

The core dataset for our study stems from scTab⁵ and is derived from the CELLxGENE⁴⁴ census version 2023-05-15, a long-term supported release hosted by CELLxGENE. This dataset represents a substantial collection of human single-cell RNA-sequencing data, encompassing 22.2 million cells spanning 164 unique cell types, 5,052 unique donors and 56 different tissues. To ensure the reproducibility of dataset creation, scTab applied stringent criteria for inclusion, focusing on primary data from 10x-based sequencing protocols and ensuring a broad representation across cell types and donors. The scTab data are divided into training, validation and test sets based on donors, avoiding label leakage and ensuring each set contains unique donors. This donor-based splitting approach allowed us to maintain a proportional representation of cells across the sets. It ensured that each cell type was represented in the training and testing phases. It further presented a challenging test split with unseen donors. The final split resulted in 15.2 million cells for training, 3.5 million for validation and 3.4 million for testing.

Single-cell atlases

We further considered smaller, focused datasets to test whether access to the auxiliary data gives an advantage. These datasets are subsets of the CELLxGENE⁴⁴ census of scTab⁵ (scTab dataset), tailored to specific applications, and consist of the Human Lung Cell Atlas (HLCA)⁴ (available at cellxgene.cziscience.com/e/9f222629-9e39-47d0-b83f-e08d610c7479.cxg; 775,790 cells after filtering, 51 cell types, 540,732 training, 117,541 validation, 117,517 test samples), peripheral blood mononuclear cells (PBMCs) after SARS-CoV-2 infection⁴⁸ (available at cellxgene.cziscience.com/e/2a498ace-872a-4935-984b-1afa70fd9886.cxg; 78,354 cells after filtering, 30 cell types, 78,354 training, 33,761 validation, 189,756 test samples), and the Tabula Sapiens Atlas (available at cellxgene.cziscience.com/e/53d208b0-2cfd-4366-9866-c3c6114081bc.cxg; 335,861 cells after filtering, 161 cell types, 223,337 training, 54,908 validation, 57,616 test samples)⁴⁹. The division into training, validation and test sets is derived from their allocation within the scTab dataset to prevent data leakage. Note that the training, validation and test sets of the PBMC, Tabula Sapiens and HLCA datasets are also part of the corresponding splits of the full scTab dataset.

Unseen datasets

To evaluate our models’ performance in unseen data analysis scenarios, we incorporated five unseen datasets published after the CELLxGENE census version of scTab: (1) all non-neuronal cells from the Human Brain Atlas⁵² (available at cellxgene.cziscience.com/e/b165f033-9dec-468a-9248-802fc6902a74.cxg) (2) dissection, tail of hippocampus (HiT) - caudal hippocampus - CA4-DGC from the Human Brain Atlas⁵² (available at cellxgene.cziscience.com/e/9f499d32-400d-4c42-ac9a-fb1481844fee.cxg), (3) the single-cell analysis of prenatal and postnatal human cortical development⁵³ (available at cellxgene.cziscience.com/e/1a38e762-2465-418f-b81c-6a4bce261c34.cxg), (4) circulating immune cells—CV19 infection, vaccination and HC⁵⁴ (available at cellxgene.cziscience.com/e/242c6e7f-9016-4048-af70-d631f5eea188.cxg), and (v) human, great apes study⁵⁵ (available at cellxgene.cziscience.com/e/2bdd3a2c-2ff4-4314-adf3-8a06b797a33a.cxg). The unseen datasets were filtered for the genes used in scTab; missing genes were zero-padded. The datasets were then normalized to 10,000 counts per cell and log1p transformed. The full datasets were used as the test split, that is, no samples were used for training.

NeurIPS multiome dataset

Our study included the NeurIPS multiome dataset⁵⁷, a multimodal bone marrow dataset that integrates gene-expression counts with proteomics data. While distinct in its multi-omic nature, this dataset underwent similar preprocessing steps to our other datasets, ensuring consistency across all analyses. We split the dataset into training, validation and test sets using an 80/10/10 random split. We chose 2,000 highly variable genes using Scanpy⁶⁴ as a standard preprocessing step for this dataset.

Self-supervision methods

Overview

SSL is the concept that data, along with their inherent pairwise relationships, are sufficient for learning meaningful data representations, even in the absence of explicit labels. While supervised learning relies on paired observations and labels (X, Y), SSL thus depends on only the input X and an inter-sample relationship (X, G), where G is constructed through a data augmentation that sustains the semantic information of X⁸. Thereby, the method distils signal from noise⁶⁵, a crucial aspect for managing challenges such as class imbalances in large, real-world datasets⁶⁶. In single-cell data, this means distiling the signal of the cellular omics and removing noise sources such as batch effects or inconsistent labelling.

In the context of SCG, SSL harnesses these capabilities to navigate the complexities of vast, unlabelled datasets replete with intricate biological interdependencies. The framework is structured into two distinct phases: pre-training and fine-tuning. During the pre-training phase, the model employs contrastive learning or denoising methods to learn a data representation. This representation, characterized by its broad applicability, is then utilized in one of two ways. First, as a zero-shot SSL model, it can be directly applied to a downstream task without further label-dependent training. Alternatively, as an SSL model, it undergoes fine-tuning to enhance performance on specific tasks. This fine-tuning capitalizes on the rich data representation acquired during pre-training, adjusting and optimizing it for the desired application. The fine-tuning phase of SSL, therefore, is not only about refining the pre-training but also about strategically leveraging the pre-established data mappings for task-specific optimizations.

Core principles and strategies

The choice of self-supervised pre-training, that is, learning the inter-sample relationship, is critical to obtaining a meaningful data representation as it gives rise to the signal-to-noise distinction in the dataset. Our SSL framework is designed around two primary pre-training strategies: masked autoencoders and contrastive learning, both adapted to meet the unique demands of SCG.

Masked autoencoders

This approach follows the concept of self-prediction, where a significant portion of input features (genes in SCG) are masked (that is, set to zero), and the model is trained to reconstruct these missing parts^9,45,67. It thus sets focus on inter-feature dependencies. We implemented various masking strategies. (1) In random masking, 50% of genes are randomly chosen and masked with different choices in each iteration. (2) In GP masking, sets of genes known for biological functions are masked such that n% of genes are masked and reconstructed. The C8 cell-type signature gene sets from the Human MSigDB Collections^68,69,70 were used. Next, we introduce isolated masked autoencoders, in which all genes but a defined set are masked, and only this set is reconstructed. (3) For this, we present a GP to TF isolated masking. This masking predicts the expression value of the transcription factor known to correspond to a gene programme. This connection is given in the TFT transcription factor targets subset of C3 regulatory target gene sets from the Human MSigDB Collections^71,72. (4) Last, we present a GP to GP isolated masking. In this strategy, a gene programme is kept unmasked and used to predict only itself. The gene programmes for this strategy also stem from the C8 cell-type signature gene sets from the Human MSigDB Collections. These strategies are tailored to capture specific gene interactions and relationships, making them particularly suited for the intricate nature of single-cell data.

Contrastive learning

Unlike self-prediction, contrastive learning focuses on understanding relationships between different samples, thus focusing on inter-sample dependencies. This method minimizes distances between similar samples and maximizes distances between dissimilar ones in the embedded space. Contrastive methods are typically distinguished by their strategy to avoid representation collapse, the trivial solution to contrastive losses of constant representations^9,10. BYOL is an example of architectural regularization through its teacher–student network. Barlow twins is an example of an information maximization method that avoids collapse by maximizing the information content of the embedding. We incorporated BYOL and Barlow twins in our framework to benchmark two schools of thought. We used a combination of negative binomial noise and masking as data augmentation, simulating the expected noise profiles in SCG data.

Zero-shot SSL concept

A key concept in our study is the differentiation between the zero-shot SSL and SSL models. The zero-shot SSL model represents the initial phase of pre-training, where the model learns from the data without any label guidance through self-supervision algorithms. This model, even without fine-tuning, can provide meaningful insights into data, as demonstrated in various downstream tasks. The SSL model, in contrast, undergoes an additional fine-tuning phase tailored to specific downstream applications. This distinction allows us to explore the full spectrum of SSL’s capabilities, from a generalized understanding of data to specialized, task-specific optimizations.

In summary, our self-supervision methods in SCG are defined by a nuanced application of masked autoencoders and contrastive learning adapted to the field’s specific challenges. The zero-shot SSL concept plays a central role in our approach, highlighting the potential of SSL to derive meaningful insights from large-scale, unlabelled datasets. This methodological framework sets the stage for a detailed exploration and benchmarking of SSL’s impact on various SCG tasks, as detailed in the following sections of our study.

Downstream applications in SCG

Cell-type annotation

Cell-type annotation in SCG is a classification task where data samples, represented as vectors of RNA-sequencing counts, are assigned to distinct cellular identities. Although seemingly straightforward, this task is complicated by the noise and heterogeneity inherent in large-scale datasets. We utilize the scTab dataset as the primary basis for our cell-type annotation analysis. We employ various SSL methods and compare their effectiveness against supervised approaches. We train the classifier using a cross-entropy loss. We evaluate cell-type annotation performance by kNN (k = 5) classification using the scTab validation set as neighbours of the test sample. The validation set is sufficiently large and diverse, making it a simple and scalable alternative to the training set for this purpose. This choice is driven to have the same evaluation, including for the zero-shot SSL model that does not have a prediction head. Our evaluation metrics focus on the macro F1 score, reflecting the models’ ability to handle class imbalances, supplemented by the micro F1 score, offering an additional comparative perspective to class imbalances. Exemplary loss curves for this training are shown in Supplementary Fig. 5 and a list of hyperparameters is shown in Supplementary Table 1.

Gene-expression reconstruction

Gene-expression reconstruction, the process of reconstructing counts from the transcriptome, still presents challenges due to the inherent noise and dispersion in RNA-sequencing data. The popular scVI model³⁸ inspires our approach and diverges in its use of input data. While scVI uses raw counts as input and models them as a negative binomial distribution, our method employs normalized data for consistency with other downstream tasks. Nonetheless, similar to scVI, we predict the parameters of the negative binomial distribution. This strategy of modelling distribution parameters rather than direct RNA-sequencing count prediction enhanced reconstruction accuracy in our experiments. We opt for a non-variational, fully connected autoencoder framework consistent with our cell-type prediction approach. Performance evaluation encompasses MSE and uniform and weighted explained variance. We reported the weighted explained variance to best reflect the actual reconstruction efficacy, accounting for class imbalances. We include the MSE and uniform explained variance in our framework as supplementary evaluation, and they were used in our experiments. The hyperparameters used are shown in Supplementary Table 1.

Cross-modality prediction

Cross-modality prediction is the task of predicting one modality from another. Such a task could potentially augment cellular data by a different modality, offering another perspective. For pre-training, we used masking (1) on the auxiliary scTab dataset and (2) on the downstream task dataset. For fine-tuning, we included two studies, both using normalized and log1p transformed RNA-sequencing counts as originating modalities. First, we predicted all 134 normalized and log1p transformed protein counts (proteomics) available in the NeurIPS CITE-seq dataset⁵⁷. We trained the models in a random training, validation and test split using coupled RNA and proteomics counts. Second, we predicted all 116,490 TF-IDF (term frequency-inverse document frequency)⁷³-normalized ATAC counts available in the NeurIPS multiome dataset⁵⁷. Again, we trained the models in a random training, validation and test split using the coupled RNA and ATAC counts. Hyperparameters are shown in Supplementary Table 1.

Data integration

Data integration is an effort to study a set of related SCG datasets, possibly curated from various donors with different pipelines and in different settings that create batch effects and technical artefacts. The scIB⁶³ integration benchmarking is a well-established analysis to determine how well the relevant and meaningful biological signals are preserved in any model data representation while removing the unwanted batch effects resulting in a mixed representation of various datasets. Accordingly, the scIB pipeline measures two metrics, including the bio conservation and batch correction metrics, each consisting of several evaluations through different methods. The hyperparameters for data integration are shown in Supplementary Table 1.

Contrastive method choice

This benchmark developed contrastive methods based on BYOL and Barlow twins, two well-performing negative-pair-free methods. This choice is motivated by their reliance solely on data augmentations rather than sampling negative pairs in a large and heterogeneous dataset and their proven performance^46,47. Other reasonable choices include simple Siamese networks⁷⁴, which were excluded due to repeatedly observed training instability in our setting, and SimCLR¹², which was not pursued further as BYOL and Barlow twins showed superior performance in previous benchmarks. While VICReg¹¹ is promising by design, we focused on BYOL and Barlow twins due to their robustness. As contrastive learning methods generally performed worse than masking approaches, we prioritized them for thorough investigation.

Batch effect

Batch effects were not explicitly corrected when working with large datasets, such as scTab, covering 249 datasets. Including many datasets seems to reduce the relative impact of such effects on the overall variation. When working with fewer datasets, such as in the data integration experiments covering three datasets, a batch covariate needs to be included to avoid strong batch effects.

Computational resources

The experiments for this work were conducted on a graphics processing unit (GPU) server with the following specifications:

GPU: 16x Tesla V100 GPUs with 32 GB random access memory (RAM) per card
GPU: 2x Tesla V100 GPUs with 16 GB RAM per card
GPU: 8x A100-SXM4 GPUs with 40 GB RAM per card

All pre-training methods were trained on a single GPU for 2 days with early stopping, using up to 160 GB of system memory at a batch size of 8,192. For practitioners with limited GPU memory, smaller batch sizes can reduce memory usage. For example, a batch size of 2 uses under 1 GB of VRAM but greatly increases training time (>200 h per epoch on scTab). All fine-tuning methods were trained on a single GPU for 1 day with early stopping. All models were checked for convergence in the validation metrics.

Terminology

In this paper, we use the following terminology:

Term	Definition	Example
Architecture	Neural network structure	Multi-Layer-Perceptron Transformer
Method	Training approach	SSL Supervised learning Unsupervised learning
Model	A trained architecture using a specific method	scTab⁵ scGPT³⁶ Nicheformer³⁸

We use the above table’s terminology throughout the paper to distinguish between architecture, method and model. This distinction clarifies how different methods impact models that share similar architectures. For example, scGPT trains a transformer architecture using SSL.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The scTab data are available with instructions in the corresponding publication⁵. The smaller datasets are publicly available on CELLxGENE⁴⁴ and subsets of the scTab datasets (HLCA, Dataset ID 148; PBMC, Dataset ID 87; Tabula Sapiens, Dataset ID 41). The unseen datasets are sourced from CELLxGENE⁴⁴ with instructions in the corresponding publications^52,53,54,55. The NeurIPS multiome dataset is publicly available from NCBI GEO under accession GSE194122 with instructions in the corresponding publication⁵⁷.

Code availability

The code is available at github.com/theislab/ssl_in_scg and on Zenodo at https://doi.org/10.5281/zenodo.13358872 (ref. ⁷⁵). A lean version for masked pre-training on adatas fitting into memory is available at github.com/theislab/sc_mae.

References

Angerer, P. et al. Single cells make big data: new challenges and opportunities in transcriptomics. Curr. Opin. Syst. Biol. 4, 85–91 (2017).
Article Google Scholar
Lotfollahi, M. et al. Mapping single-cell data to reference atlases by transfer learning. Nat. Biotechnol. 40, 121–130 (2022).
Article Google Scholar
Regev, A. et al. The human cell atlas. eLife 6, e27041 (2017).
Article Google Scholar
Sikkema, L. et al. An integrated cell atlas of the lung in health and disease. Nat. Med. 29, 1563–1577 (2023).
Article Google Scholar
Fischer, F. et al. scTab: scaling cross-tissue single-cell annotation models. Nat. Commun. 15, 6611 (2024).
Article Google Scholar
Consens, M. E. et al. To transformers and beyond: large language models for the genome. Preprint at https://arxiv.org/abs/2311.07621 (2023).
Boiarsky, R., Singh, N., Buendia, A., Getz, G. & Sontag, D. A deep dive into single-cell RNA sequencing foundation models. Preprint at bioRxiv https://doi.org/10.1101/2023.10.19.563100 (2023).
Balestriero, R. et al. Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods. Adv. Neural Inf. Process. Syst. 35, 26671–26685 (2022).
Google Scholar
Weng, L. et al. Self-supervised learning: self-prediction and contrastive learning. Adv. Neural Inf. Process. Syst. https://nips.cc/media/neurips-2021/Slides/21895.pdf (2021).
Uelwer, T. et al. A survey on self-supervised representation learning. Preprint at https://arxiv.org/abs/2308.11455 (2023).
Bardes, A. et al. Y. VICReg: Variance-Invariance-Covariance regularization for self-supervised learning. Int. Conf. Learn. Represent. https://openreview.net/forum?id=xm6YD62D1Ub (2022).
Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In Proc. 37th International Conference on Machine Learning Vol. 119 (eds Iii, H. D. & Singh, A.) 1597–1607 (PMLR, 2020).
Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. Improving language understanding by generative pre-training (2018); https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Devlin, J. et al. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers (eds Burstein, J. et al.) 4171–4186 (Association for Computational Linguistics, 2019).
Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://arxiv.org/abs/2108.07258 (2021).
Yang, M. et al. Contrastive learning enables rapid mapping to multimodal single-cell atlas of multimillion scale. Nat. Mach. Intell. 4, 696–709 (2022).
Article Google Scholar
Xiong, Z. et al. scGCL: an imputation method for scRNA-seq data based on graph contrastive learning. Bioinformatics 39, btad098 (2023).
Article Google Scholar
Yan, X., Zheng, R., Wu, F. & Li, M. CLAIRE: contrastive learning-based batch correction framework for better balance between batch mixing and preservation of cellular heterogeneity. Bioinformatics 39, btad099 (2023).
Article Google Scholar
Chen, L., Zhai, Y., He, Q., Wang, W. & Deng, M. Integrating deep supervised, self-supervised and unsupervised learning for single-cell RNA-seq clustering and annotation. Genes 11, 792 (2020).
Article Google Scholar
Zhang, R., Luo, Y., Ma, J., Zhang, M. & Wang, S. scPretrain: multi-task self-supervised learning for cell-type classification. Bioinformatics 38, 1607–1614 (2022).
Article Google Scholar
Shen, H. et al. Miscell: an efficient self-supervised learning approach for dissecting single-cell transcriptome. iScience 24, 103200 (2021).
Article Google Scholar
Wan, H., Chen, L. & Deng, M. scNAME: neighborhood contrastive clustering with ancillary mask estimation for scRNA-seq data. Bioinformatics 38, 1575–1583 (2022).
Article Google Scholar
Ciortan, M. & Defrance, M. Contrastive self-supervised clustering of scRNA-seq data. BMC Bioinform. 22, 280 (2021).
Article Google Scholar
Han, W. et al. Self-supervised contrastive learning for integrative single cell RNA-seq data analysis. Brief. Bioinform. 23, bbac377 (2022).
Article Google Scholar
Du, L., Han, R., Liu, B., Wang, Y. & Li, J. ScCCL: single-cell data clustering based on self-supervised contrastive learning. IEEE/ACM Trans. Comput. Biol. Bioinform. 20, 2233–2241 (2023).
Article Google Scholar
Peng, W. et al. Multi-network graph contrastive learning for cancer driver gene identification. IEEE Trans. Netw. Sci. Eng. 11, 3430–3440 (2024).
Article Google Scholar
Zhang, W., Jiang, R., Chen, S. & Wang, Y. scIBD: a self-supervised iterative-optimizing model for boosting the detection of heterotypic doublets in single-cell chromatin accessibility data. Genome Biol. 24, 225 (2023).
Article Google Scholar
Vime: extending the success of self-and semi-supervised learning to tabular domain. https://proceedings.neurips.cc/paper/2020/hash/7d97667a3e056acab9aaf653807b4a03-Abstract.html
Lee, C. et al. Self-supervision enhanced feature selection with correlated gates. In Proc. 10th International Conference on Learning Representations https://openreview.net/forum?id=oDFvtxzPOx (OpenReview.net, 2022).
Geuenich, M. J., Gong, D.-W. & Campbell, K. R. The impacts of active and self-supervised learning on efficient annotation of single-cell expression data. Nat. Commun. 15, 1014 (2024).
Article Google Scholar
Richter, T. et al. SpatialSSL: whole-brain spatial transcriptomics in the mouse brain with self-supervised learning. (2023).
Chen, J. et al. Transformer for one stop interpretable cell type annotation. Nat. Commun. 14, 223 (2023).
Article Google Scholar
Tang, W. et al. Single-cell multimodal prediction via transformers. In Proc. 32nd ACM International Conference on Information and Knowledge Management 2422–2431 (CIKM, 2023).
Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023).
Article Google Scholar
Cui, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods 21, 1470–1480 (2024).
Article Google Scholar
Yang, F. et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat. Mach. Intell. 4, 852–866 (2022).
Article Google Scholar
Schaar, A. C. et al. Nicheformer: a foundation model for single-cell and spatial omics. Preprint at bioRxiv https://doi.org/10.1101/2024.04.15.589472 (2024).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
Article Google Scholar
Lotfollahi, M. et al. Predicting cellular responses to complex perturbations in high-throughput screens. Mol. Syst. Biol. 19, e11517 (2023).
Article Google Scholar
Goldblum, M. et al. Battle of the backbones: a large-scale comparison of pretrained models across computer vision tasks. In Proc. 37th Conference on Neural Information Processing Systems, Datasets and Benchmarks Track https://openreview.net/forum?id=1yOnfDpkVe (NeurIPS, 2023).
Smith, S. L., Brock, A., Berrada, L. & De, S. ConvNets match vision transformers at scale. Preprint at https://arxiv.org/abs/2310.19909 (2023).
Radford, A. et al. Robust speech recognition via large-scale weak supervision. In Proc. 40th International Conference on Machine Learning Vol. 202 (eds Krause, A. et al.) 28492–28518 (PMLR, 2023).
Dann, E. et al. Precise identification of cell states altered in disease using healthy single-cell references. Nat. Genet. 55, 1998–2008 (2023).
Article Google Scholar
CZI Single-Cell Biology Program et al. CZ CELL×GENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data. Preprint at bioRxiv https://doi.org/10.1101/2023.10.30.563174 (2023).
He, K. et al. Masked autoencoders are scalable vision learners. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition 15979–15988 (IEEE, 2022).
Grill, J.-B. et al. Bootstrap your own latent—a new approach to self-supervised learning. In Advances in Neural Information Processing Systems 21271–21284 (Curran Associates, 2020).
Zbontar, J., Jing, L., Misra, I., LeCun, Y. & Deny, S. Barlow twins: self-supervised learning via redundancy reduction. in Proc. 38th International Conference on Machine Learning 12310–12320 (PMLR, 2021).
Yoshida, M. et al. Local and systemic responses to SARS-CoV-2 infection in children and adults. Nature 602, 321–327 (2022).
Article Google Scholar
Tabula Sapiens Consortium et al. The Tabula Sapiens: a multiple-organ, single-cell transcriptomic atlas of humans. Science 376, eabl4896 (2022).
Article Google Scholar
Fleck, J. S., Camp, J. G. & Treutlein, B. What is a cell type? Science 381, 733–734 (2023).
Article Google Scholar
Heimberg, G. et al. Scalable querying of human cell atlases via a foundational model reveals commonalities across fibrosis-associated macrophages. Preprint at bioRxiv https://doi.org/10.1101/2023.07.18.549537 (2023).
Siletti, K. et al. Transcriptomic diversity of cell types across the adult human brain. Science 382, eadd7046 (2023).
Article Google Scholar
Velmeshev, D. et al. Single-cell analysis of prenatal and postnatal human cortical development. Science 382, eadf0834 (2023).
Article Google Scholar
Ivanova, E. et al. mRNA COVID-19 vaccine elicits potent adaptive immune response without the acute inflammation of SARS-CoV-2 infection. iScience 26, 108572 (2023).
Article Google Scholar
Jorstad, N. L. et al. Comparative transcriptomics reveals human-specific cortical features. Science 382, eade9516 (2023).
Article Google Scholar
Heumos, L. et al. Best practices for single-cell analysis across modalities. Nat. Rev. Genet. 24, 550–572 (2023).
Article Google Scholar
Luecken, M. et al. A sandbox for prediction and integration of DNA, RNA, and proteins in single cells. In Proc. 35th Conference on Neural Information Processing Systems Track on Datasets and Benchmarks (eds Vanschoren, J. & Yeung, S.) https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/158f3069a435b314a80bdcb024f8e422-Paper-round2.pdf (2021).
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
Article Google Scholar
Gayoso, A. et al. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nat. Methods 18, 272–282 (2021).
Article Google Scholar
Travaglini, K. J. et al. A molecular cell atlas of the human lung from single-cell RNA sequencing. Nature 587, 619–625 (2020).
Article Google Scholar
Wang, A. et al. Single-cell multiomic profiling of human lungs reveals cell-type-specific and age-dynamic control of SARS-CoV2 host genes. eLife 9, e62522 (2020).
Article Google Scholar
Melms, J. C. et al. A molecular single-cell lung atlas of lethal COVID-19. Nature 595, 114–119 (2021).
Article Google Scholar
Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022).
Article Google Scholar
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 19, 15 (2018).
Article Google Scholar
von Kügelgen, J. et al. Self-supervised learning with data augmentations provably isolates content from style. In Advances in Neural Information Processing Systems 16451–16467 (Curran Associates, 2021).
Liu, H., et al. Self-supervised learning is more robust to dataset imbalance. In NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications https://openreview.net/forum?id=vUz4JPRLpGx (2021).
Cao, S., Xu, P. & Clifton, D. A. How to understand masked autoencoders. Preprint at https://arxiv.org/abs/2202.03670 (2022).
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).
Article Google Scholar
Liberzon, A. et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740 (2011).
Article Google Scholar
Liberzon, A. et al. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 1, 417–425 (2015).
Article Google Scholar
Kolmykov, S. et al. GTRD: an integrated view of transcription regulation. Nucleic Acids Res. 49, D104–D111 (2021).
Article Google Scholar
Xie, X. et al. Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals. Nature 434, 338–345 (2005).
Article Google Scholar
Bredikhin, D., Kats, I. & Stegle, O. MUON: multimodal omics analysis framework. Genome Biol. 23, 42 (2022).
Article Google Scholar
Chen, X. & He, K. Exploring simple Siamese representation learning. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 15745, 15753 (2020).
Google Scholar
Richter, T. & Bahrami, M. Theislab/ssl_in_scg: first release. Zenodo https://doi.org/10.5281/zenodo.13358873 (2024).

Download references

Acknowledgements

We thank F. Fischer for his valuable assistance with scTab and his constructive comments, which improved our work’s narrative. For the cross-modality prediction task, we thank A. Litinetskaya for her valuable feedback. We also thank M. Stahl, X. and A. Chernysheva for their contributions during their master practical course, which laid the groundwork for the multiomics task (together with Y. Xia, who continued afterwards). We are particularly grateful to A. Palma for his feedback on the paper’s storyline and to A. Palma, A. Szałata and E. Roellin for their valuable input on the paper, greatly enhancing its quality. We thank F. Curion for her input, sparking our exploration of isolated masked autoencoders, and her feedback on the multiomics application. T.R. and M.B. are supported by the Helmholtz Association under the joint research school ‘Munich School For Data Science’. T.R. and F.J.T. acknowledge support by the Helmholtz Association’s Initiative and Networking Fund through CausalCellDynamics (grant number Interlabs-0029), F.J.T. acknowledges support by the European Union (ERC, DeepCell - 101054957). Views and opinions expressed are, however, those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. The language of this paper was refined using ChatGPT by OpenAI and Grammarly by Grammarly Inc.

Funding

Open access funding provided by Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH).

Author information

Authors and Affiliations

Department of Computational Health, Institute of Computational Biology, Helmholtz Munich, Munich, Germany
Till Richter, Mojtaba Bahrami, David S. Fischer & Fabian J. Theis
TUM School of Computation, Information and Technology, Technical University of Munich, Munich, Germany
Till Richter, Yufan Xia & Fabian J. Theis
TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany
Mojtaba Bahrami & Fabian J. Theis
Eric and Wendy Schmidt Center at the Broad Institute, Cambridge, MA, USA
David S. Fischer

Authors

Till Richter
View author publications
Search author on:PubMed Google Scholar
Mojtaba Bahrami
View author publications
Search author on:PubMed Google Scholar
Yufan Xia
View author publications
Search author on:PubMed Google Scholar
David S. Fischer
View author publications
Search author on:PubMed Google Scholar
Fabian J. Theis
View author publications
Search author on:PubMed Google Scholar

Contributions

T.R., D.S.F. and F.J.T. conceptualized the project. T.R. led pilot analyses, method development and implementation. T.R. and F.J.T. outlined the downstream analyses. T.R. performed the cell-type prediction and gene-expression reconstruction studies. T.R. and Y.X. undertook the cross-modality prediction analysis, and M.B. performed the data integration study. The paper was written by T.R., M.B., D.S.F. and F.J.T., with all authors contributing to discussions and providing comments on the paper.

Corresponding author

Correspondence to Fabian J. Theis.

Ethics declarations

Competing interests

F.J.T. consults for Immunai, CytoReason, Cellarity and Omniscope and has an ownership interest in Dermagnostix GmbH and Cellarity. The other authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Qi Liu, Qing Nie and Zhiyuan Yuan for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–6 and Table 1.

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Richter, T., Bahrami, M., Xia, Y. et al. Delineating the effective use of self-supervised learning in single-cell genomics. Nat Mach Intell 7, 68–78 (2025). https://doi.org/10.1038/s42256-024-00934-3

Download citation

Received: 16 February 2024
Accepted: 22 October 2024
Published: 27 December 2024
Issue date: January 2025
DOI: https://doi.org/10.1038/s42256-024-00934-3

This article is cited by

Prediction of human pathogenic start loss variants based on self-supervised contrastive learning
- Jie Liu
- Henghui Fan
- Junfeng Xia
BMC Biology (2025)
KI ermöglicht neue Einblicke in die Komplexität biologischer Systeme
- Till Richter
- Fabian Theis
BIOspektrum (2025)