Abstract
Single-cell perturbation technologies enable systematic investigation of gene functions and regulatory networks with single-cell resolution. However, performing large-scale and combinatorial perturbation screens poses notable challenges due to their exponentially increased complexity. Computational methods, including foundation models, have been developed to predict perturbation effects. Yet despite claims of promising performance, concerns remain about their true efficacy, particularly when evaluated across diverse and previously unseen cellular contexts and perturbation scenarios. Here, we present a comprehensive benchmark of 27 methods for single-cell perturbation response prediction, evaluated across 29 datasets using 6 complementary performance metrics. By evaluating them under multiple scenarios, we systematically assess their generalizability, including that of emerging foundation models. Our results provide practical guidance for method selection and underscore the need for cellular context embedding approaches to enhance the generalizability of perturbation effect prediction in single-cell research.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout





Similar content being viewed by others
Data availability
We have uploaded all the processed datasets used in our benchmark study to Figshare at https://doi.org/10.6084/m9.figshare.28143422 (ref. 63) and https://doi.org/10.6084/m9.figshare.28147883 (ref. 64) and Zenodo via https://doi.org/10.5281/zenodo.14607156 (ref. 65) and https://doi.org/10.5281/zenodo.14638779 (ref. 66).
The kangCrossCell33, kangCrossPatient33 and Haber37 datasets consist of preprocessed data from Lotfollahi et al., which can be downloaded directly from Google Drive via https://drive.google.com/drive/folders/1n1SLbXha4OH7j7zZ0zZAxrj_-2kczgl8. The Parekh58, CrossPatient35 and sciPlex3 (ref. 1) datasets were obtained from the PerturBase database http://www.perturbase.cn/. KaggleCrossPatient and KaggleCrossCell can be downloaded from the Kaggle competition webpage via https://www.kaggle.com/competitions/open-problems-single-cell-perturbations/data/. The McFarland32 dataset was downloaded from the scPerturb database (version 1.3), which is available at Zenodo via https://doi.org/10.5281/zenodo.10044268. (ref. 59). CrossSpecies34 is available at https://github.com/theislab/scgen-reproducibility/blob/master/code/DataDownloader.py/. The Afriat60 perturbation dataset was downloaded from the biolord GitHub tutorial site via https://biolord.readthedocs.io/en/latest/tutorials/biolord_pipeline.html. The sciPlex3_comb dataset1 was downloaded from the CPA tutorial website and can be accessed at https://drive.google.com/uc?export=download&id=1RRV0_qYKGTvD3oCklKfoZQFYqKJy4l6t. The remaining 16 datasets—Adamson3, Frangieh45, TianActivation44, TianInhibition44, Replogle_exp6 (ref. 43), Replogle_exp7 (ref. 43), Replogle_exp8 (ref. 43), Papalexi42, Replogle_RPE1essential41, Replogle_K562essential41, Norman40, Wessels39, Schmidt38, sciPlex_A549 (ref. 1), sciPlex3_K562 (ref. 1) and sciPlex3_MCF7 (ref. 1)—were downloaded from the PerturBase database via http://www.perturbase.cn/. Source data are provided with this paper.
Code availability
The scripts used in this study are available via GitHub at https://github.com/bm2-lab/scPerturBench/ and Zenodo at https://doi.org/10.5281/zenodo.15904698 (ref. 65). To promote transparency and reproducibility, we provide a Podman image that contains all major scripts used in our benchmark, along with the complete set of preconfigured conda environments (https://github.com/bm2-lab/scPerturBench/). We have created an online platform that hosts the benchmarking results of all evaluated tools across all datasets and performance metrics included in our study (https://bm2-lab.github.io/scPerturBench-reproducibility/).
References
Srivatsan, S. R. et al. Massively multiplex chemical transcriptomics at single-cell resolution. Science 367, 45–51 (2020.
Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866 (2016).
Adamson, B. et al. A multiplexed single-cell CRISPR screening platform enables systematic dissection of the unfolded protein response. Cell 167, 1867–1882 (2016).
Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen predicts single-cell perturbation responses. Nat. Methods 16, 715–721 (2019).
Roohani, Y., Huang, K. & Leskovec, J. Predicting transcriptional outcomes of novel multigene perturbations with GEARS. Nat. Biotechnol. 42, 927–935 (2024).
Hao, M. et al. Large-scale foundation model on single-cell transcriptomics. Nat. Methods 21, 1481–1491 (2024).
Cui, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods 21, 1470–1480 (2024).
Yang, F. et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat. Mach. Intell. 4, 852–866 (2022).
Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023).
Wu, Y. et al. PerturBench: benchmarking machine learning models for cellular perturbation analysis. Preprint at https://arxiv.org/abs/2408.10609 (2024).
Bendidi, I. et al. Benchmarking transcriptomics foundation models for perturbation analysis: one PCA still rules them all. Preprint at https://arxiv.org/abs/2410.13956 (2024).
Ahlmann-Eltze, C., Huber, W. & Anders, S. Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines. Nat. Methods 22, 1657–1661 (2025).
Peidli, S. et al. scPerturb: harmonized single-cell perturbation data. Nat. Methods 21, 531–540 (2024).
Bunne, C. et al. Learning single-cell perturbation responses using neural optimal transport. Nat. Methods 20, 1759–1768 (2023).
Jiang, Q., Chen, S., Chen, X. & Jiang, R. scPRAM accurately predicts single-cell gene expression perturbation response based on attention mechanism. Bioinformatics 40, btae265 (2024).
Piran, Z., Cohen, N., Hoshen, Y. & Nitzan, M. Disentanglement of single-cell data with biolord. Nat. Biotechnol. 42, 1678–1683 (2024).
Yeh, C. -H., Chen, Z. -G., Liou, C. -Y. & Chen, M. -J. Homogeneous space construction and projection for single-cell expression prediction based on deep learning. Bioengineering 10, 996 (2023).
Zhang, Z., Zhao, X., Bindra, M., Qiu, P. & Zhang, X. scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data. Nat. Commun. 15, 912 (2024).
Wei, X., Dong, J. & Wang, F. scPreGAN, a deep generative model for predicting the response of single-cell expression to perturbation. Bioinformatics 38, 3377–3384 (2022).
Wang, H., Wang, Y., Jiang, Q., Zhang, Y. & Chen, S. SCREEN: predicting single-cell gene expression perturbation responses via optimal transport. Front. Comput. Sci. 18, 2095–2228 (2024).
Kana, O. et al. Generative modeling of single-cell gene expression for dose-dependent chemical perturbations. Patterns 4, 100817 (2023).
Lotfollahi, M., Naghipourfar, M., Theis, F. J. & Wolf, F. A. Conditional out-of-distribution generation for unpaired data using transfer VAE. Bioinformatics 36, i610–i617 (2020).
Bai, D., Ellington, C. N., Mo, S., Song, L. & Xing, E. P. AttentionPert: accurately modeling multiplexed genetic perturbations with multi-scale effects. Bioinformatics 40, i453–i461 (2024).
Lotfollahi, M. et al. Predicting cellular responses to complex perturbations in high-throughput screens. Mol. Syst. Biol. 19, e11517 (2023).
Chen, Y. & Zou, J. Simple and effective embedding model for single-cell biology built from ChatGPT. Nat. Biomed. Eng. 9, 483–493 (2025).
Hetzel, L., Boehm, S., Kilbertus, N., Günnemann, S. & Theis, F. Predicting cellular responses to novel drug perturbations at a single-cell resolution. Adv. Neural Inf. Process. Syst. 35, 26711–26722 (2022).
Zhu, O. & Li, J. Scouter: predicting transcriptional responses to genetic perturbations with LLM embeddings. Preprint at bioRxiv https://doi.org/10.1101/2024.12.06.627290 (2024).
Liu, T., Chen, T., Zheng, W., Luo, X. & Zhao, H. scELMo: embeddings from language models are good learners for single-cell data analysis. Preprint at bioRxiv https://doi.org/10.1101/2023.12.07.569910 (2023).
Yang, X. et al. GeneCompass: deciphering universal gene regulatory mechanisms with a knowledge-informed cross-species foundation model. Cell Res. 34, 830–845 (2024).
Qi, X. et al. Predicting transcriptional responses to novel chemical perturbations using deep generative model for drug discovery. Nat. Commun. 15, 9256 (2024).
Huang, W. & Liu, H. Predicting single-cell cellular responses to perturbations using cycle consistency learning. Bioinformatics 40, i462–i470 (2024).
McFarland, J. M. et al. Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action. Nat. Commun. 11, 4296 (2020).
Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).
Hagai, T. et al. Gene expression variability across cells and species shapes innate immunity. Nature 563, 197–202 (2018).
Zhao, W. et al. Deconvolution of cell type-specific drug responses in human tumor tissue with single-cell RNA-seq. Genome Med. 13, 82 (2021).
Nault, R., Fader, K. A., Bhattacharya, S. & Zacharewski, T. R. Single-nuclei RNA sequencing assessment of the hepatic effects of 2,3,7,8-tetrachlorodibenzo-p-dioxin. Cell Mol. Gastroenterol. Hepatol. 11, 147–159 (2021).
Haber, A. L. et al. A single-cell survey of the small intestinal epithelium. Nature 551, 333–339 (2017).
Schmidt, R. et al. CRISPR activation and interference screens decode stimulation responses in primary human T cells. Science 375, eabj4008 (2022).
Wessels, H. H. et al. Efficient combinatorial targeting of RNA transcripts in single cells with Cas13 RNA Perturb-seq. Nat. Methods 20, 86–94 (2023).
Norman, T. M. et al. Exploring genetic interaction manifolds constructed from rich single-cell phenotypes. Science 365, 786–793 (2019).
Replogle, J. M. et al. Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq. Cell 185, 2559–2575 (2022).
Papalexi, E. et al. Characterizing the molecular regulation of inhibitory immune checkpoints with multimodal single-cell screens. Nat. Genet. 53, 322–331 (2021).
Replogle, J. M. et al. Combinatorial single-cell CRISPR screens by direct guide RNA capture and targeted sequencing. Nat. Biotechnol. 38, 954–961 (2020).
Tian, R. et al. Genome-wide CRISPRi/a screens in human neurons link lysosomal failure to ferroptosis. Nat. Neurosci. 24, 1020–1034 (2021).
Frangieh, C. J. et al. Multimodal pooled Perturb-CITE-seq screens in patient models define mechanisms of cancer immune evasion. Nat. Genet. 53, 332–341 (2021).
Wei, Z. et al. PerturBase: a comprehensive database for single-cell perturbation data analysis and visualization. Nucleic Acids Res. 53, D1099–D1111 (2025).
Ji, Y. et al. Optimal distance metrics for single-cell RNA-seq populations. Preprint at bioRxiv https://doi.org/10.1101/2023.12.26.572833 (2023).
Gaudelet, T. et al. Season combinatorial intervention predictions with Salt & Peper. Preprint at https://arxiv.org/html/2404.16907v1 (2024).
Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022).
Yuan, Z. et al. Benchmarking spatial clustering methods with spatially resolved transcriptomics data. Nat. Methods 21, 712–722 (2024).
Shahapure, K. R. & Nicholas, C. Cluster quality analysis using silhouette score. in 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA) 747-748 (IEEE, 2020).
Gao, Y. et al. Toward subtask-decomposition-based learning and benchmarking for predicting genetic perturbation outcomes and beyond. Nat. Comput Sci. 4, 773–785 (2024).
Subramanian, A. et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell 171, 1437–1452 (2017).
Song, D. et al. scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics. Nat. Biotechnol. 42, 247–252 (2024).
Rood, J. E., Hupalowska, A. & Regev, A. Toward a foundation model of causal cell and tissue biology with a perturbation cell and tissue atlas. Cell 187, 4520–4545 (2024).
Zhang, J. et al. Tahoe-100M: a giga-scale single-cell perturbation atlas for context-dependent gene function and cellular modeling. Preprint at bioRxiv https://doi.org/10.1101/2025.02.20.639398 (2025).
Huang, A. C. et al. X-Atlas/Orion: genome-wide Perturb-seq datasets via a scalable fix-cryopreserve platform for training dose-dependent biological foundation models. Preprint at bioRxiv https://doi.org/10.1101/2025.06.11.659105 (2025).
Parekh, U. et al. Mapping cellular reprogramming via pooled overexpression screens with paired fitness and single-cell RNA-sequencing readout. Cell Syst. 7, 548–555 (2018).
Peidli, S. et al. scPerturb single-cell perturbation data: RNA and protein h5ad files (1.3). Zenodo https://doi.org/10.5281/zenodo.10044268 (2022).
Afriat, A. et al. A spatiotemporally resolved single-cell atlas of the Plasmodium liver stage. Nature 611, 563–569 (2022).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Heumos, L. et al. Pertpy: an end-to-end framework for perturbation analysis. Preprint at bioRxiv https://doi.org/10.1101/2024.08.04.606516 (2024).
Zhiting, W. Cellular context generalization datasets. Figshare https://doi.org/10.6084/m9.figshare.28143422 (2025).
Zhiting, W. Perturbation generalization datasets. Figshare https://doi.org/10.6084/m9.figshare.28147883 (2025).
Wei, Z. et al. Benchmarking algorithms for generalizable single-cell perturbation response prediction. Zenodo https://doi.org/10.5281/zenodo.14607156 (2025).
Zhiting, W. Perturbation generalization H5ad datasets. Zenodo https://doi.org/10.5281/zenodo.14638780 (2025).
Acknowledgements
We gratefully acknowledge all single-cell perturbation dataset owners for generously sharing their data. Q.L. was supported by the National Key Research and Development Program of China (grant no. 2025YFC3409300), National Natural Science Foundation of China (grant no. T2425019, 32341008), Shanghai Pilot Program for Basic Research; Shanghai Science and Technology Innovation Action Plan—Key Specialization in Computational Biology, Shanghai Shuguang Scholars Project, Shanghai Excellent Academic Leader Project, Shanghai Municipal Science and Technology Major Project (2021SHZDZX0100), Fundamental Research Funds for the Central Universities, Funding for open access charge, National Natural Science Foundation of China. Z.W. was supported by the National Natural Science Foundation of China (32500555).
Author information
Authors and Affiliations
Contributions
Z.W., Y.W., Y.C.G., A.L., G.C. and Q.L. conceived and designed the study. Z.W. designed the metrics, established the benchmarking pipeline and collected the methods and datasets. Z.W. and Y.W. implemented the benchmarking pipeline. Y.C.G. developed the cell-line embedding strategy. Z.W. and Y.W. analyzed the results and generated the figures with help from P.L., D.S., Y.L.G., S.Q.W., D.L., K.D., X.Y., C.T., S.F., X.C., W.L., Y.Y. and C.Z. Z.W., Y.W. and Q.L. wrote the manuscript. Z.W. and S.G.W. designed the website. Q.L., G.C. and A.L. supervised the entire project. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Yi Zhao and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Correlation of model performance across 13 commonly used evaluation metrics.
(a–d) Pairwise correlations among the 13 metrics used to assess model performance on the KangCrossCell, Haber, Replogle-RPE1essential, and Replogle-K562essential datasets, respectively. Each panel displays the Spearman correlation coefficients between different metrics, computed across all prediction methods evaluated in each dataset.
Extended Data Fig. 2 Effects of covariates on model performance across datasets.
(a) For each of the 12 benchmark datasets, we evaluated the impact of covariates—including cellular context, perturbation, and model (method)—on prediction performance using ANOVA based on ordinary least squares (OLS) regression. In datasets containing only a single perturbation condition (KangCrossCell, CrossSpecies, KangCrossPatient and TCDD), only cellular context and model were included as covariates. For the remaining datasets, perturbation identity was additionally included as a third covariate.
Extended Data Fig. 3 Effects of covariates on model performance across multicondition datasets.
(a-e) For each of the 5 multicondition benchmark datasets, we evaluated the impact of covariates—including cellular context, perturbation, model (method) and time-point/dosage—on prediction performance using ANOVA based on ordinary least squares (OLS) regression. only scDisInFact, biolord, and scVIDR explicitly incorporate time-point/dosage information. Consequently, only these three methods were included in the 5 multicondition datasets. In datasets containing only a single perturbation condition (CrossSpecies and TCDD), only cellular context, model and time-point/dosage were included as covariates. For the remaining datasets, perturbation identity was additionally included as a covariate.
Extended Data Fig. 4 Impact of inter- and intra-heterogeneity on model performance.
(a–b) Correlation between model performance and inter-heterogeneity, as measured by MSE and PCC-delta. A higher degree of inter-heterogeneity indicates greater variation across cellular contexts, which typically increases the difficulty of generalizing perturbation effects. A linear regression line with 95% confidence interval is shown. Pearson correlation coefficients were calculated, and statistical significance was assessed using two-sided t-tests. Adjustments were not made for multiple comparisons. (c–d) Correlation between model performance and intra-heterogeneity, as measured by MSE and PCC-delta. In both cases, results from test contexts across 12 datasets were aggregated, with each point representing a test context within a specific dataset. MSE and PCC-delta were chosen as representative performance metrics (see Methods). A linear regression line with 95% confidence interval is shown. Pearson correlation coefficients were calculated, and statistical significance was assessed using two-sided t-tests. Adjustments were not made for multiple comparisons. (e) Two-way ANOVA assessing the effects of inter- and intra-heterogeneity on model performance, as measured by MSE. (f) Two-way ANOVA assessing the effects of inter- and intra-heterogeneity on model performance, as measured by PCC-delta.
Extended Data Fig. 5 Impact of fine-tuning set size on model performance.
(a) Impact of fine-tuning set size on model performance in the Replogle-K562essential dataset. (b) Impact of fine-tuning set size on model performance in the Replogle-RPE1essential dataset.
Supplementary information
Supplementary Information (download PDF )
Supplementary Notes 1–18 and Figs. 1–45
Supplementary Table 1 (download XLSX )
Metrics used in single-cell perturbation prediction methods.
Supplementary Table 2 (download XLSX )
Effects of covariates on model performance across multi-condition datasets.
Supplementary Table 3 (download XLSX )
The detailed information of datasets in the cellular context and perturbation generalization scenario.
Supplementary Table 4 (download XLSX )
The detailed information of simulated datasets for robustness experiments in the cellular context and perturbation generalization scenario.
Supplementary Table 5 (download XLSX )
Comparison of method performance between our study and prior studies.
Source data
Source Data Fig. 1 (download XLSX )
Datasets for benchmarking the single-cell perturbation effect prediction.
Source Data Fig. 2 (download XLSX )
Benchmarking results for the o.o.d. setting in the cellular context generalization scenario.
Source Data Fig. 3 (download XLSX )
Limitation of current methods in the cellular context generalization scenario.
Source Data Fig. 4 (download XLSX )
Benchmarking results for genetic perturbation in the perturbation generalization scenario.
Source Data Fig. 5 (download XLSX )
Benchmarking results in the perturbation generalization scenario.
Source Data Extended Data Fig./Table 1 (download XLSX )
Correlation of model performance across 13 commonly used evaluation metrics.
Source Data Extended Data Fig./Table 2 (download XLSX )
Effects of covariates on model performance across datasets.
Source Data Extended Data Fig./Table 3 (download XLSX )
Effects of covariates on model performance across multi-condition datasets.
Source Data Extended Data Fig./Table 4 (download XLSX )
Impact of inter-heterogeneity and intra-heterogeneity on model performance.
Source Data Extended Data Fig./Table 5 (download XLSX )
Impact of fine-tuning set size on model performance.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wei, Z., Wang, Y., Gao, Y. et al. Benchmarking algorithms for generalizable single-cell perturbation response prediction. Nat Methods 23, 451–464 (2026). https://doi.org/10.1038/s41592-025-02980-0
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s41592-025-02980-0


