Benchmarking algorithms for generalizable single-cell perturbation response prediction

Wei, Zhiting; Wang, Yiheng; Gao, Yicheng; Wang, Shuguang; Li, Ping; Si, Duanmiao; Gao, Yuli; Wu, Siqi; Li, Danlu; Dong, Kejing; Yang, Xingbo; Tang, Chen; Fu, Shaliu; Chen, Xiaohan; Li, Wannian; You, Yuzhou; Zhang, Chen; Liang, Aibin; Chuai, Guohui; Liu, Qi

doi:10.1038/s41592-025-02980-0

Analysis
Published: 11 December 2025

Benchmarking algorithms for generalizable single-cell perturbation response prediction

Zhiting Wei ORCID: orcid.org/0009-0003-7382-6284^1,2,3,4^na1,
Yiheng Wang^1,2,5^na1,
Yicheng Gao^1,2^na1,
Shuguang Wang ORCID: orcid.org/0000-0002-4425-3291^1,2^na1,
Ping Li¹,
Duanmiao Si^1,2,
Yuli Gao^1,2,
Siqi Wu^1,2,
Danlu Li^1,2,
Kejing Dong ORCID: orcid.org/0009-0004-2805-4438^1,2,
Xingbo Yang^1,2,
Chen Tang^1,2,
Shaliu Fu ORCID: orcid.org/0000-0003-1707-5474^1,2,
Xiaohan Chen^1,2,
Wannian Li^1,2,
Yuzhou You^1,2,
Chen Zhang^1,2,
Aibin Liang ORCID: orcid.org/0000-0002-8978-1987¹,
Guohui Chuai ORCID: orcid.org/0000-0003-2423-8411^1,2,3 &
…
Qi Liu ORCID: orcid.org/0000-0003-2578-1221^1,2,3

Nature Methods volume 23, pages 451–464 (2026)Cite this article

10k Accesses
8 Citations
15 Altmetric
Metrics details

Subjects

Abstract

Single-cell perturbation technologies enable systematic investigation of gene functions and regulatory networks with single-cell resolution. However, performing large-scale and combinatorial perturbation screens poses notable challenges due to their exponentially increased complexity. Computational methods, including foundation models, have been developed to predict perturbation effects. Yet despite claims of promising performance, concerns remain about their true efficacy, particularly when evaluated across diverse and previously unseen cellular contexts and perturbation scenarios. Here, we present a comprehensive benchmark of 27 methods for single-cell perturbation response prediction, evaluated across 29 datasets using 6 complementary performance metrics. By evaluating them under multiple scenarios, we systematically assess their generalizability, including that of emerging foundation models. Our results provide practical guidance for method selection and underscore the need for cellular context embedding approaches to enhance the generalizability of perturbation effect prediction in single-cell research.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Workflow and datasets for benchmarking the single-cell perturbation effect prediction.**

**Fig. 2: Benchmarking results for the o.o.d. setting in the cellular context generalization scenario.**

**Fig. 3: Limitation of current methods in the cellular context generalization scenario.**

**Fig. 4: Benchmarking results for genetic perturbation in the perturbation generalization scenario.**

**Fig. 5: Benchmarking results in the perturbation generalization scenario.**

CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells

Article Open access 20 May 2025

scPerturb: harmonized single-cell perturbation data

Article 26 January 2024

Interpretation, extrapolation and perturbation of single cells

Article 02 January 2026

Data availability

We have uploaded all the processed datasets used in our benchmark study to Figshare at https://doi.org/10.6084/m9.figshare.28143422 (ref. ⁶³) and https://doi.org/10.6084/m9.figshare.28147883 (ref. ⁶⁴) and Zenodo via https://doi.org/10.5281/zenodo.14607156 (ref. ⁶⁵) and https://doi.org/10.5281/zenodo.14638779 (ref. ⁶⁶).

The kangCrossCell³³, kangCrossPatient³³ and Haber³⁷ datasets consist of preprocessed data from Lotfollahi et al., which can be downloaded directly from Google Drive via https://drive.google.com/drive/folders/1n1SLbXha4OH7j7zZ0zZAxrj_-2kczgl8. The Parekh⁵⁸, CrossPatient³⁵ and sciPlex3 (ref. ¹) datasets were obtained from the PerturBase database http://www.perturbase.cn/. KaggleCrossPatient and KaggleCrossCell can be downloaded from the Kaggle competition webpage via https://www.kaggle.com/competitions/open-problems-single-cell-perturbations/data/. The McFarland³² dataset was downloaded from the scPerturb database (version 1.3), which is available at Zenodo via https://doi.org/10.5281/zenodo.10044268. (ref. ⁵⁹). CrossSpecies³⁴ is available at https://github.com/theislab/scgen-reproducibility/blob/master/code/DataDownloader.py/. The Afriat⁶⁰ perturbation dataset was downloaded from the biolord GitHub tutorial site via https://biolord.readthedocs.io/en/latest/tutorials/biolord_pipeline.html. The sciPlex3_comb dataset¹ was downloaded from the CPA tutorial website and can be accessed at https://drive.google.com/uc?export=download&id=1RRV0_qYKGTvD3oCklKfoZQFYqKJy4l6t. The remaining 16 datasets—Adamson³, Frangieh⁴⁵, TianActivation⁴⁴, TianInhibition⁴⁴, Replogle_exp6 (ref. ⁴³), Replogle_exp7 (ref. ⁴³), Replogle_exp8 (ref. ⁴³), Papalexi⁴², Replogle_RPE1essential⁴¹, Replogle_K562essential⁴¹, Norman⁴⁰, Wessels³⁹, Schmidt³⁸, sciPlex_A549 (ref. ¹), sciPlex3_K562 (ref. ¹) and sciPlex3_MCF7 (ref. ¹)—were downloaded from the PerturBase database via http://www.perturbase.cn/. Source data are provided with this paper.

Code availability

The scripts used in this study are available via GitHub at https://github.com/bm2-lab/scPerturBench/ and Zenodo at https://doi.org/10.5281/zenodo.15904698 (ref. ⁶⁵). To promote transparency and reproducibility, we provide a Podman image that contains all major scripts used in our benchmark, along with the complete set of preconfigured conda environments (https://github.com/bm2-lab/scPerturBench/). We have created an online platform that hosts the benchmarking results of all evaluated tools across all datasets and performance metrics included in our study (https://bm2-lab.github.io/scPerturBench-reproducibility/).

References

Srivatsan, S. R. et al. Massively multiplex chemical transcriptomics at single-cell resolution. Science 367, 45–51 (2020.
Article CAS PubMed Google Scholar
Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866 (2016).
Article CAS PubMed PubMed Central Google Scholar
Adamson, B. et al. A multiplexed single-cell CRISPR screening platform enables systematic dissection of the unfolded protein response. Cell 167, 1867–1882 (2016).
Article CAS PubMed PubMed Central Google Scholar
Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen predicts single-cell perturbation responses. Nat. Methods 16, 715–721 (2019).
Article CAS PubMed Google Scholar
Roohani, Y., Huang, K. & Leskovec, J. Predicting transcriptional outcomes of novel multigene perturbations with GEARS. Nat. Biotechnol. 42, 927–935 (2024).
Article CAS PubMed Google Scholar
Hao, M. et al. Large-scale foundation model on single-cell transcriptomics. Nat. Methods 21, 1481–1491 (2024).
Article CAS PubMed Google Scholar
Cui, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods 21, 1470–1480 (2024).
Article CAS PubMed Google Scholar
Yang, F. et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat. Mach. Intell. 4, 852–866 (2022).
Article Google Scholar
Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023).
Article CAS PubMed PubMed Central Google Scholar
Wu, Y. et al. PerturBench: benchmarking machine learning models for cellular perturbation analysis. Preprint at https://arxiv.org/abs/2408.10609 (2024).
Bendidi, I. et al. Benchmarking transcriptomics foundation models for perturbation analysis: one PCA still rules them all. Preprint at https://arxiv.org/abs/2410.13956 (2024).
Ahlmann-Eltze, C., Huber, W. & Anders, S. Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines. Nat. Methods 22, 1657–1661 (2025).
Article CAS PubMed PubMed Central Google Scholar
Peidli, S. et al. scPerturb: harmonized single-cell perturbation data. Nat. Methods 21, 531–540 (2024).
Article CAS PubMed PubMed Central Google Scholar
Bunne, C. et al. Learning single-cell perturbation responses using neural optimal transport. Nat. Methods 20, 1759–1768 (2023).
Article CAS PubMed PubMed Central Google Scholar
Jiang, Q., Chen, S., Chen, X. & Jiang, R. scPRAM accurately predicts single-cell gene expression perturbation response based on attention mechanism. Bioinformatics 40, btae265 (2024).
Article CAS PubMed PubMed Central Google Scholar
Piran, Z., Cohen, N., Hoshen, Y. & Nitzan, M. Disentanglement of single-cell data with biolord. Nat. Biotechnol. 42, 1678–1683 (2024).
Article CAS PubMed PubMed Central Google Scholar
Yeh, C. -H., Chen, Z. -G., Liou, C. -Y. & Chen, M. -J. Homogeneous space construction and projection for single-cell expression prediction based on deep learning. Bioengineering 10, 996 (2023).
Article PubMed PubMed Central Google Scholar
Zhang, Z., Zhao, X., Bindra, M., Qiu, P. & Zhang, X. scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data. Nat. Commun. 15, 912 (2024).
Article CAS PubMed PubMed Central Google Scholar
Wei, X., Dong, J. & Wang, F. scPreGAN, a deep generative model for predicting the response of single-cell expression to perturbation. Bioinformatics 38, 3377–3384 (2022).
Article CAS PubMed Google Scholar
Wang, H., Wang, Y., Jiang, Q., Zhang, Y. & Chen, S. SCREEN: predicting single-cell gene expression perturbation responses via optimal transport. Front. Comput. Sci. 18, 2095–2228 (2024).
Article Google Scholar
Kana, O. et al. Generative modeling of single-cell gene expression for dose-dependent chemical perturbations. Patterns 4, 100817 (2023).
Article CAS PubMed PubMed Central Google Scholar
Lotfollahi, M., Naghipourfar, M., Theis, F. J. & Wolf, F. A. Conditional out-of-distribution generation for unpaired data using transfer VAE. Bioinformatics 36, i610–i617 (2020).
Article CAS PubMed Google Scholar
Bai, D., Ellington, C. N., Mo, S., Song, L. & Xing, E. P. AttentionPert: accurately modeling multiplexed genetic perturbations with multi-scale effects. Bioinformatics 40, i453–i461 (2024).
Article PubMed PubMed Central Google Scholar
Lotfollahi, M. et al. Predicting cellular responses to complex perturbations in high-throughput screens. Mol. Syst. Biol. 19, e11517 (2023).
Article CAS PubMed PubMed Central Google Scholar
Chen, Y. & Zou, J. Simple and effective embedding model for single-cell biology built from ChatGPT. Nat. Biomed. Eng. 9, 483–493 (2025).
Article PubMed Google Scholar
Hetzel, L., Boehm, S., Kilbertus, N., Günnemann, S. & Theis, F. Predicting cellular responses to novel drug perturbations at a single-cell resolution. Adv. Neural Inf. Process. Syst. 35, 26711–26722 (2022).
Google Scholar
Zhu, O. & Li, J. Scouter: predicting transcriptional responses to genetic perturbations with LLM embeddings. Preprint at bioRxiv https://doi.org/10.1101/2024.12.06.627290 (2024).
Liu, T., Chen, T., Zheng, W., Luo, X. & Zhao, H. scELMo: embeddings from language models are good learners for single-cell data analysis. Preprint at bioRxiv https://doi.org/10.1101/2023.12.07.569910 (2023).
Yang, X. et al. GeneCompass: deciphering universal gene regulatory mechanisms with a knowledge-informed cross-species foundation model. Cell Res. 34, 830–845 (2024).
Article PubMed PubMed Central Google Scholar
Qi, X. et al. Predicting transcriptional responses to novel chemical perturbations using deep generative model for drug discovery. Nat. Commun. 15, 9256 (2024).
Article CAS PubMed PubMed Central Google Scholar
Huang, W. & Liu, H. Predicting single-cell cellular responses to perturbations using cycle consistency learning. Bioinformatics 40, i462–i470 (2024).
Article PubMed PubMed Central Google Scholar
McFarland, J. M. et al. Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action. Nat. Commun. 11, 4296 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).
Article CAS PubMed Google Scholar
Hagai, T. et al. Gene expression variability across cells and species shapes innate immunity. Nature 563, 197–202 (2018).
Article CAS PubMed PubMed Central Google Scholar
Zhao, W. et al. Deconvolution of cell type-specific drug responses in human tumor tissue with single-cell RNA-seq. Genome Med. 13, 82 (2021).
Article CAS PubMed PubMed Central Google Scholar
Nault, R., Fader, K. A., Bhattacharya, S. & Zacharewski, T. R. Single-nuclei RNA sequencing assessment of the hepatic effects of 2,3,7,8-tetrachlorodibenzo-p-dioxin. Cell Mol. Gastroenterol. Hepatol. 11, 147–159 (2021).
Article CAS PubMed Google Scholar
Haber, A. L. et al. A single-cell survey of the small intestinal epithelium. Nature 551, 333–339 (2017).
Article CAS PubMed PubMed Central Google Scholar
Schmidt, R. et al. CRISPR activation and interference screens decode stimulation responses in primary human T cells. Science 375, eabj4008 (2022).
Article CAS PubMed PubMed Central Google Scholar
Wessels, H. H. et al. Efficient combinatorial targeting of RNA transcripts in single cells with Cas13 RNA Perturb-seq. Nat. Methods 20, 86–94 (2023).
Article CAS PubMed Google Scholar
Norman, T. M. et al. Exploring genetic interaction manifolds constructed from rich single-cell phenotypes. Science 365, 786–793 (2019).
Article CAS PubMed PubMed Central Google Scholar
Replogle, J. M. et al. Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq. Cell 185, 2559–2575 (2022).
Article CAS PubMed PubMed Central Google Scholar
Papalexi, E. et al. Characterizing the molecular regulation of inhibitory immune checkpoints with multimodal single-cell screens. Nat. Genet. 53, 322–331 (2021).
Article CAS PubMed PubMed Central Google Scholar
Replogle, J. M. et al. Combinatorial single-cell CRISPR screens by direct guide RNA capture and targeted sequencing. Nat. Biotechnol. 38, 954–961 (2020).
Article CAS PubMed PubMed Central Google Scholar
Tian, R. et al. Genome-wide CRISPRi/a screens in human neurons link lysosomal failure to ferroptosis. Nat. Neurosci. 24, 1020–1034 (2021).
Article CAS PubMed PubMed Central Google Scholar
Frangieh, C. J. et al. Multimodal pooled Perturb-CITE-seq screens in patient models define mechanisms of cancer immune evasion. Nat. Genet. 53, 332–341 (2021).
Article CAS PubMed PubMed Central Google Scholar
Wei, Z. et al. PerturBase: a comprehensive database for single-cell perturbation data analysis and visualization. Nucleic Acids Res. 53, D1099–D1111 (2025).
Article CAS PubMed PubMed Central Google Scholar
Ji, Y. et al. Optimal distance metrics for single-cell RNA-seq populations. Preprint at bioRxiv https://doi.org/10.1101/2023.12.26.572833 (2023).
Gaudelet, T. et al. Season combinatorial intervention predictions with Salt & Peper. Preprint at https://arxiv.org/html/2404.16907v1 (2024).
Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022).
Article CAS PubMed Google Scholar
Yuan, Z. et al. Benchmarking spatial clustering methods with spatially resolved transcriptomics data. Nat. Methods 21, 712–722 (2024).
Article CAS PubMed Google Scholar
Shahapure, K. R. & Nicholas, C. Cluster quality analysis using silhouette score. in 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA) 747-748 (IEEE, 2020).
Gao, Y. et al. Toward subtask-decomposition-based learning and benchmarking for predicting genetic perturbation outcomes and beyond. Nat. Comput Sci. 4, 773–785 (2024).
Article PubMed Google Scholar
Subramanian, A. et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell 171, 1437–1452 (2017).
Article CAS PubMed PubMed Central Google Scholar
Song, D. et al. scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics. Nat. Biotechnol. 42, 247–252 (2024).
Article CAS PubMed Google Scholar
Rood, J. E., Hupalowska, A. & Regev, A. Toward a foundation model of causal cell and tissue biology with a perturbation cell and tissue atlas. Cell 187, 4520–4545 (2024).
Article CAS PubMed Google Scholar
Zhang, J. et al. Tahoe-100M: a giga-scale single-cell perturbation atlas for context-dependent gene function and cellular modeling. Preprint at bioRxiv https://doi.org/10.1101/2025.02.20.639398 (2025).
Huang, A. C. et al. X-Atlas/Orion: genome-wide Perturb-seq datasets via a scalable fix-cryopreserve platform for training dose-dependent biological foundation models. Preprint at bioRxiv https://doi.org/10.1101/2025.06.11.659105 (2025).
Parekh, U. et al. Mapping cellular reprogramming via pooled overexpression screens with paired fitness and single-cell RNA-sequencing readout. Cell Syst. 7, 548–555 (2018).
Article CAS PubMed PubMed Central Google Scholar
Peidli, S. et al. scPerturb single-cell perturbation data: RNA and protein h5ad files (1.3). Zenodo https://doi.org/10.5281/zenodo.10044268 (2022).
Afriat, A. et al. A spatiotemporally resolved single-cell atlas of the Plasmodium liver stage. Nature 611, 563–569 (2022).
Article CAS PubMed Google Scholar
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Article PubMed PubMed Central Google Scholar
Heumos, L. et al. Pertpy: an end-to-end framework for perturbation analysis. Preprint at bioRxiv https://doi.org/10.1101/2024.08.04.606516 (2024).
Zhiting, W. Cellular context generalization datasets. Figshare https://doi.org/10.6084/m9.figshare.28143422 (2025).
Zhiting, W. Perturbation generalization datasets. Figshare https://doi.org/10.6084/m9.figshare.28147883 (2025).
Wei, Z. et al. Benchmarking algorithms for generalizable single-cell perturbation response prediction. Zenodo https://doi.org/10.5281/zenodo.14607156 (2025).
Zhiting, W. Perturbation generalization H5ad datasets. Zenodo https://doi.org/10.5281/zenodo.14638780 (2025).

Download references

Acknowledgements

We gratefully acknowledge all single-cell perturbation dataset owners for generously sharing their data. Q.L. was supported by the National Key Research and Development Program of China (grant no. 2025YFC3409300), National Natural Science Foundation of China (grant no. T2425019, 32341008), Shanghai Pilot Program for Basic Research; Shanghai Science and Technology Innovation Action Plan—Key Specialization in Computational Biology, Shanghai Shuguang Scholars Project, Shanghai Excellent Academic Leader Project, Shanghai Municipal Science and Technology Major Project (2021SHZDZX0100), Fundamental Research Funds for the Central Universities, Funding for open access charge, National Natural Science Foundation of China. Z.W. was supported by the National Natural Science Foundation of China (32500555).

Author information

These authors contributed equally: Zhiting Wei, Yiheng Wang, Yicheng Gao, Shuguang Wang.

Authors and Affiliations

Department of Hematology, Tongji Hospital, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, China
Zhiting Wei, Yiheng Wang, Yicheng Gao, Shuguang Wang, Ping Li, Duanmiao Si, Yuli Gao, Siqi Wu, Danlu Li, Kejing Dong, Xingbo Yang, Chen Tang, Shaliu Fu, Xiaohan Chen, Wannian Li, Yuzhou You, Chen Zhang, Aibin Liang, Guohui Chuai & Qi Liu
Shanghai Key Laboratory of Anesthesiology and Brain Functional Modulation, Clinical Research Center for Anesthesiology and Perioperative Medicine, Translational Research Institute of Brain and Brain-Like Intelligence, Shanghai Fourth People’s Hospital, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, China
Zhiting Wei, Yiheng Wang, Yicheng Gao, Shuguang Wang, Duanmiao Si, Yuli Gao, Siqi Wu, Danlu Li, Kejing Dong, Xingbo Yang, Chen Tang, Shaliu Fu, Xiaohan Chen, Wannian Li, Yuzhou You, Chen Zhang, Guohui Chuai & Qi Liu
State Key Laboratory of Autonomous Intelligent Unmanned Systems, Frontiers Science Center for Intelligent Autonomous Systems, Ministry of Education, Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai, China
Zhiting Wei, Guohui Chuai & Qi Liu
Institute for Data-Driven Tumor Immunology, Chongqing Medical University, Chongqing, China
Zhiting Wei
Institute of Biophysics, Chinese Academy of Sciences; College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
Yiheng Wang

Authors

Zhiting Wei
View author publications
Search author on:PubMed Google Scholar
Yiheng Wang
View author publications
Search author on:PubMed Google Scholar
Yicheng Gao
View author publications
Search author on:PubMed Google Scholar
Shuguang Wang
View author publications
Search author on:PubMed Google Scholar
Ping Li
View author publications
Search author on:PubMed Google Scholar
Duanmiao Si
View author publications
Search author on:PubMed Google Scholar
Yuli Gao
View author publications
Search author on:PubMed Google Scholar
Siqi Wu
View author publications
Search author on:PubMed Google Scholar
Danlu Li
View author publications
Search author on:PubMed Google Scholar
Kejing Dong
View author publications
Search author on:PubMed Google Scholar
Xingbo Yang
View author publications
Search author on:PubMed Google Scholar
Chen Tang
View author publications
Search author on:PubMed Google Scholar
Shaliu Fu
View author publications
Search author on:PubMed Google Scholar
Xiaohan Chen
View author publications
Search author on:PubMed Google Scholar
Wannian Li
View author publications
Search author on:PubMed Google Scholar
Yuzhou You
View author publications
Search author on:PubMed Google Scholar
Chen Zhang
View author publications
Search author on:PubMed Google Scholar
Aibin Liang
View author publications
Search author on:PubMed Google Scholar
Guohui Chuai
View author publications
Search author on:PubMed Google Scholar
Qi Liu
View author publications
Search author on:PubMed Google Scholar

Contributions

Z.W., Y.W., Y.C.G., A.L., G.C. and Q.L. conceived and designed the study. Z.W. designed the metrics, established the benchmarking pipeline and collected the methods and datasets. Z.W. and Y.W. implemented the benchmarking pipeline. Y.C.G. developed the cell-line embedding strategy. Z.W. and Y.W. analyzed the results and generated the figures with help from P.L., D.S., Y.L.G., S.Q.W., D.L., K.D., X.Y., C.T., S.F., X.C., W.L., Y.Y. and C.Z. Z.W., Y.W. and Q.L. wrote the manuscript. Z.W. and S.G.W. designed the website. Q.L., G.C. and A.L. supervised the entire project. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Aibin Liang, Guohui Chuai or Qi Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Yi Zhao and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Correlation of model performance across 13 commonly used evaluation metrics.

(a–d) Pairwise correlations among the 13 metrics used to assess model performance on the KangCrossCell, Haber, Replogle-RPE1essential, and Replogle-K562essential datasets, respectively. Each panel displays the Spearman correlation coefficients between different metrics, computed across all prediction methods evaluated in each dataset.

Source data

Extended Data Fig. 2 Effects of covariates on model performance across datasets.

(a) For each of the 12 benchmark datasets, we evaluated the impact of covariates—including cellular context, perturbation, and model (method)—on prediction performance using ANOVA based on ordinary least squares (OLS) regression. In datasets containing only a single perturbation condition (KangCrossCell, CrossSpecies, KangCrossPatient and TCDD), only cellular context and model were included as covariates. For the remaining datasets, perturbation identity was additionally included as a third covariate.

Source data

Extended Data Fig. 3 Effects of covariates on model performance across multicondition datasets.

(a-e) For each of the 5 multicondition benchmark datasets, we evaluated the impact of covariates—including cellular context, perturbation, model (method) and time-point/dosage—on prediction performance using ANOVA based on ordinary least squares (OLS) regression. only scDisInFact, biolord, and scVIDR explicitly incorporate time-point/dosage information. Consequently, only these three methods were included in the 5 multicondition datasets. In datasets containing only a single perturbation condition (CrossSpecies and TCDD), only cellular context, model and time-point/dosage were included as covariates. For the remaining datasets, perturbation identity was additionally included as a covariate.

Source data

Extended Data Fig. 4 Impact of inter- and intra-heterogeneity on model performance.

(a–b) Correlation between model performance and inter-heterogeneity, as measured by MSE and PCC-delta. A higher degree of inter-heterogeneity indicates greater variation across cellular contexts, which typically increases the difficulty of generalizing perturbation effects. A linear regression line with 95% confidence interval is shown. Pearson correlation coefficients were calculated, and statistical significance was assessed using two-sided t-tests. Adjustments were not made for multiple comparisons. (c–d) Correlation between model performance and intra-heterogeneity, as measured by MSE and PCC-delta. In both cases, results from test contexts across 12 datasets were aggregated, with each point representing a test context within a specific dataset. MSE and PCC-delta were chosen as representative performance metrics (see Methods). A linear regression line with 95% confidence interval is shown. Pearson correlation coefficients were calculated, and statistical significance was assessed using two-sided t-tests. Adjustments were not made for multiple comparisons. (e) Two-way ANOVA assessing the effects of inter- and intra-heterogeneity on model performance, as measured by MSE. (f) Two-way ANOVA assessing the effects of inter- and intra-heterogeneity on model performance, as measured by PCC-delta.

Source data

Extended Data Fig. 5 Impact of fine-tuning set size on model performance.

(a) Impact of fine-tuning set size on model performance in the Replogle-K562essential dataset. (b) Impact of fine-tuning set size on model performance in the Replogle-RPE1essential dataset.

Source data

Supplementary information

Supplementary Information (download PDF )

Supplementary Notes 1–18 and Figs. 1–45

Reporting Summary (download PDF )

Peer Review File (download PDF )

Supplementary Table 1 (download XLSX )

Metrics used in single-cell perturbation prediction methods.

Supplementary Table 2 (download XLSX )

Effects of covariates on model performance across multi-condition datasets.

Supplementary Table 3 (download XLSX )

The detailed information of datasets in the cellular context and perturbation generalization scenario.

Supplementary Table 4 (download XLSX )

The detailed information of simulated datasets for robustness experiments in the cellular context and perturbation generalization scenario.

Supplementary Table 5 (download XLSX )

Comparison of method performance between our study and prior studies.

Source data

Source Data Fig. 1 (download XLSX )

Datasets for benchmarking the single-cell perturbation effect prediction.

Source Data Fig. 2 (download XLSX )

Benchmarking results for the o.o.d. setting in the cellular context generalization scenario.

Source Data Fig. 3 (download XLSX )

Limitation of current methods in the cellular context generalization scenario.

Source Data Fig. 4 (download XLSX )

Benchmarking results for genetic perturbation in the perturbation generalization scenario.

Source Data Fig. 5 (download XLSX )

Benchmarking results in the perturbation generalization scenario.

Source Data Extended Data Fig./Table 1 (download XLSX )

Correlation of model performance across 13 commonly used evaluation metrics.

Source Data Extended Data Fig./Table 2 (download XLSX )

Effects of covariates on model performance across datasets.

Source Data Extended Data Fig./Table 3 (download XLSX )

Effects of covariates on model performance across multi-condition datasets.

Source Data Extended Data Fig./Table 4 (download XLSX )

Impact of inter-heterogeneity and intra-heterogeneity on model performance.

Source Data Extended Data Fig./Table 5 (download XLSX )

Impact of fine-tuning set size on model performance.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wei, Z., Wang, Y., Gao, Y. et al. Benchmarking algorithms for generalizable single-cell perturbation response prediction. Nat Methods 23, 451–464 (2026). https://doi.org/10.1038/s41592-025-02980-0

Download citation

Received: 20 January 2025
Accepted: 05 November 2025
Published: 11 December 2025
Version of record: 11 December 2025
Issue date: February 2026
DOI: https://doi.org/10.1038/s41592-025-02980-0

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links