A pre-trained large generative model for translating single-cell transcriptomes to proteomes

Liu, Linjing; Li, Wei; Wang, Fang; Li, Yiming; Huang, Long-Kai; Wong, Ka-Chun; Yang, Fan; Yao, Jianhua

doi:10.1038/s41551-025-01528-z

Article
Published: 05 November 2025

A pre-trained large generative model for translating single-cell transcriptomes to proteomes

Nature Biomedical Engineering (2025) Cite this article

8669 Accesses
6 Citations
14 Altmetric
Metrics details

Subjects

Abstract

Measuring protein abundance at the single-cell level can facilitate a high-resolution understanding of biological mechanisms in cellular processes and disease progression. However, current single-cell proteomic technologies face challenges such as limited coverage, constrained throughput and sensitivity, batch effects, high costs and stringent experimental operations. Inspired by the translation procedure in both natural language processing and the genetic central dogma, we propose a pre-trained, large generative model named single-cell translator (scTranslator). scTranslator can generate multi-omics data by inferring the missing single-cell proteome based on the transcriptome. Through systematic benchmarking and validation on independent datasets, we have confirmed the accuracy, stability and flexibility of scTranslator across various profiling techniques (for example, CITE-seq, spatial CITE-seq, REAP-seq, NEAT-seq), cell types (for example, monocytes, macrophages, T cells, B cells), tissues (for example, blood, lung, brain) and a wide range of disease contexts, including infectious, metabolic and oncologic conditions. Furthermore, scTranslator shows its superiority in assisting various downstream analyses and applications, including gene/protein interaction inference, perturbation prediction, cell clustering, batch correction and cell origin recognition in pan-cancer data.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Overview of scTranslator and its downstream applications.**

**Fig. 2: Overall performance of scTranslator.**

**Fig. 3: Systematic benchmarks on protein inference.**

**Fig. 4: Overview of independent datasets and comparative experiment results.**

**Fig. 5: Integrative regulatory inference and gene perturbation results.**

**Fig. 6: Exploration of cell heterogeneity in CITE-seq PBMCs and cell origin in pan-cancer data.**

Data availability

All datasets used in this study are publicly available from established community or institutional repositories. TCGA data were accessed from the National Cancer Institute (https://www.cancer.gov). The CPTAC and MSKCC datasets were accessed through cBioPortal for Cancer Genomics (https://www.cbioportal.org). The Broad Institute Cancer Cell Line Encyclopedia dataset was obtained from the Broad Institute (https://sites.broadinstitute.org/ccle). All single-cell datasets used for second-stage pretraining and the CITE-seq DCs dataset were accessed via the Single-cell Proteomic DataBase portal (https://scproteomicsdb.com). The Perturb-CITE-seq dataset was downloaded from the Broad Institute Single Cell Portal (https://singlecell.broadinstitute.org/single_cell/study/SCP1064). Other datasets were accessed from the NCBI Gene Expression Omnibus: Seurat CITE-seq and ECCITE-seq PBMCs datasets (GSE164378), REAP-seq PBMCs dataset (GSE100501), CITE-seq CBMCs dataset (GSE100866), CITE-seq MCs dataset (GSE163120), CITE-seq mouse dataset (GSE150599), spatial CITE-seq dataset (GSE213264), NEAT-seq dataset (GSE178707) and Single-cell pan-cancer dataset (GSE154763). Source data are provided with this paper.

Code availability

The source code is available under the Apache License 2.0 via GitHub at https://github.com/TencentAILabHealthcare/scTranslator. For reproducibility, the scripts for all experiments and results analysis were also included in the above repository.

References

Tang, F. et al. mRNA-seq whole-transcriptome analysis of a single cell. Nat. Methods 6, 377–382 (2009).
Article CAS PubMed Google Scholar
Schoof, E. M. et al. Quantitative single-cell proteomics as a tool to characterize cellular hierarchies. Nat. Commun. 12, 3341 (2021).
Article CAS PubMed PubMed Central Google Scholar
Schwanhäusser, Björn et al. Global quantification of mammalian gene expression control. Nature 473, 337–342 (2011).
Article PubMed Google Scholar
Liu, Y., Beyer, A. & Aebersold, R. On the dependency of cellular protein levels on mRNA abundance. Cell 165, 535–550 (2016).
Article CAS PubMed Google Scholar
Vogel, C. & Marcotte, E. M. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat. Rev. Genet. 13, 227–232 (2012).
Article CAS PubMed PubMed Central Google Scholar
Khan, Z. et al. Primate transcript and protein expression levels evolve under compensatory selection pressures. Science 342, 1100–1104 (2013).
Article CAS PubMed PubMed Central Google Scholar
Brunner, Andreas-David et al. Ultra-high sensitivity mass spectrometry quantifies single-cell proteome changes upon perturbation. Mol. Syst. Biol. 18, e10798 (2022).
Article CAS PubMed PubMed Central Google Scholar
OpenAI et al. GPT-4 technical report. Preprint at https://arxiv.org/2303.08774 (2023).
Yang, F. et al. scbert as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat. Mach. Intell. 4, 852–866 (2022).
Article Google Scholar
Singhal, K. et al. Large language models encode clinical knowledge. Preprint at https://doi.org/10.48550/arXiv.2212.13138 (2022).
Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023).
Article CAS PubMed PubMed Central Google Scholar
Hao, Y. et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat. Biotechnol. 42, 293–304 (2024).
Article CAS PubMed Google Scholar
Zhou, Z., Ye, C., Wang, J. & Zhang, N. R. Surface protein imputation from single cell transcriptomes by deep neural networks. Nat. Commun. 11, 651 (2020).
Article CAS PubMed PubMed Central Google Scholar
Lakkis, J. et al. A multi-use deep learning method for CITE-seq and single-cell RNA-seq data integration with cell surface protein prediction and imputation. Nat. Mach. Intell 4, 940–952 (2022).
Article PubMed PubMed Central Google Scholar
Ashuach, T. et al. Multivi: deep generative model for the integration of multimodal data. Nat. Methods 20, 1222–1231 (2023).
Article CAS PubMed PubMed Central Google Scholar
Wu, K. E., Yost, K. E., Chang, H. Y. & Zou, J. Babel enables cross-modality translation between multiomic profiles at single-cell resolution. Proc. Natl Acad. Sci. USA 118, e2023070118 (2021).
Article CAS PubMed PubMed Central Google Scholar
Yang, KarrenDai et al. Multi-domain translation between single-cell imaging and sequencing data using autoencoders. Nat. Commun. 12, 31 (2021).
Article CAS PubMed PubMed Central Google Scholar
Minoura, K., Abe, K., Nam, H., Nishikawa, H. & Shimamura, T. A mixture-of-experts deep generative model for integrated analysis of single-cell multiomics data. Cell Rep. Methods 1, 100071 (2021).
Article CAS PubMed PubMed Central Google Scholar
Wen, H. et al. Graph neural networks for multimodal single-cell data integration. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (eds Zhang, A. & Rangwala, H.) 4153–4163 (Association for Computing Machinery, 2022).
Choromanski, K. et al. Rethinking attention with performers. In 9th International Conference on Learning Representations (eds Oh, A. et al.)(ICLR, 2021).
Frangieh, C. J. et al. Multimodal pooled Perturb-CITE-seq screens in patient models define mechanisms of cancer immune evasion. Nat. Genet. 53, 332–341 (2021).
Article CAS PubMed PubMed Central Google Scholar
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017).
Google Scholar
Chen, M. X. et al. The best of both worlds: combining recent advances in neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (eds. Gurevych, I. & Miyao, Y.) 76–86 (Association for Computational Linguistics, 2018).
Dosovitskiy, A. et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations (eds Oh, A. et al.) (ICLR, 2021)
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog 1, 9 (2019).
Google Scholar
Liu, Z. et al. Swin transformer: hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (eds Hassner, T. et al.) 9992–10002 (ICCV, 2021).
Kim, W., Son, B., & Kim, I. Vilt: vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 5583–5594 (PMLR, 2021).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Burtsien, J. et al.) 1, 4171–4186 (Association for Computational Linguistics, 2019).
He, K. et al. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (eds Dana, K. et al.) 15979–15988 (IEEE, 2022).
Radford, A. et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 8748–8763 (PMLR, 2021).
Radford, A. et al. Improving language understanding by generative pre-training. OpenAI Blog https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.(2018).
Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 35, 27730–27744 (2022).
Google Scholar
Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (eds Liu, C. et al.) 9726–9735 (IEEE, 2020).
Stegle, O., Teichmann, S. A. & Marioni, J. C. Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet. 16, 133–145 (2015).
Article CAS PubMed Google Scholar
Shapiro, E., Biezuner, T. & Linnarsson, S. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nat. Rev. Genet. 14, 618–630 (2013).
Article CAS PubMed Google Scholar
Zhou, H. et al. Informer: beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence (eds Fourcade, M. et al.) 35, 11106–11115 (AAAI Press, 2021).
Edfors, F. et al. Gene-specific correlation of RNA and protein levels in human cells and tissues. Mol. Syst. Biol. 12, 883 (2016).
Article PubMed PubMed Central Google Scholar
Payne, S. H. The utility of protein and mRNA correlation. Trends Biochem. Sci. 40, 1–3 (2015).
Article CAS PubMed Google Scholar
Wegler, C. et al. Global variability analysis of mRNA and protein concentrations across and within human tissues. NAR Genom. Bioinform. 2, lqz010 (2020).
Article PubMed Google Scholar
Asensio, J. O., Verheijen, M. & Caiment, F. Predicting missing proteomics values using machine learning: filling the gap using transcriptomics and other biological features. Comput. Struct. Biotechnol. J. 20, 2057–2069 (2022).
Article Google Scholar
Greenbaum, D., Colangelo, C., Williams, K. & Gerstein, M. Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biol. 4, 1–8 (2003).
Article Google Scholar
Desai, D. M., Sap, J., Silvennoinen, O., Schlessinger, J. & Weiss, A. The catalytic activity of the CD45 membrane-proximal phosphatase domain is required for TCR signaling and regulation. EMBO J. 13, 4002–4010 (1994).
Article CAS PubMed PubMed Central Google Scholar
Irles, C. et al. CD45 ectodomain controls interaction with gems and lck activity for optimal TCR signaling. Nat. Immunol. 4, 189–197 (2003).
Article CAS PubMed Google Scholar
Liu, Y. et al. Integrated pan-cancer genomic analysis reveals the role of slc30a5 in the proliferation, metastasis, and prognosis of hepatocellular carcinoma. J. Cancer 15, 4686 (2024).
Article CAS PubMed PubMed Central Google Scholar
Barresi, V. et al. Transcriptome analysis reveals an altered expression profile of zinc transporters in colorectal cancer. J. Cell. Biochem. 119, 9707–9719 (2018).
Article CAS PubMed Google Scholar
Riffault, B. et al. Pro-brain-derived neurotrophic factor inhibits GABAergic neurotransmission by activating endocytosis and repression of GABAA receptors. J. Neurosci. 34, 13516–13534 (2014).
Article PubMed PubMed Central Google Scholar
Hilton, B. J. et al. An active vesicle priming machinery suppresses axon regeneration upon adult CNS injury. Neuron 110, 51–69 (2022).
Article CAS PubMed PubMed Central Google Scholar
Vallania, F. et al. Genome-wide discovery of functional transcription factor binding sites by comparative genomics: the case of Stat3. Proc. Natl Acad. Sci. USA 106, 5117–5122 (2009).
Article CAS PubMed PubMed Central Google Scholar
Utley, A. T., Carlson, L. & Lee, K. P. CD28 induces mitochondrial respiration through irf4 for long lived plasma cells survival. Blood 128, 128 (2016).
Article Google Scholar
Li, P. et al. BATF-JUN is critical for IRF4-mediated transcription in T cells. Nature 490, 543–546 (2012).
Article CAS PubMed PubMed Central Google Scholar
Mandal, M. et al. BRWD1 orchestrates epigenetic landscape of late B lymphopoiesis. Nat. Commun. 9, 3888 (2018).
Article PubMed PubMed Central Google Scholar
Adriani, M. et al. Impaired in vitro regulatory T cell function associated with Wiskott-Aldrich syndrome. Clin. Immunol. 124, 41–48 (2007).
Article CAS PubMed PubMed Central Google Scholar
Humblet-Baron, S. et al. Wiskott-Aldrich syndrome protein is required for regulatory T cell homeostasis. J. Clin. Invest. 117, 407–418 (2007).
Article CAS PubMed PubMed Central Google Scholar
Chen, W., Liu, S. & Wang, F. Potential impact and mechanism of long non-coding RNAs on cancer and associated T cells. J. Cancer 12, 4873–4882 (2021).
Article CAS PubMed PubMed Central Google Scholar
Zeni, P. F. & Mraz, M. LncRNAs in adaptive immunity: role in physiological and pathological conditions. RNA Biol. 18, 619–632 (2020).
Article PubMed PubMed Central Google Scholar
Szklarczyk, D. et al. The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 51, D638–D646 (2023).
Article CAS PubMed Google Scholar
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
Article CAS PubMed PubMed Central Google Scholar
Tran, HoaThiNhu et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 1–32 (2020).
Article Google Scholar
Li, H., Brouwer, C. R. & Luo, W. A universal deep neural network for in-depth cleaning of single-cell RNA-seq data. Nat. Commun. 13, 1901 (2022).
Article CAS PubMed PubMed Central Google Scholar
Fishbein, L. et al. Comprehensive molecular characterization of pheochromocytoma and paraganglioma. Cancer Cell 31, 181–193 (2017).
Article CAS PubMed PubMed Central Google Scholar
Cancer Genome Atlas Research Network Analysis Working Group Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature 499, 43–49 (2013).
Article Google Scholar
Ciriello, G. et al. Comprehensive molecular portraits of invasive lobular breast cancer. Cell 163, 506–519 (2015).
Article CAS PubMed PubMed Central Google Scholar
Kahles, André et al. Comprehensive analysis of alternative splicing across tumors from 8,705 patients. Cancer Cell 34, 211–224 (2018).
Article CAS PubMed PubMed Central Google Scholar
Dou, Y. et al. Proteogenomic characterization of endometrial carcinoma. Cell 180, 729–748 (2020).
Article CAS PubMed PubMed Central Google Scholar
Cao, L. et al. Proteogenomic characterization of pancreatic ductal adenocarcinoma. Cell 184, 5031–5052 (2021).
Article CAS PubMed PubMed Central Google Scholar
Satpathy, S. et al. A proteogenomic portrait of lung squamous cell carcinoma. Cell 184, 4348–4371 (2021).
Article CAS PubMed PubMed Central Google Scholar
Gillette, M. A. et al. Proteogenomic characterization reveals therapeutic vulnerabilities in lung adenocarcinoma. Cell 182, 200–225 (2020).
Article CAS PubMed PubMed Central Google Scholar
Wang, Liang-Bo et al. Proteogenomic and metabolomic characterization of human glioblastoma. Cancer Cell 39, 509–528 (2021).
Article CAS PubMed PubMed Central Google Scholar
Petralia, F. et al. Integrated proteogenomic characterization across major histological types of pediatric brain cancer. Cell 183, 1962–1985 (2020).
Article CAS PubMed PubMed Central Google Scholar
Krug, K. et al. Proteogenomic landscape of breast cancer tumorigenesis and targeted therapy. Cell 183, 1436–1456 (2020).
Article CAS PubMed PubMed Central Google Scholar
Nusinow, D. P. et al. Quantitative proteomics of the cancer cell line encyclopedia. Cell 180, 387–402 (2020).
Article CAS PubMed PubMed Central Google Scholar
The Cancer Cell Line Encyclopedia. Consistency of drug profiles and predictors in large-scale cancer cell line data. Nature 528, 84 (2015).
PubMed Central Google Scholar
Pietzak, E. J. et al. Genomic differences between “primary” and “secondary” muscle-invasive bladder cancer as a basis for disparate outcomes to cisplatin-based neoadjuvant chemotherapy. Eur. Urol. 75, 231–239 (2019).
Article PubMed Google Scholar
Wang, F. et al. SPDB: a comprehensive resource and knowledgebase for proteomic data at the single-cell resolution. Nucleic Acids Res. 52, D562–D571 (2024).
Article CAS PubMed Google Scholar
Sparks, R. et al. Influenza vaccination reveals sex dimorphic imprints of prior mild COVID-19. Nature 614, 752–761 (2023).
Article CAS PubMed PubMed Central Google Scholar
Nathan, A. et al. Multimodally profiling memory T cells from a tuberculosis cohort identifies cell state associations with demographics, environment and disease. Nat. Immunol. 22, 781–793 (2021).
Article CAS PubMed PubMed Central Google Scholar
Kaufmann, M. et al. Identifying CNS-colonizing T cells as potential therapeutic targets to prevent progression of multiple sclerosis. Med 2, 296–312 (2021).
Article CAS PubMed Google Scholar
Liu, C. et al. Time-resolved systems immunology reveals a late juncture linked to fatal COVID-19. Cell 184, 1836–1857 (2021).
Article CAS PubMed PubMed Central Google Scholar
Sacco, K. et al. Immunopathological signatures in multisystem inflammatory syndrome in children and pediatric COVID-19. Nat. Med. 28, 1050–1062 (2022).
Article CAS PubMed PubMed Central Google Scholar
Beneyto-Calabuig, S. et al. Clonally resolved single-cell multi-omics identifies routes of cellular differentiation in acute myeloid leukemia. Cell Stem Cell 30, 706–721 (2023).
Article CAS PubMed Google Scholar
Kumar, P. et al. Single-cell transcriptomics and surface epitope detection in human brain epileptic lesions identifies pro-inflammatory signaling. Nat. Neurosci. 25, 956–966 (2022).
Article CAS PubMed PubMed Central Google Scholar
Guilliams, M. et al. Spatial proteogenomics reveals distinct and evolutionarily conserved hepatic macrophage niches. Cell 185, 379–396 (2022).
Article CAS PubMed PubMed Central Google Scholar
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587 (2021).
Article CAS PubMed PubMed Central Google Scholar
Peterson, V. M. et al. Multiplexed quantification of proteins and transcripts in single cells. Nat. Biotechnol. 35, 936–939 (2017).
Article CAS PubMed Google Scholar
Maier, B. et al. A conserved dendritic-cell regulatory program limits antitumour immunity. Nature 580, 257–262 (2020).
Article CAS PubMed PubMed Central Google Scholar
Pombo Antunes, AnaRita et al. Single-cell profiling of myeloid cells in glioblastoma across species and disease stage reveals macrophage competition and specialization. Nat. Neurosci. 24, 595–610 (2021).
Article CAS PubMed Google Scholar
Gayoso, A. et al. Joint probabilistic modeling of single-cell multi-omic data with totalvi. Nat. Methods 18, 272–282 (2021).
Article CAS PubMed PubMed Central Google Scholar
Liu, Y. et al. High-plex protein and whole transcriptome co-mapping at cellular resolution with spatial CITE-seq. Nat. Biotechnol. 41, 1405–1409 (2023).
Article CAS PubMed PubMed Central Google Scholar
Chen, A. F. et al. NEAT-seq: simultaneous profiling of intra-nuclear proteins, chromatin accessibility and gene expression in single cells. Nat. Methods 19, 547–553 (2022).
Article CAS PubMed PubMed Central Google Scholar
Cheng, S. et al. A pan-cancer single-cell transcriptional atlas of tumor infiltrating myeloid cells. Cell 184, 792–809 (2021).
Article CAS PubMed Google Scholar
Liu, L. et al. Machine learning protocols in early cancer detection based on liquid biopsy: a survey. Life 11, 638 (2021).
Article PubMed PubMed Central Google Scholar
Minoura, K. & Shimamura, T. scMM source code. Zenodo. https://doi.org/10.5281/zenodo.5149733 (2021).
Wen, H. & Tang, J. scMoGNN source code. pypi. https://pydance.readthedocs.io/en/latest/ (2022).
Yang, K. D. & Uhler, C. CMAE source code. Zenodo. https://doi.org/10.5281/zenodo.4266733 (2021).
Hao, Y. & Rahul, S. Seurat source code. Github. https://github.com/satijalab/seurat (2023).
Ashuach, T. & Yosef, N. MultiVI source code. Zenodo. https://doi.org/10.5281/zenodo.5762077 (2021).
Wu, K. E. & Zou, J. BABEL source code. Github. https://github.com/wukevin/babel (2021).
Lakkis, J. & Li, M. sciPENN source code. Zenodo. https://doi.org/10.5281/zenodo.6944521 (2022).
Zhou, Z. & Zhang, N. R. cTP-net source code. Github. https://github.com/zhouzilu/ctpnetpy (2020).
Raudvere, U. et al. g: Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Res. 47, W191–W198 (2019).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank Z. Zheng and D. Tang for their valuable suggestions and discussion during the preparation of this manuscript. This research was substantially sponsored by the research projects (grant numbers 32170654 and 32000464 (K.-C.W.)) supported by the National Natural Science Foundation of China and was substantially supported by the Shenzhen Research Institute, City University of Hong Kong. The work described in this paper was substantially supported by the grant from the Research Grants Council of the Hong Kong Special Administrative Region (CityU 11203723 (K.-C.W.)). This project was substantially funded by the Strategic Interdisciplinary Research Grant of City University of Hong Kong (project number 2021SIRG036 (K.-C.W.)). The work described in this paper was partially supported by the grant from City University of Hong Kong (CityU 9667265 (K.-C.W.)) and Key-Area Research and Development Program of Guangdong Province (2021B0101420005 (K.-C.W.)). F.Y. was supported by the Young Elite Scientists Sponsorship Program by CAST (2023QNRC001).

Author information

Authors and Affiliations

Department of Computer Science, City University of Hong Kong, Hong Kong, China
Linjing Liu & Ka-Chun Wong
AI Lab, Tencent, Shenzhen, China
Linjing Liu, Wei Li, Fang Wang, Yiming Li, Long-Kai Huang, Fan Yang & Jianhua Yao
Department of Chemical Pathology, Prince of Wales Hospital, The Chinese University of Hong Kong, Hong Kong, China
Linjing Liu
Centre for Novostics, Hong Kong Science Park, Hong Kong, China
Linjing Liu

Authors

Linjing Liu
View author publications
Search author on:PubMed Google Scholar
Wei Li
View author publications
Search author on:PubMed Google Scholar
Fang Wang
View author publications
Search author on:PubMed Google Scholar
Yiming Li
View author publications
Search author on:PubMed Google Scholar
Long-Kai Huang
View author publications
Search author on:PubMed Google Scholar
Ka-Chun Wong
View author publications
Search author on:PubMed Google Scholar
Fan Yang
View author publications
Search author on:PubMed Google Scholar
Jianhua Yao
View author publications
Search author on:PubMed Google Scholar

Contributions

L.L., F.Y. and J.Y. conceived the project. L.L. designed the algorithm framework and developed the method. L.L. performed research and conducted experiments under the supervision of K.-C.W., F.Y. and J.Y. L.L. and F.Y. analysed the results and wrote the manuscript. K.-C.W. and J.Y. revised the manuscript. W.L. provided suggestions for downstream analysis, assisted with figure polishing and contributed to manuscript improvement. F.W. and Y.L. contributed to data collection and manuscript improvement. L.-K.H. provided suggestions and plan for incremental learning. All authors reviewed and approved the manuscript.

Corresponding authors

Correspondence to Ka-Chun Wong, Fan Yang or Jianhua Yao.

Ethics declarations

Competing interests

Authors L.L., W.L., F.W., F.Y. and J.Y. are inventors on patent applications related to this work filed by Tencent Technology (Shenzhen) Company Ltd. F.W., L.-K.H., F.Y. and J.Y. are employees of Tencent AI for Life Science Lab. The other authors declare no competing interests.

Peer review

Peer review information

Nature Biomedical Engineering thanks Xiaohui Fan, Nguyen Quoc Khanh Le and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Joint plots for evaluating prediction performance at individual level, related to Fig. 2b.

Joint plots of distribution histogram and scatter plot for CD1 (a), CD49 (b), CD158 (c), and CD307 (d) protein family members.

Extended Data Fig. 2 Pre-training data size observation and ablation study results, related to Fig. 2.

a, Influence of pre-training dataset sizes. All plots are reported based on 10 independent runs with different random seeds. Lines show the mean performance, and shaded error bands represent the 95% confidence interval (CI) across replicates. b, Ablation study of re-indexed GPE module. Each box plot represents results from 10 technical replicates with varying random seeds. Box plots show medians (center lines), interquartile range (boxes), whiskers extending to 1.5 × IQR, and outliers as individual points.

Source data.

Extended Data Fig. 3 Influence of sample amount on scTranslator in fine-tuning stage.

Each data point represents the mean performance computed across 10 independent runs using different random seeds. Shaded error bands denote the 95% confidence interval (CI) around the mean. a, Experiment results of sample amount effect in aligned mode. b, Experiment results of sample amount effect in un-aligned mode.

Source data.

Extended Data Fig. 4 Benchmarking results of aligned mode evaluated by cosine similarity, PCC, MSE, and MAE of SOTA methods on the test sets.

All metrics are computed over 10 independent runs using different random seeds. Box plots show medians, interquartile ranges, and whiskers extending to 1.5 × IQR. All replicate values are visualized as jittered dots to reflect the distribution across runs.

Extended Data Fig. 5 Bechmarking results of un-aligned mode evaluated by cosine similarity, PCC, MSE, and MAE of SOTA methods on the test sets.

All metrics are computed over 10 independent runs using different random seeds. Box plots show medians, interquartile ranges, and whiskers extending to 1.5 × IQR. All replicate values are visualized as jittered dots to reflect the distribution across runs.

Extended Data Fig. 6 Gene regulatory and protein interaction investigation.

a, Heatmaps of attention score matrices, where the left and right panels correspond to the encoder’s (gene regulatory) and decoder’s (protein interaction) attention scores, respectively. b, Zoomed in attention matrix, related to Fig. 5a. c, Bar plots show the most highly focused gene for proteins of TCR α/β, TCR γ/δ, CD45, and CLEC2. d, The overlap among TCR α/β, TCR γ/δ, and CLEC2. e, The overlap among TCR α/β, TCR γ/δ, and CD45.

Extended Data Fig. 7 Gene knockout results of IFN- γ response pathway.

Violin plots show the distribution of predicted protein levels for downregulated (left panels) and upregulated (right panels) proteins following gene knockout of (a) JAK1 (n = 360 cells), (b) JAK2 (n= 401 cells), (c) IFNGR1 (n = 287 cells), and (d) IFNGR2 (n = 435 cells), compared to the control group (n = 12,627 cells). Statistical significance was assessed using a two-sided Mann–Whitney U test (no correction for multiple comparisons). Significance levels are denoted as: P < 0.05 (*), P < 0.01(**), P < 0.001 (***). Outliers were excluded using z-score filtering (∣z∣ < 3). Each subplot displays the density distribution of protein levels for both perturbed and control cells, with embedded box plots showing the median, interquartile range, whiskers extending to 1.5 × IQR.

Extended Data Fig. 8 Visualization of markers.

UMAP plots of predicted protein colored by expression level of potential markers for monocyte (a), NK cells (b), DCs (c), and B cells (d).

Extended Data Fig. 9 Visualization to observe batch effect.

UMAP plots of protein abundance on Dataset 1, colored by cell type in top panel and by batch information in bottom panel. The left plots are visualization for predicted protein by scTranslator and the right plots are visualization for observed protein.

Extended Data Fig. 10

Illustration of gene position embedding and expression value embedding.

Supplementary information

Reporting Summary (download PDF )

Supplementary Tables (download XLSX )

Supplementary Tables 1–7.

Source data

Source Data Figs. 2, 4, 5 and 6 and Extended Data Figs. 2 and 3 (download XLSX )

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, L., Li, W., Wang, F. et al. A pre-trained large generative model for translating single-cell transcriptomes to proteomes. Nat. Biomed. Eng (2025). https://doi.org/10.1038/s41551-025-01528-z

Download citation

Received: 22 January 2024
Accepted: 02 September 2025
Published: 05 November 2025
Version of record: 05 November 2025
DOI: https://doi.org/10.1038/s41551-025-01528-z