Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

A pre-trained large generative model for translating single-cell transcriptomes to proteomes

Abstract

Measuring protein abundance at the single-cell level can facilitate a high-resolution understanding of biological mechanisms in cellular processes and disease progression. However, current single-cell proteomic technologies face challenges such as limited coverage, constrained throughput and sensitivity, batch effects, high costs and stringent experimental operations. Inspired by the translation procedure in both natural language processing and the genetic central dogma, we propose a pre-trained, large generative model named single-cell translator (scTranslator). scTranslator can generate multi-omics data by inferring the missing single-cell proteome based on the transcriptome. Through systematic benchmarking and validation on independent datasets, we have confirmed the accuracy, stability and flexibility of scTranslator across various profiling techniques (for example, CITE-seq, spatial CITE-seq, REAP-seq, NEAT-seq), cell types (for example, monocytes, macrophages, T cells, B cells), tissues (for example, blood, lung, brain) and a wide range of disease contexts, including infectious, metabolic and oncologic conditions. Furthermore, scTranslator shows its superiority in assisting various downstream analyses and applications, including gene/protein interaction inference, perturbation prediction, cell clustering, batch correction and cell origin recognition in pan-cancer data.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of scTranslator and its downstream applications.
Fig. 2: Overall performance of scTranslator.
Fig. 3: Systematic benchmarks on protein inference.
Fig. 4: Overview of independent datasets and comparative experiment results.
Fig. 5: Integrative regulatory inference and gene perturbation results.
Fig. 6: Exploration of cell heterogeneity in CITE-seq PBMCs and cell origin in pan-cancer data.

Data availability

All datasets used in this study are publicly available from established community or institutional repositories. TCGA data were accessed from the National Cancer Institute (https://www.cancer.gov). The CPTAC and MSKCC datasets were accessed through cBioPortal for Cancer Genomics (https://www.cbioportal.org). The Broad Institute Cancer Cell Line Encyclopedia dataset was obtained from the Broad Institute (https://sites.broadinstitute.org/ccle). All single-cell datasets used for second-stage pretraining and the CITE-seq DCs dataset were accessed via the Single-cell Proteomic DataBase portal (https://scproteomicsdb.com). The Perturb-CITE-seq dataset was downloaded from the Broad Institute Single Cell Portal (https://singlecell.broadinstitute.org/single_cell/study/SCP1064). Other datasets were accessed from the NCBI Gene Expression Omnibus: Seurat CITE-seq and ECCITE-seq PBMCs datasets (GSE164378), REAP-seq PBMCs dataset (GSE100501), CITE-seq CBMCs dataset (GSE100866), CITE-seq MCs dataset (GSE163120), CITE-seq mouse dataset (GSE150599), spatial CITE-seq dataset (GSE213264), NEAT-seq dataset (GSE178707) and Single-cell pan-cancer dataset (GSE154763). Source data are provided with this paper.

Code availability

The source code is available under the Apache License 2.0 via GitHub at https://github.com/TencentAILabHealthcare/scTranslator. For reproducibility, the scripts for all experiments and results analysis were also included in the above repository.

References

  1. Tang, F. et al. mRNA-seq whole-transcriptome analysis of a single cell. Nat. Methods 6, 377–382 (2009).

    Article  CAS  PubMed  Google Scholar 

  2. Schoof, E. M. et al. Quantitative single-cell proteomics as a tool to characterize cellular hierarchies. Nat. Commun. 12, 3341 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Schwanhäusser, Björn et al. Global quantification of mammalian gene expression control. Nature 473, 337–342 (2011).

    Article  PubMed  Google Scholar 

  4. Liu, Y., Beyer, A. & Aebersold, R. On the dependency of cellular protein levels on mRNA abundance. Cell 165, 535–550 (2016).

    Article  CAS  PubMed  Google Scholar 

  5. Vogel, C. & Marcotte, E. M. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat. Rev. Genet. 13, 227–232 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Khan, Z. et al. Primate transcript and protein expression levels evolve under compensatory selection pressures. Science 342, 1100–1104 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Brunner, Andreas-David et al. Ultra-high sensitivity mass spectrometry quantifies single-cell proteome changes upon perturbation. Mol. Syst. Biol. 18, e10798 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. OpenAI et al. GPT-4 technical report. Preprint at https://arxiv.org/2303.08774 (2023).

  9. Yang, F. et al. scbert as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat. Mach. Intell. 4, 852–866 (2022).

    Article  Google Scholar 

  10. Singhal, K. et al. Large language models encode clinical knowledge. Preprint at https://doi.org/10.48550/arXiv.2212.13138 (2022).

  11. Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Hao, Y. et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat. Biotechnol. 42, 293–304 (2024).

    Article  CAS  PubMed  Google Scholar 

  13. Zhou, Z., Ye, C., Wang, J. & Zhang, N. R. Surface protein imputation from single cell transcriptomes by deep neural networks. Nat. Commun. 11, 651 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Lakkis, J. et al. A multi-use deep learning method for CITE-seq and single-cell RNA-seq data integration with cell surface protein prediction and imputation. Nat. Mach. Intell 4, 940–952 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  15. Ashuach, T. et al. Multivi: deep generative model for the integration of multimodal data. Nat. Methods 20, 1222–1231 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Wu, K. E., Yost, K. E., Chang, H. Y. & Zou, J. Babel enables cross-modality translation between multiomic profiles at single-cell resolution. Proc. Natl Acad. Sci. USA 118, e2023070118 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Yang, KarrenDai et al. Multi-domain translation between single-cell imaging and sequencing data using autoencoders. Nat. Commun. 12, 31 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Minoura, K., Abe, K., Nam, H., Nishikawa, H. & Shimamura, T. A mixture-of-experts deep generative model for integrated analysis of single-cell multiomics data. Cell Rep. Methods 1, 100071 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Wen, H. et al. Graph neural networks for multimodal single-cell data integration. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (eds Zhang, A. & Rangwala, H.) 4153–4163 (Association for Computing Machinery, 2022).

  20. Choromanski, K. et al. Rethinking attention with performers. In 9th International Conference on Learning Representations (eds Oh, A. et al.)(ICLR, 2021).

  21. Frangieh, C. J. et al. Multimodal pooled Perturb-CITE-seq screens in patient models define mechanisms of cancer immune evasion. Nat. Genet. 53, 332–341 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017).

    Google Scholar 

  23. Chen, M. X. et al. The best of both worlds: combining recent advances in neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (eds. Gurevych, I. & Miyao, Y.) 76–86 (Association for Computational Linguistics, 2018).

  24. Dosovitskiy, A. et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations (eds Oh, A. et al.) (ICLR, 2021)

  25. Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog 1, 9 (2019).

    Google Scholar 

  26. Liu, Z. et al. Swin transformer: hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (eds Hassner, T. et al.) 9992–10002 (ICCV, 2021).

  27. Kim, W., Son, B., & Kim, I. Vilt: vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 5583–5594 (PMLR, 2021).

  28. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Burtsien, J. et al.) 1, 4171–4186 (Association for Computational Linguistics, 2019).

  29. He, K. et al. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (eds Dana, K. et al.) 15979–15988 (IEEE, 2022).

  30. Radford, A. et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 8748–8763 (PMLR, 2021).

  31. Radford, A. et al. Improving language understanding by generative pre-training. OpenAI Blog https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.(2018).

  32. Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 35, 27730–27744 (2022).

    Google Scholar 

  33. Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).

    Google Scholar 

  34. He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (eds Liu, C. et al.) 9726–9735 (IEEE, 2020).

  35. Stegle, O., Teichmann, S. A. & Marioni, J. C. Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet. 16, 133–145 (2015).

    Article  CAS  PubMed  Google Scholar 

  36. Shapiro, E., Biezuner, T. & Linnarsson, S. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nat. Rev. Genet. 14, 618–630 (2013).

    Article  CAS  PubMed  Google Scholar 

  37. Zhou, H. et al. Informer: beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence (eds Fourcade, M. et al.) 35, 11106–11115 (AAAI Press, 2021).

  38. Edfors, F. et al. Gene-specific correlation of RNA and protein levels in human cells and tissues. Mol. Syst. Biol. 12, 883 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  39. Payne, S. H. The utility of protein and mRNA correlation. Trends Biochem. Sci. 40, 1–3 (2015).

    Article  CAS  PubMed  Google Scholar 

  40. Wegler, C. et al. Global variability analysis of mRNA and protein concentrations across and within human tissues. NAR Genom. Bioinform. 2, lqz010 (2020).

    Article  PubMed  Google Scholar 

  41. Asensio, J. O., Verheijen, M. & Caiment, F. Predicting missing proteomics values using machine learning: filling the gap using transcriptomics and other biological features. Comput. Struct. Biotechnol. J. 20, 2057–2069 (2022).

    Article  Google Scholar 

  42. Greenbaum, D., Colangelo, C., Williams, K. & Gerstein, M. Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biol. 4, 1–8 (2003).

    Article  Google Scholar 

  43. Desai, D. M., Sap, J., Silvennoinen, O., Schlessinger, J. & Weiss, A. The catalytic activity of the CD45 membrane-proximal phosphatase domain is required for TCR signaling and regulation. EMBO J. 13, 4002–4010 (1994).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Irles, C. et al. CD45 ectodomain controls interaction with gems and lck activity for optimal TCR signaling. Nat. Immunol. 4, 189–197 (2003).

    Article  CAS  PubMed  Google Scholar 

  45. Liu, Y. et al. Integrated pan-cancer genomic analysis reveals the role of slc30a5 in the proliferation, metastasis, and prognosis of hepatocellular carcinoma. J. Cancer 15, 4686 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Barresi, V. et al. Transcriptome analysis reveals an altered expression profile of zinc transporters in colorectal cancer. J. Cell. Biochem. 119, 9707–9719 (2018).

    Article  CAS  PubMed  Google Scholar 

  47. Riffault, B. et al. Pro-brain-derived neurotrophic factor inhibits GABAergic neurotransmission by activating endocytosis and repression of GABAA receptors. J. Neurosci. 34, 13516–13534 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  48. Hilton, B. J. et al. An active vesicle priming machinery suppresses axon regeneration upon adult CNS injury. Neuron 110, 51–69 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Vallania, F. et al. Genome-wide discovery of functional transcription factor binding sites by comparative genomics: the case of Stat3. Proc. Natl Acad. Sci. USA 106, 5117–5122 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Utley, A. T., Carlson, L. & Lee, K. P. CD28 induces mitochondrial respiration through irf4 for long lived plasma cells survival. Blood 128, 128 (2016).

    Article  Google Scholar 

  51. Li, P. et al. BATF-JUN is critical for IRF4-mediated transcription in T cells. Nature 490, 543–546 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Mandal, M. et al. BRWD1 orchestrates epigenetic landscape of late B lymphopoiesis. Nat. Commun. 9, 3888 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  53. Adriani, M. et al. Impaired in vitro regulatory T cell function associated with Wiskott-Aldrich syndrome. Clin. Immunol. 124, 41–48 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Humblet-Baron, S. et al. Wiskott-Aldrich syndrome protein is required for regulatory T cell homeostasis. J. Clin. Invest. 117, 407–418 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Chen, W., Liu, S. & Wang, F. Potential impact and mechanism of long non-coding RNAs on cancer and associated T cells. J. Cancer 12, 4873–4882 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Zeni, P. F. & Mraz, M. LncRNAs in adaptive immunity: role in physiological and pathological conditions. RNA Biol. 18, 619–632 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  57. Szklarczyk, D. et al. The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 51, D638–D646 (2023).

    Article  CAS  PubMed  Google Scholar 

  58. Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Tran, HoaThiNhu et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 1–32 (2020).

    Article  Google Scholar 

  60. Li, H., Brouwer, C. R. & Luo, W. A universal deep neural network for in-depth cleaning of single-cell RNA-seq data. Nat. Commun. 13, 1901 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Fishbein, L. et al. Comprehensive molecular characterization of pheochromocytoma and paraganglioma. Cancer Cell 31, 181–193 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Cancer Genome Atlas Research Network Analysis Working Group Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature 499, 43–49 (2013).

    Article  Google Scholar 

  63. Ciriello, G. et al. Comprehensive molecular portraits of invasive lobular breast cancer. Cell 163, 506–519 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Kahles, André et al. Comprehensive analysis of alternative splicing across tumors from 8,705 patients. Cancer Cell 34, 211–224 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Dou, Y. et al. Proteogenomic characterization of endometrial carcinoma. Cell 180, 729–748 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Cao, L. et al. Proteogenomic characterization of pancreatic ductal adenocarcinoma. Cell 184, 5031–5052 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Satpathy, S. et al. A proteogenomic portrait of lung squamous cell carcinoma. Cell 184, 4348–4371 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  68. Gillette, M. A. et al. Proteogenomic characterization reveals therapeutic vulnerabilities in lung adenocarcinoma. Cell 182, 200–225 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Wang, Liang-Bo et al. Proteogenomic and metabolomic characterization of human glioblastoma. Cancer Cell 39, 509–528 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Petralia, F. et al. Integrated proteogenomic characterization across major histological types of pediatric brain cancer. Cell 183, 1962–1985 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. Krug, K. et al. Proteogenomic landscape of breast cancer tumorigenesis and targeted therapy. Cell 183, 1436–1456 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Nusinow, D. P. et al. Quantitative proteomics of the cancer cell line encyclopedia. Cell 180, 387–402 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. The Cancer Cell Line Encyclopedia. Consistency of drug profiles and predictors in large-scale cancer cell line data. Nature 528, 84 (2015).

    PubMed Central  Google Scholar 

  74. Pietzak, E. J. et al. Genomic differences between “primary” and “secondary” muscle-invasive bladder cancer as a basis for disparate outcomes to cisplatin-based neoadjuvant chemotherapy. Eur. Urol. 75, 231–239 (2019).

    Article  PubMed  Google Scholar 

  75. Wang, F. et al. SPDB: a comprehensive resource and knowledgebase for proteomic data at the single-cell resolution. Nucleic Acids Res. 52, D562–D571 (2024).

    Article  CAS  PubMed  Google Scholar 

  76. Sparks, R. et al. Influenza vaccination reveals sex dimorphic imprints of prior mild COVID-19. Nature 614, 752–761 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  77. Nathan, A. et al. Multimodally profiling memory T cells from a tuberculosis cohort identifies cell state associations with demographics, environment and disease. Nat. Immunol. 22, 781–793 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  78. Kaufmann, M. et al. Identifying CNS-colonizing T cells as potential therapeutic targets to prevent progression of multiple sclerosis. Med 2, 296–312 (2021).

    Article  CAS  PubMed  Google Scholar 

  79. Liu, C. et al. Time-resolved systems immunology reveals a late juncture linked to fatal COVID-19. Cell 184, 1836–1857 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  80. Sacco, K. et al. Immunopathological signatures in multisystem inflammatory syndrome in children and pediatric COVID-19. Nat. Med. 28, 1050–1062 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  81. Beneyto-Calabuig, S. et al. Clonally resolved single-cell multi-omics identifies routes of cellular differentiation in acute myeloid leukemia. Cell Stem Cell 30, 706–721 (2023).

    Article  CAS  PubMed  Google Scholar 

  82. Kumar, P. et al. Single-cell transcriptomics and surface epitope detection in human brain epileptic lesions identifies pro-inflammatory signaling. Nat. Neurosci. 25, 956–966 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  83. Guilliams, M. et al. Spatial proteogenomics reveals distinct and evolutionarily conserved hepatic macrophage niches. Cell 185, 379–396 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  84. Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  85. Peterson, V. M. et al. Multiplexed quantification of proteins and transcripts in single cells. Nat. Biotechnol. 35, 936–939 (2017).

    Article  CAS  PubMed  Google Scholar 

  86. Maier, B. et al. A conserved dendritic-cell regulatory program limits antitumour immunity. Nature 580, 257–262 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  87. Pombo Antunes, AnaRita et al. Single-cell profiling of myeloid cells in glioblastoma across species and disease stage reveals macrophage competition and specialization. Nat. Neurosci. 24, 595–610 (2021).

    Article  CAS  PubMed  Google Scholar 

  88. Gayoso, A. et al. Joint probabilistic modeling of single-cell multi-omic data with totalvi. Nat. Methods 18, 272–282 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  89. Liu, Y. et al. High-plex protein and whole transcriptome co-mapping at cellular resolution with spatial CITE-seq. Nat. Biotechnol. 41, 1405–1409 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  90. Chen, A. F. et al. NEAT-seq: simultaneous profiling of intra-nuclear proteins, chromatin accessibility and gene expression in single cells. Nat. Methods 19, 547–553 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  91. Cheng, S. et al. A pan-cancer single-cell transcriptional atlas of tumor infiltrating myeloid cells. Cell 184, 792–809 (2021).

    Article  CAS  PubMed  Google Scholar 

  92. Liu, L. et al. Machine learning protocols in early cancer detection based on liquid biopsy: a survey. Life 11, 638 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  93. Minoura, K. & Shimamura, T. scMM source code. Zenodo. https://doi.org/10.5281/zenodo.5149733 (2021).

  94. Wen, H. & Tang, J. scMoGNN source code. pypi. https://pydance.readthedocs.io/en/latest/ (2022).

  95. Yang, K. D. & Uhler, C. CMAE source code. Zenodo. https://doi.org/10.5281/zenodo.4266733 (2021).

  96. Hao, Y. & Rahul, S. Seurat source code. Github. https://github.com/satijalab/seurat (2023).

  97. Ashuach, T. & Yosef, N. MultiVI source code. Zenodo. https://doi.org/10.5281/zenodo.5762077 (2021).

  98. Wu, K. E. & Zou, J. BABEL source code. Github. https://github.com/wukevin/babel (2021).

  99. Lakkis, J. & Li, M. sciPENN source code. Zenodo. https://doi.org/10.5281/zenodo.6944521 (2022).

  100. Zhou, Z. & Zhang, N. R. cTP-net source code. Github. https://github.com/zhouzilu/ctpnetpy (2020).

  101. Raudvere, U. et al. g: Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Res. 47, W191–W198 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank Z. Zheng and D. Tang for their valuable suggestions and discussion during the preparation of this manuscript. This research was substantially sponsored by the research projects (grant numbers 32170654 and 32000464 (K.-C.W.)) supported by the National Natural Science Foundation of China and was substantially supported by the Shenzhen Research Institute, City University of Hong Kong. The work described in this paper was substantially supported by the grant from the Research Grants Council of the Hong Kong Special Administrative Region (CityU 11203723 (K.-C.W.)). This project was substantially funded by the Strategic Interdisciplinary Research Grant of City University of Hong Kong (project number 2021SIRG036 (K.-C.W.)). The work described in this paper was partially supported by the grant from City University of Hong Kong (CityU 9667265 (K.-C.W.)) and Key-Area Research and Development Program of Guangdong Province (2021B0101420005 (K.-C.W.)). F.Y. was supported by the Young Elite Scientists Sponsorship Program by CAST (2023QNRC001).

Author information

Authors and Affiliations

Authors

Contributions

L.L., F.Y. and J.Y. conceived the project. L.L. designed the algorithm framework and developed the method. L.L. performed research and conducted experiments under the supervision of K.-C.W., F.Y. and J.Y. L.L. and F.Y. analysed the results and wrote the manuscript. K.-C.W. and J.Y. revised the manuscript. W.L. provided suggestions for downstream analysis, assisted with figure polishing and contributed to manuscript improvement. F.W. and Y.L. contributed to data collection and manuscript improvement. L.-K.H. provided suggestions and plan for incremental learning. All authors reviewed and approved the manuscript.

Corresponding authors

Correspondence to Ka-Chun Wong, Fan Yang or Jianhua Yao.

Ethics declarations

Competing interests

Authors L.L., W.L., F.W., F.Y. and J.Y. are inventors on patent applications related to this work filed by Tencent Technology (Shenzhen) Company Ltd. F.W., L.-K.H., F.Y. and J.Y. are employees of Tencent AI for Life Science Lab. The other authors declare no competing interests.

Peer review

Peer review information

Nature Biomedical Engineering thanks Xiaohui Fan, Nguyen Quoc Khanh Le and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Joint plots for evaluating prediction performance at individual level, related to Fig. 2b.

Joint plots of distribution histogram and scatter plot for CD1 (a), CD49 (b), CD158 (c), and CD307 (d) protein family members.

Extended Data Fig. 2 Pre-training data size observation and ablation study results, related to Fig. 2.

a, Influence of pre-training dataset sizes. All plots are reported based on 10 independent runs with different random seeds. Lines show the mean performance, and shaded error bands represent the 95% confidence interval (CI) across replicates. b, Ablation study of re-indexed GPE module. Each box plot represents results from 10 technical replicates with varying random seeds. Box plots show medians (center lines), interquartile range (boxes), whiskers extending to 1.5 × IQR, and outliers as individual points.

Source data.

Extended Data Fig. 3 Influence of sample amount on scTranslator in fine-tuning stage.

Each data point represents the mean performance computed across 10 independent runs using different random seeds. Shaded error bands denote the 95% confidence interval (CI) around the mean. a, Experiment results of sample amount effect in aligned mode. b, Experiment results of sample amount effect in un-aligned mode.

Source data.

Extended Data Fig. 4 Benchmarking results of aligned mode evaluated by cosine similarity, PCC, MSE, and MAE of SOTA methods on the test sets.

All metrics are computed over 10 independent runs using different random seeds. Box plots show medians, interquartile ranges, and whiskers extending to 1.5 × IQR. All replicate values are visualized as jittered dots to reflect the distribution across runs.

Extended Data Fig. 5 Bechmarking results of un-aligned mode evaluated by cosine similarity, PCC, MSE, and MAE of SOTA methods on the test sets.

All metrics are computed over 10 independent runs using different random seeds. Box plots show medians, interquartile ranges, and whiskers extending to 1.5 × IQR. All replicate values are visualized as jittered dots to reflect the distribution across runs.

Extended Data Fig. 6 Gene regulatory and protein interaction investigation.

a, Heatmaps of attention score matrices, where the left and right panels correspond to the encoder’s (gene regulatory) and decoder’s (protein interaction) attention scores, respectively. b, Zoomed in attention matrix, related to Fig. 5a. c, Bar plots show the most highly focused gene for proteins of TCR α/β, TCR γ/δ, CD45, and CLEC2. d, The overlap among TCR α/β, TCR γ/δ, and CLEC2. e, The overlap among TCR α/β, TCR γ/δ, and CD45.

Extended Data Fig. 7 Gene knockout results of IFN- γ response pathway.

Violin plots show the distribution of predicted protein levels for downregulated (left panels) and upregulated (right panels) proteins following gene knockout of (a) JAK1 (n = 360 cells), (b) JAK2 (n= 401 cells), (c) IFNGR1 (n = 287 cells), and (d) IFNGR2 (n = 435 cells), compared to the control group (n = 12,627 cells). Statistical significance was assessed using a two-sided Mann–Whitney U test (no correction for multiple comparisons). Significance levels are denoted as: P < 0.05 (*), P < 0.01(**), P < 0.001 (***). Outliers were excluded using z-score filtering (z < 3). Each subplot displays the density distribution of protein levels for both perturbed and control cells, with embedded box plots showing the median, interquartile range, whiskers extending to 1.5 × IQR.

Extended Data Fig. 8 Visualization of markers.

UMAP plots of predicted protein colored by expression level of potential markers for monocyte (a), NK cells (b), DCs (c), and B cells (d).

Extended Data Fig. 9 Visualization to observe batch effect.

UMAP plots of protein abundance on Dataset 1, colored by cell type in top panel and by batch information in bottom panel. The left plots are visualization for predicted protein by scTranslator and the right plots are visualization for observed protein.

Extended Data Fig. 10

Illustration of gene position embedding and expression value embedding.

Supplementary information

Source data

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, L., Li, W., Wang, F. et al. A pre-trained large generative model for translating single-cell transcriptomes to proteomes. Nat. Biomed. Eng (2025). https://doi.org/10.1038/s41551-025-01528-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1038/s41551-025-01528-z

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing