Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Comprehensive prediction and analysis of human protein essentiality based on a pretrained large language model

A preprint version of the article is available at bioRxiv.

Abstract

Human essential proteins (HEPs) are indispensable for individual viability and development. However, experimental methods to identify HEPs are often costly, time consuming and labor intensive. In addition, existing computational methods predict HEPs only at the cell line level, but HEPs vary across living human, cell line and animal models. Here we develop a sequence-based deep learning model, Protein Importance Calculator (PIC), by fine-tuning a pretrained protein language model. PIC not only substantially outperforms existing methods for predicting HEPs but also provides comprehensive prediction results across three levels: human, cell line and mouse. Furthermore, we define the protein essential score, derived from PIC, to quantify human protein essentiality and validate its effectiveness by a series of biological analyses. We also demonstrate the biomedical value of the protein essential score by identifying potential prognostic biomarkers for breast cancer and quantifying the essentiality of 617,462 human microproteins.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overall workflow.
Fig. 2: Ablation studies of PIC models.
Fig. 3: Performance presentation and comparison of PIC models.
Fig. 4: Biological relevance of PES generated by PIC models.
Fig. 5: Cross-level analyses based on PES at different levels.
Fig. 6: Finding potential prognostic biomarkers for breast cancer through PES.

Similar content being viewed by others

Data availability

Protein essentiality data were collected from gnomAD (https://gnomad.broadinstitute.org/), Project Score (https://www.sanger.ac.uk/tool/project-score-database/) and OGEE (https://v3.ogee.info/#/home) databases to train PIC-human, PIC-cell and PIC-mouse, respectively. PPI network node degree is accessible through the STRING (https://cn.string-db.org/) database. Gene expression values in normal tissue and cancer tissue are accessible through the Human Protein Atlas (https://www.proteinatlas.org/) database. Phylop and phastCons are accessible through the UCSC genome browser (https://genome.ucsc.edu/). The numbers of protein-related diseases are accessible through the DisGeNet (https://www.disgenet.com/) database. The transcriptome data and clinical information for patients with breast cancer can be collected from the TCGA (https://www.cancer.gov/ccg/research/genome-sequencing/tcga) database. Additional experimental validation cohort data were obtained from the GEO (https://www.ncbi.nlm.nih.gov/geo/) database including GSE58644, GSE25066, GSE86166, GSE96058, GSE199633 and GSE202203.The protein sequences of microproteins are accessible through the smORFunction (http://www.cuilab.cn/smorfunction) website. The data used in this study are available on Zenodo at https://doi.org/10.5281/zenodo.13994480 (ref. 39). Source data are provided with this paper.

Code availability

The PIC web server is available at http://www.cuilab.cn/pic. The PIC source code is available on GitHub at https://github.com/KangBoming/PIC and Zenodo at https://doi.org/10.5281/zenodo.13994480 (ref. 39).

References

  1. Bartha, I., di Iulio, J., Venter, J. C. & Telenti, A. Human gene essentiality. Nat. Rev. Genet. 19, 51–62 (2018).

    Google Scholar 

  2. Ji, X., Rajpal, D. K. & Freudenberg, J. M. The essentiality of drug targets: an analysis of current literature and genomic databases. Drug Discov. Today 24, 544–550 (2019).

    MATH  Google Scholar 

  3. Aromolaran, O., Aromolaran, D., Isewon, I. & Oyelade, J. Machine learning approach to gene essentiality prediction: a review. Brief. Bioinf. 22, bbab128 (2021).

    Google Scholar 

  4. Joy, M. P., Brock, A., Ingber, D. E. & Huang, S. High-betweenness proteins in the yeast protein interaction network. J. Biomed. Biotechnol. 2005, 96–103 (2005).

    MATH  Google Scholar 

  5. Wuchty, S. & Stadler, P. F. Centers of complex networks. J. Theor. Biol. 223, 45–53 (2003).

    MathSciNet  MATH  Google Scholar 

  6. Hahn, M. W. & Kern, A. D. Comparative genomics of centrality and essentiality in three eukaryotic protein-interaction networks. Mol. Biol. Evol. 22, 803–806 (2005).

    Google Scholar 

  7. Li, G. et al. Predicting essential proteins based on subcellular localization, orthology and PPI networks. BMC Bioinf. 17, 279 (2016).

    MATH  Google Scholar 

  8. Guo, F. B. et al. Accurate prediction of human essential genes using only nucleotide composition and association information. Bioinformatics 33, 1758–1764 (2017).

    MATH  Google Scholar 

  9. Hasan, M. A. & Lonardi, S. DeeplyEssential: a deep neural network for predicting essential genes in microbes. BMC Bioinf. 21, 367 (2020).

    MATH  Google Scholar 

  10. Zhang, X., Xiao, W. & Xiao, W. DeepHE: accurately predicting human essential genes based on deep learning. PLoS Comput. Biol. 16, e1008229 (2020).

    MATH  Google Scholar 

  11. Zeng, M. et al. A deep learning framework for identifying essential proteins by integrating multiple types of biological information. IEEE/ACM Trans. Comput. Biol. Bioinf. 18, 296–305 (2021).

    MATH  Google Scholar 

  12. Li, Y., Zeng, M., Wu, Y., Li, Y. & Li, M. Accurate prediction of human essential proteins using ensemble deep learning. IEEE/ACM Trans. Comput. Biol. Bioinf. 19, 3263–3271 (2022).

    MATH  Google Scholar 

  13. Li, Y., Zeng, M., Zhang, F., Wu, F. X. & Li, M. DeepCellEss: cell line-specific essential protein prediction with attention-based interpretable deep learning. Bioinformatics 39, btac779 (2023).

    Google Scholar 

  14. Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).

    MathSciNet  MATH  Google Scholar 

  15. Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).

    MATH  Google Scholar 

  16. Thumuluri, V., Almagro Armenteros, J. J., Johansen, A. R., Nielsen, H. & Winther, O. DeepLoc 2.0: multi-label subcellular localization prediction using protein language models. Nucleic Acids Res. 50, W228–W234 (2022).

    Google Scholar 

  17. Teufel, F. et al. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat. Biotechnol. 40, 1023–1025 (2022).

    MATH  Google Scholar 

  18. Hou, X., Wang, Y., Bu, D., Wang, Y. & Sun, S. EMNGly: predicting N-linked glycosylation sites using the language models for feature extraction. Bioinformatics 39, btad650 (2023).

    MATH  Google Scholar 

  19. Chen, S. et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature 625, 92–100 (2024).

    MATH  Google Scholar 

  20. Eppig, J. T. Mouse Genome Informatics (MGI) Resource: genetic, genomic, and biological knowledgebase for the laboratory mouse. ILAR J. 58, 17–41 (2017).

    Google Scholar 

  21. Dwane, L. et al. Project Score database: a resource for investigating cancer cell dependencies and prioritizing therapeutic targets. Nucleic Acids Res. 49, D1365–D1372 (2021).

    MATH  Google Scholar 

  22. Breiman, L. Bagging predictors. Mach. Learn. 24, 123–140 (1996).

    MATH  Google Scholar 

  23. Chen, H. et al. New insights on human essential genes based on integrated analysis and the construction of the HEGIAP web-based platform. Brief. Bioinf. 21, 1397–1410 (2020).

    MATH  Google Scholar 

  24. Uhlén, M. et al. Proteomics. Tissue-based map of the human proteome. Science 347, 1260419 (2015).

    MATH  Google Scholar 

  25. Wang, J. D., Xu, J. Q., Long, Z. J. & Weng, J. Y. Disruption of mitochondrial oxidative phosphorylation by chidamide eradicates leukemic cells in AML. Clin. Transl. Oncol. 25, 1805–1820 (2023).

    MATH  Google Scholar 

  26. Liu, L. et al. High metabolic dependence on oxidative phosphorylation drives sensitivity to metformin treatment in MLL/AF9 acute myeloid leukemia. Cancers 14, 486 (2022).

    Google Scholar 

  27. UniProt Consortium. UniProt: the Universal Protein knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023).

  28. Jabbarzadeh Kaboli, P. et al. Unlocking c-MET: a comprehensive journey into targeted therapies for breast cancer. Cancer Lett. 588, 216780 (2024).

    Google Scholar 

  29. Zheng, L. et al. A potential tumor marker: chaperonin containing TCP‑1 controls the development of malignant tumors (Review). Int. J. Oncol. 63, 106 (2023).

    MATH  Google Scholar 

  30. Cai, M., Li, H., Chen, R. & Zhou, X. MRPL13 promotes tumor cell proliferation, migration and EMT process in breast cancer through the PI3K–AKT–mTOR pathway. Cancer Manag. Res. 13, 2009–2024 (2021).

    Google Scholar 

  31. Zhao, Y. et al. Deubiquitinase PSMD7 regulates cell fate and is associated with disease progression in breast cancer. Am. J. Transl. Res. 12, 5433–5448 (2020).

    MATH  Google Scholar 

  32. Vishnubalaji, R. & Alajez, N. M. Single-cell transcriptome analysis revealed heterogeneity and identified novel therapeutic targets for breast cancer subtypes. Cells 12, 1182 (2023).

    Google Scholar 

  33. Gui, Z., Liu, P., Zhang, D. & Wang, W. Clinical implications and immune implications features of TARS1 in breast cancer. Front. Oncol. 13, 1207867 (2023).

    MATH  Google Scholar 

  34. Song, S. et al. CHMP4A stimulates CD8+ T-lymphocyte infiltration and inhibits breast tumor growth via the LSD1/IFNβ axis. Cancer Sci. 114, 3162–3175 (2023).

    MATH  Google Scholar 

  35. Ji, X., Cui, C. & Cui, Q. smORFunction: a tool for predicting functions of small open reading frames and microproteins. BMC Bioinf. 21, 455 (2020).

    MATH  Google Scholar 

  36. Polycarpou-Schwarz, M. et al. The cancer-associated microprotein CASIMO1 controls cell proliferation and interacts with squalene epoxidase modulating lipid droplet formation. Oncogene 37, 4750–4768 (2018).

    MATH  Google Scholar 

  37. Makarewich, C. A. et al. MOXI is a mitochondrial micropeptide that enhances fatty acid β-oxidation. Cell Rep. 23, 3701–3709 (2018).

    MATH  Google Scholar 

  38. Bhatta, A. et al. A mitochondrial micropeptide is required for activation of the Nlrp3 inflammasome. J. Immunol. 204, 428–437 (2020).

    MATH  Google Scholar 

  39. Kang, B., Fan, R., Cui, C. & Cui, Q. Comprehensive prediction and analysis of human protein essentiality based on a pre-trained protein large language model(v1.0). Zenodo https://doi.org/10.5281/zenodo.13994480 (2024).

  40. Martin, F. J. et al. Ensembl 2023. Nucleic Acids Res. 51, D933–D941 (2023).

    MATH  Google Scholar 

Download references

Acknowledgements

This study was supported by grants from the National Natural Science Foundation of China (62025102, 32301239 and 81921001) and the Scientific and Technological Research Project of Xinjiang Production and Construction Corps (2023AB057).

Author information

Authors and Affiliations

Authors

Contributions

Q.C. presented the original idea. B.K. and R.F. designed the study. B.K. performed the study. B.K., R.F., C.C. and Q.C. wrote or edited the paper.

Corresponding author

Correspondence to Qinghua Cui.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Min Li and Stefano Lonardi for their contribution to the peer review of this work. Primary Handling Editor: Jie Pan, in collaboration with the Nature Computational Science team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Supplementary Notes 1–3, Figs. 1–8 and Tables 1–14 and descriptions of Supplementary Data 1–8.

Reporting Summary (download PDF )

Peer Review File (download PDF )

Supplementary Data (download ZIP )

Supplementary Data 1–6 and 8.

Source data

Source Data Fig. 2 (download XLSX )

The source data for Fig. 2.

Source Data Fig. 3 (download XLSX )

The source data for Fig. 3.

Source Data Fig. 4 (download XLSX )

The source data for Fig. 4.

Source Data Fig. 5 (download XLSX )

The source data for Fig. 5.

Source Data Fig. 6 (download XLSX )

The source data for Fig. 6.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kang, B., Fan, R., Cui, C. et al. Comprehensive prediction and analysis of human protein essentiality based on a pretrained large language model. Nat Comput Sci 5, 196–206 (2025). https://doi.org/10.1038/s43588-024-00733-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s43588-024-00733-1

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing