Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
Root-associated protein prediction using a protein large language model and hypergraph convolutional networks
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 08 January 2026

Root-associated protein prediction using a protein large language model and hypergraph convolutional networks

  • Lei Chen1,
  • Xingyu Xun1 &
  • Bo Zhou2 

Scientific Reports , Article number:  (2026) Cite this article

  • 788 Accesses

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Computational biology and bioinformatics
  • Plant sciences

Abstract

Plant root-associated proteins promote plant growth and enhance stress tolerance. They participate in signaling and plant growth regulation. It is clear that they play key roles in plant growth, development and environmental adaptation. At present, the root-associated proteins have not been fully discovered. It is essential to identify latent root-associated proteins. Traditional methods (proteomic analysis, transcriptome and expression analysis) for determining root-associated proteins are highly relied on the data generated by biochemical experiments, which are always expensive and time-consuming. On the other hand, the current computational models show weak ability, providing great spaces for improvement. In this study, we propose a new computational model, Hypergraph-Root, for predicting root-associated proteins. The model employed several feature types to represent proteins, which were derived from proteins BLOSUM62 and position-specific scoring matrices as well as by a protein language model. These features were improved by hypergraph convolutional network and multi-head attention. The final predicted result was yielded by a fully connected layer. The model yielded high performance with AUC about 0.9 on training and independent datasets. It had evident advantages compared with existing models. Some additional tests were conducted to prove the rationality of the model’s structure.

Similar content being viewed by others

Tissue-specific signatures of metabolites and proteins in asparagus roots and exudates

Article Open access 01 April 2021

Common gene expression patterns are observed in rice roots during associations with plant growth-promoting bacteria, Herbaspirillum seropedicae and Azospirillum brasilense

Article Open access 25 May 2022

Protein extraction from Buckwheat, Chondrus crispus, and Spelt and assessment of nutritional benefits and limitations in vitro

Article Open access 24 September 2025

Data availability

The data underlying this study are openly available in RGPDB database at http://sysbio.unl.edu/RGPDB/. The codes and refined data are available at https://github.com/Xxy0413-1119/Hypergraph-Root.

References

  1. Hodge, A., Berta, G., Doussan, C., Merchan, F. & Crespi, M. Plant root growth, architecture and function. Plant Soil 321, 153–187 (2009).

    Google Scholar 

  2. Huang, B., Rachmilevitch, S. & Xu, J. Root carbon and protein metabolism associated with heat tolerance. J. Exp. Bot. 63, 3455–3465 (2012).

    Google Scholar 

  3. Fageria, N. K. The Role of Plant Roots in Crop Production (CRC Press, 2012).

    Google Scholar 

  4. Dawson, N., Sillitoe, I., Marsden, R. L. & Orengo, C. A. The classification of protein domains. In Bioinformatics: Volume I: Data, Sequence Analysis, and Evolution, 137–164 (2017).

  5. Moisseyev, G. et al. RGPDB: database of root-associated genes and promoters in maize, soybean, and sorghum. Database J. Biol. Databases Curation 2020, baaa038. https://doi.org/10.1093/database/baaa038 (2020).

    Google Scholar 

  6. Almagro Armenteros, J. J. et al. SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat. Biotechnol. 37, 420–423. https://doi.org/10.1038/s41587-019-0036-z (2019).

    Google Scholar 

  7. Yang, L., Gao, J., Gao, M., Jiang, L. & Luo, L. Characterization of plasma membrane proteins in stylosanthes leaves and roots using simplified enrichment method with a nonionic detergent. Front. Plant Sci. 13, 1071225. https://doi.org/10.3389/fpls.2022.1071225 (2022).

    Google Scholar 

  8. Iwasaki, Y. et al. Proteomics analysis of plasma membrane fractions of the root, leaf, and flower of rice. Int. J. Mol. Sci. 21, 6988. https://doi.org/10.3390/ijms21196988 (2020).

    Google Scholar 

  9. Voothuluru, P., Anderson, J. C., Sharp, R. E. & Peck, S. C. Plasma membrane proteomics in the maize primary root growth zone: Novel insights into root growth adaptation to water stress. Plant Cell Environ. 39, 2043–2054 (2016).

    Google Scholar 

  10. Kumar Meher, P. et al. SVM-root: Identification of root-associated proteins in plants by employing the support vector machine with sequence-derived features. Curr. Bioinform. 19, 69–80. https://doi.org/10.2174/1574893618666230417104543 (2024).

    Google Scholar 

  11. Zhou, B., Liu, S. Y., Chen, L. & Dai, Q. Graph-root: Prediction of root-associated proteins in maize, sorghum, and soybean based on graph convolutional network and network embedding method. Curr. Bioinform. https://doi.org/10.2174/0115748936343410241008103219 (2024).

    Google Scholar 

  12. Feng, Y., You, H., Zhang, Z., Ji, R. & Gao, Y. in Proceedings of the AAAI Conference on Artificial Intelligence 3558–3565.

  13. Szklarczyk, D. et al. The STRING database in 2023: Protein–protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 51, D638–D646. https://doi.org/10.1093/nar/gkac1000 (2022).

    Google Scholar 

  14. Yates, A. D. et al. Ensembl Genomes 2022: An expanding genome resource for non-vertebrates. Nucleic Acids Res. 50, D996-d1003. https://doi.org/10.1093/nar/gkab1007 (2022).

    Google Scholar 

  15. UniProt Consortium. UniProt: The universal protein knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531. https://doi.org/10.1093/nar/gkac1052 (2023).

    Google Scholar 

  16. Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152. https://doi.org/10.1093/bioinformatics/bts565 (2012).

    Google Scholar 

  17. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. in 26th International Conference on Neural Information Processing Systems 3111–3119 (2013).

  18. Elnaggar, A. et al. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).

    Google Scholar 

  19. Steinegger, M., Mirdita, M. & Söding, J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat. Methods 16, 603–606 (2019).

    Google Scholar 

  20. Steinegger, M. & Söding, J. Clustering huge protein sequence sets in linear time. Nat. Commun. 9, 2542 (2018).

    Google Scholar 

  21. Suzek, B. E. et al. UniRef clusters: A comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).

    Google Scholar 

  22. Henikoff, S. & Henikoff, J. G. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U. S. A. 89, 10915–10919. https://doi.org/10.1073/pnas.89.22.10915 (1992).

    Google Scholar 

  23. Yao, L. et al. DeepAFP: An effective computational framework for identifying antifungal peptides based on deep learning. Protein Sci. 32, e4758. https://doi.org/10.1002/pro.4758 (2023).

    Google Scholar 

  24. Fang, Y. et al. AFP-MFL: Accurate identification of antifungal peptides using multi-view feature learning. Brief. Bioinform. 24, bbac606. https://doi.org/10.1093/bib/bbac606 (2023).

    Google Scholar 

  25. Ning, Q. & Li, J. DLF-Sul: A multi-module deep learning framework for prediction of S-sulfinylation sites in proteins. Brief. Bioinform. 23, bbac323. https://doi.org/10.1093/bib/bbac323 (2022).

    Google Scholar 

  26. Pearson, W. R. An introduction to sequence similarity (“homology”) searching. Curr. Protoc. Bioinform. 42, 3–1 (2013).

    Google Scholar 

  27. Cheol Jeong, J., Lin, X. & Chen, X.-W. On position-specific scoring matrix for protein function prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 8, 308–315 (2010).

    Google Scholar 

  28. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402. https://doi.org/10.1093/nar/25.17.3389 (1997).

    Google Scholar 

  29. Boeckmann, B. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370. https://doi.org/10.1093/nar/gkg095 (2003).

    Google Scholar 

  30. Singh, J., Litfin, T., Singh, J., Paliwal, K. & Zhou, Y. SPOT-Contact-LM: Improving single-sequence-based prediction of protein contact map using a transformer language model. Bioinformatics 38, 1888–1894. https://doi.org/10.1093/bioinformatics/btac053 (2022).

    Google Scholar 

  31. Ouyang, D. et al. HGCLAMIR: Hypergraph contrastive learning with attention mechanism and integrated multi-view representation for predicting miRNA-disease associations. PLoS Comput. Biol. 20, e1011927. https://doi.org/10.1371/journal.pcbi.1011927 (2024).

    Google Scholar 

  32. Peng, W., He, Z., Dai, W. & Lan, W. MHCLMDA: Multihypergraph contrastive learning for miRNA–disease association prediction. Brief. Bioinform. 25, bbad524. https://doi.org/10.1093/bib/bbad524 (2024).

    Google Scholar 

  33. Lin, Z. et al. A structured self-attentive sentence embedding. arXiv preprint https://arxiv.org/abs/1703.03130. https://doi.org/10.48550/arXiv.1703.03130 (2017).

  34. Kingma, D. P. & Ba, J. in 3rd International Conference on Learning Representations (Louisiana, 2019).

  35. Kohavi, R. in International Joint Conference on Artificial Intelligence 1137–1145 (Lawrence Erlbaum Associates Ltd).

  36. Bao, Y. et al. Recognizing SARS-CoV-2 infection of nasopharyngeal tissue at the single-cell level by machine learning method. Mol. Immunol. 177, 44–61 (2025).

    Google Scholar 

  37. Liao, H. et al. Machine learning analysis of CD4+ T cell gene expression in diverse diseases: Insights from cancer, metabolic, respiratory, and digestive disorders. Cancer Genet. 290–291, 56–60. https://doi.org/10.1016/j.cancergen.2024.12.004 (2025).

    Google Scholar 

  38. Chen, L., Gu, J. & Zhou, B. PMiSLocMF: Predicting miRNA subcellular localizations by incorporating multi-source features of miRNAs. Brief. Bioinfor. 25, bbae386 (2024).

    Google Scholar 

  39. Chen, L., Chen, Y. & Zhou, B. HCLAMCMI: Prediction of circRNA-miRNA interactions based on hypergraph contrastive learning and an attention mechanism. J. Chem. Inf. Model. 65, 12099–12115 (2025).

    Google Scholar 

  40. Powers, D. Evaluation: From precision, recall and f-measure to roc., informedness, markedness & correlation. J. Mach. Learn. Technol. 2, 37–63 (2011).

    Google Scholar 

  41. Matthews, B. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405, 442–451 (1975).

    Google Scholar 

  42. Chen, L. & Li, J. PDTDAHN: Predicting drug-target-disease associations using a heterogeneous network. Curr. Bioinform. in press (2025).

  43. Chen, L., Zhang, S. & Zhou, B. Herb-disease association prediction model based on network consistency projection. Sci. Rep. 15, 3328 (2025).

    Google Scholar 

  44. Chen, L., Lu, Y., Xu, J. & Zhou, B. Prediction of drug’s anatomical therapeutic chemical (ATC) code by constructing biological profiles of ATC codes. BMC Bioinform. 26, 86 (2025).

    Google Scholar 

  45. Chen, L., Zhu, W. & Chen, D. An end-to-end 3D graph neural network for predicting drug-target-disease associations. Curr. Bioinform. (2025).

  46. Chowdhury, S. Y., Shatabda, S. & Dehzangi, A. iDNAProt-ES: identification of DNA-binding proteins using evolutionary and structural features. Sci. Rep. 7, 14938 (2017).

    Google Scholar 

  47. Swain, P. H. & Hauska, H. The decision tree classifier: Design and potential. IEEE Trans. Geosci. Electron. 15, 142–147 (1977).

    Google Scholar 

  48. Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20, 273–297 (1995).

    Google Scholar 

  49. Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).

    Google Scholar 

  50. Tang, S. & Chen, L. iATC-NFMLP: Identifying classes of anatomical therapeutic chemicals based on drug networks, fingerprints and multilayer perceptron. Curr. Bioinform. 17, 814–824 (2022).

    Google Scholar 

  51. Chen, L. & Zhao, X. PCDA-HNMP: Predicting circRNA-disease association using heterogeneous network and meta-path. Math. Biosci. Eng. 20, 20553–20575 (2023).

    Google Scholar 

  52. Wang, Y., Xu, Y., Yang, Z., Liu, X. & Dai, Q. Using recursive feature selection with random forest to improve protein structural class prediction for low-similarity sequences. Comput. Math. Methods Med. 2021, 5529389. https://doi.org/10.1155/2021/5529389 (2021).

    Google Scholar 

  53. Onesime, M., Yang, Z. & Dai, Q. Genomic island prediction via chi-square test and random forest algorithm. Comput. Math. Methods Med. 2021, 9969751. https://doi.org/10.1155/2021/9969751 (2021).

    Google Scholar 

  54. Pedregosa, F. et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    Google Scholar 

  55. Jones, P. et al. InterProScan 5: Genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).

    Google Scholar 

  56. Tsay, Y. F., Chiu, C. C., Tsai, C. B., Ho, C. H. & Hsu, P. K. Nitrate transporters and peptide transporters. FEBS Lett. 581, 2290–2300. https://doi.org/10.1016/j.febslet.2007.04.047 (2007).

    Google Scholar 

  57. Ding, L. et al. Aquaporin PIP2;1 affects water transport and root growth in rice (Oryza sativa L.). Plant Physiol. Biochem. 139, 152–160. https://doi.org/10.1016/j.plaphy.2019.03.017 (2019).

    Google Scholar 

  58. Marrocco, K., Thomann, A., Parmentier, Y., Genschik, P. & Criqui, M. C. The APC/C E3 ligase remains active in most post-mitotic Arabidopsis cells and is required for proper vasculature development and organization. Development 136, 1475–1485. https://doi.org/10.1242/dev.035535 (2009).

    Google Scholar 

  59. Kobayashi, T. et al. Iron-binding haemerythrin RING ubiquitin ligases regulate plant iron responses and accumulation. Nat. Commun. 4, 2792. https://doi.org/10.1038/ncomms3792 (2013).

    Google Scholar 

Download references

Author information

Authors and Affiliations

  1. College of Information Engineering, Shanghai Maritime University, Shanghai, China

    Lei Chen & Xingyu Xun

  2. School of Basic Medical Sciences, Shanghai University of Medicine and Health Sciences, Shanghai, 201318, China

    Bo Zhou

Authors
  1. Lei Chen
    View author publications

    Search author on:PubMed Google Scholar

  2. Xingyu Xun
    View author publications

    Search author on:PubMed Google Scholar

  3. Bo Zhou
    View author publications

    Search author on:PubMed Google Scholar

Contributions

L.C. designed the research; L.C., X.X. and B.Z. conducted the experiments; X.X. and B.Z. analyzed the results; L.C. and X.X. wrote the manuscript. All authors have read and approved the manuscript.

Corresponding author

Correspondence to Lei Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, L., Xun, X. & Zhou, B. Root-associated protein prediction using a protein large language model and hypergraph convolutional networks. Sci Rep (2026). https://doi.org/10.1038/s41598-026-35110-7

Download citation

  • Received: 17 September 2025

  • Accepted: 02 January 2026

  • Published: 08 January 2026

  • DOI: https://doi.org/10.1038/s41598-026-35110-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Protein classification
  • ProtT5
  • BLOSUM62 matrix
  • Position-specific scoring matrix
  • Hypergraph
  • Deep learning
Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on Twitter
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com sitemap

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing