Abstract
Plant root-associated proteins promote plant growth and enhance stress tolerance. They participate in signaling and plant growth regulation. It is clear that they play key roles in plant growth, development and environmental adaptation. At present, the root-associated proteins have not been fully discovered. It is essential to identify latent root-associated proteins. Traditional methods (proteomic analysis, transcriptome and expression analysis) for determining root-associated proteins are highly relied on the data generated by biochemical experiments, which are always expensive and time-consuming. On the other hand, the current computational models show weak ability, providing great spaces for improvement. In this study, we propose a new computational model, Hypergraph-Root, for predicting root-associated proteins. The model employed several feature types to represent proteins, which were derived from proteins BLOSUM62 and position-specific scoring matrices as well as by a protein language model. These features were improved by hypergraph convolutional network and multi-head attention. The final predicted result was yielded by a fully connected layer. The model yielded high performance with AUC about 0.9 on training and independent datasets. It had evident advantages compared with existing models. Some additional tests were conducted to prove the rationality of the model’s structure.
Similar content being viewed by others
Data availability
The data underlying this study are openly available in RGPDB database at http://sysbio.unl.edu/RGPDB/. The codes and refined data are available at https://github.com/Xxy0413-1119/Hypergraph-Root.
References
Hodge, A., Berta, G., Doussan, C., Merchan, F. & Crespi, M. Plant root growth, architecture and function. Plant Soil 321, 153–187 (2009).
Huang, B., Rachmilevitch, S. & Xu, J. Root carbon and protein metabolism associated with heat tolerance. J. Exp. Bot. 63, 3455–3465 (2012).
Fageria, N. K. The Role of Plant Roots in Crop Production (CRC Press, 2012).
Dawson, N., Sillitoe, I., Marsden, R. L. & Orengo, C. A. The classification of protein domains. In Bioinformatics: Volume I: Data, Sequence Analysis, and Evolution, 137–164 (2017).
Moisseyev, G. et al. RGPDB: database of root-associated genes and promoters in maize, soybean, and sorghum. Database J. Biol. Databases Curation 2020, baaa038. https://doi.org/10.1093/database/baaa038 (2020).
Almagro Armenteros, J. J. et al. SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat. Biotechnol. 37, 420–423. https://doi.org/10.1038/s41587-019-0036-z (2019).
Yang, L., Gao, J., Gao, M., Jiang, L. & Luo, L. Characterization of plasma membrane proteins in stylosanthes leaves and roots using simplified enrichment method with a nonionic detergent. Front. Plant Sci. 13, 1071225. https://doi.org/10.3389/fpls.2022.1071225 (2022).
Iwasaki, Y. et al. Proteomics analysis of plasma membrane fractions of the root, leaf, and flower of rice. Int. J. Mol. Sci. 21, 6988. https://doi.org/10.3390/ijms21196988 (2020).
Voothuluru, P., Anderson, J. C., Sharp, R. E. & Peck, S. C. Plasma membrane proteomics in the maize primary root growth zone: Novel insights into root growth adaptation to water stress. Plant Cell Environ. 39, 2043–2054 (2016).
Kumar Meher, P. et al. SVM-root: Identification of root-associated proteins in plants by employing the support vector machine with sequence-derived features. Curr. Bioinform. 19, 69–80. https://doi.org/10.2174/1574893618666230417104543 (2024).
Zhou, B., Liu, S. Y., Chen, L. & Dai, Q. Graph-root: Prediction of root-associated proteins in maize, sorghum, and soybean based on graph convolutional network and network embedding method. Curr. Bioinform. https://doi.org/10.2174/0115748936343410241008103219 (2024).
Feng, Y., You, H., Zhang, Z., Ji, R. & Gao, Y. in Proceedings of the AAAI Conference on Artificial Intelligence 3558–3565.
Szklarczyk, D. et al. The STRING database in 2023: Protein–protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 51, D638–D646. https://doi.org/10.1093/nar/gkac1000 (2022).
Yates, A. D. et al. Ensembl Genomes 2022: An expanding genome resource for non-vertebrates. Nucleic Acids Res. 50, D996-d1003. https://doi.org/10.1093/nar/gkab1007 (2022).
UniProt Consortium. UniProt: The universal protein knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531. https://doi.org/10.1093/nar/gkac1052 (2023).
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152. https://doi.org/10.1093/bioinformatics/bts565 (2012).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. in 26th International Conference on Neural Information Processing Systems 3111–3119 (2013).
Elnaggar, A. et al. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).
Steinegger, M., Mirdita, M. & Söding, J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat. Methods 16, 603–606 (2019).
Steinegger, M. & Söding, J. Clustering huge protein sequence sets in linear time. Nat. Commun. 9, 2542 (2018).
Suzek, B. E. et al. UniRef clusters: A comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
Henikoff, S. & Henikoff, J. G. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U. S. A. 89, 10915–10919. https://doi.org/10.1073/pnas.89.22.10915 (1992).
Yao, L. et al. DeepAFP: An effective computational framework for identifying antifungal peptides based on deep learning. Protein Sci. 32, e4758. https://doi.org/10.1002/pro.4758 (2023).
Fang, Y. et al. AFP-MFL: Accurate identification of antifungal peptides using multi-view feature learning. Brief. Bioinform. 24, bbac606. https://doi.org/10.1093/bib/bbac606 (2023).
Ning, Q. & Li, J. DLF-Sul: A multi-module deep learning framework for prediction of S-sulfinylation sites in proteins. Brief. Bioinform. 23, bbac323. https://doi.org/10.1093/bib/bbac323 (2022).
Pearson, W. R. An introduction to sequence similarity (“homology”) searching. Curr. Protoc. Bioinform. 42, 3–1 (2013).
Cheol Jeong, J., Lin, X. & Chen, X.-W. On position-specific scoring matrix for protein function prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 8, 308–315 (2010).
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402. https://doi.org/10.1093/nar/25.17.3389 (1997).
Boeckmann, B. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370. https://doi.org/10.1093/nar/gkg095 (2003).
Singh, J., Litfin, T., Singh, J., Paliwal, K. & Zhou, Y. SPOT-Contact-LM: Improving single-sequence-based prediction of protein contact map using a transformer language model. Bioinformatics 38, 1888–1894. https://doi.org/10.1093/bioinformatics/btac053 (2022).
Ouyang, D. et al. HGCLAMIR: Hypergraph contrastive learning with attention mechanism and integrated multi-view representation for predicting miRNA-disease associations. PLoS Comput. Biol. 20, e1011927. https://doi.org/10.1371/journal.pcbi.1011927 (2024).
Peng, W., He, Z., Dai, W. & Lan, W. MHCLMDA: Multihypergraph contrastive learning for miRNA–disease association prediction. Brief. Bioinform. 25, bbad524. https://doi.org/10.1093/bib/bbad524 (2024).
Lin, Z. et al. A structured self-attentive sentence embedding. arXiv preprint https://arxiv.org/abs/1703.03130. https://doi.org/10.48550/arXiv.1703.03130 (2017).
Kingma, D. P. & Ba, J. in 3rd International Conference on Learning Representations (Louisiana, 2019).
Kohavi, R. in International Joint Conference on Artificial Intelligence 1137–1145 (Lawrence Erlbaum Associates Ltd).
Bao, Y. et al. Recognizing SARS-CoV-2 infection of nasopharyngeal tissue at the single-cell level by machine learning method. Mol. Immunol. 177, 44–61 (2025).
Liao, H. et al. Machine learning analysis of CD4+ T cell gene expression in diverse diseases: Insights from cancer, metabolic, respiratory, and digestive disorders. Cancer Genet. 290–291, 56–60. https://doi.org/10.1016/j.cancergen.2024.12.004 (2025).
Chen, L., Gu, J. & Zhou, B. PMiSLocMF: Predicting miRNA subcellular localizations by incorporating multi-source features of miRNAs. Brief. Bioinfor. 25, bbae386 (2024).
Chen, L., Chen, Y. & Zhou, B. HCLAMCMI: Prediction of circRNA-miRNA interactions based on hypergraph contrastive learning and an attention mechanism. J. Chem. Inf. Model. 65, 12099–12115 (2025).
Powers, D. Evaluation: From precision, recall and f-measure to roc., informedness, markedness & correlation. J. Mach. Learn. Technol. 2, 37–63 (2011).
Matthews, B. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405, 442–451 (1975).
Chen, L. & Li, J. PDTDAHN: Predicting drug-target-disease associations using a heterogeneous network. Curr. Bioinform. in press (2025).
Chen, L., Zhang, S. & Zhou, B. Herb-disease association prediction model based on network consistency projection. Sci. Rep. 15, 3328 (2025).
Chen, L., Lu, Y., Xu, J. & Zhou, B. Prediction of drug’s anatomical therapeutic chemical (ATC) code by constructing biological profiles of ATC codes. BMC Bioinform. 26, 86 (2025).
Chen, L., Zhu, W. & Chen, D. An end-to-end 3D graph neural network for predicting drug-target-disease associations. Curr. Bioinform. (2025).
Chowdhury, S. Y., Shatabda, S. & Dehzangi, A. iDNAProt-ES: identification of DNA-binding proteins using evolutionary and structural features. Sci. Rep. 7, 14938 (2017).
Swain, P. H. & Hauska, H. The decision tree classifier: Design and potential. IEEE Trans. Geosci. Electron. 15, 142–147 (1977).
Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20, 273–297 (1995).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Tang, S. & Chen, L. iATC-NFMLP: Identifying classes of anatomical therapeutic chemicals based on drug networks, fingerprints and multilayer perceptron. Curr. Bioinform. 17, 814–824 (2022).
Chen, L. & Zhao, X. PCDA-HNMP: Predicting circRNA-disease association using heterogeneous network and meta-path. Math. Biosci. Eng. 20, 20553–20575 (2023).
Wang, Y., Xu, Y., Yang, Z., Liu, X. & Dai, Q. Using recursive feature selection with random forest to improve protein structural class prediction for low-similarity sequences. Comput. Math. Methods Med. 2021, 5529389. https://doi.org/10.1155/2021/5529389 (2021).
Onesime, M., Yang, Z. & Dai, Q. Genomic island prediction via chi-square test and random forest algorithm. Comput. Math. Methods Med. 2021, 9969751. https://doi.org/10.1155/2021/9969751 (2021).
Pedregosa, F. et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Jones, P. et al. InterProScan 5: Genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).
Tsay, Y. F., Chiu, C. C., Tsai, C. B., Ho, C. H. & Hsu, P. K. Nitrate transporters and peptide transporters. FEBS Lett. 581, 2290–2300. https://doi.org/10.1016/j.febslet.2007.04.047 (2007).
Ding, L. et al. Aquaporin PIP2;1 affects water transport and root growth in rice (Oryza sativa L.). Plant Physiol. Biochem. 139, 152–160. https://doi.org/10.1016/j.plaphy.2019.03.017 (2019).
Marrocco, K., Thomann, A., Parmentier, Y., Genschik, P. & Criqui, M. C. The APC/C E3 ligase remains active in most post-mitotic Arabidopsis cells and is required for proper vasculature development and organization. Development 136, 1475–1485. https://doi.org/10.1242/dev.035535 (2009).
Kobayashi, T. et al. Iron-binding haemerythrin RING ubiquitin ligases regulate plant iron responses and accumulation. Nat. Commun. 4, 2792. https://doi.org/10.1038/ncomms3792 (2013).
Author information
Authors and Affiliations
Contributions
L.C. designed the research; L.C., X.X. and B.Z. conducted the experiments; X.X. and B.Z. analyzed the results; L.C. and X.X. wrote the manuscript. All authors have read and approved the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Chen, L., Xun, X. & Zhou, B. Root-associated protein prediction using a protein large language model and hypergraph convolutional networks. Sci Rep (2026). https://doi.org/10.1038/s41598-026-35110-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-35110-7


