Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Nature Communications
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. nature communications
  3. articles
  4. article
High-resolution phage-host assignment through key proteins using large language models
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 20 March 2026

High-resolution phage-host assignment through key proteins using large language models

  • Zhihua Du1 na1,
  • Min Li2,3,4,5 na1,
  • Kaihuang Lin1 na1,
  • Bo Xing2,3,
  • Yuehua Ou1,
  • Zihao Lin1,
  • Wenchen Song  ORCID: orcid.org/0000-0002-3114-72542,
  • Jie Chen6,
  • Junhua Li  ORCID: orcid.org/0000-0001-6784-18734,5,7,
  • Jianqiang Li6 &
  • …
  • Minfeng Xiao  ORCID: orcid.org/0000-0002-0507-73522,7,8 

Nature Communications , Article number:  (2026) Cite this article

  • 4203 Accesses

  • 4 Altmetric

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Bacteriophages
  • Bioinformatics
  • Microbial communities
  • Genomics
  • Machine learning

Abstract

Viral sequences in diverse environments remain largely uncharacterized, impeding our comprehension of their genetic makeup, biological interactions, and potential applications. This underscores an urgent need for innovative analytical methods. Here, we present the VirHost Hunter framework, which employs phage tails and lysins, bypassing the requirement for full genomes, for efficient and high-resolution host assignment. By harnessing Protein Language Models and Vision Transformers, VirHost Hunter captures protein functional homology despite sequence dissimilarity, significantly boosting prediction accuracy. In the scenario of disease-associated gut bacteria, the calibrated VirHost Hunter surpasses existing methods, doubling phage host assignments, expanding taxonomic reach, and revealing previously uncharacterized phages targeting gut bacteria, including Akkermansia and Prevotella. Therefore, we establish a gut phage lysin database, enabling the synthesis of a lysin that effectively and specifically targets an obesity-promoting bacterium. VirHost Hunter’s precision and scalability mark a significant leap forward in virome research and present a promising avenue for microbiome therapies.

Similar content being viewed by others

Extensive gut virome variation and its associations with host and environmental factors in a population-level cohort

Article Open access 06 September 2022

Insights into gut microbiomes in stem cell transplantation by comprehensive shotgun long-read sequencing

Article Open access 19 February 2024

Gut virome dynamics: from commensal to critical player in health and disease

Article 05 November 2025

Data availability

The Gut Phage Lysin Database (GPLD) protein sequence data generated in this study have been deposited in the CNGB Sequence Archive (CNSA)111 of the China National GeneBank DataBase (CNGBdb)112 with accession number CNP0005794. The benchmark datasets, derived training and testing data generated in this study, are available in the Zenodo repository under https://doi.org/10.5281/zenodo.17340915. All accession codes are from previously published datasets and are provided in Supplementary Data 5. Source data supporting the findings of this study are provided in the Source Data. All these bacterial strains and plasmids can be requested by direct correspondence with the lead contact, Dr. Minfeng Xiao (xiaominfeng@genomics.cn). These resources are strictly limited to academic research use, and this publication must be cited accordingly. Source data are provided with this paper.

Code availability

The models constructed in this study, together with the corresponding scripts, are publicly available on GitHub at https://github.com/YuehuaOu/Viral-Host-Hunter and have been archived on Zenodo under https://doi.org/10.5281/zenodo.18399248. The pre-trained model weights generated in this study are available in the Zenodo repository under https://doi.org/10.5281/zenodo.17340381.

References

  1. Bayfield, O. W. et al. Structural atlas of a human gut crassvirus. Nature 617, 409–416 (2023).

    Google Scholar 

  2. Koskella, B. & Brockhurst, M. A. Bacteria-phage coevolution as a driver of ecological and evolutionary processes in microbial communities. FEMS Microbiol. Rev. 38, 916–931 (2014).

    Google Scholar 

  3. Borin, J. M., Avrani, S., Barrick, J. E., Petrie, K. L. & Meyer, J. R. Coevolutionary phage training leads to greater bacterial suppression and delays the evolution of phage resistance. Proc. Natl. Acad. Sci. USA 118, e2104592118 (2021).

    Google Scholar 

  4. Blazanin, M. & Turner, P. E. Community context matters for bacteria-phage ecology and evolution. ISME J. 15, 3119–3128 (2021).

    Google Scholar 

  5. Lawrence, D., Baldridge, M. T. & Handley, S. A. Phages and human health: more than idle hitchhikers. Viruses 11, 587 (2019).

    Google Scholar 

  6. Federici, S., Nobs, S. P. & Elinav, E. Phages and their potential to modulate the microbiome and immunity. Cell Mol. Immunol. 18, 889–904 (2021).

    Google Scholar 

  7. Bhargava, K., Nath, G., Bhargava, A., Aseri, G. K. & Jain, N. Phage therapeutics: from promises to practices and prospectives. Appl. Microbiol. Biotechnol. 105, 9047–9067 (2021).

    Google Scholar 

  8. Vijay, A. & Valdes, A. M. Role of the gut microbiome in chronic diseases: a narrative review. Eur. J. Clin. Nutr. 76, 489–501 (2022).

    Google Scholar 

  9. Guerin, E. & Hill, C. Shining light on human gut bacteriophages. Front Cell Infect. Microbiol. 10, 481 (2020).

    Google Scholar 

  10. Porter, N. T. et al. Phase-variable capsular polysaccharides and lipoproteins modify bacteriophage susceptibility in Bacteroides thetaiotaomicron. Nat. Microbiol. 5, 1170–1181 (2020).

    Google Scholar 

  11. Vazquez, R., Garcia, E. & Garcia, P. Phage lysins for fighting bacterial respiratory infections: a new generation of antimicrobials. Front. Immunol. 9, 2252 (2018).

    Google Scholar 

  12. Ghose, C. & Euler, C. W. Gram-negative bacterial lysins. Antibiotics 9, 74 (2020).

    Google Scholar 

  13. Danis-Wlodarczyk, K. M., Wozniak, D. J. & Abedon, S. T. Treating bacterial infections with bacteriophage-based enzybiotics: in vitro in vivo and clinical application. Antibiotics 10, 1497 (2021).

    Google Scholar 

  14. Rahman, M. U. et al. Endolysin, a promising solution against antimicrobial resistance. Antibiotics 10, 1277 (2021).

    Google Scholar 

  15. Lee, C., Kim, H. & Ryu, S. Bacteriophage and endolysin engineering for biocontrol of food pathogens/pathogens in the food: recent advances and future trends. Crit. Rev. Food Sci. Nutr. 63, 8919–8938 (2023).

    Google Scholar 

  16. Khan, F. M., Chen, J. H., Zhang, R. & Liu, B. A comprehensive review of the applications of bacteriophage-derived endolysins for foodborne bacterial pathogens and food safety: recent advances, challenges, and future perspective. Front. Microbiol. 14, 1259210 (2023).

    Google Scholar 

  17. Criel, B., Taelman, S., Van Criekinge, W., Stock, M. & Briers, Y. PhaLP: a database for the study of phage lytic proteins and their evolution. Viruses 13, 1240 (2021).

    Google Scholar 

  18. Coutinho, F. H. et al. RaFAH: Host prediction for viruses of Bacteria and Archaea based on protein content. Patterns 2, 100274 (2021).

    Google Scholar 

  19. Pons, J. C. et al. VPF-Class: taxonomic assignment and host prediction of uncultivated viruses based on viral protein families. Bioinformatics 37, 1805–1813 (2021).

    Google Scholar 

  20. Amgarten, D., Iha, B. K. V., Piroupo, C. M., da Silva, A. M. & Setubal, J. C. vHULK, a new tool for bacteriophage host prediction based on annotated genomic features and neural networks. Phage 3, 204–212 (2022).

    Google Scholar 

  21. Zielezinski, A., Barylski, J. & Karlowski, W. M. Taxonomy-aware, sequence similarity ranking reliably predicts phage-host relationships. BMC Biol. 19, 223 (2021).

    Google Scholar 

  22. Dion, M. B. et al. Streamlining CRISPR spacer-based bacterial host predictions to decipher the viral dark matter. Nucleic Acids Res. 49, 3127–3138 (2021).

    Google Scholar 

  23. Zhang, R. et al. SpacePHARER: sensitive identification of phages from CRISPR spacers in prokaryotic hosts. Bioinformatics 37, 3364–3366 (2021).

    Google Scholar 

  24. Edwards, R. A., McNair, K., Faust, K., Raes, J. & Dutilh, B. E. Computational approaches to predict bacteriophage-host relationships. FEMS Microbiol. Rev. 40, 258–272 (2016).

    Google Scholar 

  25. Li, J., Yang, F., Xiao, M. & Li, A. Advances and challenges in cataloguing the human gut virome. Cell Host Microbe 30, 908–916 (2022).

    Google Scholar 

  26. Galiez, C., Siebert, M., Enault, F., Vincent, J. & Söding, J. WIsH: who is the host? Predicting prokaryotic hosts from metagenomic phage contigs. Bioinformatics 33, 3113–3114 (2017).

    Google Scholar 

  27. Leite, D. M. C. et al. Computational prediction of inter-species relationships through omics data analysis and machine learning. BMC Bioinforma. 19, 420 (2018).

    Google Scholar 

  28. Li, M. et al. A deep learning-based method for identification of bacteriophage-host interaction. IEEE/ACM Trans. Comput Biol. Bioinform 18, 1801–1810 (2021).

    Google Scholar 

  29. Ruohan, W., Xianglilan, Z., Jianping, W. & Shuai Cheng, L. I. DeepHost: phage host prediction with convolutional neural network. Brief Bioinform. 23, bbab385 (2022).

    Google Scholar 

  30. Liu, D., Ma, Y., Jiang, X. & He, T. Predicting virus-host association by Kernelized logistic matrix factorization and similarity network fusion. BMC Bioinforma. 20, 594 (2019).

    Google Scholar 

  31. Wang, W. et al. A network-based integrated framework for predicting virus-prokaryote interactions. NAR Genom. Bioinform. 2, lqaa044 (2020).

    Google Scholar 

  32. Lu, C. et al. Prokaryotic virus host predictor: a Gaussian model for host prediction of prokaryotic viruses in metagenomics. BMC Biol. 19, 5 (2021).

    Google Scholar 

  33. Shang, J. & Sun, Y. Predicting the hosts of prokaryotic viruses using GCN-based semi-supervised learning. BMC Biol. 19, 250 (2021).

    Google Scholar 

  34. Shang, J. & Sun, Y. CHERRY: A computational method for accurate prediction of virus-pRokarYotic interactions using a graph encoder-decoder model. Brief Bioinform. 23, bbac182 (2022).

    Google Scholar 

  35. Tan, J. et al. HoPhage: an ab initio tool for identifying hosts of phage fragments from metaviromes. Bioinformatics 38, 543–545 (2022).

    Google Scholar 

  36. Tang, T., Hou, S., Fuhrman, J. A. & Sun, F. Phage-bacterial contig association prediction with a convolutional neural network. Bioinformatics 38, i45–i52 (2022).

    Google Scholar 

  37. Zielezinski, A., Deorowicz, S. & Gudys, A. PHIST: fast and accurate prediction of prokaryotic hosts from metagenomic viral sequences. Bioinformatics 38, 1447–1449 (2022).

    Google Scholar 

  38. Villarroel, J. et al. HostPhinder: a phage host prediction tool. Viruses 8, 116 (2016).

    Google Scholar 

  39. Boeckaerts, D. et al. Predicting bacteriophage hosts based on sequences of annotated receptor-binding proteins. Sci. Rep. 11, 1467 (2021).

    Google Scholar 

  40. Gonzales, M. E. M., Ureta, J. C. & Shrestha, A. M. S. Protein embeddings improve phage-host interaction prediction. PLoS ONE 18, e0289030 (2023).

    Google Scholar 

  41. Paez-Espino, D. et al. Uncovering Earth’s virome. Nature 536, 425–430 (2016).

    Google Scholar 

  42. Gregory, A. C. et al. The gut virome database reveals age-dependent patterns of virome diversity in the human gut. Cell Host Microbe 28, 724–740.e728 (2020).

    Google Scholar 

  43. Camarillo-Guerrero, L. F., Almeida, A., Rangel-Pineros, G., Finn, R. D. & Lawley, T. D. Massive expansion of human gut bacteriophage diversity. Cell 184, 1098–1109 e1099 (2021).

    Google Scholar 

  44. Nayfach, S. et al. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat. Microbiol. 6, 960–970 (2021).

    Google Scholar 

  45. Tisza, M. J. & Buck, C. B. A catalog of tens of thousands of viruses from human metagenomes reveals hidden associations with chronic diseases. Proc. Natl. Acad. Sci. USA 118, e2023202118 (2021).

    Google Scholar 

  46. Roux, S. et al. iPHoP: an integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria. PLoS Biol. 21, e3002083 (2023).

    Google Scholar 

  47. McGinnis, S. & Madden, T. L. BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res. 32, W20–W25 (2004).

    Google Scholar 

  48. Ahlgren, N. A., Ren, J., Lu, Y. Y., Fuhrman, J. A. & Sun, F. Alignment-free $d_2;*$ oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences. Nucleic Acids Res. 45, 39–53 (2017).

    Google Scholar 

  49. Garcia-Doval, C. & van Raaij, M. J. Structure of the receptor-binding carboxy-terminal domain of bacteriophage T7 tail fibers. Proc. Natl. Acad. Sci. USA 109, 9390–9395 (2012).

    Google Scholar 

  50. Yehl, K. et al. Engineering phage host-range and suppressing bacterial resistance through phage tail fiber mutagenesis. Cell 179, 459–469.e459 (2019).

    Google Scholar 

  51. Chen, M. et al. Alterations in gp37 expand the host range of a T4-like phage. Appl. Environ. Microbiol. 83, e01576–e01617 (2017).

    Google Scholar 

  52. Santos, S. B. et al. Selection and characterization of a multivalent Salmonella phage and its production in a nonpathogenic Escherichia coli strain. Appl Environ. Microbiol. 76, 7338–7342 (2010).

    Google Scholar 

  53. Pas, C., Latka, A., Fieseler, L. & Briers, Y. Phage tailspike modularity and horizontal gene transfer reveals specificity towards E. coli O-antigen serogroups. Virol. J. 20, 174 (2023).

    Google Scholar 

  54. Nobrega, F. L. et al. Targeting mechanisms of tailed bacteriophages. Nat. Rev. Microbiol. 16, 760–773 (2018).

    Google Scholar 

  55. Dams, D., Brøndsted, L., Drulis-Kawa, Z. & Briers, Y. Engineering of receptor-binding proteins in bacteriophages and phage tail-like bacteriocins. Biochem Soc. Trans. 47, 449–460 (2019).

    Google Scholar 

  56. Opperman, C. J., Wojno, J. M. & Brink, A. J. Treating bacterial infections with bacteriophages in the 21st century. S Afr. J. Infect. Dis. 37, 346 (2022).

    Google Scholar 

  57. Rakhuba, D. V., Kolomiets, E. I., Dey, E. S. & Novik, G. I. Bacteriophage receptors, mechanisms of phage adsorption and penetration into host cell. Pol. J. Microbiol. 59, 145–155 (2010).

    Google Scholar 

  58. Li, M. et al. Enhancing Strain-level phage-host prediction through experimentally vali-dated negatives and feature optimization strategies. bioRxiv https://doi.org/10.1101/2025.05.31.656987 (2025).

  59. Islam, M. Z. et al. Molecular anatomy of the receptor binding module of a bacteriophage long tail fibre. PLoS Pathog. 15, e1008193 (2019).

    Google Scholar 

  60. Nelson, D., Schuch, R., Chahales, P., Zhu, S. & Fischetti, V. A. PlyC: a multimeric bacteriophage lysin. Proc. Natl. Acad. Sci. USA 103, 10765–10770 (2006).

    Google Scholar 

  61. Flamholz, Z. N., Biller, S. J. & Kelly, L. Large language models improve annotation of prokaryotic viral proteins. Nat. Microbiol. 9, 537–549 (2024).

    Google Scholar 

  62. Dosovitskiy, A. et al. in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations (2021).

  63. Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).

    Google Scholar 

  64. Kim, G. B., Gao, Y., Palsson, B. O. & Lee, S. Y. DeepTFactor: a deep learning-based tool for the prediction of transcription factors. Proc. Natl. Acad. Sci. USA 118, e2021171118 (2021).

    Google Scholar 

  65. Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).

    Google Scholar 

  66. Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).

    Google Scholar 

  67. Lloyd-Price, J. et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature 569, 655–662 (2019).

    Google Scholar 

  68. Schirmer, M., Garner, A., Vlamakis, H. & Xavier, R. J. Microbial genes and pathways in inflammatory bowel disease. Nat. Rev. Microbiol. 17, 497–511 (2019).

    Google Scholar 

  69. Ternes, D. et al. The gut microbial metabolite formate exacerbates colorectal cancer progression. Nat. Metab. 4, 458–475 (2022).

    Google Scholar 

  70. Wong, S. H. & Yu, J. Gut microbiota in colorectal cancer: mechanisms of action and clinical applications. Nat. Rev. Gastroenterol. Hepatol. 16, 690–704 (2019).

    Google Scholar 

  71. Qin, Y. et al. Consistent signatures in the human gut microbiome of old- and young-onset colorectal cancer. Nat. Commun. 15, 3396 (2024).

    Google Scholar 

  72. Liu, R. et al. Gut microbiome and serum metabolome alterations in obesity and after weight-loss intervention. Nat. Med. 23, 859–868 (2017).

    Google Scholar 

  73. Jie, Z. et al. The gut microbiome in atherosclerotic cardiovascular disease. Nat. Commun. 8, 845 (2017).

    Google Scholar 

  74. Wu, C. et al. Obesity-enriched gut microbe degrades myo-inositol and promotes lipid absorption. Cell Host Microbe 32, 1301–1314 e1309 (2024).

    Google Scholar 

  75. Wang, T. et al. Divergent age-associated and metabolism-associated gut microbiome signatures modulate cardiovascular disease risk. Nat. Med 30, 1722–1731 (2024).

    Google Scholar 

  76. Qin, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60 (2012).

    Google Scholar 

  77. Shen, J. et al. Large-scale phage cultivation for commensal human gut bacteria. Cell Host Microbe 31, 665–677.e667 (2023).

    Google Scholar 

  78. Zhu, F. et al. Metagenome-wide association of gut microbiome features for schizophrenia. Nat. Commun. 11, 1612 (2020).

    Google Scholar 

  79. Alpizar-Rodriguez, D. et al. Prevotella copri in individuals at risk for rheumatoid arthritis. Ann. Rheum. Dis. 78, 590–593 (2019).

    Google Scholar 

  80. Scher, J. U. et al. Expansion of intestinal Prevotella copri correlates with enhanced susceptibility to arthritis. Elife 2, e01202 (2013).

    Google Scholar 

  81. Maeda, Y. et al. Dysbiosis contributes to arthritis development via activation of autoreactive T cells in the intestine. Arthritis Rheumatol. 68, 2646–2661 (2016).

    Google Scholar 

  82. Tsai, C. Y. et al. Abundance of Prevotella copri in the gut microbiota is inversely related to a healthy diet in patients with type 2 diabetes. J. Food Drug Anal. 31, 599–608 (2023).

    Google Scholar 

  83. Yue, T. et al. High-risk genotypes for type 1 diabetes are associated with the imbalance of gut microbiome and serum metabolites. Front. Immunol. 13, 1033393 (2022).

    Google Scholar 

  84. Yang, C. et al. Prevotella copri alleviates hyperglycemia and regulates gut microbiota and metabolic profiles in mice. mSystems 9, e0053224 (2024).

    Google Scholar 

  85. Devoto, A. E. et al. Megaphages infect Prevotella, and variants are widespread in gut microbiomes. Nat. Microbiol. 4, 693–700 (2019).

    Google Scholar 

  86. Weston, J., Elisseeff, A., Zhou, D., Leslie, C. S. & Noble, W. S. Protein ranking: from local to global structure in the protein similarity network. Proc. Natl. Acad. Sci. USA 101, 6559–6563 (2004).

    Google Scholar 

  87. Copp, J. N., Anderson, D. W., Akiva, E., Babbitt, P. C. & Tokuriki, N. Exploring the sequence, function, and evolutionary space of protein superfamilies using sequence similarity networks and phylogenetic reconstructions. Methods Enzymol. 620, 315–347 (2019).

    Google Scholar 

  88. Dey, K. K., Xie, D. & Stephens, M. A new sequence logo plot to highlight enrichment and depletion. BMC Bioinforma. 19, 473 (2018).

    Google Scholar 

  89. Gupta, A., Osadchiy, V. & Mayer, E. A. Brain-gut-microbiome interactions in obesity and food addiction. Nat. Rev. Gastroenterol. Hepatol. 17, 655–672 (2020).

    Google Scholar 

  90. Kasai, C. et al. Comparison of the gut microbiota composition between obese and non-obese individuals in a Japanese population, as analyzed by terminal restriction fragment length polymorphism and next-generation sequencing. BMC Gastroenterol. 15, 100 (2015).

    Google Scholar 

  91. Kocelak, P. et al. Resting energy expenditure and gut microbiota in obese and normal weight subjects. Eur. Rev. Med Pharm. Sci. 17, 2816–2821 (2013).

    Google Scholar 

  92. Fujimoto, K. et al. An enterococcal phage-derived enzyme suppresses graft-versus-host disease. Nature 632, 174–181 (2024).

    Google Scholar 

  93. Dzuvor, C. K. O. et al. Engineering self-assembled endolysin nanoparticles against antibiotic-resistant bacteria. ACS Appl. Bio. Mater. https://doi.org/10.1021/acsabm.2c00741 (2022).

  94. Lee, C., Kim, J., Son, B. & Ryu, S. Development of advanced chimeric endolysin to control multidrug-resistant staphylococcus aureus through domain shuffling. ACS Infect. Dis. 7, 2081–2092 (2021).

    Google Scholar 

  95. Diez-Martinez, R. et al. Improving the lethal effect of cpl-7, a pneumococcal phage lysozyme with broad bactericidal activity, by inverting the net charge of its cell wall-binding module. Antimicrob. Agents Chemother. 57, 5355–5365 (2013).

    Google Scholar 

  96. Zampara, A. et al. Exploiting phage receptor-binding proteins to enable endolysins to kill Gram-negative bacteria. Sci. Rep. 10, 12087 (2020).

    Google Scholar 

  97. Hendrix, R. W., Smith, M. C., Burns, R. N., Ford, M. E. & Hatfull, G. F. Evolutionary relationships among diverse bacteriophages and prophages: all the world’s a phage. Proc. Natl. Acad. Sci. USA 96, 2192–2197 (1999).

    Google Scholar 

  98. Adriaenssens, E. M. Phage diversity in the human gut microbiome: a taxonomist’s perspective. mSystems 6, e0079921 (2021).

    Google Scholar 

  99. Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C. & Dosovitskiy, A. Do vision transformers see like convolutional neural networks? Adv. neural Inf. Process. Syst. 34, 12116–12128 (2021).

    Google Scholar 

  100. Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinforma. 11, 119 (2010).

    Google Scholar 

  101. Li, W. et al. RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation. Nucleic Acids Res 49, D1020–D1028 (2021).

    Google Scholar 

  102. Cantalapiedra, C. P., Hernandez-Plaza, A., Letunic, I., Bork, P. & Huerta-Cepas, J. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol. Biol. Evol. 38, 5825–5829 (2021).

    Google Scholar 

  103. Aziz, R. K. et al. The RAST Server: rapid annotations using subsystems technology. BMC Genomics 9, 75 (2008).

    Google Scholar 

  104. Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).

    Google Scholar 

  105. Lu, Y. Y., Noble, W. S. & Keich, U. A BLAST from the past: revisiting blastpas E-value. Bioinformatics 40, btae729 (2024).

    Google Scholar 

  106. Skewes-Cox, P., Sharpton, T. J., Pollard, K. S. & DeRisi, J. L. Profile hidden Markov models for the detection of viruses within metagenomic sequence data. PLoS One 9, e105067 (2014).

    Google Scholar 

  107. Mitchell, A. L. et al. InterPro in 2019: improving coverage, classification and access to protein sequence annotations. Nucleic Acids Res 47, D351–D360 (2019).

    Google Scholar 

  108. Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS One 11, e0163962 (2016).

    Google Scholar 

  109. Song, W. et al. Prophage Hunter: an integrative hunting tool for active prophages. Nucleic Acids Res 47, W74–w80 (2019).

    Google Scholar 

  110. Wickham, H. & Wickham, H. Data analysis. (Springer, 2016).

  111. Guo, X. et al. CNSA: a data repository for archiving omics data. Database (Oxford) 2020 (2020).

  112. Chen, F. Z. et al. CNGBdb: China National Genebank Database. Yi Chuan 42, 799–809 (2020).

    Google Scholar 

Download references

Acknowledgments

This work was supported by the National Key R&D Program of China (Grant No. 2020YFA0908700) to M.X., Z.D., Junhua L., and Jianqiang L., and in part by the National Natural Science Fund for Distinguished Young Scholars (Grant No. 62325307) to Jiangqiang L. We sincerely thank the China National GeneBank Database (CNGB) for providing valuable data support and computational resources.

Author information

Author notes
  1. These authors contributed equally: Zhihua Du, Min Li, Kaihuang Lin.

Authors and Affiliations

  1. College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060, China

    Zhihua Du, Kaihuang Lin, Yuehua Ou & Zihao Lin

  2. BGI Research, Shenzhen, 518083, China

    Min Li, Bo Xing, Wenchen Song & Minfeng Xiao

  3. University of Chinese Academy of Sciences, Beijing, 100049, China

    Min Li & Bo Xing

  4. BGI Research, Belgrade, 11000, Serbia

    Min Li & Junhua Li

  5. Shenzhen Key Laboratory of Unknown Pathogen Identification, BGI Research, Shenzhen, 518083, China

    Min Li & Junhua Li

  6. School of Artificial Intelligence, Shenzhen University, Shenzhen, 518060, China

    Jie Chen & Jianqiang Li

  7. State Key Laboratory of Genome and Multi-omics Technologies, BGI Research, Shenzhen, 518083, China

    Junhua Li & Minfeng Xiao

  8. BGI College, Shenzhen, 518083, China

    Minfeng Xiao

Authors
  1. Zhihua Du
    View author publications

    Search author on:PubMed Google Scholar

  2. Min Li
    View author publications

    Search author on:PubMed Google Scholar

  3. Kaihuang Lin
    View author publications

    Search author on:PubMed Google Scholar

  4. Bo Xing
    View author publications

    Search author on:PubMed Google Scholar

  5. Yuehua Ou
    View author publications

    Search author on:PubMed Google Scholar

  6. Zihao Lin
    View author publications

    Search author on:PubMed Google Scholar

  7. Wenchen Song
    View author publications

    Search author on:PubMed Google Scholar

  8. Jie Chen
    View author publications

    Search author on:PubMed Google Scholar

  9. Junhua Li
    View author publications

    Search author on:PubMed Google Scholar

  10. Jianqiang Li
    View author publications

    Search author on:PubMed Google Scholar

  11. Minfeng Xiao
    View author publications

    Search author on:PubMed Google Scholar

Contributions

M.X. conceived the study. Z.D., K.L., and Y.O. developed the tool. M.L. and B.X. compiled the training, validation, and test sets. K.L., M.L., B.X., and M.X. analyzed the viral dark matter. K.L., M.L., B.X., and Y.O. drafted the manuscript and made the figures. Z.L. and M.L. maintained and improved the code. Z.D., M.X., and Junhua L. revised the manuscript. W.S., J.C. and Jianqiang L. provided consultation. B.X., Y.O., and Z.L. contributed equally to this work. All authors read, edited, and approved the final manuscript.

Corresponding authors

Correspondence to Junhua Li, Jianqiang Li or Minfeng Xiao.

Ethics declarations

Competing interests

Authors affiliated with BGI Research (M. L., M. X., and Junhua. L) and authors affiliated with Shenzhen University (Z. D., K. L., and Jiangqiang. L) have filed a patent application titled “ Method and System for Predicting Host Range of Bacteriophages, and Corresponding Computer Device or Medium”(Application No. PCT/CN2025/075560). It describes predicting algorithm in this study that predicts host specificity of phages based on tail and lysin proteins. Other authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Karthik Anantharaman and the other anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Description of Additional Supplementary Files (download PDF )

Supplementary Data 1 (download XLSX )

Supplementary Data 2 (download XLSX )

Supplementary Data 3 (download CSV )

Supplementary Data 4 (download XLSX )

Supplementary Data 5 (download XLSX )

Reporting Summary (download PDF )

Transparent Peer Review file (download PDF )

Source data

Source Data (download XLSX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Du, Z., Li, M., Lin, K. et al. High-resolution phage-host assignment through key proteins using large language models. Nat Commun (2026). https://doi.org/10.1038/s41467-026-70613-x

Download citation

  • Received: 06 December 2024

  • Accepted: 26 February 2026

  • Published: 20 March 2026

  • DOI: https://doi.org/10.1038/s41467-026-70613-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Associated content

Collection

Artificial intelligence in genomics

Advertisement

Explore content

  • Research articles
  • Reviews & Analysis
  • News & Comment
  • Videos
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • Aims & Scope
  • Editors
  • Journal Information
  • Open Access Fees and Funding
  • Calls for Papers
  • Editorial Values Statement
  • Journal Metrics
  • Editors' Highlights
  • Contact
  • Editorial policies
  • Top Articles

Publish with us

  • For authors
  • For Reviewers
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Nature Communications (Nat Commun)

ISSN 2041-1723 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research