Abstract
Viral sequences in diverse environments remain largely uncharacterized, impeding our comprehension of their genetic makeup, biological interactions, and potential applications. This underscores an urgent need for innovative analytical methods. Here, we present the VirHost Hunter framework, which employs phage tails and lysins, bypassing the requirement for full genomes, for efficient and high-resolution host assignment. By harnessing Protein Language Models and Vision Transformers, VirHost Hunter captures protein functional homology despite sequence dissimilarity, significantly boosting prediction accuracy. In the scenario of disease-associated gut bacteria, the calibrated VirHost Hunter surpasses existing methods, doubling phage host assignments, expanding taxonomic reach, and revealing previously uncharacterized phages targeting gut bacteria, including Akkermansia and Prevotella. Therefore, we establish a gut phage lysin database, enabling the synthesis of a lysin that effectively and specifically targets an obesity-promoting bacterium. VirHost Hunter’s precision and scalability mark a significant leap forward in virome research and present a promising avenue for microbiome therapies.
Similar content being viewed by others
Data availability
The Gut Phage Lysin Database (GPLD) protein sequence data generated in this study have been deposited in the CNGB Sequence Archive (CNSA)111 of the China National GeneBank DataBase (CNGBdb)112 with accession number CNP0005794. The benchmark datasets, derived training and testing data generated in this study, are available in the Zenodo repository under https://doi.org/10.5281/zenodo.17340915. All accession codes are from previously published datasets and are provided in Supplementary Data 5. Source data supporting the findings of this study are provided in the Source Data. All these bacterial strains and plasmids can be requested by direct correspondence with the lead contact, Dr. Minfeng Xiao (xiaominfeng@genomics.cn). These resources are strictly limited to academic research use, and this publication must be cited accordingly. Source data are provided with this paper.
Code availability
The models constructed in this study, together with the corresponding scripts, are publicly available on GitHub at https://github.com/YuehuaOu/Viral-Host-Hunter and have been archived on Zenodo under https://doi.org/10.5281/zenodo.18399248. The pre-trained model weights generated in this study are available in the Zenodo repository under https://doi.org/10.5281/zenodo.17340381.
References
Bayfield, O. W. et al. Structural atlas of a human gut crassvirus. Nature 617, 409–416 (2023).
Koskella, B. & Brockhurst, M. A. Bacteria-phage coevolution as a driver of ecological and evolutionary processes in microbial communities. FEMS Microbiol. Rev. 38, 916–931 (2014).
Borin, J. M., Avrani, S., Barrick, J. E., Petrie, K. L. & Meyer, J. R. Coevolutionary phage training leads to greater bacterial suppression and delays the evolution of phage resistance. Proc. Natl. Acad. Sci. USA 118, e2104592118 (2021).
Blazanin, M. & Turner, P. E. Community context matters for bacteria-phage ecology and evolution. ISME J. 15, 3119–3128 (2021).
Lawrence, D., Baldridge, M. T. & Handley, S. A. Phages and human health: more than idle hitchhikers. Viruses 11, 587 (2019).
Federici, S., Nobs, S. P. & Elinav, E. Phages and their potential to modulate the microbiome and immunity. Cell Mol. Immunol. 18, 889–904 (2021).
Bhargava, K., Nath, G., Bhargava, A., Aseri, G. K. & Jain, N. Phage therapeutics: from promises to practices and prospectives. Appl. Microbiol. Biotechnol. 105, 9047–9067 (2021).
Vijay, A. & Valdes, A. M. Role of the gut microbiome in chronic diseases: a narrative review. Eur. J. Clin. Nutr. 76, 489–501 (2022).
Guerin, E. & Hill, C. Shining light on human gut bacteriophages. Front Cell Infect. Microbiol. 10, 481 (2020).
Porter, N. T. et al. Phase-variable capsular polysaccharides and lipoproteins modify bacteriophage susceptibility in Bacteroides thetaiotaomicron. Nat. Microbiol. 5, 1170–1181 (2020).
Vazquez, R., Garcia, E. & Garcia, P. Phage lysins for fighting bacterial respiratory infections: a new generation of antimicrobials. Front. Immunol. 9, 2252 (2018).
Ghose, C. & Euler, C. W. Gram-negative bacterial lysins. Antibiotics 9, 74 (2020).
Danis-Wlodarczyk, K. M., Wozniak, D. J. & Abedon, S. T. Treating bacterial infections with bacteriophage-based enzybiotics: in vitro in vivo and clinical application. Antibiotics 10, 1497 (2021).
Rahman, M. U. et al. Endolysin, a promising solution against antimicrobial resistance. Antibiotics 10, 1277 (2021).
Lee, C., Kim, H. & Ryu, S. Bacteriophage and endolysin engineering for biocontrol of food pathogens/pathogens in the food: recent advances and future trends. Crit. Rev. Food Sci. Nutr. 63, 8919–8938 (2023).
Khan, F. M., Chen, J. H., Zhang, R. & Liu, B. A comprehensive review of the applications of bacteriophage-derived endolysins for foodborne bacterial pathogens and food safety: recent advances, challenges, and future perspective. Front. Microbiol. 14, 1259210 (2023).
Criel, B., Taelman, S., Van Criekinge, W., Stock, M. & Briers, Y. PhaLP: a database for the study of phage lytic proteins and their evolution. Viruses 13, 1240 (2021).
Coutinho, F. H. et al. RaFAH: Host prediction for viruses of Bacteria and Archaea based on protein content. Patterns 2, 100274 (2021).
Pons, J. C. et al. VPF-Class: taxonomic assignment and host prediction of uncultivated viruses based on viral protein families. Bioinformatics 37, 1805–1813 (2021).
Amgarten, D., Iha, B. K. V., Piroupo, C. M., da Silva, A. M. & Setubal, J. C. vHULK, a new tool for bacteriophage host prediction based on annotated genomic features and neural networks. Phage 3, 204–212 (2022).
Zielezinski, A., Barylski, J. & Karlowski, W. M. Taxonomy-aware, sequence similarity ranking reliably predicts phage-host relationships. BMC Biol. 19, 223 (2021).
Dion, M. B. et al. Streamlining CRISPR spacer-based bacterial host predictions to decipher the viral dark matter. Nucleic Acids Res. 49, 3127–3138 (2021).
Zhang, R. et al. SpacePHARER: sensitive identification of phages from CRISPR spacers in prokaryotic hosts. Bioinformatics 37, 3364–3366 (2021).
Edwards, R. A., McNair, K., Faust, K., Raes, J. & Dutilh, B. E. Computational approaches to predict bacteriophage-host relationships. FEMS Microbiol. Rev. 40, 258–272 (2016).
Li, J., Yang, F., Xiao, M. & Li, A. Advances and challenges in cataloguing the human gut virome. Cell Host Microbe 30, 908–916 (2022).
Galiez, C., Siebert, M., Enault, F., Vincent, J. & Söding, J. WIsH: who is the host? Predicting prokaryotic hosts from metagenomic phage contigs. Bioinformatics 33, 3113–3114 (2017).
Leite, D. M. C. et al. Computational prediction of inter-species relationships through omics data analysis and machine learning. BMC Bioinforma. 19, 420 (2018).
Li, M. et al. A deep learning-based method for identification of bacteriophage-host interaction. IEEE/ACM Trans. Comput Biol. Bioinform 18, 1801–1810 (2021).
Ruohan, W., Xianglilan, Z., Jianping, W. & Shuai Cheng, L. I. DeepHost: phage host prediction with convolutional neural network. Brief Bioinform. 23, bbab385 (2022).
Liu, D., Ma, Y., Jiang, X. & He, T. Predicting virus-host association by Kernelized logistic matrix factorization and similarity network fusion. BMC Bioinforma. 20, 594 (2019).
Wang, W. et al. A network-based integrated framework for predicting virus-prokaryote interactions. NAR Genom. Bioinform. 2, lqaa044 (2020).
Lu, C. et al. Prokaryotic virus host predictor: a Gaussian model for host prediction of prokaryotic viruses in metagenomics. BMC Biol. 19, 5 (2021).
Shang, J. & Sun, Y. Predicting the hosts of prokaryotic viruses using GCN-based semi-supervised learning. BMC Biol. 19, 250 (2021).
Shang, J. & Sun, Y. CHERRY: A computational method for accurate prediction of virus-pRokarYotic interactions using a graph encoder-decoder model. Brief Bioinform. 23, bbac182 (2022).
Tan, J. et al. HoPhage: an ab initio tool for identifying hosts of phage fragments from metaviromes. Bioinformatics 38, 543–545 (2022).
Tang, T., Hou, S., Fuhrman, J. A. & Sun, F. Phage-bacterial contig association prediction with a convolutional neural network. Bioinformatics 38, i45–i52 (2022).
Zielezinski, A., Deorowicz, S. & Gudys, A. PHIST: fast and accurate prediction of prokaryotic hosts from metagenomic viral sequences. Bioinformatics 38, 1447–1449 (2022).
Villarroel, J. et al. HostPhinder: a phage host prediction tool. Viruses 8, 116 (2016).
Boeckaerts, D. et al. Predicting bacteriophage hosts based on sequences of annotated receptor-binding proteins. Sci. Rep. 11, 1467 (2021).
Gonzales, M. E. M., Ureta, J. C. & Shrestha, A. M. S. Protein embeddings improve phage-host interaction prediction. PLoS ONE 18, e0289030 (2023).
Paez-Espino, D. et al. Uncovering Earth’s virome. Nature 536, 425–430 (2016).
Gregory, A. C. et al. The gut virome database reveals age-dependent patterns of virome diversity in the human gut. Cell Host Microbe 28, 724–740.e728 (2020).
Camarillo-Guerrero, L. F., Almeida, A., Rangel-Pineros, G., Finn, R. D. & Lawley, T. D. Massive expansion of human gut bacteriophage diversity. Cell 184, 1098–1109 e1099 (2021).
Nayfach, S. et al. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat. Microbiol. 6, 960–970 (2021).
Tisza, M. J. & Buck, C. B. A catalog of tens of thousands of viruses from human metagenomes reveals hidden associations with chronic diseases. Proc. Natl. Acad. Sci. USA 118, e2023202118 (2021).
Roux, S. et al. iPHoP: an integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria. PLoS Biol. 21, e3002083 (2023).
McGinnis, S. & Madden, T. L. BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res. 32, W20–W25 (2004).
Ahlgren, N. A., Ren, J., Lu, Y. Y., Fuhrman, J. A. & Sun, F. Alignment-free $d_2;*$ oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences. Nucleic Acids Res. 45, 39–53 (2017).
Garcia-Doval, C. & van Raaij, M. J. Structure of the receptor-binding carboxy-terminal domain of bacteriophage T7 tail fibers. Proc. Natl. Acad. Sci. USA 109, 9390–9395 (2012).
Yehl, K. et al. Engineering phage host-range and suppressing bacterial resistance through phage tail fiber mutagenesis. Cell 179, 459–469.e459 (2019).
Chen, M. et al. Alterations in gp37 expand the host range of a T4-like phage. Appl. Environ. Microbiol. 83, e01576–e01617 (2017).
Santos, S. B. et al. Selection and characterization of a multivalent Salmonella phage and its production in a nonpathogenic Escherichia coli strain. Appl Environ. Microbiol. 76, 7338–7342 (2010).
Pas, C., Latka, A., Fieseler, L. & Briers, Y. Phage tailspike modularity and horizontal gene transfer reveals specificity towards E. coli O-antigen serogroups. Virol. J. 20, 174 (2023).
Nobrega, F. L. et al. Targeting mechanisms of tailed bacteriophages. Nat. Rev. Microbiol. 16, 760–773 (2018).
Dams, D., Brøndsted, L., Drulis-Kawa, Z. & Briers, Y. Engineering of receptor-binding proteins in bacteriophages and phage tail-like bacteriocins. Biochem Soc. Trans. 47, 449–460 (2019).
Opperman, C. J., Wojno, J. M. & Brink, A. J. Treating bacterial infections with bacteriophages in the 21st century. S Afr. J. Infect. Dis. 37, 346 (2022).
Rakhuba, D. V., Kolomiets, E. I., Dey, E. S. & Novik, G. I. Bacteriophage receptors, mechanisms of phage adsorption and penetration into host cell. Pol. J. Microbiol. 59, 145–155 (2010).
Li, M. et al. Enhancing Strain-level phage-host prediction through experimentally vali-dated negatives and feature optimization strategies. bioRxiv https://doi.org/10.1101/2025.05.31.656987 (2025).
Islam, M. Z. et al. Molecular anatomy of the receptor binding module of a bacteriophage long tail fibre. PLoS Pathog. 15, e1008193 (2019).
Nelson, D., Schuch, R., Chahales, P., Zhu, S. & Fischetti, V. A. PlyC: a multimeric bacteriophage lysin. Proc. Natl. Acad. Sci. USA 103, 10765–10770 (2006).
Flamholz, Z. N., Biller, S. J. & Kelly, L. Large language models improve annotation of prokaryotic viral proteins. Nat. Microbiol. 9, 537–549 (2024).
Dosovitskiy, A. et al. in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations (2021).
Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).
Kim, G. B., Gao, Y., Palsson, B. O. & Lee, S. Y. DeepTFactor: a deep learning-based tool for the prediction of transcription factors. Proc. Natl. Acad. Sci. USA 118, e2021171118 (2021).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
Lloyd-Price, J. et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature 569, 655–662 (2019).
Schirmer, M., Garner, A., Vlamakis, H. & Xavier, R. J. Microbial genes and pathways in inflammatory bowel disease. Nat. Rev. Microbiol. 17, 497–511 (2019).
Ternes, D. et al. The gut microbial metabolite formate exacerbates colorectal cancer progression. Nat. Metab. 4, 458–475 (2022).
Wong, S. H. & Yu, J. Gut microbiota in colorectal cancer: mechanisms of action and clinical applications. Nat. Rev. Gastroenterol. Hepatol. 16, 690–704 (2019).
Qin, Y. et al. Consistent signatures in the human gut microbiome of old- and young-onset colorectal cancer. Nat. Commun. 15, 3396 (2024).
Liu, R. et al. Gut microbiome and serum metabolome alterations in obesity and after weight-loss intervention. Nat. Med. 23, 859–868 (2017).
Jie, Z. et al. The gut microbiome in atherosclerotic cardiovascular disease. Nat. Commun. 8, 845 (2017).
Wu, C. et al. Obesity-enriched gut microbe degrades myo-inositol and promotes lipid absorption. Cell Host Microbe 32, 1301–1314 e1309 (2024).
Wang, T. et al. Divergent age-associated and metabolism-associated gut microbiome signatures modulate cardiovascular disease risk. Nat. Med 30, 1722–1731 (2024).
Qin, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60 (2012).
Shen, J. et al. Large-scale phage cultivation for commensal human gut bacteria. Cell Host Microbe 31, 665–677.e667 (2023).
Zhu, F. et al. Metagenome-wide association of gut microbiome features for schizophrenia. Nat. Commun. 11, 1612 (2020).
Alpizar-Rodriguez, D. et al. Prevotella copri in individuals at risk for rheumatoid arthritis. Ann. Rheum. Dis. 78, 590–593 (2019).
Scher, J. U. et al. Expansion of intestinal Prevotella copri correlates with enhanced susceptibility to arthritis. Elife 2, e01202 (2013).
Maeda, Y. et al. Dysbiosis contributes to arthritis development via activation of autoreactive T cells in the intestine. Arthritis Rheumatol. 68, 2646–2661 (2016).
Tsai, C. Y. et al. Abundance of Prevotella copri in the gut microbiota is inversely related to a healthy diet in patients with type 2 diabetes. J. Food Drug Anal. 31, 599–608 (2023).
Yue, T. et al. High-risk genotypes for type 1 diabetes are associated with the imbalance of gut microbiome and serum metabolites. Front. Immunol. 13, 1033393 (2022).
Yang, C. et al. Prevotella copri alleviates hyperglycemia and regulates gut microbiota and metabolic profiles in mice. mSystems 9, e0053224 (2024).
Devoto, A. E. et al. Megaphages infect Prevotella, and variants are widespread in gut microbiomes. Nat. Microbiol. 4, 693–700 (2019).
Weston, J., Elisseeff, A., Zhou, D., Leslie, C. S. & Noble, W. S. Protein ranking: from local to global structure in the protein similarity network. Proc. Natl. Acad. Sci. USA 101, 6559–6563 (2004).
Copp, J. N., Anderson, D. W., Akiva, E., Babbitt, P. C. & Tokuriki, N. Exploring the sequence, function, and evolutionary space of protein superfamilies using sequence similarity networks and phylogenetic reconstructions. Methods Enzymol. 620, 315–347 (2019).
Dey, K. K., Xie, D. & Stephens, M. A new sequence logo plot to highlight enrichment and depletion. BMC Bioinforma. 19, 473 (2018).
Gupta, A., Osadchiy, V. & Mayer, E. A. Brain-gut-microbiome interactions in obesity and food addiction. Nat. Rev. Gastroenterol. Hepatol. 17, 655–672 (2020).
Kasai, C. et al. Comparison of the gut microbiota composition between obese and non-obese individuals in a Japanese population, as analyzed by terminal restriction fragment length polymorphism and next-generation sequencing. BMC Gastroenterol. 15, 100 (2015).
Kocelak, P. et al. Resting energy expenditure and gut microbiota in obese and normal weight subjects. Eur. Rev. Med Pharm. Sci. 17, 2816–2821 (2013).
Fujimoto, K. et al. An enterococcal phage-derived enzyme suppresses graft-versus-host disease. Nature 632, 174–181 (2024).
Dzuvor, C. K. O. et al. Engineering self-assembled endolysin nanoparticles against antibiotic-resistant bacteria. ACS Appl. Bio. Mater. https://doi.org/10.1021/acsabm.2c00741 (2022).
Lee, C., Kim, J., Son, B. & Ryu, S. Development of advanced chimeric endolysin to control multidrug-resistant staphylococcus aureus through domain shuffling. ACS Infect. Dis. 7, 2081–2092 (2021).
Diez-Martinez, R. et al. Improving the lethal effect of cpl-7, a pneumococcal phage lysozyme with broad bactericidal activity, by inverting the net charge of its cell wall-binding module. Antimicrob. Agents Chemother. 57, 5355–5365 (2013).
Zampara, A. et al. Exploiting phage receptor-binding proteins to enable endolysins to kill Gram-negative bacteria. Sci. Rep. 10, 12087 (2020).
Hendrix, R. W., Smith, M. C., Burns, R. N., Ford, M. E. & Hatfull, G. F. Evolutionary relationships among diverse bacteriophages and prophages: all the world’s a phage. Proc. Natl. Acad. Sci. USA 96, 2192–2197 (1999).
Adriaenssens, E. M. Phage diversity in the human gut microbiome: a taxonomist’s perspective. mSystems 6, e0079921 (2021).
Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C. & Dosovitskiy, A. Do vision transformers see like convolutional neural networks? Adv. neural Inf. Process. Syst. 34, 12116–12128 (2021).
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinforma. 11, 119 (2010).
Li, W. et al. RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation. Nucleic Acids Res 49, D1020–D1028 (2021).
Cantalapiedra, C. P., Hernandez-Plaza, A., Letunic, I., Bork, P. & Huerta-Cepas, J. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol. Biol. Evol. 38, 5825–5829 (2021).
Aziz, R. K. et al. The RAST Server: rapid annotations using subsystems technology. BMC Genomics 9, 75 (2008).
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).
Lu, Y. Y., Noble, W. S. & Keich, U. A BLAST from the past: revisiting blastpas E-value. Bioinformatics 40, btae729 (2024).
Skewes-Cox, P., Sharpton, T. J., Pollard, K. S. & DeRisi, J. L. Profile hidden Markov models for the detection of viruses within metagenomic sequence data. PLoS One 9, e105067 (2014).
Mitchell, A. L. et al. InterPro in 2019: improving coverage, classification and access to protein sequence annotations. Nucleic Acids Res 47, D351–D360 (2019).
Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS One 11, e0163962 (2016).
Song, W. et al. Prophage Hunter: an integrative hunting tool for active prophages. Nucleic Acids Res 47, W74–w80 (2019).
Wickham, H. & Wickham, H. Data analysis. (Springer, 2016).
Guo, X. et al. CNSA: a data repository for archiving omics data. Database (Oxford) 2020 (2020).
Chen, F. Z. et al. CNGBdb: China National Genebank Database. Yi Chuan 42, 799–809 (2020).
Acknowledgments
This work was supported by the National Key R&D Program of China (Grant No. 2020YFA0908700) to M.X., Z.D., Junhua L., and Jianqiang L., and in part by the National Natural Science Fund for Distinguished Young Scholars (Grant No. 62325307) to Jiangqiang L. We sincerely thank the China National GeneBank Database (CNGB) for providing valuable data support and computational resources.
Author information
Authors and Affiliations
Contributions
M.X. conceived the study. Z.D., K.L., and Y.O. developed the tool. M.L. and B.X. compiled the training, validation, and test sets. K.L., M.L., B.X., and M.X. analyzed the viral dark matter. K.L., M.L., B.X., and Y.O. drafted the manuscript and made the figures. Z.L. and M.L. maintained and improved the code. Z.D., M.X., and Junhua L. revised the manuscript. W.S., J.C. and Jianqiang L. provided consultation. B.X., Y.O., and Z.L. contributed equally to this work. All authors read, edited, and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
Authors affiliated with BGI Research (M. L., M. X., and Junhua. L) and authors affiliated with Shenzhen University (Z. D., K. L., and Jiangqiang. L) have filed a patent application titled “ Method and System for Predicting Host Range of Bacteriophages, and Corresponding Computer Device or Medium”(Application No. PCT/CN2025/075560). It describes predicting algorithm in this study that predicts host specificity of phages based on tail and lysin proteins. Other authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Karthik Anantharaman and the other anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Du, Z., Li, M., Lin, K. et al. High-resolution phage-host assignment through key proteins using large language models. Nat Commun (2026). https://doi.org/10.1038/s41467-026-70613-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-026-70613-x


