Integrating metagenomics and cultivation unveils oral phage diversity and potential impact on hosts

Jie, Zhuye; Liang, Hewei; Meng, Yanzheng; Zhang, Jiahao; Zhang, Tao; Li, Wenxi; Lin, Xiaoqian; Hu, Tongyuan; Han, Mo; Liang, Weiting; Ju, Yanmei; Tong, Xin; Jin, Xin; Xu, Xun; Zhang, Wenwei; Wang, Jian; Yang, Huanming; Kristiansen, Karsten; Xiao, Liang; Zou, Yuanqiang

doi:10.1038/s41522-025-00773-z

Download PDF

Article
Open access
Published: 26 July 2025

Integrating metagenomics and cultivation unveils oral phage diversity and potential impact on hosts

Zhuye Jie^1,2,3^na1,
Hewei Liang^1,4^na1,
Yanzheng Meng^4,5^na1,
Jiahao Zhang⁴^na1,
Tao Zhang^1,2^na1,
Wenxi Li^4,6,
Xiaoqian Lin⁴,
Tongyuan Hu¹,
Mo Han⁷,
Weiting Liang¹,
Yanmei Ju¹,
Xin Tong^2,4,
Xin Jin⁸,
Xun Xu⁸,
Wenwei Zhang⁸,
Jian Wang^4,9,
Huanming Yang^4,9,
Karsten Kristiansen^3,4,10,
Liang Xiao^8,11 &
…
Yuanqiang Zou^8,11

npj Biofilms and Microbiomes volume 11, Article number: 145 (2025) Cite this article

5038 Accesses
2 Citations
Metrics details

Subjects

Metagenomics

Abstract

Bacteriophages play important roles in the regulation of bacterial communities and may thus impact human health. However, the scarcity of genomic data of oral microorganism has limited information on oral phages. We collected data on 5427 metagenomic samples and 2178 cultivated bacterial genomes across different geographical areas and populations, generating the Oral Phage Database (OPD), comprising 189,859 representative phage genome sequences, including 3416 huge phages (genome size > 200 kbp). OPD reveals that most oral viruses are unknown and encode an enormous variety of dark proteins. Numerous oral phages infecting a broad range of hosts carry anti-defense genes, auxiliary metabolic genes, and virulence factors that may affect bacterial metabolism and influence human health. The composition of oral phages varies among different populations, and several phages have the potential to act as biomarkers for disease. OPD expands our knowledge of phage-bacteria interactions, huge phages diversity, and potential impacts on human health.

A catalog of bacterial reference genomes from cultivated human oral bacteria

Article Open access 03 July 2023

Multi-cohort shotgun metagenomic analysis of oral and gut microbiota overlap in healthy adults

Article Open access 16 January 2024

Implications of oral streptococcal bacteriophages in autism spectrum disorder

Article Open access 18 November 2022

Introduction

The human body hosts an abundance of microorganisms that mainly reside in the gut, in the oral cavity, and on the skin¹. As the organ with the second highest abundance of microorganisms², the human oral cavity harbors ~10⁸ virus-like particles per milliliter of saliva³, most of which are bacteriophages. As interactions between phages and bacteria can modulate human oral microbial communities by infecting host bacteria and transferring genes, and as oral phages may play a role in shaping oral microbial ecosystems⁴, they could impact human health and hold potential for therapy in addressing dental diseases caused by bacterial infections^5,6.

In recent years, bioinformatics tools or pipelines aimed at recovering viruses from assembled sequences have been developed^7,8, resulting in the generation of a series of virus genome catalogs based on these tools, such as the human Gut Virome Database (GVD)⁹, the Gut Phage Database (GPD)¹⁰, the Metagenomic Gut Virus catalog (MGV)¹¹, and the Oral Virus Database (OVD)¹². These catalogs constitute important resources for studies exploring the relationships between phages, bacteria, and humans¹³, expanding disease-associated biomarkers¹⁴, and revealing horizontal gene transfer (HGT) via phages¹⁵. Nevertheless, these studies have several limitations. First, data based mainly on a series of independent microbiome studies may include batch effects and contamination. Second, the interaction networks of phage-to-host and phage-to-phage were not characterized in detail. Third, ignoring phenotype information (e.g., lifestyle, disease, ethnicity) limits investigations on associations between phage composition and human phenotypes. Moreover, compared to the large number of viral genomes and the high diversity of the same in the gut^16,17, the phageome of the oral cavity has not been fully explored, calling for the construction of an oral phage genome catalog combined with genome annotation, host assignment, relative abundance analysis, and database comparisons. Despite recent attempts to perform metagenomic exploration of phages in the oral cavity¹², a number of human oral microbiota sequencing datasets were not fully used¹⁸, and the phage distribution in the oral microbiota remains largely underexplored.

To expand our knowledge of the oral phageome, we performed a comprehensive large-scale viral genome identification based on 5427 oral metagenomes and 2178 genomes of oral isolated bacteria to establish an extensive phylogenomic and gene functional database, the Oral Phage Database (OPD). More than 90% (4918/5427) of the metagenomics samples with phenotypic information were from China and were sequenced on our in-house platform. We uncovered a diverse array of 189,859 nonredundant phage genomes with less than 95% nucleotide similarity, with a median size of 27.61 kbp, including 3416 huge phages with genome sizes larger than 200 kbp. By combining both genome-wide and proteome-wide analyses, we revealed a complex interaction network between oral phages and their hosts and further revealed specific phage compositions in different populations. Thus, OPD represents a comprehensive source of oral phages and provides insight into their potential impact on the host.

Results

Construction of the Oral Phage Database (OPD)

To explore the diversity of the phageome in the human oral cavity, 5427 oral metagenomic biosamples from multiple cohorts and 2178 genomes from cultivated oral bacteria were collected as the data source. The data comprised a total of 17.26 Tbp sequencing data generated in-house and from public datasets, including the following: (1) 3953 samples from Shenzhen, China, an industrialized area, including 2675 samples from the 4D-SZ (Disease, Drug, Diet, Daily life) cohort¹⁸, and 1278 newly sequenced samples; (2) 671 samples from the Chinese Yunnan cohort representing individuals living in a non-industrialized environment¹⁸; (3) 294 samples including rheumatoid arthritis (RA) patients from the Chinese Beijing cohorts¹⁹; and (4) a total of 509 public oral metagenomes downloaded from the NCBI SRA databases, including 102 from Fiji²⁰, 149 from France²¹, 72 from Germany²¹, 85 from Luxembourg²², and 101 from the United States²³ (Supplementary Fig. 1A, Supplementary Table 1, Supplementary Table 2). In addition, we retrieved the genomes of 2178 bacterial isolates cultured from the human oral cavity, of which 1089 were derived from healthy Chinese individuals²⁴, and the remaining 1089 genomes were collected from the expanded Human Oral Microbiome Database (eHOMD)². The retrieval of cultured bacterial genomes provided curated viral sequences along with specific host genome-resolved information. Over 670 million raw contigs from the data source were scanned by VirFinder and VirSorter2 to identify viral-like sequences^7,8. Initially, a total of 12,545,619 virus-like contigs were identified. Since some contaminating mobile genetic elements (MGEs) such as plasmids or integrative and conjugative elements, as well as human sequences were present in these virus-like contigs, and virus-like contigs from different samples may exhibit high sequence identity exceeding 95%, a quality control pipeline was designed to filter out such sequences, as well as those shorter than 10 kbp in metagenomes and 1 kbp in the genomes of bacterial isolates. Finally, 189,859 viral-like contigs with a median size of 27.61 kbp were obtained, resulting in the establishment of an integrated, consistently processed, non-redundant database, the OPD (Fig. 1A). We used CheckV to evaluate the level of completeness and contamination²⁵. As a result, 4709 sequences (2.5%) were assigned as complete and high quality (>90% completeness) and 53,432 sequences (28.1%) were assigned as medium quality (50-90% completeness) (Supplementary Fig. 1B, Supplementary Table 3). The sequences with a completeness greater than 50% were defined as viral draft genomes with a median genome length of 48,519 bp and a median CheckV genome completeness of 65.1(Fig. 1B and C), and these genomes were used for subsequent analyses. To be able to generalize on global phage distribution properties, the viral genomes sharing high sequence identities may be further clustered into suitable taxonomic levels for subsequent analysis. We grouped the oral viral draft genomes with reference genomes based on shared protein clusters by vConTACT2²⁶, whereby 1915 non-singleton viral clusters (VCs) with 10,382 sub viral clusters (subVCs) were generated. After elimination of subVCs only containing vConTACT2 reference genomes, 9983 subVCs remained (Supplementary Table 4), of which 489 represented phages from cultured bacterial isolate data (Fig. 1D). vConTACT2 provides high-confidence genus assignments for identified VCs and for each subVC, the genome with the highest quality score assessed by CheckV was selected as the representative²⁵. Among these subVCs, 64.8% comprised only one member, indicating that their genomes are distant from those of other phages (Fig. 1E). More than 57% of the subVCs were predicted to be lytic phages, which is consistent with previously reported studies¹² (Fig. 1F).

**Fig. 1: Framework for constructing the Oral Phage Database (OPD) and genome feature statistics.**

To identify the unique phage genomes from OPD, we compared the OPD catalog with existing virus catalogs by clustering viral draft genomes via vConTACT2, including GVD, GPD, CHVD, and OVD. Note that we also attempted to cluster OPD and MGV but this exceeded the memory limits of our high-performance computing (HPC) cluster. Strikingly, we found that the oral phages from OPD exhibited little overlap with phages from gut virus catalogs, indicating distinct phage compositions in the gut and the oral cavity (Fig. 1G). A total of 20,136 phage genomes did not cluster with genomes from other catalogs, suggesting that OPD considerably expanded the list of phages in the human oral cavity (Fig. 1H).

OPD expands oral phage genome diversity

To systematically assess the taxonomic diversity within the OPD, we annotated all subVC representative genomes using geNomad with the International Committee on Taxonomy of Viruses (ICTV) MSL39 database^27,28. Due to frequent genome mutations and the lack of common genes existing in all phage genomes, we used VIPtree to construct phylogenetic trees based on the proteomes of each subVC representative to investigate the phylogeny of the phage genomes (Fig. 2)²⁹. 99.87% (9970/9983) of subVCs were annotated at the class level, with the majority classified as Caudoviricetes at 99.68% (9951/9983), and the remaining as Malgrandaviricetes (16/9983) and Faserviricetes (2/9983). However, only 238 subVCs could be annotated at the order level, of which 220 were Crassvirales, accounting for 2.20% (220/9983) of all oral phages. Furthermore, only 67 phages were annotated at the family level, indicating that the majority of oral phages remains challenging to classify taxonomically. Our research reveals many genomes from potentially novel phages.

The sample information of OPD enabled us to identify phages that were shared between geographic regions. Since China contributed the most data, we compared the phage composition of subVCs between samples from China and samples from other countries (Supplementary Fig. 2). In samples from China, we identified 9566 subVCs, a number far greater than the number observed in samples from other countries. Among these 9566 subVCs, 7620 (79.65%) were not detected in samples from other countries. Thirty-three subVCs were present in all countries, which reflects globally distributed phage strains that may infect globally distributed bacteria.

Proteome-wide comparison of structures identifies anti-defense genes in oral phages

To understand the function of oral phages, we investigated the proteins encoded by these phages. A total of 9,181,412 encoded proteins in OPD were annotated using the eggNOG-mapper^30,31, yielding results covering several functional databases including COG, GO, KEGG, and PFAM. Among the proteins with COG functional annotations, the most enriched functional category was information storage and processing with 9.24% of all proteins fitting these typical viral functions (Fig. 3A). However, 78.71% (7,226,689) of these proteins were unknown, indicating that little is known about the functional potential of human oral phages.

**Fig. 3: Identification of putative functional proteins from OPD using sequence and structure information.**

Previous studies have reported that the phages from different habitats may carry specific genes to promote their survival³². To identify the OPD-specific proteins, we clustered all proteins from existing virus catalogs, including GVD, GPD, CHVD, MGV, and OVD^{9,10,11,12,33}. First, we used MMseqs2 to cluster all proteins on the basis of 70% sequence identity and a 90% sequence alignment, resulting in 2,762,333 protein clusters (PCs)³⁴. Examining the structures of these clusters, we observed that 522,534 proteins were unique to OPD and not clustering with proteins from other catalogs (Fig. 3B). This high number of proteins selectively found in the OPD suggests an untapped novel functional potential for proteins in the human oral cavity. Next, we re-clustered all the OPD-specific protein sequences with 30% sequence identity and a 70% sequence alignment into 279,401 PCs, of which 111,622 had COG annotations (Supplementary Fig. 3A).

As recent breakthroughs in protein 3D structure prediction have enabled accurate and fast structural characterization of protein sequences, we ran ESMFold to fold the representative sequence of OPD-specific PCs with sequence lengths between 30 to 1800 amino acids³⁵. The results are summarized in Fig. 3C. Overall, among these 271,139 proteins, 92,192 provided predictions in ESMFold with good confidence (mean pLDDT > 0.5 and pTM > 0.5), corresponding to 34% of all of the proteins, and 42,479 provided predictions in ESMFold with high confidence (mean pLDDT > 0.7 and pTM > 0.7), which corresponds to 15.7% of the total folded structures (Supplementary Table 5).

Structural comparisons enabled the identification of distant homologs that were not identified by sequence alignment. Phages have evolved many bacterial defense systems as the result of an “arms race” between phage and host³⁶. A recently reported gene is Thoeris anti-defense 2 (Tad2), which encodes a protein that can bind to 1″–3′ gcADPR to disarm the Thoeris defense system of the bacteria³². Tad2 has been reported in gut phages, but its homologs have not been investigated in oral phages. We used Foldseek to search the structure of SPO1 phage Tad2 (PDB:8smf) against high quality structures in OPD³⁷, and obtained 6 candidates (Supplementary Fig. 3B). Phylogenetic analysis of these proteins demonstrated that they are distinct homologs of Tad2 (Fig. 3D). As Tad2 forms a homotetrameric structure to bind 1″–3′gcADPR, we used ColabFold to predict the homotetrameric structures of these 6 candidates³⁸. However, only opdg_86857_54 was predicted successfully with high pLDDT and ipTM. The sequence identity of opdg_86857_54 to 8smf was only 20.93%, but the structural similarity was very high with an RMSD of 2.55 and a TM-score of 0.78 (Fig. 3E). To further corroborate their molecular function, molecular dynamics (MD) simulation was used to determine the binding stability of 1″–3′gcADPR-opdg_86857_54 homotetramer complex. Specifically, we set 1″–3′ gcADPR as the ligand, opdg_86857_54 homotetramer as the protein, amber/ff14SB as the protein force field, gaff-2.11 as the ligand force field, and tip3p as the water box. After minimizing the ligand-protein complex, we started the simulation for 100 ns (Fig. 3F). The analysis revealed that 1″–3′ gcADPR can tightly bind to the protein pocket in a reasonable pose, indicating that a series of novel Tad2s can be encoded by oral phages to promote resistance to host-elicited killing.

Broad host range of bacteriophages and auxiliary functions

The interaction between phages and bacteria plays an important role in the regulation of oral microorganisms¹³, but the host range of most oral phages has not been described in detail. The retrieval of 2178 genomes of isolated bacteria from the oral cavity enabled the identification of spacer sequences in CRISPR arrays that are copied from bacterial phages. We screened all 2178 genomes of isolated bacteria from our data source, mining 23,636 CRISPR spacers and 1810 Cas proteins with high confidence (Supplementary Fig. 4A). After matching the spacers with all the viral draft genomes in OPD, a total of 99,612 matched pairs enabled the assignment of 25.34% (14,732/58,141) of the phages to hosts. To overview phage-host interactions, we decorated the bacterial phylogeny tree with infection relationships derived from 6584 bacterial species-subVC pairs, and subsequently constructed a genus-level interaction network comprising 578 bacterial genus-subVC pairs (Fig. 4A, Supplementary Fig. 4B, Supplementary Table 6). We further conducted host prediction using the integrated machine learning framework iPHoP³⁹. The iPHoP results revealed a significantly broader host range. Although we only considered the top 5 host predictions from each method, these predictions matched 61.59% (4055/6584) of the bacterial species-subVC pairs predicted based on the foregoing oral bacterial CRISPR spacer matching (Supplementary Fig. 4C). Based on this concordance, we utilized the CRISPR-spacers-based results as our primary reference for host assignment in downstream analyses. At the species level, our analyses indicated that nearly half (1436/2896) of the subVCs could infect more than one bacterial species (Supplementary Fig. 4D). Such cross-species infecting phages have been described in the gut phageome, but this phenomenon seems to be more common in the microbiota of the oral cavity¹⁰. We next determined the diversity of phages infecting oral bacteria and found that the dominant bacteria in the oral cavity harbored a highly diverse population of phages (Fig. 4B). For instance, the three major species colonizing the oral cavity, Neisseria flavescens, Neisseria sicca, and Streptococcus oralis were found to harbor more than 500 subVCs, ranking these species as the top 3 species with the highest viral diversity. To further explore the associations between these cross-species infecting subVCs and their hosts in a more comprehensive manner, we mapped subVCs to the phylogenetic tree of oral bacteria (Fig. 4A). Most of the cross-species infecting subVCs targeted species of Streptococcus and Prevotella. In contrast, as the most abundant genus in the oral cavity, Neisseria harbored the most diversified subVCs, but none of these subVCs were found to be present in other species, which indicates a high specificity of phages infecting Neisseria. The bacteria from the Actinobacteriota and Fusobacteriota phyla harbored few cross-species phages. Taken together, we concluded that the host range of oral phages is closely related to bacterial phylogeny, reflecting the coevolution of phages and bacteria.

**Fig. 4: The interaction between oral phages and bacterial hosts.**

The genes carried by bacteriophages may assist host bacteria in survival and may have an impact on humans. Thus, we focused on proteins involved in key functions such as metabolism and virulence. Auxiliary metabolic genes (AMGs) refer to phage-encoded genes that enhance the metabolic capacity of the host to facilitate infection^40,41. We investigated the presence of AMGs in oral phages by focusing on KEGG Orthologies (KOs) associated with metabolism identified in oral phages and bacteria (Supplementary Fig. 5). Overall, 76.16% (1083/1422) of the KOs were shared by oral phages and bacteria, and 91.60% (992/1083) of the shared KOs could be defined as AMGs (Fig. 4C). As visualized by the metabolic map, main metabolic pathways including carbohydrate metabolism, energy metabolism, and amino acid metabolism were shared by oral phages and bacteria and were mainly classified as AMGs. Additionally, most of the KOs associated with nucleotide metabolism were also encoded by phages and bacteria, but were not reported as AMGs⁴⁰.

Among all proteins in OPD, 77,773 proteins from 10,267 phages were defined as virulence factors (VFs) according to the Virulence Factor Database (VFDB)⁴², and more than half (13 of 25) of the top 25 types of VFs were significantly enriched in lysogenic phages (Fig. 4D), with only 3 of 25 being significantly enriched in lytic phages. This result supported the assumptions of inclusive fitness within bacterial populations to explain why the retention of phage-mediated VFs benefited the bacterial host. Notably, according to VFDB annotations, there are numerous phage-encoded proteins associated with bacterial adhesion, including 2167 proteins related to elongation factor Tu (EF-Tu), 1642 proteins related to auto transporter adhesion (UpaG), 723 proteins related to Bartonella adhesion A (BadA), and 610 proteins related to type IV pili (T4P). Interestingly, T4P has been reported as a coat protein of phages⁴³, and the hosts of these phages all belong to Neisseria, including Neisseria sicca and Neisseria flavescens, which are defined as opportunistic pathogens, suggesting that some oral phages might assist pathogens in adhering to host cells.

Marked variation in the phageome across populations

Previous studies have reported that environmental factors (e.g., lifestyle, geography, urbanization) play an important role in shaping the composition of the microbiota^44,45,46,47. We reasoned that the oral phage community would also be affected by these factors as well. To investigate the variation in the oral phageome, we performed profiling of the human oral microbiome by mapping the metagenomics reads to a comprehensive database including representative phage genomes from OPD, bacterial genomes from the oral species-level genome bins (SGBs) catalog¹⁸, and fungal genomes from NCBI Taxonomy⁴⁸. Although the relative abundances of phages varied between different countries, they were consistently less than 8.47% among all kinds of microorganisms (Fig. 5A), and the relative abundance in Chinese samples was greater than that in non-Chinese samples (Supplementary Fig. 6A), possibly reflecting differences related to ethnicity, lifestyle, and geography, all factors warranting further exploration.

**Fig. 5: Variations of oral phage compositions across different populations.**

To further examine to what extent the oral phageome was affected by human lifestyle, we took advantage of our Chinese cohorts, which included 2675 individuals from Shenzhen, a highly industrialized and densely populated metropolitan city, and 671 individuals from the Yunnan Province, a less developed province where people have relatively low incomes and follow a more traditional rural lifestyle⁴⁷ (Supplementary Table 2). The relative abundance of oral phages from in the industrialized area was significantly greater than that of the population living in the less industrialized area (Fig. 5B), with the opposite trend for the abundance of bacteria. Principal Co-ordinates Analysis (PCoA) revealed a clear separation between the populations from Shenzhen and Yunnan (Fig. 5C). We selected the dominant phages characterizing the top 10 most abundant phages in each population (Supplementary Fig. 6B). Although the phage composition of the populations from Shenzhen and Yunnan varied significantly, 6 core phages dominated the populations from both Shenzhen and Yunnan. Individuals from Shenzhen showed an enrichment of VC_486_17 and VC_1899_0, a complete circular lytic phage (genome size of 106 kbp) without host assignment, compared with populations from Yunnan. These findings were consistent with previous studies reporting that gut phageome structure variations are associated with urbanization and lifestyle⁴⁷. Taken together, phage composition and diversity differed between populations, indicating that industrialization and lifestyle impacted the human oral ecosystems.

Studies have demonstrated that many human intestinal microorganisms can act as biomarkers relevant to human health and disease^49,50,51. By exploring the microbiome composition of the SZ-4D (n = 2403) and the Beijing cohorts (n = 97) with phenotype information (Supplementary Table 7), we were able to compare the potential value of different oral microorganisms as biomarkers (i.e., bacteria, fungi, and virus) in relation to dental diseases (caries, calculus, and dental ulcers), and systemic diseases (obesity and rheumatoid arthritis). We established a series of random forest (RF) classifiers with fivefold cross-validation using multiple microbial abundances in our cohorts as features, including bacteria, fungi, viruses (phages), and the combinations of these features (Supplementary Table 8).

Most RF models exhibited an area under the receiver-operating characteristic curve (AUC) value < 0.65 for the testing sets, indicating that the composition of oral microorganisms/phages in most cases had limited diagnostic clinical value. However, in two cases, dental calculus and rheumatoid arthritis (RA), the AUCs for the testing sets approached 0.8 and 0.9, respectively, showing a potential clinical value (Fig. 5D). For dental calculus, the composition of the fungal community was not informative, whereas information on other features had comparable predictive values. Notably, for RA, the composition of bacteria, fungi, viruses, and the combination of these parameters had AUCs of approximately 0.9, suggesting that such information has the potential for being of clinical value.

To further explore the value of oral microbial phages as biomarkers and identify key viral biomarkers, we calculated the importance of viral features in relation to dental calculus and RA models trained on phage data for which AUC values > 0.75 were achieved. According to the RA phage models, VC_769 appeared to be a key player since four of its subVCs ranked amongst the top five subVCs with the highest importance values (higher than 0.006) (Fig. 5E). VC_769 was annotated as a lysogenic phage VC infecting Lactococcus lactis. Previous studies have suggested that Lactococcus lactis is a beneficial bacterium enriched in healthy individuals rather than in RA patients¹⁹. For the dental calculus model, VC_1293 and VC_1377 played important roles as they represented subVCs with the highest importance values, higher than 0.004 (Fig. 5F). VC_1377 is a lytic phage VC that infects Porphyromonas gingivalis, a periodontal pathogen. Together with the results above, we reasoned that the abundance of certain strictly host-specific phages would follow their bacterial hosts, which would enable phages to become biomarkers.

The presence of numerous huge phages containing the CRISPR system in the human oral cavity

By screening the genome sizes of the OPD, 3416 phages with genome sizes greater than 200 kbp were identified, and these were defined as huge phages⁵². To investigate the distribution of huge phages in different ecological systems, we generated a large-scale genome dataset by integrating nearly 4000 huge phages from multiple ecological niches^10,52. After evaluating the quality of these huge phage genomes using CheckV, we selected 3640 “no-provirus” genomes with completeness >50% for the subsequent analyses to avoid interference from bacterial sequences, including 3232 from the human oral cavity, 187 from the human gut, and 221 from other environmental ecosystems. To explore their evolutionary characteristics, we next constructed a phylogenetic tree using the proteomes of the high-quality huge phages (Supplementary Fig. 7A). Almost all large phages found in the human gut and various environmental ecosystems, as well as a small portion found in the human oral cavity, exhibited relatively short branch lengths in the tree, indicating that these phages may harbor some similar evolutionary features. Notably, most oral huge phages were evolutionarily distant from phages found in the human gut and various environmental ecosystems, and these oral phages were difficult to annotate taxonomically.

The CRISPR–Cas systems identified in phages function to eliminate intruding competing phages⁵². We screened the huge phage genomes and identified 265 phages containing CRISPR-Cas systems. After extracting their spacer sequences and aligning them within the OPD, 11,638 pairs of CRISPR-associated interactions were revealed, demonstrating the existence of universal CRISPR-dependent interactions between different phages in human symbiotic microbes (Supplementary Fig. 7B). This analysis enabled the identification of phages with the same host or ‘co-hosts’ phages. For example, the huge phages opdg_2921 and opdg_2470 have CRISPR systems targeting 25 normal phages in OPD (Fig. 6A). The host of many of these phages was identified as Pauljensenia hongkongensis, a gram-positive, strictly anaerobic and non-spore-forming bacterium from the Actinomycetaceae family. Accordingly, it is likely that other phages with unknown hosts could also infect P. hongkongensis, since only in this way could their genome be recorded by opdg_2921 and opdg_2470, forming CRISPR spacers in their genome.

**Fig. 6: Characteristics of huge phages.**

One hundred Cas proteins were predicted in these huge phages with high confidence, and most of these belong to the type V CRISPR system (Supplementary Fig. 7C). Cas12 is an example of a small Cas protein from the type V CRISPR system that can cleave dsDNA molecules and has gene editing potential⁵³. We used ESMFold to fold all proteins predicted to be Cas12 homologs and obtained 63 structures (Supplementary Fig. 7D). The phylogenetic tree demonstrated that some Cas12 proteins from huge phages formed a new clade (Fig. 6B). Among these proteins, a notable member is opdg_150_142, a protein comprising only 465 amino acid residues and with high prediction quality (Fig. 6C). Using predicted opdg_150_142 structure as query, we searched against the PDB database by Foldseek⁵⁴, and found that they exhibited topological similarity to known Cas12 proteins. For example, Cas12f1 is a Cas protein from an uncultured archaeon capable of forming a dimer to cleave dsDNA⁵⁵. The pairwise structural alignment showed that opdg_150_142 and Cas12f1 has the same fold with a 0.56 TM-score (Fig. 6C), indicating that they may be candidates for gene editing enzymes.

Discussion

Although the oral cavity contains numerous commensal microorganisms, there is limited detailed information on the structure and composition of bacteriophages. In this study, we performed large-scale data mining of 5427 oral metagenomes and 2178 genomes of oral isolated bacteria, generating a comprehensive oral phage database with genome-wide and proteome-wide annotations. Taking advantage of large cohorts and genomes sequenced in house, a total of 9983 non-redundant oral phage sub-VCs were identified and many of these phages have not been reported previously, representing numerous diverse and previously uncharacterized viral groups in the oral cavity. As a useful resource, OPD contains more than three thousand huge phages, which constitute a valuable database for further exploring the characteristics of huge phages and investigating their role in microbial ecosystems. Beyond genome catalog generation, we also explored the interactions between these phages and their oral bacterial hosts, describing the broad host range of the oral phages being closely related to host phylogeny. The prevalence of oral phages with a broad host range differs substantially from those of the previous studies reporting that most human phages are highly host-specific. Some phages can encode anti-defense proteins to resist bacteria, while others carry AMGs or VFs possibly impacting host metabolism or enhancing virulence, which may modulate host adaptation. Based on the cohorts sequenced by us, we found that the abundance of oral phages was significantly greater in industrialized areas than in non-industrialized areas and revealed a difference in phage composition between these two areas. Our study reveals phage-based biomarkers for oral diseases, which synergistically enhance conventional bacterial and fungal diagnostic markers and exhibit therapeutic potential in phage-based interventions.

Powerful protein structural prediction tools based on large language models such as ESMFold make it possible to perform proteome-wide structure analyses from metagenomic data with high efficiency. Using protein structural information and fast structural alignment to uncover novel anti-defense proteins and Cas proteins, we present evidence that structural similarity can be an alternative approach for traditional sequence-based methods in microbiome research. Molecular dynamics can simulate the binding and interaction between target proteins and ligands, and can be applied in some situations when experimentally validating the function of proteins is difficult. We obtained about ~90,000 high-quality proteins in OPD, but most of these proteins have not been fully characterized and need further investigation.

Although this study successfully improved our understanding of oral phages, it has certain limitations. Variations in data volume and sample size among different cohorts may influence the assessment of viral diversity and relative abundance across regions. Viral-like particle (VLP) sequencing data, which are the most accurate data for the virome, are missing in this study. The lack of VLP sequencing data may have reduced sensitivity in detecting low-abundance lytic phages. For host prediction, broad host ranges were determined using CRISPR-spacer matching, a widely used method whose accuracy is difficult to confirm. Additionally, functional exploration of viral genomes, particularly AMG analysis, was based on sequence homology, requiring cautious interpretation due to potential host contamination. The absence of experimental validation (e.g., phage isolation, functional assays) means that our computational predictions should be treated as putative biological hypotheses rather than confirmed mechanisms. Despite these limitations, our study offers valuable insights and targets for future experiments. Thus, we anticipate that OPD will constitute a valuable resource for further exploration of human oral phages and their interactions with their hosts.

Methods

Data collection and phage sequence detection

The sample collection and analysis were approved by the Institutional Review Board on Bioethics and Biosafety of BGI under the numbers BGI-IRB 19121 and BGI-IRB 22112-T1.

The 4D-SZ cohort of Chinese living in Shenzhen comprises 3,953 oral metagenomics samples, among which 1278 were newly sequenced (Supplementary Table 1). The Yunnan cohort consisted of 671 salivary samples from the Yunnan province. Other public oral metagenomic datasets were downloaded from NCBI SRA databases with accession codes SRP133047, SRP029441, SRS3984307, ERP110622, and ERP006678. All the high-quality reads were individually assembled using the assembly module of the METAPI pipelin,e applying SPAdes v3.13.0 with option ‘-meta’⁵⁶. For other details, see methods of Zhu et al. ¹⁸. In addition to metagenomics data, we also collected data from the Cultivated Oral Bacteria Genome Reference (COGR)²⁴ and the Expanded Human Oral Microbiome Database (eHOMD)². The culture conditions for oral bacteria and genome assembly methods were based on the methods described by Zou et al. ⁵⁷, Li et al. ²⁴, and Pride et al. ².

In order to mine phage genomes from sequencing data, a unified pipeline was developed, mainly based on VirFinder and VirSorter2. VirFinder is a tool to identify viral sequences based on viral sequence k-mers frequencies and machine learning⁸. This method utilizes the commonly used sequence k-mers to construct a sequence phase volume and build a machine learning classifier without reference, which significantly improves the speed and accuracy of virus sequence identification. Virsorter2, on the other hand, applies a multi-classifier, expert-guided approach to detect different DNA and RNA viral genomes and determines the presence of virus sequences after comparing them to its own database⁷. Using strict criteria: ‘Virfinder score >= 0.9, p-value <= 0.05 AND Virsorter2 score > 0.7’, a total of more than 12 million sequences greater than 1 kb in length were retained.

Sequence decontamination

To further improve the quality of the viral genomes, reduce the contamination rate, and select high-quality data for subsequent analysis, a set of quality screening and control assessment processes were developed. The first step involved processing the 12,545,619 sequences from the previous step by removing human contamination and eukaryotic viruses, which was done mainly based on BLAST v2.9.0⁵⁸. Setting strict parameters: ‘-word_size 28 -best_hit_overhang 0.1 -best_hit_score_edge 0.1 -dust yes -evalue 0.0001 -min_raw_gapped_score 100 -penalty -5 -perc_identity 90 -soft_masking true’ to globally match all sequences with the human genome GRCh38, the matched sequences were excluded, and we further removed sequences shorter than 10 kb, whereby 566,789 sequences remained. To remove redundancy and select the representative sequences, the sequences from metagenome data were clustered using CD-HIT with parameters ‘-c 0.95 -G 0 -aS 0.75’, which finally selected 240,215 cluster representative sequences⁵⁹. Considering that some sequences, such as plasmids and long mobile elements, could be misclassified as viruses, an artificial neural network used to construct the Gut Phage Database was also introduced for quality control¹⁰. Using this tool, a further 50,356 contaminating sequences were removed from all assembled contigs. The remaining 189,859 sequences constitute the genomes of the oral phage database.

Genome quality assessment

CheckV v0.7.0 is a fully automated quality control and assessment process for assessing the quality of single viral contigs²⁵, including identification of contamination by the host of the integrated provirus, estimation of genomic fragment integrity, and identification of closed genomes. Results are rated based on viral sequence integrity and are divided into five classes: Complete (100% completeness), High-quality (greater than or equal to 90% and less than 100% completeness), Medium-quality (greater than or equal to 50% and less than 90% completeness), Low-quality (less than 50% completeness), Not-determined (no viral gene found in the sequence). We applied CheckV v0.7.0 to evaluate the obtained genomes and used the complete, high-quality, and medium-quality sequences as phage draft genomes for subsequent analysis, obtaining a total of 58,141 sequences.

Generation of viral clusters and phylogenetic tree

For further clustering of viral genomes into viral clusters (VCs), a network-based classification tool vConTACT2 was applied²⁶. The genomes with completeness higher than 50% were defined as viral draft genomes, for these sequences, we used protein sequences predicted by Prodigal with gene-to-genome mapping file as the input to generate protein clusters (PCs) and viral clusters (VCs), with parameters ‘--raw-proteins --rel-mode ‘Diamond’ --proteins-fp --pcs-mode MCL --vcs-mode ClusterONE --c1-bin cluster_one-1.0.jar --db ‘ProkaryoticViralRefSeq88-Merged”. For each subVC, the genome with highest completeness and longest length was selected as the representative sequence. The reference sequences from the ProkaryoticViralRefSeq88-Merged database were removed, resulting in 9983 subVCs. We also cluster the viral draft genomes of OPD and other databases, including GVD, GPD, CHVD, and OVD^9,10,12,33. A proteomics-based phylogenetic tree of reference sequences of the 9,983 subVCs was constructed via VIPtree with default parameters²⁹.

Taxonomic assignment

geNomad, a tool for virus identification and annotation with improved classification performance, was applied for taxonomic assignment^10,27. Annotations for viral draft genomes were performed by the following parameters ‘--lenient-taxonomy --full-ictv-lineage’. Specifically, for each genome, the genes encoded by the sequence were aligned to the geNomad marker gene set that is associated with viral taxa defined in ICTV MSL 39²⁸. Each gene is subsequently classified based on the taxonomic lineage of the assigned marker, if more than 2 genes hit the marker set and more than half of all hit genes belong to a specific taxon, then the taxonomy of the sequence is determined as the most specific taxon. For each subVC, the annotated result of representative genomes was defined as the subVC taxonomic assignment result.

Host prediction and analyses of oral bacteria

We used Crass to mine CRISPR spacer sequences from the 2178 oral bacteria, and CRISPRCasTyper to predict the Cas proteins^60,61. These CRISPR spacers were matched to phage draft genomes by Blastn with the parameters ‘-perc_identity 100 -qcov_hsp_perc 100’. Based on the matching results, the phage-host relationship was classified as the subVC-species level and VC-genus level, and the infection network was displayed by Cytoscape (v3.8.2)⁶². iPHoP was employed for additional validation using default parameters, retaining only matches with Confidence Scores >90 and the top 5 predictions from each method³⁹. Note that iPHoP results were not incorporated into downstream analyses.

GTDB-TK (v1.5.0) with database release202 was used to perform taxonomic annotation of the 2,178 genomes of oral isolated bacteria⁶³. Then we selected the genomes with the highest completeness calculated by CheckM and longest length as the species-level representative genomes and used the representative genomes to construct the maximum-likelihood phylogenetic tree based on 120 conserved single-copy genes⁶⁴. Then we mapped the phages to their infecting hosts on the phylogenetic tree. The tree was visualized using iTOL v6⁶⁵.

Protein-coding gene calling and functional analysis based on sequence information

To identify protein-coding genes for generating a gene and proteome catalog, Prodigal v2.6.3 was applied for all genomes from OPD, oral bacteria, and other downloaded phage genomes⁶⁶. Overall, 9,181,412 genes from OPD were predicted and translated into protein sequences. Then these genes were annotated by the eggNOG-mapper with default parameters and the eggNOG database, including GO, KEGG, COG, and PFAM30³⁰. We then used MMseqs2 to cluster the proteins from OPD and other virome databases for comparison³⁸. All proteins were clustered with the parameters ‘-s 7.5 -c 0.9 --min-seq-id 0.7 --cov-mode 0 -e 0.001 --cluster-mode 0’, resulting in 2,762,333 clusters. The cluster result was visualized by the ‘UpSetR’ R package.

For the AMGs, we matched the annotation to the KOs-list ‘VIBRANT_AMGs.tsv’ supplied by VIBRANT⁴⁰. BLAST v2.9.0 was used to align the gene sequences to VFDB for annotation of virulence factors⁴².

Proteome-wide structure prediction for functional protein identification

The sequence from OPD-specific protein clusters was re-clustered with the parameters ‘-s 7.5 -c 0.7 --min-seq-id 0.3 --cov-mode 0 -e 0.001 --cluster-mode 0’. The reference sequences with lengths between 30 to 1800 amino acid residues of these protein clusters were predicted via ESMFold on 4 Nvidia A100 GPUs for one week³⁵.

We used Foldseek for a search of the SPO1 phage Tad2 (PDB:8smf) against high-quality structures in OPD and then performed pair-wise comparison by MMalign^37,67. The homotetrameric structures of 6 candidates were predicted by ColabFold with AlphaFold_multimer v2.3.1 model and the SPO1 phage Tad2 as template input. The molecular dynamics simulation for opdg_86857_54 and putative ligand 1″–3′ gcADPR was performed by a OpenMM-based protocol (https://github.com/tdudgeon/simple-simulate-complex)⁶⁸. The phylogenetic tree of Tad2 proteins was built by Fasttree and visualized by ggtree⁶⁹.

Metagenomic reads mapping

To estimate the relative abundances of viruses, bacteria, and fungi in each sample, we first generated a customized database with OPD, the oral SGBs from Zhu et al. ¹⁸, and the fungal genomes downloaded from NCBI. Next, we mapped metagenomic reads to the database using Kraken⁷⁰. Relative abundances of viruses, bacteria, and fungi were determined by Bracken with default parameters⁷¹.

The variation in oral viruses between different regions was assessed by permutational analysis of variance (PERMANOVA) using Bray-Curtis dissimilarity on the relative abundance profile of viruses from the R vegan package⁷².

Construction of prediction models

Random forest models for each disease were performed via the scikit-learn and imblearn packages on Python 3.7 using abundance profiles of the samples. Models were built with 1000 trees and fivefold cross validation. ROC curves and feature importance were visualized by Python seaborn package.

Prediction of CRISPR systems in huge phages

CRISPR systems in huge phages were predicted and visualized by CRISPRCasTyper⁶⁰ with default parameters. Then the spacers predicted from the huge phage genomes were aligned to other phages in OPD by BLAST setting the parameters ‘--evalue 1e-6’, and hits with more than 95% sequence identity were retained. The interaction network was plotted by Cytosape. The putative Cas12 proteins were predicted by ESMFold³⁵.

Data availability

The data that support the findings of this study have been deposited into CNGB Sequence Archive (CNSA)⁷³ of China National GeneBank DataBase (CNGBdb)⁷⁴ with accession number CNP0003685.

References

Liang, G. & Bushman, F. D. The human virome: assembly, composition and host interactions. Nat. Rev. Microbiol. 19, 514–527 (2021).
Article CAS PubMed PubMed Central Google Scholar
Escapa, I. F. et al. New insights into human nostril microbiome from the expanded human oral microbiome database (eHOMD): a resource for the microbiome of the human aerodigestive tract. Msystems 3, 00187–00118 (2018).
Article Google Scholar
Pride, D. T. et al. Evidence of a robust resident bacteriophage population revealed through analysis of the human salivary virome. ISME J. 6, 915–926 (2012).
Article CAS PubMed Google Scholar
Wang, J., Gao, Y. & Zhao, F. Phage–bacteria interaction network in human oral microbiome. Environ. Microbiol. 18, 2143–2158 (2016).
Article CAS PubMed Google Scholar
Ho, S. X., Min, N., Wong, E. P. Y., Chong, C. Y. & Chu, J. J. H. Characterization of oral virome and microbiome revealed distinctive microbiome disruptions in paediatric patients with hand, foot and mouth disease. npj Biofilms Microbiomes 7, 19 (2021).
Article CAS PubMed PubMed Central Google Scholar
Khalifa, L. et al. Phage therapy against Enterococcus faecalis in dental root canals. J. Oral. Microbiol. 8, 32157 (2016).
Article PubMed Google Scholar
Guo, J. et al. VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome 9, 1–13 (2021).
Article Google Scholar
Ren, J., Ahlgren, N. A., Lu, Y. Y., Fuhrman, J. A. & Sun, F. VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5, 1–20 (2017).
Article Google Scholar
Gregory, A. C. et al. The gut virome database reveals age-dependent patterns of virome diversity in the human gut. Cell Host Microbe 28, 724–740.e728 (2020).
Article CAS PubMed PubMed Central Google Scholar
Camarillo-Guerrero, L. F., Almeida, A., Rangel-Pineros, G., Finn, R. D. & Lawley, T. D. Massive expansion of human gut bacteriophage diversity. Cell 184, 1098–1109.e1099 (2021).
Article CAS PubMed PubMed Central Google Scholar
Nayfach, S. et al. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat. Microbiol. 6, 960–970 (2021).
Article CAS PubMed PubMed Central Google Scholar
Li, S. et al. A catalog of 48,425 nonredundant viruses from oral metagenomes expands the horizon of the human oral virome. Iscience 25 (2022).
Wahida, A., Tang, F. & Barr, J. J. Rethinking phage-bacteria-eukaryotic relationships and their influence on human health. Cell Host Microbe 29, 681–688 (2021).
Article CAS PubMed Google Scholar
Shen, S., Huo, D., Ma, C., Jiang, S. & Zhang, J. Expanding the colorectal cancer biomarkers based on the human gut phageome. Microbiol. Spectr. 9, e00090–00021 (2021).
Article PubMed PubMed Central Google Scholar
Arnold, B. J., Huang, I.-T. & Hanage, W. P. Horizontal gene transfer and adaptive evolution in bacteria. Nat. Rev. Microbiol. 20, 206–218 (2022).
Article CAS PubMed Google Scholar
Li, R., Wang, Y., Hu, H., Tan, Y. & Ma, Y. Metagenomic analysis reveals unexplored diversity of archaeal virome in the human gut. Nat. Commun. 13, 7978 (2022).
Article CAS PubMed PubMed Central Google Scholar
Wang, J. et al. Maternal and neonatal viromes indicate the risk of offspring’s gastrointestinal tract exposure to pathogenic viruses of vaginal origin during delivery. Mlife 1, 303–310 (2022).
Article PubMed PubMed Central Google Scholar
Zhu, J. et al. Over 50,000 metagenomically assembled draft genomes for the human oral microbiome reveal new taxa. Genomics Proteom. Bioinforma. 20, 246–259 (2022).
Article Google Scholar
Zhang, X. et al. The oral and gut microbiomes are perturbed in rheumatoid arthritis and partly normalized after treatment. Nat. Med. 21, 895–905 (2015).
Article CAS PubMed Google Scholar
Brito, I. L. et al. Transmission of human-associated microbiota along family and social networks. Nat. Microbiol. 4, 964–971 (2019).
Article CAS PubMed PubMed Central Google Scholar
Zeller, G. et al. Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol. Syst. Biol. 10, 766 (2014).
Article PubMed PubMed Central Google Scholar
Heintz-Buschart, A. et al. Integrated multi-omics of the human gut microbiome in a case study of familial type 1 diabetes. Nat. Microbiol. 2, 1–13 (2016).
Google Scholar
Goltsman, D. S. A. et al. Metagenomic analysis with strain-level resolution reveals fine-scale variation in the human pregnancy microbiome. Genome Res. 28, 1467–1480 (2018).
Article CAS PubMed PubMed Central Google Scholar
Li, W. et al. A catalog of bacterial reference genomes from cultivated human oral bacteria. npj Biofilms Microbiomes 9, 45 (2023).
Article CAS PubMed PubMed Central Google Scholar
Nayfach, S. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat. Biotechnol. 39, 578–585 (2021).
Article CAS PubMed Google Scholar
Bin Jang, H. et al. Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat. Biotechnol. 37, 632–639 (2019).
Article Google Scholar
Camargo, A. P. et al. Identification of mobile genetic elements with geNomad. Nat. Biotechnol. 42, 1303–1312 (2024).
Article CAS PubMed Google Scholar
Simmonds, P. et al. Changes to virus taxonomy and the ICTV Statutes ratified by the International Committee on Taxonomy of Viruses (2024). Arch. Virol. 169, 236 (2024).
Article CAS PubMed PubMed Central Google Scholar
Nishimura, Y. et al. ViPTree: the viral proteomic tree server. Bioinformatics 33, 2379–2380, https://doi.org/10.1093/bioinformatics/btx157 (2017).
Article CAS PubMed Google Scholar
Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P. & Huerta-Cepas, J. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol. Biol. Evol. 38, 5825–5829 (2021).
Article CAS PubMed PubMed Central Google Scholar
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309–D314 (2019).
Article CAS PubMed Google Scholar
Yirmiya, E. et al. Phages overcome bacterial immunity via diverse anti-defence proteins. Nature 625, 352–359, https://doi.org/10.1038/s41586-023-06869-w (2024).
Article CAS PubMed Google Scholar
Tisza, M. J., Belford, A. K., Dominguez-Huerta, G., Bolduc, B. & Buck, C. B. Cenote-Taker 2 democratizes virus discovery and sequence annotation. Virus Evol.7, veaa100 (2021).
Article PubMed Google Scholar
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Article CAS PubMed Google Scholar
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130, https://doi.org/10.1126/science.ade2574 (2023).
Article CAS PubMed Google Scholar
Hampton, H. G., Watson, B. N. J. & Fineran, P. C. The arms race between bacteria and their phage foes. Nature 577, 327–336, https://doi.org/10.1038/s41586-019-1894-8 (2020).
Article CAS PubMed Google Scholar
van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246, https://doi.org/10.1038/s41587-023-01773-0 (2024).
Article CAS PubMed Google Scholar
Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682, https://doi.org/10.1038/s41592-022-01488-1 (2022).
Article CAS PubMed PubMed Central Google Scholar
Roux, S. et al. iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria. PLoS Biol. 21, e3002083 (2023).
Article CAS PubMed PubMed Central Google Scholar
Kieft, K., Zhou, Z. & Anantharaman, K. VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. Microbiome 8, 1–23 (2020).
Article Google Scholar
Thompson, L. R. et al. Phage auxiliary metabolic genes and the redirection of cyanobacterial host carbon metabolism. Proc. Natl. Acad. Sci. 108, E757–E764 (2011).
Article CAS PubMed PubMed Central Google Scholar
Liu, B., Zheng, D., Zhou, S., Chen, L. & Yang, J. VFDB 2022: a general classification scheme for bacterial virulence factors. Nucleic Acids Res. 50, D912–D917 (2022).
Article CAS PubMed Google Scholar
Karaolis, D. K., Somara, S., Maneval, D. R. Jr, Johnson, J. A. & Kaper, J. B. A bacteriophage encoding a pathogenicity island, a type-IV pilus and a phage receptor in cholera bacteria. Nature 399, 375–379 (1999).
Article CAS PubMed Google Scholar
Nishijima, S. et al. Extensive gut virome variation and its associations with host and environmental factors in a population-level cohort. Nat. Commun. 13, 5252 (2022).
Article CAS PubMed PubMed Central Google Scholar
Olm, M. R. et al. Robust variation in infant gut microbiome assembly across a spectrum of lifestyles. Science 376, 1220–1223 (2022).
Article CAS PubMed PubMed Central Google Scholar
Shkoporov, A. N. et al. The human gut virome is highly diverse, stable, and individual specific. Cell Host Microbe 26, 527–541.e525 (2019).
Article CAS PubMed Google Scholar
Zuo, T. et al. Human-gut-DNA virome variations across geography, ethnicity, and urbanization. Cell Host Microbe 28, 741–751.e744 (2020).
Article CAS PubMed Google Scholar
Schoch, C. L. et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database 2020, baaa062 (2020).
Article CAS PubMed PubMed Central Google Scholar
Jie, Z. et al. The gut microbiome in atherosclerotic cardiovascular disease. Nat. Commun. 8, 845 (2017).
Article PubMed PubMed Central Google Scholar
Jie, Z. et al. A consortium of three-bacteria isolated from human feces inhibits formation of atherosclerotic deposits and lowers lipid levels in a mouse model. Iscience 26 (2023).
Zhou, C. et al. Metagenomic profiling of the pro-inflammatory gut microbiota in ankylosing spondylitis. J. Autoimmun. 107, 102360 (2020).
Article CAS PubMed Google Scholar
Al-Shayeb, B. et al. Clades of huge phages from across Earth’s ecosystems. Nature 578, 425–431 (2020).
Article CAS PubMed PubMed Central Google Scholar
Yan, W. X. et al. Functionally diverse type V CRISPR-Cas systems. Science 363, 88–91, https://doi.org/10.1126/science.aav7271 (2019).
Article CAS PubMed Google Scholar
Bank, R. P. D. delivery of experimentally-determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning. Nucleic Acids Res. 51, D488–D508 (2023).
Article Google Scholar
Takeda, S. N. et al. Structure of the miniature type V-F CRISPR-Cas effector enzyme. Mol. Cell 81, 558–570.e553, https://doi.org/10.1016/j.molcel.2020.11.035 (2021).
Article CAS PubMed Google Scholar
Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).
Article CAS PubMed PubMed Central Google Scholar
Zou, Y. et al. 1520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses. Nat. Biotechnol. 37, 179–185 (2019).
Article CAS PubMed PubMed Central Google Scholar
Sf, A. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Article Google Scholar
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
Article CAS PubMed PubMed Central Google Scholar
Russel, J., Pinilla-Redondo, R., Mayo-Muñoz, D., Shah, S. A. & Sørensen, S. J. CRISPRCasTyper: automated identification, annotation, and classification of CRISPR-Cas loci. CRISPR J. 3, 462–469 (2020).
Article CAS PubMed Google Scholar
Skennerton, C. T., Imelfort, M. & Tyson, G. W. Crass: identification and reconstruction of CRISPR from unassembled metagenomic data. Nucleic Acids Res. 41, e105–e105 (2013).
Article CAS PubMed PubMed Central Google Scholar
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Article CAS PubMed PubMed Central Google Scholar
Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. (Oxford University Press, 2020).
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Article CAS PubMed PubMed Central Google Scholar
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 49, W293–W296 (2021).
Article CAS PubMed PubMed Central Google Scholar
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinforma. 11, 1–11 (2010).
Article Google Scholar
Mukherjee, S. & Zhang, Y. MM-align: a quick algorithm for aligning multiple-chain protein complex structures using iterative dynamic programming. Nucleic Acids Res. 37, e83, https://doi.org/10.1093/nar/gkp318 (2009).
Article CAS PubMed PubMed Central Google Scholar
Eastman, P. et al. OpenMM 7: Rapid development of high performance algorithms for molecular dynamics. PLoS Comput. Biol. 13, e1005659 (2017).
Article PubMed PubMed Central Google Scholar
Yu, G., Smith, D. K., Zhu, H., Guan, Y. & Lam, T. T. Y. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods Ecol. Evol. 8, 28–36 (2017).
Article Google Scholar
Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 1–13 (2019).
Article Google Scholar
Lu, J., Breitwieser, F. P., Thielen, P. & Salzberg, S. L. Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci. 3, e104 (2017).
Article PubMed Google Scholar
Oksanen, J. et al. Package ‘vegan’: Community ecology package. Version 2, 1–295 (2013).
Google Scholar
Guo, X. et al. CNSA: a data repository for archiving omics data. Database 2020, baaa055 (2020).
Article PubMed PubMed Central Google Scholar
Chen, F. Z. et al. CNGBdb: China National GeneBank DataBase. Hereditas 42, 799–809 (2020).
PubMed Google Scholar

Download references

Acknowledgements

This work was supported by grants from the National Key Research and Development Program of China (No. 2018YFC1313801), the Natural Science Foundation of Guangdong Province, China (No. 2019B020230001), and the Shenzhen Municipal Government of China (No. XMHT20220104017). We also thank the colleagues at BGI-Shenzhen for sample collection, and discussions, and China National GeneBank (CNGB) Shenzhen for DNA extraction, library construction, and sequencing. And we additionally thank BGI Precision Nutrition (Shenzhen) Technology Co., Ltd for their technical guidance and support.

Author information

These authors contributed equally: Zhuye Jie, Hewei Liang, Yanzheng Meng, Jiahao Zhang, Tao Zhang.

Authors and Affiliations

BGI Research, Wuhan, 430074, China
Zhuye Jie, Hewei Liang, Tao Zhang, Tongyuan Hu, Weiting Liang & Yanmei Ju
Shenzhen Key Laboratory of Human Commensal Microorganisms and Health Research, BGI Research, Shenzhen, 518083, China
Zhuye Jie, Tao Zhang & Xin Tong
Laboratory of Integrative Biomedicine, Department of Biology, University of Copenhagen, Universitetsparken 13, 2100, Copenhagen, Denmark
Zhuye Jie & Karsten Kristiansen
BGI Research, Shenzhen, 518083, China
Hewei Liang, Yanzheng Meng, Jiahao Zhang, Wenxi Li, Xiaoqian Lin, Xin Tong, Jian Wang, Huanming Yang & Karsten Kristiansen
Peking University–Tsinghua University–National Institute of Biological Sciences Joint Graduate Program, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
Yanzheng Meng
School of Bioscience and Biotechnology, South China University of Technology, Guangzhou, 510006, China
Wenxi Li
BGI Research, Sanya, 572025, China
Mo Han
State Key Laboratory of Genome and Multi-omics Technologies, BGI Research, Shenzhen, 518083, China
Xin Jin, Xun Xu, Wenwei Zhang, Liang Xiao & Yuanqiang Zou
James D. Watson Institute of Genome Sciences, Hangzhou, 310058, China
Jian Wang & Huanming Yang
PREDICT, Center for Molecular Prediction of Inflammatory Bowel Disease, Faculty of Medicine, Aalborg University, 2450, Copenhagen, Denmark
Karsten Kristiansen
BGI Precision Nutrition (Shenzhen) Technology Co., Ltd, Shenzhen, 518083, China
Liang Xiao & Yuanqiang Zou

Authors

Zhuye Jie
View author publications
Search author on:PubMed Google Scholar
Hewei Liang
View author publications
Search author on:PubMed Google Scholar
Yanzheng Meng
View author publications
Search author on:PubMed Google Scholar
Jiahao Zhang
View author publications
Search author on:PubMed Google Scholar
Tao Zhang
View author publications
Search author on:PubMed Google Scholar
Wenxi Li
View author publications
Search author on:PubMed Google Scholar
Xiaoqian Lin
View author publications
Search author on:PubMed Google Scholar
Tongyuan Hu
View author publications
Search author on:PubMed Google Scholar
Mo Han
View author publications
Search author on:PubMed Google Scholar
Weiting Liang
View author publications
Search author on:PubMed Google Scholar
Yanmei Ju
View author publications
Search author on:PubMed Google Scholar
Xin Tong
View author publications
Search author on:PubMed Google Scholar
Xin Jin
View author publications
Search author on:PubMed Google Scholar
Xun Xu
View author publications
Search author on:PubMed Google Scholar
Wenwei Zhang
View author publications
Search author on:PubMed Google Scholar
Jian Wang
View author publications
Search author on:PubMed Google Scholar
Huanming Yang
View author publications
Search author on:PubMed Google Scholar
Karsten Kristiansen
View author publications
Search author on:PubMed Google Scholar
Liang Xiao
View author publications
Search author on:PubMed Google Scholar
Yuanqiang Zou
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceived and designed the study: Y.Z., Z.J., H.L., Y.M., J.Z. Performed the analysis: Y.Z., Z.J., H.L., Y.M., J.Z., W.Li., X.L., T.H., W.Liang., Y.J., X.T. Contributed reagents/materials/analysis tools: Y.Z., Z.J., L.X., M.H., T.Z., X.J., X.X., J.W. Wrote the paper: Y.Z., Z.J., H.L., Y.M., J.Z. Supervised the work: H.Y., W.Z., T.Z., L.X., K.K. Revised the paper: K.K. All authors commented on the manuscript.

Corresponding authors

Correspondence to Karsten Kristiansen, Liang Xiao or Yuanqiang Zou.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Supplementary Table (download XLSX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Jie, Z., Liang, H., Meng, Y. et al. Integrating metagenomics and cultivation unveils oral phage diversity and potential impact on hosts. npj Biofilms Microbiomes 11, 145 (2025). https://doi.org/10.1038/s41522-025-00773-z

Download citation

Received: 31 October 2024
Accepted: 02 July 2025
Published: 26 July 2025
Version of record: 26 July 2025
DOI: https://doi.org/10.1038/s41522-025-00773-z