Abstract
Bacteriophages play important roles in the regulation of bacterial communities and may thus impact human health. However, the scarcity of genomic data of oral microorganism has limited information on oral phages. We collected data on 5427 metagenomic samples and 2178 cultivated bacterial genomes across different geographical areas and populations, generating the Oral Phage Database (OPD), comprising 189,859 representative phage genome sequences, including 3416 huge phages (genome size > 200 kbp). OPD reveals that most oral viruses are unknown and encode an enormous variety of dark proteins. Numerous oral phages infecting a broad range of hosts carry anti-defense genes, auxiliary metabolic genes, and virulence factors that may affect bacterial metabolism and influence human health. The composition of oral phages varies among different populations, and several phages have the potential to act as biomarkers for disease. OPD expands our knowledge of phage-bacteria interactions, huge phages diversity, and potential impacts on human health.
Similar content being viewed by others
Introduction
The human body hosts an abundance of microorganisms that mainly reside in the gut, in the oral cavity, and on the skin1. As the organ with the second highest abundance of microorganisms2, the human oral cavity harbors ~108 virus-like particles per milliliter of saliva3, most of which are bacteriophages. As interactions between phages and bacteria can modulate human oral microbial communities by infecting host bacteria and transferring genes, and as oral phages may play a role in shaping oral microbial ecosystems4, they could impact human health and hold potential for therapy in addressing dental diseases caused by bacterial infections5,6.
In recent years, bioinformatics tools or pipelines aimed at recovering viruses from assembled sequences have been developed7,8, resulting in the generation of a series of virus genome catalogs based on these tools, such as the human Gut Virome Database (GVD)9, the Gut Phage Database (GPD)10, the Metagenomic Gut Virus catalog (MGV)11, and the Oral Virus Database (OVD)12. These catalogs constitute important resources for studies exploring the relationships between phages, bacteria, and humans13, expanding disease-associated biomarkers14, and revealing horizontal gene transfer (HGT) via phages15. Nevertheless, these studies have several limitations. First, data based mainly on a series of independent microbiome studies may include batch effects and contamination. Second, the interaction networks of phage-to-host and phage-to-phage were not characterized in detail. Third, ignoring phenotype information (e.g., lifestyle, disease, ethnicity) limits investigations on associations between phage composition and human phenotypes. Moreover, compared to the large number of viral genomes and the high diversity of the same in the gut16,17, the phageome of the oral cavity has not been fully explored, calling for the construction of an oral phage genome catalog combined with genome annotation, host assignment, relative abundance analysis, and database comparisons. Despite recent attempts to perform metagenomic exploration of phages in the oral cavity12, a number of human oral microbiota sequencing datasets were not fully used18, and the phage distribution in the oral microbiota remains largely underexplored.
To expand our knowledge of the oral phageome, we performed a comprehensive large-scale viral genome identification based on 5427 oral metagenomes and 2178 genomes of oral isolated bacteria to establish an extensive phylogenomic and gene functional database, the Oral Phage Database (OPD). More than 90% (4918/5427) of the metagenomics samples with phenotypic information were from China and were sequenced on our in-house platform. We uncovered a diverse array of 189,859 nonredundant phage genomes with less than 95% nucleotide similarity, with a median size of 27.61 kbp, including 3416 huge phages with genome sizes larger than 200 kbp. By combining both genome-wide and proteome-wide analyses, we revealed a complex interaction network between oral phages and their hosts and further revealed specific phage compositions in different populations. Thus, OPD represents a comprehensive source of oral phages and provides insight into their potential impact on the host.
Results
Construction of the Oral Phage Database (OPD)
To explore the diversity of the phageome in the human oral cavity, 5427 oral metagenomic biosamples from multiple cohorts and 2178 genomes from cultivated oral bacteria were collected as the data source. The data comprised a total of 17.26 Tbp sequencing data generated in-house and from public datasets, including the following: (1) 3953 samples from Shenzhen, China, an industrialized area, including 2675 samples from the 4D-SZ (Disease, Drug, Diet, Daily life) cohort18, and 1278 newly sequenced samples; (2) 671 samples from the Chinese Yunnan cohort representing individuals living in a non-industrialized environment18; (3) 294 samples including rheumatoid arthritis (RA) patients from the Chinese Beijing cohorts19; and (4) a total of 509 public oral metagenomes downloaded from the NCBI SRA databases, including 102 from Fiji20, 149 from France21, 72 from Germany21, 85 from Luxembourg22, and 101 from the United States23 (Supplementary Fig. 1A, Supplementary Table 1, Supplementary Table 2). In addition, we retrieved the genomes of 2178 bacterial isolates cultured from the human oral cavity, of which 1089 were derived from healthy Chinese individuals24, and the remaining 1089 genomes were collected from the expanded Human Oral Microbiome Database (eHOMD)2. The retrieval of cultured bacterial genomes provided curated viral sequences along with specific host genome-resolved information. Over 670 million raw contigs from the data source were scanned by VirFinder and VirSorter2 to identify viral-like sequences7,8. Initially, a total of 12,545,619 virus-like contigs were identified. Since some contaminating mobile genetic elements (MGEs) such as plasmids or integrative and conjugative elements, as well as human sequences were present in these virus-like contigs, and virus-like contigs from different samples may exhibit high sequence identity exceeding 95%, a quality control pipeline was designed to filter out such sequences, as well as those shorter than 10 kbp in metagenomes and 1 kbp in the genomes of bacterial isolates. Finally, 189,859 viral-like contigs with a median size of 27.61 kbp were obtained, resulting in the establishment of an integrated, consistently processed, non-redundant database, the OPD (Fig. 1A). We used CheckV to evaluate the level of completeness and contamination25. As a result, 4709 sequences (2.5%) were assigned as complete and high quality (>90% completeness) and 53,432 sequences (28.1%) were assigned as medium quality (50-90% completeness) (Supplementary Fig. 1B, Supplementary Table 3). The sequences with a completeness greater than 50% were defined as viral draft genomes with a median genome length of 48,519 bp and a median CheckV genome completeness of 65.1(Fig. 1B and C), and these genomes were used for subsequent analyses. To be able to generalize on global phage distribution properties, the viral genomes sharing high sequence identities may be further clustered into suitable taxonomic levels for subsequent analysis. We grouped the oral viral draft genomes with reference genomes based on shared protein clusters by vConTACT226, whereby 1915 non-singleton viral clusters (VCs) with 10,382 sub viral clusters (subVCs) were generated. After elimination of subVCs only containing vConTACT2 reference genomes, 9983 subVCs remained (Supplementary Table 4), of which 489 represented phages from cultured bacterial isolate data (Fig. 1D). vConTACT2 provides high-confidence genus assignments for identified VCs and for each subVC, the genome with the highest quality score assessed by CheckV was selected as the representative25. Among these subVCs, 64.8% comprised only one member, indicating that their genomes are distant from those of other phages (Fig. 1E). More than 57% of the subVCs were predicted to be lytic phages, which is consistent with previously reported studies12 (Fig. 1F).
A Massively assembled contigs were predicted and filtered to obtain phage genomes. B, C Boxenplot showing the genome differences between the viral draft genomes and the genome fragments. B Sequence length distribution. C CheckV quality score. D–F Summary statistics of Viral Clusters (VCs) and subVCs. D Genome data source for each sub-VC. E Members for each subVC, F Life style type for each subVC. G, H Clustering of genome into subVCs comparing OPD to the four other existing virome catalogs. G Venn plot showing unique and shared subVCs comparing OPD against the other four databases, H Venn diagram illustrating unique genomes in the OPD and those matching with other databases, as determined by subVC clustering.
To identify the unique phage genomes from OPD, we compared the OPD catalog with existing virus catalogs by clustering viral draft genomes via vConTACT2, including GVD, GPD, CHVD, and OVD. Note that we also attempted to cluster OPD and MGV but this exceeded the memory limits of our high-performance computing (HPC) cluster. Strikingly, we found that the oral phages from OPD exhibited little overlap with phages from gut virus catalogs, indicating distinct phage compositions in the gut and the oral cavity (Fig. 1G). A total of 20,136 phage genomes did not cluster with genomes from other catalogs, suggesting that OPD considerably expanded the list of phages in the human oral cavity (Fig. 1H).
OPD expands oral phage genome diversity
To systematically assess the taxonomic diversity within the OPD, we annotated all subVC representative genomes using geNomad with the International Committee on Taxonomy of Viruses (ICTV) MSL39 database27,28. Due to frequent genome mutations and the lack of common genes existing in all phage genomes, we used VIPtree to construct phylogenetic trees based on the proteomes of each subVC representative to investigate the phylogeny of the phage genomes (Fig. 2)29. 99.87% (9970/9983) of subVCs were annotated at the class level, with the majority classified as Caudoviricetes at 99.68% (9951/9983), and the remaining as Malgrandaviricetes (16/9983) and Faserviricetes (2/9983). However, only 238 subVCs could be annotated at the order level, of which 220 were Crassvirales, accounting for 2.20% (220/9983) of all oral phages. Furthermore, only 67 phages were annotated at the family level, indicating that the majority of oral phages remains challenging to classify taxonomically. Our research reveals many genomes from potentially novel phages.
The phylogenetic tree of subVC representative sequences. The inner circle is colored sequentially by taxonomic annotations: class, order, and family. The outer bars represents genome length, and subVCs containing genomes from isolated bacteria are highlighted in cyan on the tree.
The sample information of OPD enabled us to identify phages that were shared between geographic regions. Since China contributed the most data, we compared the phage composition of subVCs between samples from China and samples from other countries (Supplementary Fig. 2). In samples from China, we identified 9566 subVCs, a number far greater than the number observed in samples from other countries. Among these 9566 subVCs, 7620 (79.65%) were not detected in samples from other countries. Thirty-three subVCs were present in all countries, which reflects globally distributed phage strains that may infect globally distributed bacteria.
Proteome-wide comparison of structures identifies anti-defense genes in oral phages
To understand the function of oral phages, we investigated the proteins encoded by these phages. A total of 9,181,412 encoded proteins in OPD were annotated using the eggNOG-mapper30,31, yielding results covering several functional databases including COG, GO, KEGG, and PFAM. Among the proteins with COG functional annotations, the most enriched functional category was information storage and processing with 9.24% of all proteins fitting these typical viral functions (Fig. 3A). However, 78.71% (7,226,689) of these proteins were unknown, indicating that little is known about the functional potential of human oral phages.
A Functional annotations via COG for protein-encoding phage genes. B UpSet plot showing overlapping protein clusters (PCs) across six existing virome catalogs. C Distribution of sequence length and ESMFold predicted local distance difference test (pLDDT) for PC representatives. D Phylogenetic distribution of Thoeris anti-defense 2 (Tad2) candidates identified from OPD. E An example of AlphaFold_multimer model of opdg_86857_54 superposed onto Tad2. F Chart plot showing the RMSD changes of the 1″–3′ gcADPR and the opdg_86857_54 homotetramer C-alpha atoms during molecular dynamics simulations.
Previous studies have reported that the phages from different habitats may carry specific genes to promote their survival32. To identify the OPD-specific proteins, we clustered all proteins from existing virus catalogs, including GVD, GPD, CHVD, MGV, and OVD9,10,11,12,33. First, we used MMseqs2 to cluster all proteins on the basis of 70% sequence identity and a 90% sequence alignment, resulting in 2,762,333 protein clusters (PCs)34. Examining the structures of these clusters, we observed that 522,534 proteins were unique to OPD and not clustering with proteins from other catalogs (Fig. 3B). This high number of proteins selectively found in the OPD suggests an untapped novel functional potential for proteins in the human oral cavity. Next, we re-clustered all the OPD-specific protein sequences with 30% sequence identity and a 70% sequence alignment into 279,401 PCs, of which 111,622 had COG annotations (Supplementary Fig. 3A).
As recent breakthroughs in protein 3D structure prediction have enabled accurate and fast structural characterization of protein sequences, we ran ESMFold to fold the representative sequence of OPD-specific PCs with sequence lengths between 30 to 1800 amino acids35. The results are summarized in Fig. 3C. Overall, among these 271,139 proteins, 92,192 provided predictions in ESMFold with good confidence (mean pLDDT > 0.5 and pTM > 0.5), corresponding to 34% of all of the proteins, and 42,479 provided predictions in ESMFold with high confidence (mean pLDDT > 0.7 and pTM > 0.7), which corresponds to 15.7% of the total folded structures (Supplementary Table 5).
Structural comparisons enabled the identification of distant homologs that were not identified by sequence alignment. Phages have evolved many bacterial defense systems as the result of an “arms race” between phage and host36. A recently reported gene is Thoeris anti-defense 2 (Tad2), which encodes a protein that can bind to 1″–3′ gcADPR to disarm the Thoeris defense system of the bacteria32. Tad2 has been reported in gut phages, but its homologs have not been investigated in oral phages. We used Foldseek to search the structure of SPO1 phage Tad2 (PDB:8smf) against high quality structures in OPD37, and obtained 6 candidates (Supplementary Fig. 3B). Phylogenetic analysis of these proteins demonstrated that they are distinct homologs of Tad2 (Fig. 3D). As Tad2 forms a homotetrameric structure to bind 1″–3′gcADPR, we used ColabFold to predict the homotetrameric structures of these 6 candidates38. However, only opdg_86857_54 was predicted successfully with high pLDDT and ipTM. The sequence identity of opdg_86857_54 to 8smf was only 20.93%, but the structural similarity was very high with an RMSD of 2.55 and a TM-score of 0.78 (Fig. 3E). To further corroborate their molecular function, molecular dynamics (MD) simulation was used to determine the binding stability of 1″–3′gcADPR-opdg_86857_54 homotetramer complex. Specifically, we set 1″–3′ gcADPR as the ligand, opdg_86857_54 homotetramer as the protein, amber/ff14SB as the protein force field, gaff-2.11 as the ligand force field, and tip3p as the water box. After minimizing the ligand-protein complex, we started the simulation for 100 ns (Fig. 3F). The analysis revealed that 1″–3′ gcADPR can tightly bind to the protein pocket in a reasonable pose, indicating that a series of novel Tad2s can be encoded by oral phages to promote resistance to host-elicited killing.
Broad host range of bacteriophages and auxiliary functions
The interaction between phages and bacteria plays an important role in the regulation of oral microorganisms13, but the host range of most oral phages has not been described in detail. The retrieval of 2178 genomes of isolated bacteria from the oral cavity enabled the identification of spacer sequences in CRISPR arrays that are copied from bacterial phages. We screened all 2178 genomes of isolated bacteria from our data source, mining 23,636 CRISPR spacers and 1810 Cas proteins with high confidence (Supplementary Fig. 4A). After matching the spacers with all the viral draft genomes in OPD, a total of 99,612 matched pairs enabled the assignment of 25.34% (14,732/58,141) of the phages to hosts. To overview phage-host interactions, we decorated the bacterial phylogeny tree with infection relationships derived from 6584 bacterial species-subVC pairs, and subsequently constructed a genus-level interaction network comprising 578 bacterial genus-subVC pairs (Fig. 4A, Supplementary Fig. 4B, Supplementary Table 6). We further conducted host prediction using the integrated machine learning framework iPHoP39. The iPHoP results revealed a significantly broader host range. Although we only considered the top 5 host predictions from each method, these predictions matched 61.59% (4055/6584) of the bacterial species-subVC pairs predicted based on the foregoing oral bacterial CRISPR spacer matching (Supplementary Fig. 4C). Based on this concordance, we utilized the CRISPR-spacers-based results as our primary reference for host assignment in downstream analyses. At the species level, our analyses indicated that nearly half (1436/2896) of the subVCs could infect more than one bacterial species (Supplementary Fig. 4D). Such cross-species infecting phages have been described in the gut phageome, but this phenomenon seems to be more common in the microbiota of the oral cavity10. We next determined the diversity of phages infecting oral bacteria and found that the dominant bacteria in the oral cavity harbored a highly diverse population of phages (Fig. 4B). For instance, the three major species colonizing the oral cavity, Neisseria flavescens, Neisseria sicca, and Streptococcus oralis were found to harbor more than 500 subVCs, ranking these species as the top 3 species with the highest viral diversity. To further explore the associations between these cross-species infecting subVCs and their hosts in a more comprehensive manner, we mapped subVCs to the phylogenetic tree of oral bacteria (Fig. 4A). Most of the cross-species infecting subVCs targeted species of Streptococcus and Prevotella. In contrast, as the most abundant genus in the oral cavity, Neisseria harbored the most diversified subVCs, but none of these subVCs were found to be present in other species, which indicates a high specificity of phages infecting Neisseria. The bacteria from the Actinobacteriota and Fusobacteriota phyla harbored few cross-species phages. Taken together, we concluded that the host range of oral phages is closely related to bacterial phylogeny, reflecting the coevolution of phages and bacteria.
A Phylogenetic tree of oral bacteria. The height of the orange bars denotes the number of subVCs the host harbors. Connection in blue represents one VC that can infect bacterial species belonging to different genera, and connection in gold represents one VC that can infect bacterial species from different phyla. B Top 20 bacterial species that are infected by the largest numbers of subVCs. The bars are colored according to the phylum to which the species belong. C Classification of KOs encoded by oral bacteria and phages. D Top 25 VFs in OPD.
The genes carried by bacteriophages may assist host bacteria in survival and may have an impact on humans. Thus, we focused on proteins involved in key functions such as metabolism and virulence. Auxiliary metabolic genes (AMGs) refer to phage-encoded genes that enhance the metabolic capacity of the host to facilitate infection40,41. We investigated the presence of AMGs in oral phages by focusing on KEGG Orthologies (KOs) associated with metabolism identified in oral phages and bacteria (Supplementary Fig. 5). Overall, 76.16% (1083/1422) of the KOs were shared by oral phages and bacteria, and 91.60% (992/1083) of the shared KOs could be defined as AMGs (Fig. 4C). As visualized by the metabolic map, main metabolic pathways including carbohydrate metabolism, energy metabolism, and amino acid metabolism were shared by oral phages and bacteria and were mainly classified as AMGs. Additionally, most of the KOs associated with nucleotide metabolism were also encoded by phages and bacteria, but were not reported as AMGs40.
Among all proteins in OPD, 77,773 proteins from 10,267 phages were defined as virulence factors (VFs) according to the Virulence Factor Database (VFDB)42, and more than half (13 of 25) of the top 25 types of VFs were significantly enriched in lysogenic phages (Fig. 4D), with only 3 of 25 being significantly enriched in lytic phages. This result supported the assumptions of inclusive fitness within bacterial populations to explain why the retention of phage-mediated VFs benefited the bacterial host. Notably, according to VFDB annotations, there are numerous phage-encoded proteins associated with bacterial adhesion, including 2167 proteins related to elongation factor Tu (EF-Tu), 1642 proteins related to auto transporter adhesion (UpaG), 723 proteins related to Bartonella adhesion A (BadA), and 610 proteins related to type IV pili (T4P). Interestingly, T4P has been reported as a coat protein of phages43, and the hosts of these phages all belong to Neisseria, including Neisseria sicca and Neisseria flavescens, which are defined as opportunistic pathogens, suggesting that some oral phages might assist pathogens in adhering to host cells.
Marked variation in the phageome across populations
Previous studies have reported that environmental factors (e.g., lifestyle, geography, urbanization) play an important role in shaping the composition of the microbiota44,45,46,47. We reasoned that the oral phage community would also be affected by these factors as well. To investigate the variation in the oral phageome, we performed profiling of the human oral microbiome by mapping the metagenomics reads to a comprehensive database including representative phage genomes from OPD, bacterial genomes from the oral species-level genome bins (SGBs) catalog18, and fungal genomes from NCBI Taxonomy48. Although the relative abundances of phages varied between different countries, they were consistently less than 8.47% among all kinds of microorganisms (Fig. 5A), and the relative abundance in Chinese samples was greater than that in non-Chinese samples (Supplementary Fig. 6A), possibly reflecting differences related to ethnicity, lifestyle, and geography, all factors warranting further exploration.
A Relative abundance of virus (phages), bacteria, and fungi in samples from 5 countries. B Comparison of oral phage abundances between Shenzhen, a highly industrialized metropolitan city, and Yunnan, a less developed province. C PCoA plot based on Bray-Curtis dissimilarities of oral phages between industrialized and non-industrialized areas. D Area under the curve for ROC curve (AUC) of different diseases prediction models on test datasets. E, F TOP5 subVCs exhibiting the highest importance values for diseases. E for RA, and (F) for dental calculus.
To further examine to what extent the oral phageome was affected by human lifestyle, we took advantage of our Chinese cohorts, which included 2675 individuals from Shenzhen, a highly industrialized and densely populated metropolitan city, and 671 individuals from the Yunnan Province, a less developed province where people have relatively low incomes and follow a more traditional rural lifestyle47 (Supplementary Table 2). The relative abundance of oral phages from in the industrialized area was significantly greater than that of the population living in the less industrialized area (Fig. 5B), with the opposite trend for the abundance of bacteria. Principal Co-ordinates Analysis (PCoA) revealed a clear separation between the populations from Shenzhen and Yunnan (Fig. 5C). We selected the dominant phages characterizing the top 10 most abundant phages in each population (Supplementary Fig. 6B). Although the phage composition of the populations from Shenzhen and Yunnan varied significantly, 6 core phages dominated the populations from both Shenzhen and Yunnan. Individuals from Shenzhen showed an enrichment of VC_486_17 and VC_1899_0, a complete circular lytic phage (genome size of 106 kbp) without host assignment, compared with populations from Yunnan. These findings were consistent with previous studies reporting that gut phageome structure variations are associated with urbanization and lifestyle47. Taken together, phage composition and diversity differed between populations, indicating that industrialization and lifestyle impacted the human oral ecosystems.
Studies have demonstrated that many human intestinal microorganisms can act as biomarkers relevant to human health and disease49,50,51. By exploring the microbiome composition of the SZ-4D (n = 2403) and the Beijing cohorts (n = 97) with phenotype information (Supplementary Table 7), we were able to compare the potential value of different oral microorganisms as biomarkers (i.e., bacteria, fungi, and virus) in relation to dental diseases (caries, calculus, and dental ulcers), and systemic diseases (obesity and rheumatoid arthritis). We established a series of random forest (RF) classifiers with fivefold cross-validation using multiple microbial abundances in our cohorts as features, including bacteria, fungi, viruses (phages), and the combinations of these features (Supplementary Table 8).
Most RF models exhibited an area under the receiver-operating characteristic curve (AUC) value < 0.65 for the testing sets, indicating that the composition of oral microorganisms/phages in most cases had limited diagnostic clinical value. However, in two cases, dental calculus and rheumatoid arthritis (RA), the AUCs for the testing sets approached 0.8 and 0.9, respectively, showing a potential clinical value (Fig. 5D). For dental calculus, the composition of the fungal community was not informative, whereas information on other features had comparable predictive values. Notably, for RA, the composition of bacteria, fungi, viruses, and the combination of these parameters had AUCs of approximately 0.9, suggesting that such information has the potential for being of clinical value.
To further explore the value of oral microbial phages as biomarkers and identify key viral biomarkers, we calculated the importance of viral features in relation to dental calculus and RA models trained on phage data for which AUC values > 0.75 were achieved. According to the RA phage models, VC_769 appeared to be a key player since four of its subVCs ranked amongst the top five subVCs with the highest importance values (higher than 0.006) (Fig. 5E). VC_769 was annotated as a lysogenic phage VC infecting Lactococcus lactis. Previous studies have suggested that Lactococcus lactis is a beneficial bacterium enriched in healthy individuals rather than in RA patients19. For the dental calculus model, VC_1293 and VC_1377 played important roles as they represented subVCs with the highest importance values, higher than 0.004 (Fig. 5F). VC_1377 is a lytic phage VC that infects Porphyromonas gingivalis, a periodontal pathogen. Together with the results above, we reasoned that the abundance of certain strictly host-specific phages would follow their bacterial hosts, which would enable phages to become biomarkers.
The presence of numerous huge phages containing the CRISPR system in the human oral cavity
By screening the genome sizes of the OPD, 3416 phages with genome sizes greater than 200 kbp were identified, and these were defined as huge phages52. To investigate the distribution of huge phages in different ecological systems, we generated a large-scale genome dataset by integrating nearly 4000 huge phages from multiple ecological niches10,52. After evaluating the quality of these huge phage genomes using CheckV, we selected 3640 “no-provirus” genomes with completeness >50% for the subsequent analyses to avoid interference from bacterial sequences, including 3232 from the human oral cavity, 187 from the human gut, and 221 from other environmental ecosystems. To explore their evolutionary characteristics, we next constructed a phylogenetic tree using the proteomes of the high-quality huge phages (Supplementary Fig. 7A). Almost all large phages found in the human gut and various environmental ecosystems, as well as a small portion found in the human oral cavity, exhibited relatively short branch lengths in the tree, indicating that these phages may harbor some similar evolutionary features. Notably, most oral huge phages were evolutionarily distant from phages found in the human gut and various environmental ecosystems, and these oral phages were difficult to annotate taxonomically.
The CRISPR–Cas systems identified in phages function to eliminate intruding competing phages52. We screened the huge phage genomes and identified 265 phages containing CRISPR-Cas systems. After extracting their spacer sequences and aligning them within the OPD, 11,638 pairs of CRISPR-associated interactions were revealed, demonstrating the existence of universal CRISPR-dependent interactions between different phages in human symbiotic microbes (Supplementary Fig. 7B). This analysis enabled the identification of phages with the same host or ‘co-hosts’ phages. For example, the huge phages opdg_2921 and opdg_2470 have CRISPR systems targeting 25 normal phages in OPD (Fig. 6A). The host of many of these phages was identified as Pauljensenia hongkongensis, a gram-positive, strictly anaerobic and non-spore-forming bacterium from the Actinomycetaceae family. Accordingly, it is likely that other phages with unknown hosts could also infect P. hongkongensis, since only in this way could their genome be recorded by opdg_2921 and opdg_2470, forming CRISPR spacers in their genome.
A The interaction network of huge phage CRISPR-Cas systems of opdg_2921 and opdg_2470, huge phages infecting Actinobacteriota. The phages in this network for which hosts were predicted by CRISPR spacers are colored in purple, and the arrows indicate spacers from huge phages targeting other normal phages. B Phylogenetic distribution of predicted Type V Cas proteins (Cas12) candidates identified from huge phages. C An example of ESMFold model of opdg_150_142 superimposed onto the Cas12f1 complex.
One hundred Cas proteins were predicted in these huge phages with high confidence, and most of these belong to the type V CRISPR system (Supplementary Fig. 7C). Cas12 is an example of a small Cas protein from the type V CRISPR system that can cleave dsDNA molecules and has gene editing potential53. We used ESMFold to fold all proteins predicted to be Cas12 homologs and obtained 63 structures (Supplementary Fig. 7D). The phylogenetic tree demonstrated that some Cas12 proteins from huge phages formed a new clade (Fig. 6B). Among these proteins, a notable member is opdg_150_142, a protein comprising only 465 amino acid residues and with high prediction quality (Fig. 6C). Using predicted opdg_150_142 structure as query, we searched against the PDB database by Foldseek54, and found that they exhibited topological similarity to known Cas12 proteins. For example, Cas12f1 is a Cas protein from an uncultured archaeon capable of forming a dimer to cleave dsDNA55. The pairwise structural alignment showed that opdg_150_142 and Cas12f1 has the same fold with a 0.56 TM-score (Fig. 6C), indicating that they may be candidates for gene editing enzymes.
Discussion
Although the oral cavity contains numerous commensal microorganisms, there is limited detailed information on the structure and composition of bacteriophages. In this study, we performed large-scale data mining of 5427 oral metagenomes and 2178 genomes of oral isolated bacteria, generating a comprehensive oral phage database with genome-wide and proteome-wide annotations. Taking advantage of large cohorts and genomes sequenced in house, a total of 9983 non-redundant oral phage sub-VCs were identified and many of these phages have not been reported previously, representing numerous diverse and previously uncharacterized viral groups in the oral cavity. As a useful resource, OPD contains more than three thousand huge phages, which constitute a valuable database for further exploring the characteristics of huge phages and investigating their role in microbial ecosystems. Beyond genome catalog generation, we also explored the interactions between these phages and their oral bacterial hosts, describing the broad host range of the oral phages being closely related to host phylogeny. The prevalence of oral phages with a broad host range differs substantially from those of the previous studies reporting that most human phages are highly host-specific. Some phages can encode anti-defense proteins to resist bacteria, while others carry AMGs or VFs possibly impacting host metabolism or enhancing virulence, which may modulate host adaptation. Based on the cohorts sequenced by us, we found that the abundance of oral phages was significantly greater in industrialized areas than in non-industrialized areas and revealed a difference in phage composition between these two areas. Our study reveals phage-based biomarkers for oral diseases, which synergistically enhance conventional bacterial and fungal diagnostic markers and exhibit therapeutic potential in phage-based interventions.
Powerful protein structural prediction tools based on large language models such as ESMFold make it possible to perform proteome-wide structure analyses from metagenomic data with high efficiency. Using protein structural information and fast structural alignment to uncover novel anti-defense proteins and Cas proteins, we present evidence that structural similarity can be an alternative approach for traditional sequence-based methods in microbiome research. Molecular dynamics can simulate the binding and interaction between target proteins and ligands, and can be applied in some situations when experimentally validating the function of proteins is difficult. We obtained about ~90,000 high-quality proteins in OPD, but most of these proteins have not been fully characterized and need further investigation.
Although this study successfully improved our understanding of oral phages, it has certain limitations. Variations in data volume and sample size among different cohorts may influence the assessment of viral diversity and relative abundance across regions. Viral-like particle (VLP) sequencing data, which are the most accurate data for the virome, are missing in this study. The lack of VLP sequencing data may have reduced sensitivity in detecting low-abundance lytic phages. For host prediction, broad host ranges were determined using CRISPR-spacer matching, a widely used method whose accuracy is difficult to confirm. Additionally, functional exploration of viral genomes, particularly AMG analysis, was based on sequence homology, requiring cautious interpretation due to potential host contamination. The absence of experimental validation (e.g., phage isolation, functional assays) means that our computational predictions should be treated as putative biological hypotheses rather than confirmed mechanisms. Despite these limitations, our study offers valuable insights and targets for future experiments. Thus, we anticipate that OPD will constitute a valuable resource for further exploration of human oral phages and their interactions with their hosts.
Methods
Data collection and phage sequence detection
The sample collection and analysis were approved by the Institutional Review Board on Bioethics and Biosafety of BGI under the numbers BGI-IRB 19121 and BGI-IRB 22112-T1.
The 4D-SZ cohort of Chinese living in Shenzhen comprises 3,953 oral metagenomics samples, among which 1278 were newly sequenced (Supplementary Table 1). The Yunnan cohort consisted of 671 salivary samples from the Yunnan province. Other public oral metagenomic datasets were downloaded from NCBI SRA databases with accession codes SRP133047, SRP029441, SRS3984307, ERP110622, and ERP006678. All the high-quality reads were individually assembled using the assembly module of the METAPI pipelin,e applying SPAdes v3.13.0 with option ‘-meta’56. For other details, see methods of Zhu et al. 18. In addition to metagenomics data, we also collected data from the Cultivated Oral Bacteria Genome Reference (COGR)24 and the Expanded Human Oral Microbiome Database (eHOMD)2. The culture conditions for oral bacteria and genome assembly methods were based on the methods described by Zou et al. 57, Li et al. 24, and Pride et al. 2.
In order to mine phage genomes from sequencing data, a unified pipeline was developed, mainly based on VirFinder and VirSorter2. VirFinder is a tool to identify viral sequences based on viral sequence k-mers frequencies and machine learning8. This method utilizes the commonly used sequence k-mers to construct a sequence phase volume and build a machine learning classifier without reference, which significantly improves the speed and accuracy of virus sequence identification. Virsorter2, on the other hand, applies a multi-classifier, expert-guided approach to detect different DNA and RNA viral genomes and determines the presence of virus sequences after comparing them to its own database7. Using strict criteria: ‘Virfinder score >= 0.9, p-value <= 0.05 AND Virsorter2 score > 0.7’, a total of more than 12 million sequences greater than 1 kb in length were retained.
Sequence decontamination
To further improve the quality of the viral genomes, reduce the contamination rate, and select high-quality data for subsequent analysis, a set of quality screening and control assessment processes were developed. The first step involved processing the 12,545,619 sequences from the previous step by removing human contamination and eukaryotic viruses, which was done mainly based on BLAST v2.9.058. Setting strict parameters: ‘-word_size 28 -best_hit_overhang 0.1 -best_hit_score_edge 0.1 -dust yes -evalue 0.0001 -min_raw_gapped_score 100 -penalty -5 -perc_identity 90 -soft_masking true’ to globally match all sequences with the human genome GRCh38, the matched sequences were excluded, and we further removed sequences shorter than 10 kb, whereby 566,789 sequences remained. To remove redundancy and select the representative sequences, the sequences from metagenome data were clustered using CD-HIT with parameters ‘-c 0.95 -G 0 -aS 0.75’, which finally selected 240,215 cluster representative sequences59. Considering that some sequences, such as plasmids and long mobile elements, could be misclassified as viruses, an artificial neural network used to construct the Gut Phage Database was also introduced for quality control10. Using this tool, a further 50,356 contaminating sequences were removed from all assembled contigs. The remaining 189,859 sequences constitute the genomes of the oral phage database.
Genome quality assessment
CheckV v0.7.0 is a fully automated quality control and assessment process for assessing the quality of single viral contigs25, including identification of contamination by the host of the integrated provirus, estimation of genomic fragment integrity, and identification of closed genomes. Results are rated based on viral sequence integrity and are divided into five classes: Complete (100% completeness), High-quality (greater than or equal to 90% and less than 100% completeness), Medium-quality (greater than or equal to 50% and less than 90% completeness), Low-quality (less than 50% completeness), Not-determined (no viral gene found in the sequence). We applied CheckV v0.7.0 to evaluate the obtained genomes and used the complete, high-quality, and medium-quality sequences as phage draft genomes for subsequent analysis, obtaining a total of 58,141 sequences.
Generation of viral clusters and phylogenetic tree
For further clustering of viral genomes into viral clusters (VCs), a network-based classification tool vConTACT2 was applied26. The genomes with completeness higher than 50% were defined as viral draft genomes, for these sequences, we used protein sequences predicted by Prodigal with gene-to-genome mapping file as the input to generate protein clusters (PCs) and viral clusters (VCs), with parameters ‘--raw-proteins --rel-mode ‘Diamond’ --proteins-fp --pcs-mode MCL --vcs-mode ClusterONE --c1-bin cluster_one-1.0.jar --db ‘ProkaryoticViralRefSeq88-Merged”. For each subVC, the genome with highest completeness and longest length was selected as the representative sequence. The reference sequences from the ProkaryoticViralRefSeq88-Merged database were removed, resulting in 9983 subVCs. We also cluster the viral draft genomes of OPD and other databases, including GVD, GPD, CHVD, and OVD9,10,12,33. A proteomics-based phylogenetic tree of reference sequences of the 9,983 subVCs was constructed via VIPtree with default parameters29.
Taxonomic assignment
geNomad, a tool for virus identification and annotation with improved classification performance, was applied for taxonomic assignment10,27. Annotations for viral draft genomes were performed by the following parameters ‘--lenient-taxonomy --full-ictv-lineage’. Specifically, for each genome, the genes encoded by the sequence were aligned to the geNomad marker gene set that is associated with viral taxa defined in ICTV MSL 3928. Each gene is subsequently classified based on the taxonomic lineage of the assigned marker, if more than 2 genes hit the marker set and more than half of all hit genes belong to a specific taxon, then the taxonomy of the sequence is determined as the most specific taxon. For each subVC, the annotated result of representative genomes was defined as the subVC taxonomic assignment result.
Host prediction and analyses of oral bacteria
We used Crass to mine CRISPR spacer sequences from the 2178 oral bacteria, and CRISPRCasTyper to predict the Cas proteins60,61. These CRISPR spacers were matched to phage draft genomes by Blastn with the parameters ‘-perc_identity 100 -qcov_hsp_perc 100’. Based on the matching results, the phage-host relationship was classified as the subVC-species level and VC-genus level, and the infection network was displayed by Cytoscape (v3.8.2)62. iPHoP was employed for additional validation using default parameters, retaining only matches with Confidence Scores >90 and the top 5 predictions from each method39. Note that iPHoP results were not incorporated into downstream analyses.
GTDB-TK (v1.5.0) with database release202 was used to perform taxonomic annotation of the 2,178 genomes of oral isolated bacteria63. Then we selected the genomes with the highest completeness calculated by CheckM and longest length as the species-level representative genomes and used the representative genomes to construct the maximum-likelihood phylogenetic tree based on 120 conserved single-copy genes64. Then we mapped the phages to their infecting hosts on the phylogenetic tree. The tree was visualized using iTOL v665.
Protein-coding gene calling and functional analysis based on sequence information
To identify protein-coding genes for generating a gene and proteome catalog, Prodigal v2.6.3 was applied for all genomes from OPD, oral bacteria, and other downloaded phage genomes66. Overall, 9,181,412 genes from OPD were predicted and translated into protein sequences. Then these genes were annotated by the eggNOG-mapper with default parameters and the eggNOG database, including GO, KEGG, COG, and PFAM3030. We then used MMseqs2 to cluster the proteins from OPD and other virome databases for comparison38. All proteins were clustered with the parameters ‘-s 7.5 -c 0.9 --min-seq-id 0.7 --cov-mode 0 -e 0.001 --cluster-mode 0’, resulting in 2,762,333 clusters. The cluster result was visualized by the ‘UpSetR’ R package.
For the AMGs, we matched the annotation to the KOs-list ‘VIBRANT_AMGs.tsv’ supplied by VIBRANT40. BLAST v2.9.0 was used to align the gene sequences to VFDB for annotation of virulence factors42.
Proteome-wide structure prediction for functional protein identification
The sequence from OPD-specific protein clusters was re-clustered with the parameters ‘-s 7.5 -c 0.7 --min-seq-id 0.3 --cov-mode 0 -e 0.001 --cluster-mode 0’. The reference sequences with lengths between 30 to 1800 amino acid residues of these protein clusters were predicted via ESMFold on 4 Nvidia A100 GPUs for one week35.
We used Foldseek for a search of the SPO1 phage Tad2 (PDB:8smf) against high-quality structures in OPD and then performed pair-wise comparison by MMalign37,67. The homotetrameric structures of 6 candidates were predicted by ColabFold with AlphaFold_multimer v2.3.1 model and the SPO1 phage Tad2 as template input. The molecular dynamics simulation for opdg_86857_54 and putative ligand 1″–3′ gcADPR was performed by a OpenMM-based protocol (https://github.com/tdudgeon/simple-simulate-complex)68. The phylogenetic tree of Tad2 proteins was built by Fasttree and visualized by ggtree69.
Metagenomic reads mapping
To estimate the relative abundances of viruses, bacteria, and fungi in each sample, we first generated a customized database with OPD, the oral SGBs from Zhu et al. 18, and the fungal genomes downloaded from NCBI. Next, we mapped metagenomic reads to the database using Kraken70. Relative abundances of viruses, bacteria, and fungi were determined by Bracken with default parameters71.
The variation in oral viruses between different regions was assessed by permutational analysis of variance (PERMANOVA) using Bray-Curtis dissimilarity on the relative abundance profile of viruses from the R vegan package72.
Construction of prediction models
Random forest models for each disease were performed via the scikit-learn and imblearn packages on Python 3.7 using abundance profiles of the samples. Models were built with 1000 trees and fivefold cross validation. ROC curves and feature importance were visualized by Python seaborn package.
Prediction of CRISPR systems in huge phages
CRISPR systems in huge phages were predicted and visualized by CRISPRCasTyper60 with default parameters. Then the spacers predicted from the huge phage genomes were aligned to other phages in OPD by BLAST setting the parameters ‘--evalue 1e-6’, and hits with more than 95% sequence identity were retained. The interaction network was plotted by Cytosape. The putative Cas12 proteins were predicted by ESMFold35.
References
Liang, G. & Bushman, F. D. The human virome: assembly, composition and host interactions. Nat. Rev. Microbiol. 19, 514–527 (2021).
Escapa, I. F. et al. New insights into human nostril microbiome from the expanded human oral microbiome database (eHOMD): a resource for the microbiome of the human aerodigestive tract. Msystems 3, 00187–00118 (2018).
Pride, D. T. et al. Evidence of a robust resident bacteriophage population revealed through analysis of the human salivary virome. ISME J. 6, 915–926 (2012).
Wang, J., Gao, Y. & Zhao, F. Phage–bacteria interaction network in human oral microbiome. Environ. Microbiol. 18, 2143–2158 (2016).
Ho, S. X., Min, N., Wong, E. P. Y., Chong, C. Y. & Chu, J. J. H. Characterization of oral virome and microbiome revealed distinctive microbiome disruptions in paediatric patients with hand, foot and mouth disease. npj Biofilms Microbiomes 7, 19 (2021).
Khalifa, L. et al. Phage therapy against Enterococcus faecalis in dental root canals. J. Oral. Microbiol. 8, 32157 (2016).
Guo, J. et al. VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome 9, 1–13 (2021).
Ren, J., Ahlgren, N. A., Lu, Y. Y., Fuhrman, J. A. & Sun, F. VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5, 1–20 (2017).
Gregory, A. C. et al. The gut virome database reveals age-dependent patterns of virome diversity in the human gut. Cell Host Microbe 28, 724–740.e728 (2020).
Camarillo-Guerrero, L. F., Almeida, A., Rangel-Pineros, G., Finn, R. D. & Lawley, T. D. Massive expansion of human gut bacteriophage diversity. Cell 184, 1098–1109.e1099 (2021).
Nayfach, S. et al. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat. Microbiol. 6, 960–970 (2021).
Li, S. et al. A catalog of 48,425 nonredundant viruses from oral metagenomes expands the horizon of the human oral virome. Iscience 25 (2022).
Wahida, A., Tang, F. & Barr, J. J. Rethinking phage-bacteria-eukaryotic relationships and their influence on human health. Cell Host Microbe 29, 681–688 (2021).
Shen, S., Huo, D., Ma, C., Jiang, S. & Zhang, J. Expanding the colorectal cancer biomarkers based on the human gut phageome. Microbiol. Spectr. 9, e00090–00021 (2021).
Arnold, B. J., Huang, I.-T. & Hanage, W. P. Horizontal gene transfer and adaptive evolution in bacteria. Nat. Rev. Microbiol. 20, 206–218 (2022).
Li, R., Wang, Y., Hu, H., Tan, Y. & Ma, Y. Metagenomic analysis reveals unexplored diversity of archaeal virome in the human gut. Nat. Commun. 13, 7978 (2022).
Wang, J. et al. Maternal and neonatal viromes indicate the risk of offspring’s gastrointestinal tract exposure to pathogenic viruses of vaginal origin during delivery. Mlife 1, 303–310 (2022).
Zhu, J. et al. Over 50,000 metagenomically assembled draft genomes for the human oral microbiome reveal new taxa. Genomics Proteom. Bioinforma. 20, 246–259 (2022).
Zhang, X. et al. The oral and gut microbiomes are perturbed in rheumatoid arthritis and partly normalized after treatment. Nat. Med. 21, 895–905 (2015).
Brito, I. L. et al. Transmission of human-associated microbiota along family and social networks. Nat. Microbiol. 4, 964–971 (2019).
Zeller, G. et al. Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol. Syst. Biol. 10, 766 (2014).
Heintz-Buschart, A. et al. Integrated multi-omics of the human gut microbiome in a case study of familial type 1 diabetes. Nat. Microbiol. 2, 1–13 (2016).
Goltsman, D. S. A. et al. Metagenomic analysis with strain-level resolution reveals fine-scale variation in the human pregnancy microbiome. Genome Res. 28, 1467–1480 (2018).
Li, W. et al. A catalog of bacterial reference genomes from cultivated human oral bacteria. npj Biofilms Microbiomes 9, 45 (2023).
Nayfach, S. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat. Biotechnol. 39, 578–585 (2021).
Bin Jang, H. et al. Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat. Biotechnol. 37, 632–639 (2019).
Camargo, A. P. et al. Identification of mobile genetic elements with geNomad. Nat. Biotechnol. 42, 1303–1312 (2024).
Simmonds, P. et al. Changes to virus taxonomy and the ICTV Statutes ratified by the International Committee on Taxonomy of Viruses (2024). Arch. Virol. 169, 236 (2024).
Nishimura, Y. et al. ViPTree: the viral proteomic tree server. Bioinformatics 33, 2379–2380, https://doi.org/10.1093/bioinformatics/btx157 (2017).
Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P. & Huerta-Cepas, J. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol. Biol. Evol. 38, 5825–5829 (2021).
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309–D314 (2019).
Yirmiya, E. et al. Phages overcome bacterial immunity via diverse anti-defence proteins. Nature 625, 352–359, https://doi.org/10.1038/s41586-023-06869-w (2024).
Tisza, M. J., Belford, A. K., Dominguez-Huerta, G., Bolduc, B. & Buck, C. B. Cenote-Taker 2 democratizes virus discovery and sequence annotation. Virus Evol.7, veaa100 (2021).
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130, https://doi.org/10.1126/science.ade2574 (2023).
Hampton, H. G., Watson, B. N. J. & Fineran, P. C. The arms race between bacteria and their phage foes. Nature 577, 327–336, https://doi.org/10.1038/s41586-019-1894-8 (2020).
van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246, https://doi.org/10.1038/s41587-023-01773-0 (2024).
Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682, https://doi.org/10.1038/s41592-022-01488-1 (2022).
Roux, S. et al. iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria. PLoS Biol. 21, e3002083 (2023).
Kieft, K., Zhou, Z. & Anantharaman, K. VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. Microbiome 8, 1–23 (2020).
Thompson, L. R. et al. Phage auxiliary metabolic genes and the redirection of cyanobacterial host carbon metabolism. Proc. Natl. Acad. Sci. 108, E757–E764 (2011).
Liu, B., Zheng, D., Zhou, S., Chen, L. & Yang, J. VFDB 2022: a general classification scheme for bacterial virulence factors. Nucleic Acids Res. 50, D912–D917 (2022).
Karaolis, D. K., Somara, S., Maneval, D. R. Jr, Johnson, J. A. & Kaper, J. B. A bacteriophage encoding a pathogenicity island, a type-IV pilus and a phage receptor in cholera bacteria. Nature 399, 375–379 (1999).
Nishijima, S. et al. Extensive gut virome variation and its associations with host and environmental factors in a population-level cohort. Nat. Commun. 13, 5252 (2022).
Olm, M. R. et al. Robust variation in infant gut microbiome assembly across a spectrum of lifestyles. Science 376, 1220–1223 (2022).
Shkoporov, A. N. et al. The human gut virome is highly diverse, stable, and individual specific. Cell Host Microbe 26, 527–541.e525 (2019).
Zuo, T. et al. Human-gut-DNA virome variations across geography, ethnicity, and urbanization. Cell Host Microbe 28, 741–751.e744 (2020).
Schoch, C. L. et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database 2020, baaa062 (2020).
Jie, Z. et al. The gut microbiome in atherosclerotic cardiovascular disease. Nat. Commun. 8, 845 (2017).
Jie, Z. et al. A consortium of three-bacteria isolated from human feces inhibits formation of atherosclerotic deposits and lowers lipid levels in a mouse model. Iscience 26 (2023).
Zhou, C. et al. Metagenomic profiling of the pro-inflammatory gut microbiota in ankylosing spondylitis. J. Autoimmun. 107, 102360 (2020).
Al-Shayeb, B. et al. Clades of huge phages from across Earth’s ecosystems. Nature 578, 425–431 (2020).
Yan, W. X. et al. Functionally diverse type V CRISPR-Cas systems. Science 363, 88–91, https://doi.org/10.1126/science.aav7271 (2019).
Bank, R. P. D. delivery of experimentally-determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning. Nucleic Acids Res. 51, D488–D508 (2023).
Takeda, S. N. et al. Structure of the miniature type V-F CRISPR-Cas effector enzyme. Mol. Cell 81, 558–570.e553, https://doi.org/10.1016/j.molcel.2020.11.035 (2021).
Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).
Zou, Y. et al. 1520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses. Nat. Biotechnol. 37, 179–185 (2019).
Sf, A. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
Russel, J., Pinilla-Redondo, R., Mayo-Muñoz, D., Shah, S. A. & Sørensen, S. J. CRISPRCasTyper: automated identification, annotation, and classification of CRISPR-Cas loci. CRISPR J. 3, 462–469 (2020).
Skennerton, C. T., Imelfort, M. & Tyson, G. W. Crass: identification and reconstruction of CRISPR from unassembled metagenomic data. Nucleic Acids Res. 41, e105–e105 (2013).
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. (Oxford University Press, 2020).
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 49, W293–W296 (2021).
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinforma. 11, 1–11 (2010).
Mukherjee, S. & Zhang, Y. MM-align: a quick algorithm for aligning multiple-chain protein complex structures using iterative dynamic programming. Nucleic Acids Res. 37, e83, https://doi.org/10.1093/nar/gkp318 (2009).
Eastman, P. et al. OpenMM 7: Rapid development of high performance algorithms for molecular dynamics. PLoS Comput. Biol. 13, e1005659 (2017).
Yu, G., Smith, D. K., Zhu, H., Guan, Y. & Lam, T. T. Y. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods Ecol. Evol. 8, 28–36 (2017).
Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 1–13 (2019).
Lu, J., Breitwieser, F. P., Thielen, P. & Salzberg, S. L. Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci. 3, e104 (2017).
Oksanen, J. et al. Package ‘vegan’: Community ecology package. Version 2, 1–295 (2013).
Guo, X. et al. CNSA: a data repository for archiving omics data. Database 2020, baaa055 (2020).
Chen, F. Z. et al. CNGBdb: China National GeneBank DataBase. Hereditas 42, 799–809 (2020).
Acknowledgements
This work was supported by grants from the National Key Research and Development Program of China (No. 2018YFC1313801), the Natural Science Foundation of Guangdong Province, China (No. 2019B020230001), and the Shenzhen Municipal Government of China (No. XMHT20220104017). We also thank the colleagues at BGI-Shenzhen for sample collection, and discussions, and China National GeneBank (CNGB) Shenzhen for DNA extraction, library construction, and sequencing. And we additionally thank BGI Precision Nutrition (Shenzhen) Technology Co., Ltd for their technical guidance and support.
Author information
Authors and Affiliations
Contributions
Conceived and designed the study: Y.Z., Z.J., H.L., Y.M., J.Z. Performed the analysis: Y.Z., Z.J., H.L., Y.M., J.Z., W.Li., X.L., T.H., W.Liang., Y.J., X.T. Contributed reagents/materials/analysis tools: Y.Z., Z.J., L.X., M.H., T.Z., X.J., X.X., J.W. Wrote the paper: Y.Z., Z.J., H.L., Y.M., J.Z. Supervised the work: H.Y., W.Z., T.Z., L.X., K.K. Revised the paper: K.K. All authors commented on the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Jie, Z., Liang, H., Meng, Y. et al. Integrating metagenomics and cultivation unveils oral phage diversity and potential impact on hosts. npj Biofilms Microbiomes 11, 145 (2025). https://doi.org/10.1038/s41522-025-00773-z
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41522-025-00773-z








