Introduction

The human gut microbiome is a complex and diverse ecosystem that is established at birth and becomes increasingly diverse over the following few years, until a stable ‘adult-like’ composition is reached during preschool years1,2,3,4,5. Within this period of maturation, several factors have been associated with differential development of the gut microbiome. The most well studied of these is birth mode6,7, but factors such as medication use8, diet2,9,10, growing up in rural or urban environments11,12,13, and the influence of siblings and pets have also been found to influence bacterial composition14,15. Early life bacterial gut microbiome dysbiosis has been associated with a range of disease outcomes in later life including asthma16,17, allergy18,19,20, and inflammatory bowel disease21. Whilst much work has been done on the bacterial component, recent studies have also shown the gut viral community to be altered in certain diseases22,23,24,25,26, suggesting that they may also play a role in disease etiology. Viruses make up a large proportion (up to 5–6% of total DNA)27 of the gut microbiome and are collectively referred to as the ‘gut virome’. Their role in the gut microbiome is a research area of growing interest although much less is yet known about them compared to their bacterial counterparts.

Although a healthy human gut virome does contain some eukaryotic viruses, here we consider only bacteriophages (phages) that infect the bacteria in the gut and make up the vast majority of the human gut virome. Shifts in the phage composition of the gut have been associated with an increasing number of diseases such as Crohn’s disease, Ulcerative Colitis22,28, and arthritis29. These phages are thought to be virulent, but previous work has shown that the gut virome contains a significant proportion of temperate phages as well30,31,32. Temperate phages can integrate into the genomes of their bacterial hosts and be maintained through generations either as a part of the host genomes or as a plasmid-like genetic element; but can also return to the lytic cycle if the bacterial host is stressed by external factors33. While less than 20% of the phages found in faecal samples from adults were predicted to be temperate31, they appear to be dominant in the infant gut virome32 and carry a specific association with later risk of asthma34. Previous studies have also found that multiple strains of bacteria from the gut contain prophages, suggesting that lysogeny is a widespread phenomenon35. Whilst virulent phages infect and kill their host cell, temperate phages may be under a selective pressure to provide useful functions to their bacterial hosts.

Prophages can influence the metabolism and function of their bacterial hosts by harbouring morons – genes that are not essential for the phage itself, but may benefit the bacterial host by providing additional functions that increase its fitness36. These can include antibiotic resistance genes37,38,39, and toxins or virulence genes that increase the fitness of pathogenic bacteria during infection40,41,42. The presence of a prophage within a bacterium also has the added benefit of being protective from infections from other phages. Superinfection exclusion is a method by which prophages prevent infection of their host by related or more distant phages43,44,45. This is achieved by a variety of methods including alterations to the cell membrane46,47, repressor-based immunity48, and inhibition of DNA translocation into the cell cytoplasm49,50, amongst others. Whilst there are several potential benefits to the bacterium from carrying a prophage, negative effects have also been identified in some cases. For example, in a monoxenic mouse model system the carriage of lambda prophage in Escherichia coli was detrimental to the host bacterium due to frequent reactivation of the prophage51. Additionally, in Streptococcus pneumoniae the carriage and expression of prophage element Spn1 has been shown to be detrimental to the fitness of the pathogen, by reducing its ability to colonise the nasopharynx52. Whether their effects are positive or negative, a growing body of evidence points to temperate phages being key components of the gut microbiome that can influence the bacterial host assemblage diversity and functional potential31,35,53,54.

Since most phages within the human gut are still uncharacterised, cataloguing and quantifying them within sequenced virome samples has been challenging. Recently however, huge catalogues of human gut phages have been published, e.g. the Metagenomic Gut Virus (MGV) catalogue55 and the Gut Phage Database (GPD)56, and these aid researchers in cataloguing and tallying their samples by simply mapping virome reads, or even shotgun metagenomics reads against them. However, as these databases are based primarily on assembled bulk metagenomics data, it is still unclear if the genomes they contain originate from actively blooming viral populations or whether they comprise fragments of chromosomally integrated prophages that might even be inactive. Likewise, numerous major human virome studies have relied on bulk metagenomics for profiling the gut virome57,58. Again, here it is uncertain to what extent such profiling covers propagating viruses or integrated prophages.

One approach to resolving this is combining shotgun metagenomics data in addition to bonafide virome sequencing data for the same samples, making it possible to distinguish actively induced prophages from dormant ones. It is thought that temperate phages spend most of their lives as prophages, and this is supported by experimental models where induction is only triggered following strong chemical stress59. However, within natural environments, little is known about the overall extent of prophage induction, and analysing paired metagenomic and viromics for the same samples could answer that question systematically. In this work we utilised a set of 662 previously sequenced infant gut metagenomes60 to identify prophages and explore their role by analysing their accessory genes. The samples originate from children of the Copenhagen Prospective Studies on Asthma in Childhood 2010 (COPSAC2010) cohort, an ongoing mother-child cohort study followed since pregnancy and throughout early life with exhaustive phenotyping and sample collection. This allows for statistical testing of hypotheses about the biology of gut prophage composition during infancy and its potential link to later disease. Moreover, viral metagenomes (viromes) have previously been deeply sequenced for the same samples32. This configuration of deep metagenomics and viromics for the same samples allowed us to distinguish active prophages from dormant ones. By using the paired metagenome and virome data we were able to estimate if the prophages were actively induced and whether this was affected by external variables such as antibiotic usage. While our work is consistent with the current hypothesis that the high virome diversity in early life stems from induced prophages, it goes on to show that this induction is pervasive and constitutive, such that most prophages in the infant gut are induced most of the time.

Methods

Study design and workflow

An outline of the study design summarising the workflow described below is shown in Supplementary Fig 1.

Sample collection

The COPSAC2010 cohort is an ongoing mother-child cohort study of 738 pregnant women and 700 children that have been followed from week 24 of pregnancy, in a protocol designed from the first COPSAC birth cohort (COPSAC2000)61. The infant faecal samples studied here were collected at 1 year of age either at the research clinic or at home by the parents following detailed instructions. All samples were mixed with 1 mL of 10% vol/vol glycerol broth (Statens Serum Institut, Copenhagen, Denmark) and stored at −80 °C until use16.

Metagenome and virome sequencing and assembly

The same samples were used for both metagenome and virome sequencing, and both datasets have been published previously32,34,60. Below, we have reproduced the relevant methodology descriptions from the original publications.

DNA extraction for metagenomic sequencing, taken from Stokholm et al.16

Genomic DNA was extracted from the infants’ samples using the PowerMag® Soil DNA Isolation Kit optimized for epMotion® (MO-BIO Laboratories, Inc., Carlsberg, CA, USA) using the epMotion® robotic platform model 5075 (Eppendorf) according to the manufacturer’s protocol with the following alterations to the workflow: 150–250 μL of the samples were added to the 96-well bead plate containing 750 μL bead/RNase A Solution and 60 μL lysis solution. Centrifugation steps were performed at 3220xRCF for 9 min. Removal of enzymatic inhibitors and DNA purification was performed as described by the manufacturer. Finally, the DNA was eluted with 100 μL Tris buffer (10 mM, pH 7.5). DNA concentrations were determined using the Quant-iT™ PicoGreen® quantification system (Life Technologies, CA, USA). Extracted DNA was stored at −20 °C.

Metagenomic sequencing for 1-year fecal samples and data processing, adapted from Li et al.60

Samples were prepared with the Kapa Hyper Prep kit (for Illumina) (KAPA Biosystems, Wilmington, MA, USA). Paired-end (150 bp) sequencing of the 663 samples in the DNA library (1-year samples) was performed with the Illumina NovaSeq apparatus by Admera Health (USA). Out of these 663 samples, only one failed to produce a library. In total, 662 gut samples at 1 year of age were sequenced for this study, generating between 32.6 and 215 million 150-bp paired-end reads per sample (mean ± SD: 48 ± 15.5 million reads). The samples were sequenced in a single batch to avoid any batch effect. Bioinformatic preprocessing was parallelized using GNU Parallel version 2018072262. Sequencing adapters were removed using BBDuk, from BBTools version 38.19 (https://sourceforge.net/projects/bbmap/), using the default options with the following exceptions: “ktrim = r k = 23 mink = 11 hdist = 1 hdist2 = 0 ptpe tbo”. Low-quality sequences and reads shorter than 50 bases were filtered out using Sickle version 1.3363. Human contamination was filtered out using the BBMap feature of BBTools, with default values. The final dataset contained between 14 and 211 million clean reads per sample (mean ± SD: 46.7 ± 15.5 million reads). Clean reads were assembled with SPAdes version 3.12.0 using default metagenomics settings64. Metagenomics diversity was analysed using Nonpareil version 3.30, in kmer mode65.

Virome extraction for 1-year fecal samples, adapted from Deng et al.66

Virome Isolation from Feces

After spiking with known phages, samples were poured into a stomacher filter bag (Interscience BagPage, 100 mL, Saint-Nom-la-Bretèche, France). The mixture was homogenized (Stomacher 80, Seward, UK) for 120 s at the high level setting. Homogenized samples, from the other side of the filter in the bag, were transferred to 50 mL tubes and centrifuged at 5000 × g for 30 min at 4 °C. After centrifugation, the supernatant was filtered through a 0.45 μm PES filter (Minisart® High Flow Syringe Filter, Sartorius, Göttingen, Germany) into the bottom of the outer tube of a Centriprep 50 K device (Millipore, Burlington, MA, USA). Afterwards, the filtrate was purified and concentrated using the Centriprep 50 K device by centrifuging at 1500× g three times in a row, first time for 30 min, second time for 10 min, and third time for 3 min. Extra centrifugation time was sometimes applied to allow the liquid level in the inner tube to be similar to the outer tube. The liquid filtered into the inner tube was poured off after each centrifugation step. A volume of 200 μL SM buffer was added to the inner tube at the end and centrifuged for 3 min. After the final centrifugation, 140 μL of the concentrated virome solution remaining in the outer tube was collected. The Centriprep filter membrane was cut out and added to the virome solution before storing at −80 °C until nucleic acids extraction. The remaining volume was stored at 4 °C for plaque assays.

Nucleic acid extraction of virome from feces

The concentrated virome solution and the cut filter membrane was first treated with 1 μL of 100 time diluted Pierce™ Universal Nuclease (Thermofisher Scientific, Waltham, MA, USA) for 5 min at room temperature, then the QIAmp viral RNA mini kit (Qiagen, Hilden, Germany) was used for viral DNA/RNA extraction following the procedures described by the manufacturer with modifications as described in ref. 67. Next, 10 μL of the extracted nucleic acids were amplified through Multiple Displacement Amplification (MDA) using the Genomephi V3 kit (GE Healthcare Life Sciences, Marlborough, MA, USA) following the instructions of the manufacturer, but the amplification time was shortened to 30 min (from 90 min). Finally, the amplified DNA was cleaned using a Genomic DNA Clean & Concentrator™ Kit (Zymo Research, Irvine, CA, USA) following the manufacturer’s protocol.

Virome sequencing and QC, adapted from Shah et al.32

Virome libraries were sequenced on the Illumina HiSeq X platform to an average depth of 3 Gb per sample with paired-end 2× 150 bp reads. Satisfactory sequencing results were obtained for 647 out of 660 samples. Virome reads were quality filtered and trimmed using Fastq Quality Trimmer/Filter v0.0.14 (options -Q 33 -t 13 -l 32 -p 90 -q 13), and residual Illumina adaptors were removed using cutadapt (v2.0). Trimmed reads were de-replicated using the VSEARCH68 (v2.4.3) derep_prefix.

Putative prophage contig identification and classification

Putative prophage sequences were identified using a combination of two methods: DeepVirFinder (v1.0)69 and VIBRANT(v1.0.1)70. Assemblies from all 662 metagenomes were concatenated into a single pool and filtered to those above a minimum length of 4 kb. These assemblies were run through DeepVirFinder (v1.0) using default parameters after creating models from all known phage genomes, downloaded from the Millardlab database in September 201971. Resulting contigs were filtered to include only those with a p-value of <0.05 after FDR correction. The assemblies were also run through VIBRANT (v1.0) setting the nucleotide input and parallelisation options only (VIBRANT_run.py -i assemlies.fna -t 16 …). Only those predicted phages of ‘medium quality’ and above were used further. Both sets of output were compared to the Refseq+plasmid database (available at https://mash.readthedocs.io/en/latest/tutorials.html) with a cut off of 95% identity using MASH (v2.2)72 and any contigs with matches that might be contamination were removed. Resulting contigs from both methods were then combined and dereplicated at 95% ANI with dedupe2.sh73. This set was also run through CheckV(v0.7.0)55 to give the details required for the minimum information about uncultivated viral genomes (MIUViG)74.

Whilst measures have been taken to remove potential bacterial contamination in these sequences by applying appropriate cut-offs with the tools used, there is always the possibility that some bacterial traces still remain and this should be considered in any future analysis. Additionally, whilst the phages identified in this work are referred to as prophages, it is important to remember that this assignment is putative, as their lifestyle has not been experimentally validated and that DNA from phage particles could also contribute to the metagenomic assemblies in addition to the DNA from bacterial chromosomes.

Predicted prophages were clustered using a network analysis performed with vCONTACT2 (v0.9.8)75, using the RefSeq release 88 database, with all other sequenced bacteriophages included using the Millardlab database as of January 202171. The resulting network was visualised in Graphia (v2.1)76.

Prophage annotation and functional analysis

Sequences were annotated using prokka (v1.14.5)77 and a custom database made from all phage genes using the Millardlab database from September 2019. The –add-genes and –locus-tags options were also used. Resulting amino acid files were clustered at 90% identity using CD-Hit (v4.8.1)78 and representative sequences from each cluster were analysed using EggNOG-mapper(v2.0)79 and default parameters. Phage lifestyle prediction was calculated in the same way as Cook et al.80 using HMM profiles for proteins that indicate a temperate lifestyle, and hmmscan (v3.3)81.

Identification of previously isolated temperate phages

The genome sequences of a set of E. coli temperate phages that were isolated from the same infant faecal samples used here82, were used to evaluate how well the temperate phage identification methods worked, and as reference genomes for comparison. To identify if any of the predicted sequences were those of the previously identified coliphages the reference genomes were dereplicated at 95% ANI to match that of the predicted contigs. The dereplicated contigs and the prophage contigs were then mapped against each other with minimap2 using the -asm20 option83. The sequences of any hits with >70% of the target contig covered was then extracted and clustered using Cluster_genomes.pl (v5.1)84 to agglomerate contigs that were >95% similar over 90% of the genome. A cut-off previously accepted as the same genome85, and the longest representative of the cluster was kept as the representative sequence.

Distribution of prophages in individuals

A set of reference contigs was constructed using the vOTUs identified from the metagenomes. The genomes of any remaining temperate coliphages isolated from the samples that had not been identified as vOTUs using the method described above were also added. Additionally, a set of 249 reference crAss phage genomes were downloaded from the dataset constructed by Guerin et al.86, with the aim of capturing the diversity of the crAss-like family of viruses that are abundant in the gut. This set of predicted phages, coliphages and crAss phages were then dereplicated at 95% ANI using dedupe2.sh to remove any remaining redundancy73.

Trimmed and QC’d reads from individual metagenomes were mapped against the set of reference contigs using bbsplit.sh with random mapping of ambiguous reads, and a minimum identity of 0.95 with the covstats option implemented73. A contig was considered present in a metagenome if there was coverage of >=1X across >=70% as used by Roux et al.85. Abundances were then calculated as counts per million (CPM). To determine how often the prophages are present in the metagenomes a binary presence/absence matrix was used so that extreme outliers in abundance would not skew results. The sum of the presence for each prophage was calculated and sorted to identify the prophages present in the most metagenomes. Alternatively, to determine how many prophages appeared in each child, the sum of presence in each metagenome was calculated.

When characterising the distribution of crAssphages in the samples the reference genomes along with any vOTUs that clustered together with them in vCONTACT were considered crAss-like prophages. The sum of presence of this subset of prophages was calculated from the presence/absence matrix.

Host prediction

Bacterial hosts of the prophages were predicted using CrisprOpenDB (v1.0)87 with 1 mismatch allowed, and the host with the most prophages predicted to infect them were identified in R and the host with more than 1% of the prophages predicted to infect them were visualised.

Functional analysis of the most abundant prophage cluster

Proteins from all members of cluster1819 (the most abundant prophage cluster) were extracted and clustered at 90% amino acid identity (AAI) using CD-Hit78 to remove redundancy. These were then analysed with EggNOG-mapper(v2.0)79 as previously described to assign Clusters of Orthologous Groups (COG) categories (Supplementary Table S2). Additionally, resulting Kegg Orthology (KO) codes were mapped onto metabolic pathways using KEGGmapper88.

Phylogenetic analysis of the most abundant prophage cluster

An initial blast search showed that Bacteroides phage Hanky p00’ (Hankyphage) was the only phage with significant sequence similarity to the members of cluster1819: the high quality vOTU_03578 was used as a representative of the cluster and had 99.90% identity and 74% query coverage with Bacteroides phage Hanky p00’; therefore, putative terminase genes from all cluster members were identified through analysis of the Hankyphage p00’ genome. The terminase protein sequence of Hankyphage p00’ was downloaded and used to identify the same protein in the cluster members using HMMsearch66, as no protein had been annotated as such. The protein sequences were aligned with ClustalW in MEGA(v10.1.8) and a maximum likelihood tree was also produced in MEGA using the JTT model and 100 bootstraps89. The tree was visualised and manually coloured in iTOL90.

Identification of diversity generating retroelements in the most abundant prophage cluster

Diversity generating retroelements were identified in members of cluster1819 by using both the myDGR web server91 and MetaCSST(v1.0)92 tools. To predict whether the target genes identified were putative tail fibre genes, as has previously been suggested93, the proteins from all members were analysed with PhANNs(v1.0.0)94 and the most significant hit to a tail fibre gene was carried forward. These were then compared with the results from the previous tools.

Determining active prophages

The paired nature of the metagenome and virome sequencing of the samples allowed for a novel exploration of whether the predicted prophages were induced at the time of sampling. Individual virome sample reads were mapped against the original metagenome assemblies containing the prophages that had been excised by the prediction tool VIBRANT70 using bbsplit.sh with a minimum identity of 0.95 and random mapping of ambiguous reads73. A total of 4291 prophages were eligible for this analysis. Using the predicted coordinates of the prophages, the coverage of each background bacterial and predicted prophage region of an assembly was extracted from a bam file that had been sorted and indexed using samtools95. For this analysis it was assumed that there was only a single prophage region per contig; there were only a minority of cases where multiple prophage regions had been predicted for a contig and had passed the quality cut-offs used in this work. Of note, the assumption of a single prophage region may lead to a small number of false negatives in the induction analysis. A small number of virome samples were also mapped against three large chromosomal contigs that were not predicted to contain any prophages as a negative control. The number of reads mapped to sections of 40 kb (the mean size of prophages in this work) were extracted from different regions of the assembly, in the same way as described above to mimic the presence of a prophage and allow us to test for induction in these negative controls.

To determine statistically if prophages were induced, the number of reads mapped to the bacterial part of the assembly and the number of reads mapped against the predicted prophage part were tested for a binomial distribution using pbinom in R(version 3.6.1), and the resulting p-values were corrected for multiple testing using the Bonferroni method. Frequency of significant induction across samples was tested against predicted host and mean RPK per prophage using linear models. Induction patterns of vOTUs were derived from a binary matrix of vOTUs vs samples (Bonferroni-significant induction yes/no) using principal component analysis (PCA) with vOTUs as observations and samples as features to visualise similarities between vOTUs regarding which samples they were induced in. In parallel, the same matrix was transformed to a euclidean distance matrix and analysed with PERMANOVA (adonis2 from the R-package ‘vegan’ v. 2.6-4; with option by = “margin”) to quantify similarities in induction patterns associated with predicted host (top hosts, genus level, vs others as shown in Fig. 5A) and viral cluster (each top cluster vs others as shown in Fig. 5C).

Environmental and clinical factors

To study factors potentially influencing prophage induction, we compared children according to key factors associated with microbiome composition: Antibiotics (yes/no), delivery mode(c-section/vaginal), furred pets(yes/no), gastrointestinal infection (yes/no), living environment(urban/rural), and siblings(any/none). Antibiotic exposure was defined as any prescription of ATC code starting with J01 (Antibacterials for systemic use) recorded at the 1-year visit in the Danish prescription register. Delivery mode, furred pets, birth address, and any siblings in the home was assessed by parental interview at the planned 1 week and 1 year visits to the research clinic. Living environment was defined by converting addresses to coordinates and mapping to 100x100m raster maps from the CORINE database of European land cover (https://land.copernicus.eu/) in a 3 km radius and performing PAM clustering on the composition of 5 major land cover types, as previously described in detail13. Gastrointestinal infections were assessed from a prospective symptom diary as any diarrhoea or vomiting within 7 days of the sample collection. Associations between these factors and induction rates were assessed using Wilcoxon tests of sample-wise induction percentages (within-sample matched vs non-matched virome-contig pairs analysed separately) and multiple testing was controlled using false discovery rate (FDR) adjustment and expressed as q-values.

Results

Identification and classification of novel prophages in the infant gut

From the 662 infant metagenomes obtained at age 1 year, we identified 10645 vOTUs after dereplication (Supplementary Table 1).

We grouped these viral contigs into approximate genus or subfamily-level classifications using vCONTACT2 and included the genomes of all sequenced bacteriophages as references, resulting in 2934 clusters. Of these, 364 were singletons and 2221 were outliers. Of all the clusters identified 953 were comprised solely of vOTUs identified in this work and of these 953 clusters, 177 were made up of a single member (Fig. 1). The hosts of 65% of the vOTUs could be predicted; the most common assignment at the genus level was Bacteroides (12.2%), followed by Salmonella/E. coli (6.3%) and Bifidobacterium (5.8%) (Fig. 2A)

Fig. 1: vCONTACT2 network analysis of vOTUs from this study and a database of phage genomes extracted from millardlab.org in January 2021.
figure 1

Each node represents a viral genome: vOTUs identified in this work are coloured in blue and reference genomes are grey. The largest and key viral families have been annotated, and viral clusters characterised in this work (VC_1819 and additional crAss-like prophages) have also been annotated. The number of clusters highlighted in blue, and their distribution throughout the network reflects the diversity of vOTUs identified.

Fig. 2: Characterising putative prophages.
figure 2

A Host prediction for the putative prophages. Only those that make up greater than 1% of the predicted hosts are shown. Salmonella and Escherichia have been grouped together, as CrisprOpenDB is known to not be able to differentiate between the two, due to the high number of Salmonella spacers available in the database. This is likely to be an overprediction of Salmonella. B Top 50 most prevalent vOTUs coloured by their viral cluster.

Previous work using these samples resulted in isolation of 35 Escherichia temperate phages and sequencing of their genomes82. There were five vOTUs that shared significant similarity with these isolated phages (Table 1). The longest sequence from each cluster was kept as representative, resulting in four predicted sequences being replaced with the isolated phage genomes and one isolated phage genome replaced by a predicted prophage genome. The ability to identify five vOTUs with similarity to previously isolated phages may reflect a level of microdiversity within this group of closely related phages. Both microdiversity and close similarity of sequences within a sample are known to cause problems with assembly and may explain why more were not found96,97. It may also reflect the fact that the isolated phages may not be abundant enough in the metagenomic sequence data to assemble fully.

Table 1 Percent identity of vOTUs identified bioinformatically in this work, and Escherichia coli temperate phages isolated in previous work82

Abundance analysis suggests no core provirome is established in infants

The distribution of the number of prophage vOTUs in each sample shows a mean of ~100, with values as low as four and as high as 400. No single vOTU was found in all children. The most widespread vOTU was found in ~70% of the children (Fig. 2B), which is below the 95% cut off used in this work to designate a prophage as core. Using a 50% cut off that has been used in previous work for the same designation34,98 results in one additional vOTU. Whilst no individual prophage could be found in all samples, the top 50 most prevalent prophages were spread between only eight viral clusters (excluding those without an assigned cluster) (Fig. 2B), showing more conservation of the genus/subfamily level than of individual viral contigs. To further examine this, we compared the prevalences of each viral cluster analogously to Fig. 2B and found no evidence of a core provirome at the cluster level either (supplementary Fig 2), with only one viral cluster present in more than 50% of the samples, excluding singletons and outliers.

CrAss-like prophages were present in 195 (29%) of the samples sequenced, and if a sample had crAss-like prophages identified, it was likely to possess only one type, as only 38% of the crAss-positive subjects contained two or more types. The identification of 109 vOTUs that clustered together with the reference crAssphage genomes has also expanded our knowledge of crAssphage and crAss-like phages, particularly those of the infant gut: an environment where they are thought to be present much less frequently than in adults86.

The functional potential of viral OTUs showed no significant patterns on the individual phage level

The percentage of all proteins involved in the different COG categories showed that the majority (64.9%) of proteins were assigned to category S – those of Unknown Function. Followed by categories reflecting viral replication – Replication, Recombination, and Repair (12.9%); Transcription (6.6%), and Cell Wall/Membrane Biogenesis (3%). Other categories were present in very small percentages of the total protein amount.

Cluster1819 is the most abundant phage cluster and contains an abundance of DGRs and morons

Cluster1819, containing 82 members, was found to be the most abundant in the samples (Fig. 3A) and is the second most prevalent cluster in the children (Supplementary Fig 2) so was characterized in more detail. Phylogenetic analysis of the large terminase gene revealed a single relative: Bacteroides Hankyphage p00’ (Accession BK010646) (Fig. 3B). The bacterial host for Hankyphage was previously identified as a Bacteroides which is the same as the predicted host for many members of this cluster81. However, there were variations on this with some vOTUs predicted to infect Prevotella and Butyricimonas.

Fig. 3: Characterisation of cluster1819.
figure 3

A Cumulative abundance of the top 10 most abundant viral clusters. B Phylogenetic tree based on a terminase protein alignment for members of cluster1819 and including the Bacteroides phage Hankyp00’. Only bootstrap values > 70% are shown as circles. The next most closely related phages were used as an outgroup due to their limited genome similarity. Any known taxonomy has been highlighted in red. C COG Category assignments for proteins belonging to members of cluster1819. The ‘other’ group is made up of those categories that comprised less than 1% of the total assignments.

The combination of MetaCSST and MyDGR identified 53 diversity-generating retroelements (DGRs) in the cluster, an element that is present in Hankyphage. Of the cluster members 49/82 were found to contain a DGR, as some contained multiple, and the target sequences were used to predict the gene it would generate diversity in. PhaNNs was used to predict the structural genes for the cluster members including the tail fibre genes, commonly a target for DGRs; when the target gene sequences were compared to the structural gene predictions all target genes were predicted to be tail fibres.

The majority of identified proteins in cluster 1819 belong to category S – those of unknown function (Fig. 3C). This is followed by category L and represents proteins involved in replication, recombination, and repair; categories O and V are equally abundant and represent those proteins involved in post-translational modification, and defence mechanisms respectively. Cell cycle control and nucleotide transport and metabolism are also categories of note. Combining these COG codes with the KEGG pathway database gave a clearer overview of what host pathways could be affected by these phages. These pathways included dTDP-L-rhamnose biosynthesis and menaquinone biosynthesis amongst others (Supplementary Table S2).

Estimation of the proportion of putative active prophages

Previously, we have shown that the mean relative abundances of both virulent and temperate phages in the virome are highly positively correlated with the corresponding abundances of their cognate host bacteria in the metagenome, at least at the host genus level and study-wide32. However, differences could still exist between different prophage clusters or even between samples, holding important clues about their biology.

Here we used the results of the virome reads mapped against the large metagenome contigs containing prophages, to discover prophages that were potentially induced and present in the samples as actively propagating viral particles. A subset comprising 4291 of the predicted prophages were able to be tested for induction via a read mapping approach, as they had been excised from a larger assembly in the prediction process, thus allowing for a comparison against the bacterial background on the same contig. We quantified and tested induction as a degree of preferential mapping of virome reads inside the predicted bounds of the prophage compared with the rest of the contig; for examples see Fig. 4A, B. This showed that induction is a widespread phenomenon; 4041 (94.2%) of the prophages were induced in at least one sample and remained significantly so (p = <0.05) after Bonferroni correction, resulting in 4.59% significant virome-prophage contig pairs, see Fig. 4C. Only 250 prophages were never found to be significantly induced in any sample, see Fig. 4D. When only considering contigs with 100 reads or more mapping in a sample, 83,418 out of 321,232 pairs (26.0%) were found to be significant.

Fig. 4: Evaluating induction in prophages using virome read mapping.
figure 4

A, B Looking across the entire prophage carrying contig, we assessed induction as differential virome read coverage inside the predicted prophage region (red lines) versus the rest of the contig, which was considered as background. These two examples were chosen to illustrate this phenomenon as a positive (A) and negative (B) example. C Volcano plot showing log2-fold induction vs. p-value (double log scale) distribution of all prophage/sample pairs. All prophage/sample pairs with an induction value > 1 and passing the Bonferroni cutoff were considered significant (red dots), which comprised 4.59% of the entire set. The red area looks larger due to massive overplotting in the lower part of the panel. D Histogram of all prophages by how many children they were significantly induced in, ranging from 0% (no children, 250 prophages) to ~60% of all the children.

Comparing prophages across different predicted hosts, we found differences in the fraction of induced samples (Fig. 5A, log linear model p < 2e−16). Among the genera with highest rates of induction were Blautia, Bifidobacterium, and Erysipelatoclostridium. The most abundant group of prophages in the metagenomes, cluster1819, was among the most commonly induced (Fig. 5B, C). Comparing the cluster against the rest of the predicted Bacteroides phages showed that the phages belonging to cluster1819 were more often significantly induced than the rest of the group, and to all prophages with different predicted hosts.

Fig. 5: Comparing induction frequencies across vOTU and sample characteristics.
figure 5

A Induction fraction by predicted host of the prophages, showing sizable differences between hosts (overall p < 2e−16), even when adjusted for mean RPK of the prophages. B Highlighting the induction frequencies of cluster 1819 prophages are much higher than other prophages with Bacteroides as their predicted host, as well as all other prophages with different predicted hosts. C Principal Component Analysis (PCA), each dot is a prophage ordinated according to their induction pattern, ie. which children these prophages were induced in. Points which are close together signify prophages that tend to be induced in the same children. Points were coloured according to some of the major Viral Clusters that show homogenous inductions patterns, notably including cluster 1819.

We found similar patterns of induction across prophages belonging to the same viral cluster (Fig. 5C), ie. which samples they were significantly induced in, suggesting that specific clusters of prophages are induced together within a sample (PERMANOVA, F = 85.1, R2 = 13.7%, p < 0.001). A similar phenomenon was seen when comparing induction patterns between vOTUs according to predicted host (F = 20.2, R2 = 7.0%, p < 0.001). This only attenuated slightly after including both viral cluster and predicted host in the model; both had significant contributions to the variation in induction patterns (Viral cluster, F = 75.7, R2 = 11.5%, p < 0.001; Predicted host, F = 16.0, R2 = 4.9%, p < 0.001).

The frequency of significant induction was also associated with the mean RPK of the prophages (log linear model, Supplementary Fig 3), however adjusting for this did not change the conclusions above.

Next, we examined induction among virome-prophage contig pairs originating from the same sample, compared with those from different samples. For matched virome-prophage contig pairs, a much higher rate of induction was seen (Fig. 6A) than for non-matched pairs. However, the mapping rate was also much higher among these pairs (Fig. 6B, C), which also influenced induction rates. Therefore, we considered both factors simultaneously and saw that in matched pairs, higher mapping rates were associated with higher rates of induction, up to around 75% in contigs attracting many reads. This was only partially seen for non-matched pairs, which increased until 100–200 reads and thereafter declined slightly (Fig. 6D).

Fig. 6: Investigating within-sample induction and contributions of environmental factors.
figure 6

Comparing all samples’ viromes mapped against prophage contigs assembled in same vs other samples, expressed as sample-wise A mean induction rate (ie per-sample mean of prophages that were significantly induced in same vs other samples), B Median number of reads mapped to contig (per-sample mean of same vs other), and C Median reads per kilobase (RPK); i.e. length-adjusted mapping rate. D Analysis of within-sample vs between-sample induction estimates adjusted for number of reads mapped to contig, showing that especially contigs with high mapping rates exhibit high rates of induction. Rates of all virome-contig matches are summarized within bins of the number of total reads per contig. E Comparison of sample-wise induction rates split in sample-matched and non-matched virome-contig pairs, comparing samples according to key environmental factors. Children with siblings exhibit lower rates of induction compared to those without siblings.

Finally, we examined whether induction rates differed between samples according to environmental and clinical factors known to be associated with microbiome composition. We analysed matched vs non-matched virome-contig pairs separately, summarising per child to allow a fair comparison. We found no association to induction rate according to recent antibiotic treatment, delivery mode, having furred pets, or an urban vs rural living environment. Children who had siblings had a slightly lower induction rate than those without siblings, but this did not remain significant after FDR correction (Wilcoxon test, p = 0.00495, q = 0.059).

Discussion

This work sought to deeply characterise prophages in the infant gut, and to highlight novel aspects of the associated phage biology. By combining machine learning based identification methods we maximised our ability to identify integrated prophages from metagenomic assemblies. Of those identified, no single prophage could be found in more than 70% of the samples suggesting that there is no core set of prophages in the infant gut. Much debate has taken place previously over the existence of a core virome with early work on the topic identifying 23 phage contigs shared by at least 50% of samples from 62 individuals98; however, more recent studies with a larger sample size found that the most ubiquitous viral population was only present in 39% of the metagenomes used, and most of the populations were only sporadically detected at all.99. These results suggest that infant viral communities are more unique to the individual than they are commonly shared. Considering the dynamic nature of the infant gut microbiome, it may not be surprising to find prophages distributed sporadically throughout the samples as they adapt to changing bacterial host abundances100,101.

We were able to assign predicted hosts to ~65% of the prophages identified here with Bacteroides, Salmonella/Escherichia, and Bifidobacterium the most common hosts; all key members of the infant gut microbiome1,16. Salmonella and Escherichia predictions were grouped together as the tool used has difficulty distinguishing between them87. Thus, these are likely to be Escherichia infecting phages rather than Salmonella. It is important to remember that these are predictions and putative hosts have not been experimentally validated. We included some experimentally validated coliphages from the same samples with very high sequence similarity. While the detection of phages with complementing methods is a strength of the study, there is also a potential for bias from increased sensitivity to those coliphages.

We also specifically looked at the prevalence of crAssphages, a well-known and large family of phages that are widespread in gut viromes86. In addition to the reference crAss-like phages that were used we also identified 109 additional vOTUs that clustered together and so were considered part of the crAss-like group. Our results strongly support previous work that has suggested that crAssphages are not abundant or prevalent in infants. Initial reports on crAss-phages demonstrated an indeterminate infection strategy, showing no clear signs of lysogeny and unusual lytic infection behaviour102. Our results support the suggestion that at least some of the crAss-like phages are temperate103.

Classification of the putative phages combined with the sequences of all sequenced bacteriophage genomes resulted in the creation of 2934 clusters with 364 singletons. Of all the clusters identified, 953 were comprised solely of vOTUs identified in this work and of these 953 clusters, 177 were made up of a single member.

A more in-depth analysis of the most abundant cluster of prophages revealed an interesting perspective into their potential role in the infant gut. The most abundant group, Cluster1819, falls within the proposed Hannahviridae family32 and is comprised of 82 prophages closely related (genus or sub-family level) to Bacteroides phage Hanky p00’. This phage was originally identified as an integrated prophage of Bacteroides dorei from metagenomic data93. Hankyphage has been predicted to be present in half of the human population from geographically distant regions, found to lysogenize at least 13 different species of Bacteroides93, and now abundant in this young age group.

The broad host range of these phages is due to the possession of diversity generating retroelements (DGRs) that target tail fibres93, which were also found in the present study. Over half of the members of the viral cluster identified here were found to contain at least one DGR, all of which were predicted to target tail fibres. This adaptation to infect different Bacteroides may be vital for their success in this dynamic environment. The cluster was predicted to infect a few different hosts; which could be an artefact of the host prediction method used or due to changing tail fibres. Without experimental evidence this cannot be validated but remains intriguing. Importantly, CRISPR-Cas systems are subject to horizontal gene transfer between related host genera, making CRISPR-based host predictions inherently uncertain, and especially insensitive for bacteria that do not normally carry them.

In addition to harbouring DGRs, members of the cluster also possessed several morons, or auxiliary metabolic genes, that may prove beneficial to their host or influence bacterial metabolism. We found genes involved in the dTDP-L-rhamnose biosynthesis pathway, which is responsible for biosynthesis of the O-antigen of lipopolysaccharides in Gram negative bacteria104,105,106. Bacteroides in particular are known to produce a number of phase-variable capsular polysaccharides (CPS)107,108, which are involved in host-tropism of Bacteroides-targeting phages107. The phase-variable expression of these CPS creates diversity that may help to ensure host infection resilience by maintaining differentially susceptible subpopulations107 and help the phage by superinfection exclusion of other phages46,47. This echoes the piggy-back-the-winner model proposed for the crAss-like phage crAss001 and its Bacteroides host109. Finally, the O-antigen is of importance for recognition of the human immune system and the pathogenicity of the bacterium110,111, as well as the bacteria’s ability to bind to and infect epithelial cells112.

Another gene of interest found in the cluster was menA: a component of the menaquinone (vitamin K2) biosynthesis pathway, an important part of the electron transfer pathway in prokaryotes and vital to humans in the blood clotting process, and bone and nervous system health113,114,115,116,117. The importance of microbially synthesised vitamin K is debated due to the low amount of total vitamin K it would be contributing118,119,120, which may be more significant in infants121.

Further work is needed to characterise other families of prophages in the infant gut.

Our induction analysis shows that out of the subset of phages we were able to test, most prophages that were identified in a sample were also induced. This suggests that these prophages are an active part of the community and may play a prominent role in the shaping of the bacterial community in the gut. Previous work has suggested that the infant gut in particular may be dominated by temperate phages that may be induced due to the high turnover rate/constant maturation of the bacterial community during the first few years of life30,31,35. However, whether pervasive prophage induction is also characteristic of more mature gut microbial compositions is still not known. One study of mice colonised with the Oligo Mouse Microbiota community (OMM12) also found widespread prophage induction, but this could as well be attributed to their gnotobiosis, known for the stress it causes hosts122.

Environmental conditions can lead to the induction of prophages from their bacterial hosts, leading to lytic replication and the production of progeny phages. A number of factors have been shown to induce prophages such as certain chemicals (mitomycin c) and antibiotics such as fluoroquinolones123,124,125. More recently, the use of common oral medications such as nonsteroidal anti-inflammatory drug diclofenac, and other antibiotics including ampicillin, norfloxacin, and ciprofloxacin were shown to induce prophages from bacterial isolates of the human gut126. Our work showed no major effects of the clinical and environmental factors tested on the proportion of induced prophages. A priori, we would have assumed that especially antibiotics could have potential for influencing the overall induction level, but no differences were seen. This may be due to the specific antibiotics used as most children received regular penicillins which may not have the same induction potential as the specific aforementioned drugs. It could also be due to time limitations of the method; evidence of induction may have already been turned over in the 4 weeks preceding sampling, and this may be why we cannot see it here. Furthermore, we only had one time point per child; sampling before and after treatment may better uncover changes in induction. We found a significant reduction in the overall induction level in children with siblings. However, this did not remain significant after FDR adjustment. Future clinical studies should examine the potential effect of siblings further.

Sequence composition and genomic structure may influence sequencing efficiency which can lead to uncertainty or noise in the mapping. Together, this highlights the need to be careful of interpreting the ‘snapshot in time’ of a population from a single time point and indicates the need for more longitudinal data to characterise the biology of prophages in the infant gut.

In summary, our results show that prophages of the infant gut form a diverse community that is different in each individual; no conserved core provirome of temperate phages was apparent at the vOTU level. Our work utilises a large infant cohort to support the previous observation that crAss-like phages are present in small numbers early in life. We also identified a novel cluster of prophages that are the most abundant in the metagenomes, which fall within the newly proposed Hannahviridae and are related to Bacteroides phage Hanky p00’. The possession of DGRs targeting tail fibres in members of this cluster suggest they may be able to infect a range of bacterial hosts. We also found evidence that they may modify host LPS through possession of components of the dTDP-L-Rhamnose pathway. Therefore, this group of phages possess elements that may allow them to maintain differentially susceptible subpopulations of their host bacterium, whilst also containing DGRs that could expand their host range. By utilising the paired metagenome and virome sequencing we were able to show that out of the identified prophages we were able to test, the majority of them were induced. However, testing induction against antibiotic usage and other factors revealed no significant associations, although this may be a reflection of the speed at which the evidence of induction is turned over in the gut, highlighting the need for more longitudinal data in the field.