Introduction

Bacteria of the genus Sulfurimonas within class Campylobacterota (formerly Epsilonproteobacteria)1 are one of the most widespread and dominant groups in deep-sea hydrothermal vents. They can grow chemolithoautotrophically with various electron donors, electron acceptors, and inorganic carbon sources2,3, playing important roles in the biogeochemical cycles of hydrothermal ecosystems. Besides, sequences of Sulfurimonas have been identified in the pelagic redox cline, coastal sediments, and many different types of terrestrial habitats worldwide3. So far, 13 Sulfurimonas species have been isolated and characterized2,4,5,6,7,8, while an increasing number of metagenome-assembled genomes (MAGs) and single-cell amplified genomes (SAGs) have been deposited in the public database. For example, an uncultivated Sulfurimonas species was described recently, which was globally abundant and active in deep-sea oxygen-saturated hydrothermal plumes9.

As the most abundant and diverse biological entities on earth, bacteriophages (or phages for short) play significant roles in shaping the structure of microbial communities and contribute to global carbon and nitrogen cycling10. During infection, phages can reprogram host metabolism through the expression of viral-encoded auxiliary metabolic genes (AMGs), and phage-mediated horizontal gene transfer (HGT) can also influence host fitness, driving microbial evolution and diversification11. Typically, phages follow one of three major development strategies: lytic, chronic, and lysogenic infections12. While lytic phages kill their hosts through cell lysis, chronic phages release virions without killing the hosts, and phages undergoing the lysogenic cycles can remain latent and replicate as prophages. It is estimated that prophage elements may account for 10 to 20% of the bacterial genomic DNA, and comprise a large proportion of strain-specific differences within species13,14.

Over the last decade, our knowledge of viral diversity has greatly expanded. With the development of high-throughput sequencing and bioinformatic tools, a large number of novel viral populations and virus-host linkages have been revealed by culture-independent surveys15,16,17,18,19. These uncultivated virus genomes (UViGs) now already represent the vast majority of the taxonomic diversity in public viral genome databases20. In this context, new virus species and higher taxa recovered from sequencing data alone are now accepted by the International Committee on Taxonomy of Viruses (ICTV)21.

Although the diversity of the viral world has been increasingly unveiled, phages infecting Sulfurimonas have never been isolated and characterized to date. Here, we present a comprehensive study of the phylogenetic diversity, genomic features, biogeographic distribution, and phage-host interaction patterns of the Sulfurimonas-associated phages. The results shed new light on the coevolution of this ecologically important genus and their phages and provided a robust foundation to further characterize the biology and ecological roles of these novel viral lineages.

Results and discussion

Overview of prophages and associated UViGs in Sulfurimonas

To determine the prevalence of phage-like elements within Sulfurimonas genomes, we screened genomic assemblies of 219 Sulfurimonas strains (Supplementary Data 1) using a combination of several different methods. As a result, 43 putative phage contigs were identified from 14.6% (32/219) of the Sulfurimonas genomes examined here (Supplementary Data 2). In addition, we mined the IMG/VR v.4 dataset20 for UViGs that were associated with Sulfurimonas. Based on sequence homology, k-mer frequencies, tRNA sequences, and CRISPR spacer similarity, a total of 40 UViGs were predicted to infect the genus Sulfurimonas (Supplementary Data 2).

The Sulfurimonas-associated prophages and UViGs are widely distributed across global oceans and continents. Most of these phages were recovered from marine ecosystems, including deep-sea hydrothermal vents, pelagic, and coastal environments, while a relatively small part of the phages were of terrestrial origin (Fig. 1, Supplementary Data 2). All the sequences were combined and clustered at 95% average nucleotide identity (ANI) and 80% coverage, resulting in 61 viral operational taxonomic units (vOTUs, Supplementary Data 3). These vOTUs were assigned to three different viral realms, i.e., Duplodnaviria (viruses with double-stranded DNA genomes; 40 vOTUs), Monodnaviria (viruses with single-stranded DNA genomes; 16 vOTUs), and Varidnaviria (a portmanteau of various DNA viruses; 5 vOTUs).

Fig. 1: Geographical distribution of the predicted Sulfurimonas phages.
figure 1

Each location was represented by a symbol proportional to the number of phage contigs. The phages belonging to different viral realms were presented by different shapes, and each ecosystem type is indicated by a unique color. This image is plotted using the R package “ggplot2” v4.3.272 and “maps” v3.3.0 (https://doi.org/10.32614/CRAN.package.maps).

According to their CheckV completeness, the viral genomes or genome fragments were categorized into five different quality tiers: complete (100% completeness; 9 vOTUs), high-quality (≥90% completeness; 4 vOTUs), medium-quality (50–90% completeness; 14 vOTUs), low-quality (<50% completeness; 29 vOTUs), and not-determined (no completeness estimate available, 5 vOTUs). The incompleteness of prophages and UViGs was probably due to the fragmented nature of high-throughput sequencing data, as most of them were derived from draft genomes, MAGs, SAGs, or metagenome assemblies. Notably, several vOTUs within the Monodnaviria and Varidnaviria showed very low completeness or had no completeness estimate, even for sequences that contained direct terminal repeats (DTR) and were approximately the expected genome length (Supplementary Data 3). These can also be complete genomes that are distantly related to all reference sequences, as viral lineages other than Caudoviricetes are underrepresented in the CheckV database.

Despite the widespread occurrence of the genus Sulfurimonas, we know next to nothing about phages infecting this group thus far. To better reflect the diversity of Sulfurimonas-associated phages, all of the vOTUs described above were retained for subsequent analysis regardless of their completeness. These vOTUs represent species-level phages that are associated with Sulfurimonas and will greatly expand our understanding of the taxonomic, genomic, and ecological characteristics of Sulfurimonas phages. Remarkably, most of the vOTUs contained a single sequence, indicating that the genetic diversity of Sulfurimonas phages remained to be fully explored.

Taxonomic scope of tailed Sulfurimonas phages

As mentioned above, approximately two-thirds of the Sulfurimonas-associated prophages and UViGs were assigned to the viral class Caudoviricetes (Supplementary Data 3) using geNomad’s markers22. However, only one out of the 40 vOTUs was classified at the established family level (vOTU_23 as Schitoviridae), suggesting that these Sulfurimonas phages were of significant phylogenetic novelty.

To illustrate the phylogenetic relationship among the tailed Sulfurimonas phages and other known prokaryotic double-stranded DNA (dsDNA) viruses, a viral proteomic tree based on genome-wide sequence similarities was built using the ViPtree server23. The resulting tree showed that the 40 vOTUs were distributed in 8 clades, many of which were far from other isolated phages (Fig. 2A). The largest group (clade B) included 22 vOTUs, followed by clade C and clade D comprising 7 and 6 vOTUs, respectively. The remaining clades (clade A, E–H) contained only one vOTU, which was unrelated to each other.

Fig. 2: Phylogenetic and network analysis of tailed Sulfurimonas phages.
figure 2

a The genome-wide proteomic tree of tailed Sulfurimonas phages and other related known dsDNA phages constructed using ViPTree. The colored inner and outer rings represent the virus family and host groups, respectively. Phages identified in this study are indicated with red stars, and the corresponding branches are colored red. The 8 major clades containing Sulfurimonas phages are denoted A–H. b Gene-sharing network analysis of tailed Sulfurimonas phages and reference viral genomes. Phages identified in this study are indicated by triangles, and viral clusters corresponding to the ViPTree clades are labeled with letters.

We also constructed a gene-sharing network to further evaluate the taxonomic position of these vOTUs using vConTACT2 (Fig. 2B). This tool group viral contigs into viral clusters (VCs) that were ~96% concordant with ICTV prokaryotic viral genera24. Clustering with 4534 phage genomes from the ProkaryoticViralRefSeq211 database suggested that 27 vOTUs could be assigned to 6 genus-level clusters, while 9 vOTUs were designated as outliers (which might be related to the VCs they were connected to, but not at the genus level) and 4 vOTUs as singletons (which had few or no gene similarities against other genomes and were not shown in the network). As illustrated in Fig. 2B, several VCs and outliers were connected and formed 3 larger clusters representing subfamily or family-level relationships. These clusters roughly corresponded to the B, C, and D clades defined by ViPtree, except for the affiliation of vOTU_32 and vOTU_37 (Fig. 2, Supplementary Data 3). The vOTU_23 (clade G) was related to viral genomes from the family Schitoviridae but was not clustered with them at the genus level, which was consistent with both the geNomad and ViPtree classification. No previously described phages were clustered with these vOTUs, even when all the NCBI phage genomes (complete or near-complete, n = 25,903 as of August 2023)25 were used as references (Supplementary Data 4).

Overall, the tailed phages of Sulfurimonas were classified into 19 genus-level VCs from 8 families, based on two different viral taxonomy approaches. The results generated by ViPtree and vConTACT2 showed a high degree of agreement and both methods performed well for long (≥10 kbp) viral contigs in previous studies26. All of the genus-level VCs did not include any known tailed phages and might represent novel genera. To date, 29 phages isolated from the phylum Campylobacterota have been deposited in the Virus-Host DB (October 2023), most of which were associated with the pathogenic genera Campylobacter and Helicobacter. Only one temperate phage was induced from the deep-sea vent Campylobacterota, Nitratiruptor sp. SB155-227. As shown in Fig. 2A, phages within clade C and clade D were clustered with the Campylobacter phage CJIE4 and Nitratiruptor phage NrS-1, respectively. The rest of Sulfurimonas phages, however, were distantly related to any phage isolates infecting Campylobacterota.

Putative phages encoding double jelly-roll major capsid proteins (DJR MCPs)

Bacteriophages with dsDNA genomes were included in two viral realms, Duplodnaviria and Varidnaviria. Apart from the diverse tailed bacteriophages belonging to Duplodnaviria, a total of 5 DJR MCP-encoding sequences were identified in Sulfurimonas MAGs (Fig. 3, Supplementary Data 2), which might represent non-tailed dsDNA phages of the realm Varidnaviria. The size of these vOTUs ranged from 7425 to 10,747 bp, with the percentage of G + C content from 33.6% to 37.8%. One of the predicted phages located in a long contig with flanking bacterial genes, indicative of a provirus. Others were viral contigs containing no host genes, including one with DTRs, and might be a complete genome.

Fig. 3: Genomic characterization and phylogenetic analysis of double jelly-roll major capsid protein-encoding phages.
figure 3

a Genome arrangement of putative phages encoding double jelly-roll major capsid proteins (DJR MCPs). ORFs are shown by block arrows drawn to scale. Gene abbreviations: MCP major capsid protein, HTH helix-turn-helix domain-containing protein, B_sand beta-sandwich jelly-roll fold protein, NAT GNAT family acetyltransferase, Rep replication protein. b Maximum likelihood tree of the MCP proteins. Branch color indicates different DJR MCP groups. DJR MCPs in Sulfurimonas genomes and in cultured Turriviridae, Corticoviridae, and Autolykiviridae are labeled with green, purple, red, and blue circles, respectively.

The genomic context of the DJR MCP-encoding sequences was quite variable. Only two genes, the MCP and the upstream packaging ATPase were conserved among all these phage elements (Fig. 3A). This is similar to most of the reported phage groups encoding DJR MCPs28. Transcription regulators containing a helix-turn-helix (HTH) domain were detected in 4 out of the 5 sequences but were missing in the vOTU_60, probably due to the incompleteness of this contig. The vOTU_57 and vOTU_58 encoded an N-acetyltransferase (NAT) which was conserved in the family Autolykiviridae and Corticoviridae29. In comparison, vOTU_60 and vOTU_61 encoded a beta-sandwich jelly-roll fold protein (B_sand), which was present in most members of the STIV group28. Several genes were predicted to be involved in cell lysis, including those encoding holin and cell wall hydrolase. Besides, the majority of the predicted phage genes were of unknown function.

A maximum likelihood phylogenetic tree was constructed for these DJR MCPs and their homologous protein sequences (Fig. 3B). At coarse grain, the MCPs formed two distinct clades, which previously defined as PM2 group and STIV group28. The MCPs of vOTU_57 and vOTU_58 were closely related to those of phage isolates from Autolykiviridae and belonged to the PM2 group. This group also consisted of the cultured Corticoviridae, representing the largest group of known DJR MCPs. The vOTU_59 also fell into the PM2 group but was more distantly related to the phages of these two families. The MCPs of vOTU_60 and vOTU_61 were clustered with the STIV group MCPs. The described members of this group included two Sulfolobus turreted icosahedral viruses (STIV1 and STIV2) from the family Turriviridae and dozens of highly diverged sequences encoded by archaeal and bacterial (pro)viruses. Protein structural homology searches using the Phyre2 also showed that the MCPs of vOTU_57, vOTU_58 and vOTU_59 were most similar to the PM2 group MCP (all three matches had 100% confidence and 98% coverage) while those of vOTU_60 and vOTU_61 resembled the STIV group MCP (99.4% and 99.5% confidence, 93% and 84% coverage, respectively) (Supplementary Fig. 1).

Tailless phages with dsDNA genomes were thought to be abundant in global surface oceans30,31. However, the vast majority of cultivated phages are tailed viruses of the realm Duplodnaviria while the non-tailed phages within Varidnaviria are far less investigated. Metagenomic studies had revealed the diversity and prevalence of this group, based on the presence of the hallmark gene encoding DJR MCPs28,29. In the current study, we used a sensitive profile-based method to screen the Sulfurimonas genomes and discovered several highly diverse DJR MCP-encoding sequences. Despite the conserved tertiary structures, sequence similarities between some MCPs were low (amino acid identity below 30% or not detectable). Thus, we could not rule out the possibility that more divergent MCPs were missed in our analysis. Further explorations are required to unravel the role of this phage group infecting Sulfurimonas.

Diverse single-stranded DNA (ssDNA) phages infecting Sulfurimonas

Bacteriophages from the Inoviridae family are characterized by circular, single-stranded DNA genomes encapsidated in filamentous virions32. Among the reported inovirus genomes, a gene encoding the morphogenesis protein (pI) was the only conserved marker gene33. Due to their unique and diverse gene content, most of the current phage prediction tools were not able to identify inoviruses from genomic or metagenomics sequences. Here, a total of 16 inovirus-like sequences were identified from 15 Sulfurimonas strains using a machine learning approach employed in Inovirus_detector. In addition, 5 UViGs from the IMG/VR v.4 datasets that were predicted to infect Sulfurimonas were classified as inoviruses (Supplementary Data 2). These prophages and UViG sequences were clustered into 16 vOTUs (Supplementary Data 3), including 2 circular contigs, 9 prophages with canonical att sites (direct repeats of ≥10 bp in a tRNA or next to an integrase), and 5 partial genomes (Fig. 4A).

Fig. 4: Genome alignment and phylogenetic analysis of inovirus-like phages.
figure 4

a Alignment and comparison of putative inoviruses identified in Sulfurimonas. ORFs are shown by block arrows drawn to scale and color-coded according to their putative functions. The color of the shading connecting homologous genes indicates amino acid identities between them. b The genome-wide proteomic tree of putative inoviruses identified in Sulfurimonas and other related ssDNA phages constructed using ViPTree. The colored left and right lines represent the virus family and host groups, respectively. Phages identified in this study are labeled with colored stars.

Genome alignments of the inovirus-like sequences (Fig. 4A) indicated that they were highly diverse at the amino acid level, yet displayed genomic synteny to some extent. Most of the homologous proteins shared low sequence similarity with amino acid identity (AAI) values below 50%. An exception was found between the vOTU_44 and vOTU_50, whose hosts were different strains of the same species. These two sequences showed high similarity (AAI > 95%), but the vOTU_44 existed as a circular contig while the vOTU_50 integrated into the host’s chromosome (Supplementary Data 2). The complete genomes encoded 13–18 genes, which were organized in functional modules involved in virion structure, assembly/secretion, regulation, DNA replication, and integration. Some host strains, such as Sulfurimonas sp. SWIR-19 and Sulfurimonas sp. S012_79_esom contained two inovirus-like sequences that were distinct from one another. However, the vOTU_54 in Sulfurimonas sp. SWIR-19 and vOTU_52 in Sulfurimonas sp. S012_79_esom lacked the replication genes and might be defective prophages.

To determine the evolutionary relationship and taxon classification of these putative inoviruses, a viral proteomic tree was constructed using ViPtree23 with all related ssDNA phages as references. As shown in Fig. 4B, all the Sulfurimonas-associated inoviruses were clustering with members of the proposed family Amplinoviridae, consistent with the previous report showing that most host families were associated with a single inovirus lineage33. This proposed family comprised large genomes associated with hosts of Deltaproteobacteria and Campylobacterota but did not include any viral isolates. Pairwise comparison of the AAI percentage of marker genes (i.e., pI-like proteins) suggested that these 16 vOTUs belonged to 10 genus-level clusters (Supplementary Fig. 2), based on the proposed threshold for delineating Inoviridae genera (50% AAI)33.

Filamentous inoviruses are extremely diverse and widespread, infecting a broad diversity of bacterial hosts33. In the current study, highly divergent inovirus-like sequences were identified in different Sulfurimonas strains as integrated prophages or extrachromosomal viral genomes. Like the other known inoviruses, these phages might establish a chronic infection within the hosts and exert significant effect on their growth, adaptability, and evolution34,35. Bacteriophages of the family Microviridae are another group of ssDNA viruses that have also been frequently observed in viromes36. However, we did not find any evidence that microviruses were associated with Sulfurimonas hosts.

Interaction between Sulfurimonas phages and their hosts

Based on the virus-host linkages revealed via prophage prediction and CRISPR spacer match, we further investigated the potential interactions between Sulfurimonas-associated phages and their past/current hosts. CRISPR arrays with the highest evidence level (annotated by CRISPRCasFinder) were detected in 65 Sulfurimonas strains. Of these, seven strains contained spacers that target prophages or UViGs sequences identified in this study (Fig. 5A, Supplementary Table 1).

Fig. 5: Interactions between phages and Sulfurimonas strains.
figure 5

a Phage-host interaction network showing current and past infections. Phages and hosts are shown by different node types, and each ecosystem type is indicated by a unique color. Current infections and historical infections are indicated as red and green arrows, respectively. b Interaction network highlighting the one-to-many associations between phage and host species. Each host species is indicated by a unique color.

CRISPR spacers from five strains were associated with the identified inoviruses, and sometimes the viral sequences were targeted by more than one spacer (Fig. 5A, Supplementary Table 1). These strains did not contain any inovirus-like sequences, indicating that they acquired immunity against the corresponding phages from the historical infection. By contrast, closely related strains or species that lacked the CRISPR spacers might still carry these prophages. For example, S. hydrogeniphila NW15 contained several spacers matching the vOTU_44 and vOTU_50, which were not detected in NW15 but were present in the genomes of S. hydrogeniphila NW367 and S. hydrogeniphila NW9, respectively (Fig. 5B).

Only two CRISPR spacers from Sulfurimonas strains were related to infection histories of dsDNA phages (Fig. 5A, Supplementary Table 1). This proportion was relatively low given that most of the identified Sulfurimonas phages had dsDNA genomes. Moreover, it seemed that dsDNA phages had narrow host ranges compared with ssDNA phages. While many inovirus-like phages interacted with different strains and species (Fig. 5B), most of the dsDNA phages infected only one Sulfurimonas strain. This was consistent with the previous findings that non-tailed phages had a broader host range than the tailed group31,37,38. One exception was the vOTU_4, a provirus identified in the genome of Sulfurimonas sp. RIFOXYD12_FULL_33_39. Spacers targeting this phage were detected in Sulfurospirillum sp. SCADC, indicative of the potential for cross-genus infections. These broad-host-range phages may have considerable impacts on microbial community ecology and evolution39, yet their influence appears to be endemic, as all predicted phage-host interactions are confined within specific habitats (Fig. 5A).

In some Sulfurimonas genomes, multiple phage contigs were detected, suggesting co-infection of two or more phages in a single cell (Fig. 5B). The frequency of co-infection might be overestimated for tailed phages if the large genomes were mis-assembled into several different contigs40. In the case of Sulfurimonas sp. OB8 and Sulfurimonas sp. NORP112, the multiple sequences from Caudoviricetes represented at least two different phages as two distinct copies of the marker gene terL were identified. Co-infections of phages from different viral realms were also observed, especially for inoviruses with persistent infection cycles. As pointed out in previous studies38,41, the presence of multiple phages favored recombination and genetic exchange between phages, which is important for the generation of phage diversity42 and for escape from CRISPR immunity43.

It is worth noting that some UViGs recovered from MAGs are ‘full viral sequences’ that lack the flanking host genes. Consequently, the possibility cannot be excluded that these contigs have been misplaced into the MAG by inaccurate binning44. In such cases, the association between the phage and its host strain may not be as robust as those observed in other virus-host pairs. Besides, given that our analysis was mainly based on uncultured data, the detection of multiple phage contigs within the host genome could also point to scenarios other than co-infection, such as sequential infections. Therefore, additional culture-dependent analyses are imperative to further clarify the interactions between Sulfurimonas phages and their hosts.

Putative AMGs in Sulfurimonas-associated phages

AMGs encoded by phages may participate in microbial metabolic pathways during infection. To better understand the impact of phages on host metabolisms, the vOTUs were examined for the presence of AMGs. Using VIBRANT and DRAM-v pipelines, a total of 11 putative AMGs were identified in the genomes of Sulfurimonas-associated phages (Supplementary Table 2). Based on annotations against the KEGG Orthologs (KO) database, Pfam, and CAZy databases, these AMGs were classified into four functional categories, i.e., energy metabolism, amino acid metabolism, nucleotide metabolism, and carbohydrate metabolism.

All of these predicted AMGs were surrounded by phage genes (DRAM-v auxiliary scores <4), suggesting a true viral origin for them. However, AMGs related to amino acid metabolism and nucleotide metabolism were frequently annotated in viruses and might perform central functions for the virus itself rather than its host45. Putative AMGs involved in carbohydrate metabolism were detected in 2 vOTUs, including the genes encoding glycoside hydrolases (GH) of the family GH19 and GH28. While phage GHs were thought to facilitate the breakdown and utilization of complex carbohydrates46,47, a few studies suggested that GHs were often associated with viral invasion and might not be legitimate AMGs26,48,49. To be conservative, all these genes were excluded and only one AMG was retained for downstream analyses.

The AMG encoding a phosphoadenosine phosphosulfate reductase (PAPS reductase or CysH) was identified in a complete prophage genome (vOTU_1). CysH is a key enzyme involved in assimilatory sulfate reduction, catalyzing the reduction of PAPS to sulfite. Viral-encoded cysH genes have been found in various marine habitats, including oxygen-deficient water columns50, deep-sea hydrothermal vents18, cold seeps16, seamounts47, and hadal trenches51. The putative CysH of Sulfurimonas prophage contains the conserved domain and structural configuration of CysH enzymes (99.9% confidence and 75% coverage of Phyre2 match) that assimilate sulfates for methionine and cysteine biosynthesis in microorganisms (Fig. 6B). However, phylogenetic analysis of cysH genes showed that the phage CysH is distantly related to its homologs in Sulfurimonas, indicating the complex evolutionary history of this AMG (Fig. 6A). Interestingly, while many Sulfurimonas species encode the cysH genes, the host of this prophage (Sulfurimonas sp. 4561-380_metabat1_scaf2bin.038_VB, a MAG recovered from hydrothermal vent metagenome) lacks one. Thus, we suspect that the phage CysH may have potential compensatory effects on host metabolisms.

Fig. 6: Phylogeny and predicted protein structure of the auxiliary metabolic gene encoded by Sulfurimonas phages.
figure 6

a Maximum likelihood tree based on the CysH amino acid sequences. Leaves were colored according to the affiliated taxonomic groups. Nodes with ultrafast bootstrap support ≥95% and SH approximate likelihood ratio test support ≥80% are indicated by black dots. GenBank accession number is given for each sequence. b Predicted tertiary structure and function of the phage CysH proteins. CysH, phosphoadenosine phosphosulfate reductase; PAPS, phosphoadenosine phosphosulfate.

Conclusions

Sulfurimonas play important roles in chemoautotrophic processes and sulfur cycles in various habitats, and infection of phages will inevitably affect their population dynamics and metabolic capacities. In this study, we have revealed a high level of taxonomic and genomic diversity in Sulfurimonas-associated phages for the first time and illustrated their interactions with different host species or strains. To overcome the limitations of the regular viral prediction tools in identifying non-tailed phages, we applied a combination of several strategies and found that these “cryptic” phage elements were more ubiquitous than supposed. Although smaller than tailed phages, their divergent genome architectures and broader host ranges suggested that they might significantly contribute to the genetic diversity and evolution of this bacterial genus. As this is an in silico analysis, further experiments will be necessary to verify the presence and activity of these phages. While this manuscript was in review, our study describing a temperate phage isolated from Sulfurimonas indica NW79 (corresponding to vOTU_25 in this study) has been published52, yet much more efforts are anticipated to gain a deeper insight into phage-host interactions. Nonetheless, the findings reported here greatly expanded the current understanding of phages infecting Sulfurimonas, and provided good basis for future investigations to explore the ecological and evolutionary role of these phages.

Methods

Acquisition of Sulfurimonas-associated phage genomes

Genomic assemblies of all Sulfurimonas strains (including culture isolates, MAGs, and SAGs) were downloaded from the NCBI genome database (June 2023), and sequences from Sulfurimonas isolates, MAGs, and SAGs retrieved from previous studies18 were added into the collection. Medium- and high-quality (completeness ≥50% and contamination ≤10%) genomes evaluated by CheckM v1.1.353 were dereplicated at 99.99% ANI using dRep v2.3.254 (--S_algorithm ANImf -comp 50 -con 10 -sa 0.9999). The final dataset included 219 strain-level genomes of Sulfurimonas, representing isolates (n = 21), MAGs (n = 188), and SAGs (n = 10) recovered from various environments (Supplementary Data 1).

Phage-like sequences in Sulfurimonas genomic assemblies were predicted using a combination of three popular virus-finding tools, including Phaster web server55 (http://phaster.ca/), Virsorter v1.0.656 (--db 2 --diamond) and VIBRANT v1.2.045 (default settings). Putative viral contigs that were identified as higher confidence predictions (i.e., “intact” or “questionable” in Phaster, “category 1/2/4/5” in Virsorter, and “complete”, “high-” or “medium-quality” in VIBRANT) by at least two methods were retained for further analysis. Since the current phage prediction tools are not efficient in identifying non-tailed phages15,56, more specific strategies were also applied to detect other viral groups. For phages encoding double jelly-roll major capsid proteins (DJR MCPs), we screened the Sulfurimonas genomes using the hmmsearch tool in HMMER v3.1b257 (score ≥ 50 and e ≤ 0.001), with a collection of hidden Markov model (HMM) profiles of DJR MCPs representing different clades within Varidnaviria28,29. Additionally, Inovirus_detector, a machine learning approach based on marker gene and genome features33 was used to search the genome assemblies for inovirus-like phages within Monodnaviria.

Phages probably infecting Sulfurimonas strains were also retrieved from the IMG/VR v.4 dataset20. This dataset composed of >15 million virus genomes and genome fragments obtained from (meta)genomes and metatranscriptomes, representing the largest collection of UViGs currently available. To be conservative, only the high-confidence viruses (~5 million sequences) were used in the analysis. Sulfurimonas-associated phages were identified through four computational host prediction methods15,19. (i) Nucleotide sequence homology. UViGs were searched against genomic sequences of Sulfurimonas strains using BLASTn58 with the following thresholds: 70% minimum nucleotide identity over 75% of the contig length, and 0.001 maximum e-value (-perc_identity 70 -qcov_hsp_perc 75 -e-value 0.001). (ii) CRISPR spacer match. Clustered regularly interspaced short palindromic repeats (CRISPR) arrays and cas genes were identified from all Sulfurimonas genomes using CRISPRCasFinder59 (-def General -cas -ccvr -ccc 20000 -gscf -keep -html). CRISPR spacers were queried for exact matches (100% identity over 100% spacer length) against the IMG/VR contigs using the BLASTn-short mode (-task blastn-short -word_size 7 -perc_identity 100 -qcov_hsp_ perc 100). (iii) tRNAs similarity. ARAGORN v1.2.3860 was used with the “-t” option to detect tRNAs from the Sulfurimonas genomes and IMG/VR contigs, and the identified tRNA sequences were compared using BLASTn (-perc_identity 100 -qcov_hsp_perc 100) to select perfect hits (100% coverage and 100% identity). (iv) k-mer frequencies. WIsH v1.061 was run with the default parameters to identify connections between IMG/VR contigs and all reference genomes from the Genome Taxonomy Database62 (GTDB, release 207), with p < 0.05 being considered as a match.

Based on the methodology described above, prophages and UViGs associated with Sulfurimonas were identified and pooled together. Only contigs ≥10 kb were retained for tailed phages from the realm Duplodnaviria. Contigs ≥5 kb were retained for non-tailed phages within Varidnaviria and Monodnaviria, because the genome size of these groups was much smaller. These contigs were then clustered at 95% shared nucleotide identity and 80% coverage to generate viral operational taxonomic units (vOTUs)63. Completeness and contamination of the vOTUs were estimated using CheckV v1.0.164. Based on their completeness, the vOTUs were classified into five quality categories: complete (100% completeness with closed genomes), high-quality (completeness ≥90%), medium-quality (completeness between 50% and 90%), low-quality (completeness <50%), and not-determined (no completeness estimate available).

Taxonomic assignment and network analysis

Taxonomy of the IMG/VR UViGs was retrieved from the metadata, and (pro)phages that clustered with them in vOTUs were assigned to the same taxon. For the remaining prophages, two classification methods were used in combination: (i) marker-based taxonomic assignment by geNomad22 web app using the default option (https://nmdc-edge.org/virus_plasmid/workflow); (ii) last common ancestor (LCA) algorithm-based assignment by CAT v5.0.365 with the “--sensitive” option.

Gene-sharing network analysis was performed using vConTACT2 v0.11.324 (--rel-mode Diamond, --vcs-mode ClusterONE). Briefly, prodigal v2.6.366 was used for ORF prediction from the vOTUs, and the resulting protein sequences were used as input. Viral RefSeq version 211 (n = 4534) was selected as the reference database, and DIAMOND v0.9.2167 was chosen for the all-to-all comparison of the protein sequences. The similarity score between viral contigs was calculated based on the number of shared protein clusters, and viral contigs were clustered through ClusterONE68. Lastly, the genome-content-based network was visualized in Cytoscape v3.7.269.

To identify other potential hosts of Sulfurimonas phages, we further used a larger collection of prokaryotic genomes for host prediction, which composed of (i) all reference genomes from the GTDB62 release 207 (n = 65,703), and (ii) a custom database of marine microbial genomes from the Marine Culture Collection of China (n = 3676). Host-phage interaction networks were constructed based on the information of current infections (existence of phage contigs in genomes) and historical infections (inferred from CRISPR spacer match). The interaction network was visualized by Gephi v0.10.170 using the Fruchterman Reingold layout.

Phylogenetic and comparative genomic analysis

A proteomic tree of the tailed Sulfurimonas phages and all prokaryotic dsDNA viruses in the Virus-Host DB71 (RefSeq release 219) was generated using ViPtree v3.723 (https://www.genome.jp/viptree/). For visualization purposes, only queries and related genomes (SG > 0.02) were selected to construct the final tree (n = 859). For inovirus-like sequences, prokaryotic ssDNA viruses in the Virus-Host DB and previously reported uncultured inoviruses33 were selected as references. Genome alignment of inoviruses was also performed and visualized by the ViPtree server. The amino acid identity (AAI) between the mophogenesis proteins was calculated using CompareM v0.0.32 with default settings (https://github.com/donovan-h-parks/CompareM), and the heatmap was drawn using the R package “ggplot2” v4.3.272.

For phylogenetic analysis of DJR MCPs, previously reported MCPs were used as references28,29. For CysH genes, related sequences were retrieved from the NCBI’s non-redundant protein database using BLASTP58. The amino acid sequences were aligned using MUSCLE v3.8.3173 and the multiple alignments were trimmed with TrimAl v1.274 (-automated1). A maximum likelihood (ML) tree was inferred using IQ-TREE v2.075 with the best substitution model selected by ModelFinder76. Support for nodes in the ML tree was evaluated with 1000 replicates of ultrafast bootstrap77 and Shimodaira-Hasegawa approximate likelihood ratio (SH-aLRT) test78 (--ufboot 1000 --alrt 1000). The constructed tree was then visualized using FigTree v1.4.4 (http://tree.bio.ed.ac.uk/software/figtree/). Structural modeling was performed using ESMFold79 and the Phyre2 web server80 (http://www.sbg.bio.ic.ac.uk/phyre2/index.cgi), and the structures were visualized in ChimeraX v1.581. Genome maps for contigs encoding DJR MCPs and AMGs were drawn using the R package “gggenes” v0.5.1 (https://doi.org/10.32614/CRAN.package.gggenes).

Identification of auxiliary metabolic genes (AMGs)

Viral AMGs were identified and annotated based on VIBRANT v1.2.045 and DRAM-v v1.3.582 as previously described47. (i) VIBRANT pipeline. The vOTU sequences were run through VIBRANT using default parameters. (ii) DRAM-v pipeline. VirSorter2 (--prep-for-dramv) was run first to produce the affi-contigs and the resulting files were used as input for DRAM-v annotation. Putative AMGs were assigned an auxiliary score based on the category of flanking genes, and only AMGs with auxiliary scores < 4 were retained. Finally, the AMGs predicted by the two pipelines were combined. To improve confidence in AMG identification, manual curation was performed to remove genes related to nucleotide metabolism, organic nitrogen, ribosomal proteins, viral invasion, and modification of viral components49,83.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.