Introduction

The global dependence on fossil-fuel energy and the escalating levels of atmospheric carbon dioxide present a pressing challenge to humanity1. In the realm of subsurface energy and carbon reservoirs, oil reservoirs have long harbored diverse microorganisms, and have been a subject of growing interest since the 1930s2. Recent research has intensified the focus on understanding the presence and role of microorganisms in these deep subsurface environments3,4, driven by their dual impact on energy production and carbon emissions5,6.

Despite extreme environmental conditions, oil reservoirs still offer a variety of niches to support various microorganisms7, including hydrocarbon oxidizing microorganisms, fermentative microorganisms, sulfate reducing microorganisms (SRMs), nitrate reducing microorganisms, and methanogens8,9. Microorganisms in oil reservoirs have adapted to these extreme conditions and contribute to important ecological processes, such as the transformation of crude oil via biodegradation and other geochemical cycles5,10. Furthermore, microbial H2S production by SRMs is a prevalent metabolic process within subsurface oil reservoirs11, and this can lead to the corrosion of metal equipment and infrastructure, souring of oil, and health risks, resulting in significant economic and ecological costs11.

While the importance of microorganisms within oil reservoirs has become increasingly recognized, the mechanisms governing the assembly and regulation of microbial communities in these deep subsurface environments remains unclear. Microbial communities in oil reservoirs undergo fluctuations during production-related engineering interventions, particularly when surface activities establish connections with subsurface reservoirs12. Additionally, while variation partitioning of beta-diversity in oil reservoirs revealed that chemical properties and physical conditions explained a substantial fraction of variation in microbial beta-diversity, more than 70% of the community variation could not been explained13. Viruses are directly associated with host physiology and mortality, and further influence microbial community dynamics in aquatic environments14. Therefore, viruses in the unique environment of oil reservoirs may be one of the major contributors to the dynamic changes observed in the microbial communities within these reservoirs.

Viruses are the most abundant biological entities on the planet and are found in nearly all environments15, even in extreme environments such as Antarctic soils16, thawing permafrost17, and cryoconite holes of glaciers18. They play pivotal roles in natural ecosystems by interacting with microbial hosts and thereby exerting significant influence over global biogeochemical cycles, such as the global cycling of nutrients, energy flow, and food web dynamics19. Viruses can function as predators, regulating microbial abundance while releasing organic matter and inorganic nutrients through cell lysis20. Furthermore, viruses can actively regulate and rebuild host metabolism via auxiliary metabolic genes (AMGs) in their genomes that are expressed during infection. Recent studies have uncovered the presence of a substantial number of viruses within oil reservoir environments21,22. A meta-analysis of eight production wells indicated that the abundance of viruses is approximately 3 × 108 mL−1. The viral abundance in production wells during water flooding is higher than that in production wells during microbial flooding21. In addition, a previous study recovered about 800 reference viral genomes (length ≥ 10 kb), revealing viruses are widespread in hydraulically fractured wells23. Laboratory experiments showed that prophage-induced dominant microorganism lysis releases intracellular metabolites that can sustain key fermentative metabolisms23, supporting the persistence of microorganisms in this ecosystem. These findings reveal the potential significance of viruses in shaping microbial communities and reprograming microbial metabolisms within the oil reservoir ecosystem.

In addition, oil reservoirs harbor a broad diversity of uncultured microorganisms and novel metabolic pathways. Zhou et al. suggested that the archaeon ‘Candidatus Methanoliparum’ alone has the ability to degrade various large hydrocarbons into methane8. Meckenstock et al. discovered complex microbial communities inhabiting small water droplets within the oil phase, and it is widely believed that microbial degradation occurs within these microenvironments24. Diverse functional gene groups identified from Pseudomonas in the oil phase were significantly differed from those in the corresponding water phases22. This indicated that oil reservoirs are a unique environment, containing uncultured microorganisms and their co-evolved novel viruses. Therefore, the study of oil reservoir viruses is conducive to revealing global viral diversity.

However, the study of viruses in oil reservoirs is still in its early stages, with only a small number of viral genomes having been identified to date23. Moreover, existing studies on viromes in oil reservoirs have predominantly focused on a limited spatial scale and specific sample types21,23. Consequently, a comprehensive understanding of viral communities in oil reservoirs remains elusive, and little is known about the distribution patterns and assembly processes of viral communities in these environments.

In this study, to gain insights into the diversity, virus-host interactions, and potential ecological roles of viruses within oil reservoirs, we assemble a catalogue encompassing viral and prokaryotic genomes sourced from 182 oil reservoir metagenomes, and analyze the ecological roles of viruses in regulating the community structure of sulfate reducing microorganisms through combined microcosm enrichment experiments and bioinformatics analysis. Our findings reveal that viruses are widely distributed in oil reservoirs and viruses not only inhibit the growth of SRMs, but also shape the structure of the sulfate reducing microbial subcommunity.

Results

Overview of prokaryotic genome and viral genome catalogues in oil reservoirs

To explore the diversity and ecological function of viruses inhabiting oil reservoirs, we collected 59 oil reservoir produced water samples from across China and performed whole shotgun metagenomic sequencing. In order to compile catalogs of both prokaryotic and viral genomes from these metagenomic datasets, we created an analysis pipeline (see Supplementary Fig. 1). Employing this pipeline, we analyzed the 59 metagenomes newly generated in this study, plus an additional 123 metagenomes obtained from public databases, originated from oil reservoir samples collected from Europe, Asia, North America, and South America (Fig. 1a). Consequently, we successfully constructed catalogs of both prokaryotic and viral genomes using a total of 182 metagenomic datasets, including 59 newly generated metagenomes from our study and 123 metagenomes obtained from public databases. The prokaryotic genome catalogue consisted of 7229 microbial metagenome-assembled genomes (MAGs) that met or exceeded medium quality ( >50% completeness and <10% contamination), encompassing 6,686 bacterial and 543 archaeal MAGs, spanning 72 bacterial and 9 archaeal phyla. The bacterial community was dominated by Proteobacteria (2402 MAGs), while the archaeal community was predominantly represented by Halobacteriota (339 MAGs). The viral catalogue of the oil reservoirs consisted of 33,657 putative viral genomes, all exceeding 10 kb in size. This catalog featured 6,218 viral genomes of medium- to high-quality ( >50% completeness). To assess the coverage of viral communities within the oil reservoir, we conducted sequencing of virus-like particles (VLPs) from 8 out of the 59 produced water samples mentioned above25. The result revealed that 77.32% of the VLPs could be mapped to the viral catalogue, suggesting that the viral catalogue extensively covered the viral population within the oil reservoir environment26.

Fig. 1: Overview of viruses in oil reservoir ecosystems.
figure 1

a Geographic distribution of collected oil reservoir samples. b Distribution of vOTUs across samples. c Relative proportion and taxonomic classification of vOTUs based on the latest ICTV classification using PhaGCN2.0 and geNomad v1.9. d Predicted lifestyles of vOTUs within the oil reservoirs.

We clustered the 6218 viral genomes into 3886 viral Operational Taxonomic Units (vOTUs) with a sequence similarity threshold of 95%, a level that approximates species-level taxonomy27. The viral richness of all samples ranged from 1 to 1020 with an average of 206 (Fig. 1b). Rarefaction analysis showed that the number of detected vOTUs was saturated (Supplementary Fig. 2), suggesting that our study provided a reasonably comprehensive sampling of the viral communities within oil reservoirs. We assigned the vOTUs to taxonomic ranks based on the latest ICTV classification using PhaGCN2.0 and geNomad v1.9, and found that 44.49% of vOTUs could not be confidently annotated at the family level (Fig. 1c), in addition, only 5.20% of vOTUs could be found in the IMG/VR v3 dataset, underscoring the presence of a substantial number of unidentified viruses within oil reservoirs. Annotated vOTUs were predominantly assigned to the class Caudoviricetes (formerly known as the order Cadovirales; 54.48%), including families such as Peduoviridae (24.94%), Casadabanvirus (7.54%), Hendrixvirinae (2.86%), and Casjensviridae (2.47%). Furthermore, we identified three core vOTUs that were present in more than 50% (n = 91) of all 182 oil reservoir metagenomes analyzed in this study. Notably, these core vOTUs represented a mere 0.08% of the total vOTUs. The majority of vOTUs (84.64%) were detected in less than 10% of the samples, highlighting the heterogeneous nature of viral communities within oil reservoirs. Additionally, our analysis predicted 1722 and 1774 vOTUs as virulent and temperate viruses, respectively, while the lifestyles of the remaining vOTUs (n = 390) could not be determined (Fig. 1d).

Biogeography of viral communities in oil reservoirs

To investigate how viral communities assembled within oil reservoirs, we compared the viral diversity and composition across various oil reservoir samples. We found significant disparities in viral diversity and composition associated with different geographic locations (Supplementary Fig. 3a, Fig. 2a, ANOSIM, R = 0.59, P = 0.001), and this disparity was not due to the sequencing depth of the samples included in the study (Supplementary Fig. 3b, Spearman’s correlation coefficient ρ = −0.04, P = 0.80). These results suggested that geographic location serves as the primary determinant of viral variation between samples on an intercontinental geographic scale. This discovery aligns with previous findings in studies of viral communities in cold seep and acid mine drainage environments28,29. Within the dataset employed in this study, the majority of oil reservoir samples from Canada and China were dominated by Peduoviridae, while a few Chinese oil reservoir samples exhibited dominance by Casadabanvirus. The majority viruses in samples from oil reservoirs in the United States remained unclassified at the family level (Supplementary Fig. 3c).

Fig. 2: The distribution and assembly process of viral communities.
figure 2

a Nonmetric multidimensional scaling (NMDS) of viral communities colored by sampling sites. b Distance-decay relationships based on Bray-Curtis similarity (1 − dissimilarity of viral communities). The blue line denotes the least-squares linear regression across all spatial scales. Fitting equation, adjusted R2 values, and P values for Distance-decay are based on linear regression and presented above graph. c Proportions of viral community assembly processes in oil reservoirs, including variable selection, homogenizing selection, homogenizing dispersal, dispersal limitations, and drift. d Correlations between individual physicochemical parameters (pH, K+, and Mg2+) and Shannon-Wiener (left) and richness index (right) were examined using a Spearman’s rank test (two-sided P-value), shaded area shows 95% confidence interval around the fitted mean line. e NMDS of viral communities from different pH oil reservoirs. f NMDS of viral communities from different temperature oil reservoirs. Sample points are colored according to oil reservoirs pH (e) and temperature (f). Color bars from red to yellow represent values from small to large, respectively.

Furthermore, we observed significant negative distance-decay relationships across all samples based on the Bray-Curtis similarities (1 − dissimilarity) of viral communities (slope = −0.05, P < 0.001) (Fig. 2b). Mantel tests indicated that the compositions of viral communities were significantly related to geographic distance (Spearman’s correlation coefficient ρ = 0.30, P = 0.001) (Supplementary Table 1). To determine the relative importance of geographic factor in structuring viral communities, we conducted multiple regression analysis using multiple regression on matrices (MRM) approach, MRM results showed that 16.53% of the variability in viral community similarity was explained by geographic distance (partial regression coefficient b = −0.37, P = 0.0001). To explore the mechanisms for viral community assembly within oil reservoirs, we performed a null model analysis. The result revealed that stochastic processes played a more substantial role than determinism in governing viral community assembly in oil reservoirs (Fig. 2c). Moreover, we utilized environmental metadata associated with this study’s 59 newly sequenced oil reservoir metagenomes to investigate the influence of environmental factors on the viral community. Correlations of individual physicochemical parameters with alpha diversity showed that the alpha diversity of viral community was significantly correlated with pH, K+, Mg2+ (P < 0.05; Fig. 2d). The Shannon-Wiener and richness index all increased as pH increased, but all decreased as K+ and Mg2+ concentrations increased (Fig. 2d). In contrast, no significant correlation between viral alpha diversity and temperature was found (Supplementary Fig. 3d). For beta diversity among viral communities, Mantel tests showed that the similarity in viral community composition among samples was significantly correlated with environmental distance (Spearman’s correlation coefficient ρ = 0.27, P = 0.0001) and geographic distance (Spearman’s correlation coefficient ρ = 0.19, P = 0.001). MRM was used to further identify the relative contributions of environmental factors versus geographic distance to viral community similarity (Supplementary Table 2). In oil reservoirs, the MRM model explained 14.42% of the variability in viral community similarity (P = 0.0001), with geographic distance, pH, and temperature being the most important variables explaining viral community similarity (partial regression coefficient b = −0.14, −0.11, and −0.071, respectively, P < 0.0001). NMDS analysis also illustrated the significant dissimilarities in viral communities between different pH (Fig. 2e) or temperature (Fig. 2f) oil reservoirs (pH: ANOSIM, R = 0.28, P = 0.001; temperature: ANOSIM, R = 0.24, P = 0.001).

Novel viral clusters in oil reservoirs

To uncover novel viruses inhabiting oil reservoirs, we employed vConTACT230 to construct a gene-sharing network that included vOTUs found in oil reservoirs and from a wide variety of other ecosystems, such as groundwater, sediment, and thermal springs, etc. In this weighted network, all vOTUs were grouped into 4011 viral clusters (VCs) (Fig. 3a). Among these VCs, 1934 were exclusively associated with a single ecosystem, while only two were shared across all ecosystems. The limited overlap of viruses between different ecosystems demonstrated a high degree of habitat specificity among viruses. Within the subset of oil reservoir viruses, 2942 out of 3886 vOTUs were clustered into 805 VCs, with 265 VCs (33%) being unique to oil reservoirs. This finding suggested that about one-third of oil reservoir viruses may be endemic to oil reservoirs. Furthermore, we found that acid mine drainage sediments and groundwater shared a higher number of VCs with oil reservoirs (Fig. 3a). Additionally, we constructed a gene-sharing network based on the vOTUs found in oil reservoirs and those deposited in the Viral RefSeq database. Our analysis showed that only a small percentage (n = 495, 14.66%) of vOTUs from oil reservoirs clustered with taxonomically known genomes from Viral RefSeq (Fig. 3b). These results indicated that oil reservoirs possess a vast array of undescribed viruses.

Fig. 3: Taxonomic diversity of viruses in oil reservoirs.
figure 3

a Gene-sharing network of viruses across various environments, including oil reservoirs, groundwater, cold seep sediments, thermal springs, wetland, acid mine drainage sediments, Minnesota peat, non-marine saline and alkaline, hydrothermal vents, and Stordalen thawing permafrost. The nodes in the network represent viruses, while the edges indicate similarity based on shared protein clusters. Node color represents the origin environment of the viruses. Pie chart showing the relative proportion of shared viral clusters between oil reservoirs and the other nine environmental virus datasets. b Gene-sharing network of viruses from oil reservoirs and RefSeq prokaryotic viruses. Pie graph showing the relative proportion of shared viral clusters between oil reservoirs and RefSeq prokaryotic viruses.

Functional genes encoded by viromes in oil reservoirs

To unravel the functional roles of viruses within oil reservoirs, we clustered all 346,145 predicted protein-coding genes derived from the oil reservoir viral genomes into 322,060 gene clusters. We found that 61.32% of viral gene clusters (accounting for 63.90% of total viral genes) lacked functional annotation against the eggNOG database. The largest gene cluster derived from viromes in oil reservoirs predominantly encoded proteins associated with replication, recombination and repair (L) functions. (Supplementary Fig. 4a).

Furthermore, we identified 105 putative auxiliary metabolic genes (AMGs) that might participate in host metabolism and adaptation to the environment, mainly involved in carbon utilization (n = 32), energy metabolism (n = 25), miscellaneous metabolism (MISC, n = 34), and transporters metabolism (n = 14). Among these AMGs, genes involved in cobalamin biosynthesis (including cobS and cobT) were the most prevalent (Supplementary Fig. 4b), being found in 25 samples. Cobalamins are a class of structurally diverse cofactors containing cobalt31 that are essential for various biological functions, including amino acid synthesis and carbon metabolism32. VirSorter2 placed the cobS and cobT within the interior of the viral genome. In addition, the structural model prediction of virus-encoded CobS and CobT at Phyre2 showed 100% confidence (Fig. 4). These findings suggest that viruses in oil reservoirs could potentially influence microbial interactions by regulating the production of cobalamin within their hosts. In addition, we identified 9 AMGs from complete/high-quality viruses that were predicted to participate in energy metabolism (Supplementary Fig. 4c, d, see Supplementary Discussion for more information).

Fig. 4: Genome map of a representative cobS and cobT encoding virus.
figure 4

AMGs are in orange, virus-like genes are in blue, viral hallmark genes are in purple, and non-virus-like or uncharacterized genes are in gray. Tertiary structures of selected AMGs based on structural modeling using Phyre2.

Virus-host interactions in oil reservoirs

To further study the potential impacts of viruses on microbial ecology within oil reservoirs, we investigated in situ virus-host interactions based on CRISPR spacers and tRNA sequences33. A total of 7197 putative virus-host pairs were predicted (1411 based on CRISPR, 5,786 based on tRNA), in which 1119 viruses (18.00% of the total viruses) were connected to 1217 prokaryotic MAGs (host MAGs, accounting for 16.83% of the total MAGs). The host MAGs were distributed across 3 archaeal phyla and 37 bacterial phyla (Fig. 5a, Supplementary Fig. 5a). The top five most frequently predicted bacterial phyla included Proteobacteria (785 MAGs), Actinobacteriota (69), Desulfobacterota (64), Chloroflexota (50), and Bacteroidota (49). The most frequently predicted archaeal phyla were Halobacteriota (10) and Methanobacteriota (8) (see Supplementary Discussion for more information). In addition, we found 422 host MAGs with the potential for hydrocarbon degradation.

Fig. 5: Linkages between each virus and its predicted host MAG in oil reservoirs.
figure 5

a Phylum level maximum-likelihood phylogenetic tree of MAGs detected in the oil reservoirs. The color of clades in the tree indicates whether MAGs are host MAGs (orange represents host MAGs, blue represents no-host MAGs). The inner circle represents the domain, and the outer circle represents the phylum as annotated by GTDB. The tree was constructed using PhyloPhlAn and visualized with iTOL. b Comparisons of α-diversity (Shannon index) between the virome and the microbiome are presented. c, d Correlation between the relative abundance of hosts and viruses (c) or VHR (d) in the oil reservoirs. In (bd), Correlations are examined using a Spearman’s rank test (two-sided P-value). Regression lines are shown in red. e, f Associations between the relative abundance of hosts and viruses (e) or VHR (f) (temperate and virulent viruses) in different samples. The red points represent temperate viruses, and the blue points represent virulent viruses, each point represents one sample. Regression lines are shown in red and blue for temperate and virulent viruses, respectively. g, h Correlation between the relative abundance ratio of virulent-to-temperate viruses and the relative abundance of hosts (g) or VHR (h). Regression lines are shown in red.

To investigate the relationship between viruses and microbes in oil reservoirs, we conducted a correlation analysis between viral and microbial profiles. Our analysis revealed a significant positive correlation of α-diversity (Shannon diversity) between the viral and prokaryotic communities (r = 0.68, P < 2.2e−16; Fig. 5b). This result indicated that the structures of the viral and prokaryotic communities were closely related within oil reservoir environments. To further explore their association, we examined the relationship between the relative abundance of each virus and its predicted host. Our results revealed a positive correlation between viral and microbial relative abundances (r = 0.68, P = 1.05e–149; Fig. 5c). Furthermore, a negative correlation was observed between Viral-to-Host Ratios (VHRs) and microbial relative abundance within oil reservoirs, showing that hosts with higher relative abundances tended to exhibit lower VHRs (Fig. 5d). This phenomenon aligns with ‘piggybacking the winner’ model, which posits that viruses exploit their hosts through lysogeny rather than killing them when host density is high. Temperate viruses protect their host cells from infection by closely related viruses via superinfection exclusion34. Consequently, the contribution of temperate viruses to the host increases with high host abundance, resulting in a ‘more microbes, fewer viruses’ scenario. To further clarify whether viruses affected their prokaryotic hosts in oil reservoirs through a ‘piggybacking the winner’ mechanism, we compared the relative abundance of temperate and virulent viruses with identified host links, and found that although both virulent and temperate viruses exhibited positive correlations in relative abundance with their hosts, the increase in the relative abundance of temperate viruses occurred at a higher ratio than that of virulent viruses (Fig. 5e), and the VHRs of temperate viruses exhibited a milder decline compared to those of virulent viruses (Fig. 5f). Additionally, we found that host relative abundances exhibited a negative correlation with the ratio of virulent and temperate viruses (Fig. 5g), whereas the VHR displayed a positive correlation with the ratio of virulent and temperate viruses (Fig. 5h). Collectively, these findings suggest that oil reservoir viruses affect their hosts mostly in the manner of ‘piggybacking the winner’.

Sulfur metabolism mediated by viruses in oil reservoirs

In oil reservoirs, a prevalent metabolic pathway involves the microbial reduction of sulfate to generate hydrogen sulfide (H2S), a process primarily conducted by sulfate-reducing microorganisms (SRMs)35,36. H2S has significant implications for the degradation of infrastructure, reservoir souring, the operational cost of oil production, and the value of crude oil37. Recent studies reported that viruses infecting sulfur-metabolizing microbes could reshape the host’s sulfur metabolism through encoded auxiliary metabolic genes (AMGs)38.

To investigate the potential impact of viruses on microbial reduction of sulfate in oil reservoirs, we identified the MAGs possessing sulfur metabolic capabilities and their associated viruses, including assimilatory sulfate reduction (ASR), thiosulfate oxidation metabolism (TSO), dissimilatory sulfate reduction (DSR), and sulfide oxidation (SO). A total of 484 host MAGs were predicted to possess the pathways for different types of sulfur metabolism (Supplementary Fig. 7c, see Supplementary Discussion for more information). The relative abundance of host MAGs with sulfur metabolic capabilities and their associated viruses were positively correlated (Fig. 6a), and the VHRs exhibited a decline corresponding to the increasing relative abundance of hosts with sulfur metabolic capabilities (Fig. 6b). These results were consistent with the findings observed in the overall MAG population. However, we noticed a parallel increase in the relative abundances of virulent and temperate viruses in tandem with the rise in host relative abundance (Fig. 6c). This difference in relative abundance increase ratio between virulent and temperate viruses was not as pronounced as the trend observed for the entire host population (Fig. 5e, Fig. 6c). In addition, virulent viruses displayed a significantly stronger correlation with hosts engaged in sulfur metabolism when compared to temperate viruses (Fig. 6d). Collectively, these findings point to a relatively more substantial contribution by virulent phages to the regulatory dynamics of sulfur metabolizing communities. Moreover, we found that hosts with DSR capabilities exhibited a relatively stronger correlation with their associated viruses in comparison to other hosts (Fig. 6e). This high correlation could be attributed to the higher prevalence of virulent viruses infecting these hosts compared to other hosts (Supplementary Fig. 7c). In addition, both virulent and temperate viruses infecting hosts capable of DSR showed a relatively stronger correlation with their hosts compared to other hosts (Fig. 6f, g). These findings collectively suggested that viruses primarily regulated DSR function within oil reservoirs, possibly in a top-down regulatory manner39, where the growth and abundance of SRMs are primarily regulated by the predation of virulent viruses.

Fig. 6: The impact of viruses on the sulfur cycle in oil reservoirs.
figure 6

a, b Correlation between the relative abundance of hosts with sulfur metabolic capabilities and viruses (a) and VHR (b) were examined using a Spearman’s rank test (two-sided P value). Regression lines are shown in red. c Associations between the relative abundance of hosts with sulfur metabolic capabilities and viruses (temperate and virulent viruses) in different samples. Red points represent temperate viruses, blue points represent virulent viruses, each point represents a single sample. Regression lines are shown in red for temperate and blue for virulent viruses. d Distribution of virus-host correlations with sulfur metabolic capabilities. Orange, blue, and red colors represent the distributions of all viruses, virulent, and temperate viruses, respectively. e Comparison of distribution of virus-host correlations with different sulfur metabolic capabilities. Orange color represents the distributions of all viruses, while blue, green, purple, and red colors represent viruses infecting different functional sulfur metabolism groups. f, g Distribution of virulent (f) or temperate (g) viruses and host correlations with sulfur metabolic capabilities. Orange color represents the distributions of all virulent (f) or temperate (b) viruses, while blue, purple, and red colors represent virulent (f) or temperate (g) viruses infecting different functional sulfur metabolism groups. Dashed lines indicate the average correlation in the distribution.

To further investigate the influence of viruses on sulfate reduction, we employed oil reservoir production water from the Huabei Oilfields as the initial inoculum and set up two distinct sets of microcosms, namely low-virus-like particles (VLPs) microcosms and high-VLPs microcosms (Supplementary Fig. 12). In brief, the production water underwent centrifugation, filtration, and ultra-concentration processes to acquire microbial pellets, virus-rich filtrate, and virus-poor filtrate, which were then used to establish two types of microcosms. In the low-VLPs microcosms, microbial pellets were diluted with virus-poor filtrate (Microbes−Viruses), while in the high-VLPs microcosms, microbial pellets were diluted with virus-rich filtrate (Microbes+Viruses). As a result, the high-VLPs microcosms featured a higher initial count of VLPs in comparison to the low-VLPs microcosms (see detailed in method). Over the course of 570 days, we monitored sulfide production, and the viral communities were analyzed at two time points (160 days and 570 days) after the start of incubation (i.e., T1 and T2). Detailed information on putative viral contigs and host MAGs can be found in the Supplementary Discussion.

Throughout the incubation period, we observed significantly lower levels of sulfide in the high-VLPs microcosms (Fig. 7a, Supplementary Data 4). Additionally, at the T1 time point, the diversity of both the overall microbial community and host-MAG subcommunity was significantly higher in the high-VLPs microcosms compared to low-VLPs microcosms (Supplementary Fig. 8c).

Fig. 7: Viral impact on dissimilatory sulfate reduction.
figure 7

a Sulfide production in microcosms. Differences of sulfide production between low-VLPs microcosms and high-VLPs microcosms were assessed using two-tailed Student’s t-test. Every point show mean ± SD with three biological replicates per treatment (n = 3). The red asterisks indicate a significantly higher level of sulfide production in the low-VLPs microcosms compared with the high-VLPs microcosms at the corresponding time point (*P < 0.05, **P < 0.01). Detailed information of the sulfide production is provided in Supplementary Data 4. b Changes in microbial and viral community composition (left) and function (right) at the T1 and T2 sampling time points. c Reporter score evaluated the enrichment of modules between the two microcosm types. d Comparison of the relative abundance of dissimilatory sulfate reduction related genes in the low-VLPs microcosms (n = 3) and high-VLPs microcosms (n = 3) at the T1 sampling time. The gradient dark blue circle represents the average RPKM value of the genes in the low-VLPs microcosms. The gradient dark blue triangle represents the average RPKM value of the genes in the high-VLPs microcosms. The differences of the relative abundance of dissimilatory sulfate reduction related genes between low-VLPs microcosms and high-VLPs microcosms were assessed using two-tailed Student’s t-test. The significant difference was highlighted by asterisk (*P < 0.05, **P < 0.01). Detailed information of the relative abundance of dissimilatory sulfate reduction related genes is provided in Supplementary Data 5. e Relative abundance variation of all-SRMs in microcosms at the family level. f Relative abundance variation of host-SRMs in microcosms. g Predicted virus-host links between virus and host SRMs in microcosms. The right two panels represent host taxonomy colored by phylum and family, and the left two panels show viral taxonomy colored by family and viral clusters (vOTUs), connecting lines show associations between host SRMs and viruses. Detailed information of the link between virus and host SRMs is provided in Supplementary Data 6. h Relative abundance proportion between host-SRMs and other SRMs in microcosms.

To examine the potential influence of initial VLP counts on the composition and function of both the microbial and viral communities, we conducted principal coordinates analysis (PCoA), which revealed distinct dissimilarity patterns in both the composition and function of the microbial community between the two microcosms (ANOSIM, Rcomposition = 0.85, Pcomposition = 0.001, Rfunction = 0.91, Pfunction = 0.001) (Fig. 7b). Similar to the microbial community, the composition and function of the viral communities showed different phases (ANOSIM, Rcomposition = 0.95, Pcomposition = 0.001, Rfunction = 0.96, Pfunction = 0.001) (Fig. 7b). As the incubation progressed, the differences in diversity and composition of the microbial and viral communities between the two sets of microcosms diminished (Fig. 7b, Supplementary Fig. 8d). Moreover, we compared the functional characteristics related to energy metabolism and hydrocarbon degradation within the low-VLPs and high-VLPs microcosms. Notably, the M00596 module, associated with DSR function, exhibited enrichment in the low-VLPs microcosms at T1 (Fig. 7c, Supplementary Data 5). Additionally, the relative abundance of genes involved in DSR was significantly lower in the high-VLPs microcosms compared to the low-VLPs microcosms at T1 (Fig. 7d). These results indicated a significant correlation between a high number of viruses and a decrease in H2S production. This decline was attributed to a decrease in the relative abundance of genes involved in DSR within the microcosms.

To gain insight into the potential influence of the initial VLP counts on the SRM community, we compared the relative abundance and composition of SRMs between the different sets of microcosms. In total, we identified 89 SRMs (referred to as all-SRMs) including 44 SRMs associated with viruses (referred to as host-SRMs) (Supplementary Fig. 9, Supplementary Data 6). In the high-VLPs microcosms, the SRM subcommunity displayed a higher diversity than the low-VLPs microcosms, which was similar with the findings observed within the entire microbial community (Fig. 7e, Supplementary Fig. 10a, b). This result could be attributed to the decreased abundance of the dominant SRM (drephostSRM5, Desulfatibacillaceae) in the community, which in turn led to an increase in the abundance of rare SRMs, such as drephostSRM3 (Desulfatiglandaceae), drephostSRM4 (Unclassified Adiutricales), and drephostSRM10 (Syntrophobacteraceae) (Fig. 7f, g).

Collectively, the results suggested that the high number of viruses in the microcosm restructured the sulfate reducing microbial subcommunity. In addition, we compared the total relative abundance of SRMs within the overall microbial community in different sets of microcosms. We found that the total relative abundances of SRMs in low-VLPs microcosms were significantly higher than in high-VLPs microcosms at T1 (P = 0.024). Furthermore, we also found that the majority of the SRMs could be associated with identified viruses (Fig. 7h), suggesting that viral infection targeting SRMs was a significant factor in the reduction in H2S production. The high initial VLP count primarily attenuated H2S production by regulating the composition of SRMs.

Discussion

In this study, we conducted large-scale data mining of metagenomes to construct a catalog of viromes from oil reservoirs. This resource contains a wide range of viral genomic diversity that is unique to oil reservoirs. It encompasses diverse and previously uncharacterized viral groups. In addition, we preliminarily verify the potential impact of viruses on the sulfate reducing microbial subcommunity. These findings will contribute to a better understanding of the ecological role of viruses in the deep biosphere, such as regulating microbial mortality, structuring the microbial community, and impacting biogeochemical cycling (Supplementary Fig. 11).

Recent metagenomic and virome surveys have uncovered a range of viruses in both aquatic and terrestrial environments, significantly enhancing our understanding of virome diversity28,40. However, our knowledge of viral communities in oil reservoirs is still limited. Many of the viruses recovered from the oil reservoirs were novel with previously uncharacterized genomes. Moreover, the taxonomic annotation ratio of oil reservoir vOTUs was lower than that of vOTUs from deep-sea cold seep sediment41, hydrothermal vent42, soil40, acid mine drainage sediment29, and human gut43 environments. This low annotation proportion can primarily be attributed to the absence of complete genomes from viral isolates in oil reservoirs and associated environments. Collectively, our findings demonstrate the uniqueness of viruses in oil reservoirs, emphasizing the limited extent of our knowledge regarding viral diversity in this habitat.

Gene-sharing network analysis showed that viruses from oil reservoirs and other ecosystems formed separate cohesive clusters (Fig. 3a), suggesting that these viruses may possess unique metabolism genes to adapt to oil reservoir environments. Oil reservoirs are relatively independent and stable ecosystems that can be isolated for millions of years prior to industrial production practices, leading to the presence of unique microbiomes4. The co-evolution of viruses and microorganisms may lead to the distinct nature of oil reservoir viromes compared to those reported in other environments. This underscores the scarcity of research and datasets on viromes in oil reservoirs. In addition, we found that groundwater and acid mine drainage sediments shared more viral clusters with oil reservoirs, the specific cause of this phenomenon remains elusive.

The associations between viral and host abundance have been described by the Kill-the-Winner and Piggyback-the-Winner theories34,44,45. Density- and frequency-dependent lytic Kill-the-Winner models predict that a high bacterial abundance is associated with a high rate of lytic infections, leading to increased VHRs45. On the other hand, Piggyback-the-Winner theory suggests that temperate viruses can protect their host cells from closely related viruses via superinfection exclusion, thus at high host densities, rather than killing their hosts, viruses might switch their lifestyle from virulent to temperate life cycle and replicate integrated into their host genomes, resulting in decreased VHRs44,45. This trend has been observed in various ecosystems, from soil to freshwater to human lungs46,47. In oil reservoirs, the associations between viral and host abundance also support the Piggyback-the-Winner theory. The ratio between virulent and temperate viruses decreased with increasing host density (Fig. 5g), suggesting that temperate viruses are a more successful strategy for viral replication at high host densities in oil reservoirs. While this study has shed light on the relationship between viruses and their hosts within oil reservoirs, there may exist potential biases in our virome analysis workflow. First, while the applied CRISPR arrays and tRNA enabled us to uncover numerous relationships between viruses and microorganisms, it is crucial to acknowledge that not all microorganisms possess CRISPR arrays, and CRISPR arrays may not be assembled and binned into MAGs. Thus, the full virus-host interactions in the oil reservoirs could not be examined. The viral binning method can offer a more effective approach for fragmented multi-contig viral assemblies, allowing for a more precise clustering of viral and microbial populations and facilitating the direct investigation of virus-host interactions48. Second, the VHRs are underestimated when relying solely on metagenomic data, which is a common problem in current research. To address this limitation in future studies, the absolute number of viruses in the filtrate can be quantified by introducing an internal reference (a certain biomass of known viruses) before filtration to quantify the loss after filtration.

SRMs are a significant microbial group within oil reservoirs36. Our association analysis showed a significant correlation between hosts capable of dissimilatory sulfate reduction and viruses. To further validate this conclusion, we conducted a microcosm experiment and discovered that viruses decreased sulfide production by reducing the abundance of genes involved in dissimilatory sulfate reduction. Additionally, we observed that the sulfate reducing microbial subcommunity followed the ‘kill the winner’ model. In microcosms with high initial count of virus-like particles, the relative abundance of dominant SRMs decreased due to viral infection. This release of niche space fostered the growth of other lower abundance SRMs. Therefore, in this way, viruses regulate the diversity and structure of the sulfate reducing microbial subcommunity. In summary, viruses not only inhibit the growth of SRMs, but also shape the structure of the sulfate reducing microbial subcommunity. The current study has significantly advanced our understanding of the role of viruses in the sulfur cycle within oil reservoirs. In addition to causing the souring of oil reservoirs, the hydrogen sulfide (H2S) produced by SRMs also influences the cost of oil production and the value of crude oil37. In future oil exploration and production, it may be possible to introduce viruses for the prevention and treatment of souring in oil reservoirs.

In conclusion, this study presents a viral catalogue of oil reservoirs, revealing the rich diversity of viruses in these environments. Furthermore, our study elucidates a wide array of host-virus interactions and provides evidence for the substantial impact of viruses on microbial sulfate reduction within oil reservoirs. These results shed light on the ecological roles of viruses and their hosts in oil reservoirs.

Methods

Collection of metagenomic data sets for oil reservoirs and geochemical analysis

A total of 59 oil reservoir production water samples were collected from five provinces across China. All samples were collected from the wellheads of each production well, where the oil and water mixture fluid were pumped out. Mixture fluid from wellheads was collected directly into clean and sterilized 5 L sampling bottles until the bottles were filled to exclude oxygen. All samples were placed in an icebox and transported to the laboratory immediately and stored at 4 °C until DNA extraction. For better separation of oil and water from the mixture fluid, all bottles filled with production mixtures were allowed to stand with gravitational precipitation for 12 h at 4 °C. Subsequently, five hundred milliliters of the water phase were collected for each sample for total microbial genomic DNA extraction. Microbial cells were obtained after filtering through 0.22 μm-pore-size polycarbonate membranes (45 mm diameter; Millipore, Bedford, MA, United States). The polycarbonate membranes with the collected microbial cells were cut into small pieces using sterile scissors, and placed into sterile centrifuge tubes for DNA extraction (FastDNA® SPIN Kit for Soil, MP Biomedicals, USA), following the manufacturer’s instructions. Extracted DNA was used for library preparation with NEB Next® Ultra IITM DNA Library Prep Kit for Illumina® (New England Biolabs, USA) and sequenced on Illumina NovaSeq 6000 platform (150 bp, paired-end reads). In total, 380 billion raw sequence reads were retained for analysis (Supplementary Data 1). For the geochemical analysis, the concentrations of cations and anions in the produced waters were analyzed using an ion chromatograph (SHINE CIC-D160) with an SH-AC-3 column (for cation analysis) and an SH-CC-9 column (for anion analysis). In addition, 123 publicly available oil reservoir metagenomes datasets were downloaded from NCBI Sequence Read Archive (SRA) database in Nov 2021 (Supplementary Data 1).

Collection of virus-like particles (VLPs) and extraction of viral DNA

To estimate whether the metagenome derived viral genome catalogue could cover the viral communities in oil reservoirs, we collected virus-like particles (VLPs) and sequenced viral DNA of additional 8 samples from oil reservoirs. Briefly, after obtaining filtrates from the water phase, the 300 ml filtrates were further used for the extraction of VLPs. To obtain VLPs, the filtrates were firstly filtered through tangential flow filtration equipment (TFF) with a 100 kDa tangential flow membrane package (Sartorius, VIVAFLOW 50 100,000 MWCO, Germany), and then continuously concentrated until reaching an ~1 mL volume using 100 kDa centrifugal filter units (Amicon® Ultra-15, Ultracel-100K, Millipore, Germany). 720 μL of concentrated solution was treated with 1000 U/mL DNase I (37 °C, 2 h) (Roche, China) before viral DNA extraction. Total viral DNA was extracted using a phenol-choroform-isopentanol method. Then, extracted total viral DNA was used to construct sequencing libraries and sequenced. Firstly, extracted total viral DNA was amplified using REPLI-g Cell WGA & WTA Kit (Qiagen, Germany), the products were then used to construct sequencing libraries using the NEB Next® Ultra II™ DNA Library Prep Kit for Illumina® (New England Biolabs, USA) following the manufacturer’s recommendations. In total, 74 million raw sequence reads were retained for analysis (Supplementary Data 1).

Processing of metagenomic sequence data and generation of prokaryotic metagenome-assembled genomes (MAGs)

The metagenomic raw reads were examined using FastQC v0.11.9 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), low-quality sequences, primers, and adaptors were trimmed using the Trimmomatic v0.3949 (parameters: LEADING:2 TRAILING:2 SLIDINGWINDOW:4:20 MINLEN:50). The trimmed reads were independently assembled using MEGAHIT v1.2.950 (parameters: -presets metasensitive) and/or SPAdes v3.15.451 (parameters: -meta, -k 21,33,55,77,99,127). For each assembly, contigs were binned using the binning module (parameters: -metabat2 -maxbin2 -concoct) and consolidated into a final bin set using the Bin_refinement module (parameters: -c 50 -x 10) within metaWRAP v1.2.152. All produced bin sets were aggregated and de-replicated at 95% average nucleotide identity (ANI) using dRep v3.2.253 (parameters: -comp 50 -con 10 -nc 0.30 -pa 0.9 -sa 0.95). The completeness and contamination of MAGs were assessed using the lineage-specific module within CheckM v1.1.354 with default parameters, which generated 4521 MAGs that met or exceeded the medium-quality thresholds ( >50% completeness and <10% contamination), including 2708 estimated to be near-complete ( >90% completeness and <5% contamination).

The taxonomy of each MAG was assigned using GTDB-Tk v1.5.055 based on the Genome Taxonomy Database (GTDB, http://gtdb.ecogenomic.org) taxonomy (release202). The maximum-likelihood phylogenetic trees of MAGs were constructed based on a concatenated dataset of 400 universally conserved marker proteins using PhyloPhlAn v3.0.6456 and visualized using iTOL v557. RPKM (Reads per kilobase per million mapped reads) values were used to represent the relative abundances of MAGs. To mitigate the impact of viral abundance on MAGs abundance, we initially removed virus contigs from MAGs before determining RPKM values, which were calculated using CoverM v0.6.1 (https://github.com/wwood/CoverM) (parameters: coverm genome -min-read-percent-identity 0.95, -min-read-aligned-percent 0.75, -contig-end-exclusion 0 and -m rpkm).

Functional annotation of MAGs and phylogenetic analysis of the DsrAB proteins

Open reading frames (ORFs) of these MAGs were predicted with Prodigal v2.6.358 (parameters: -m meta). The predicted ORFs were annotated using eggNOG-mapper v2.0.159 and the eggNOG Orthologous Groups database (version 5.0)60 in diamond mode. Annotated KO numbers were used for inferring the pathway encoded in each MAG. MAGs that encoded a complete pathway for sulfur metabolism was utilized for further analysis. In addition, we utilized the CANT-HYD61 database by employing the -cut_nc noise cut-off to annotate genes involved in the hydrocarbon degradation (Supplementary Data 7).

For phylogenetic analysis of DsrAB sequences, DsrAB sequences from both the MAGs and previously reported reference DsrAB sequences11,62 were utilized for the phylogenetic analysis (Supplementary Fig. 7a, b, Supplementary Fig. 9), which could help to distinguish reductive and oxidative type DsrAB. The DsrAB sequences were aligned using MUSCLE v3.863 with default parameters. The alignments were then filtered using TrimAL v1.464 (parameters: -cons 50). The concatenated DsrAB tree was constructed using RAxML65 (parameters: -f a -m PROTGAMMAIJTT -p 12345 -x 12345 -N 100). The Newick files with the best tree topology were visualized using iTOL v557.

Viral contig identification, dereplication, virus operational taxonomic unit (vOTU) clustering, and calculation of abundances

Viral contigs were recovered from assembled contigs using VirSorter v2.166 and DeepVirFinder v1.067. Only viral contigs ≥10 kb were retained, based on the following criteria: (1) Viral contigs identified only by VirSorter v2.166 (parameters: -exclude-lt2gene); (2) Viral contigs identified only by DeepVirFinder v1.067 (parameters: score ≥ 0.9 and p < 0.05); (3) Viral contigs identified by both VirSorter v2.166 and DeepVirFinder v1.067. The identified viral contigs from each sample were clustered into virus operational taxonomic units (vOTUs) using the parameters 95% average nucleotide identity (ANI) and 85% alignment fraction of the smallest scaffolds based on the scripts (https://bitbucket.org/berkeleylab/checkv/src/master/) provided in CheckV v0.8.168. Next provirus boundaries were identified in the vOTUs and host contamination removed using CheckV v0.8.168. RPKM values were used to represent the relative abundances of vOTUs. The RPKM values of vOTUs were counted using CoverM v0.6.1 (parameters: coverm contig -min-read-percent-identity 0.95, -min-read-aligned-percent 0.75, -contig-end-exclusion 0 and -m rpkm). Viral lifestyle was predicted by both VIBRANT v1.2.169 and CheckV v0.8.168, while the remaining vOTUs with at least 90% completeness that display no prophage signals or lysogeny-specific genes were considered as potential virulent viruses28,33,70,71. In addition, to determine overlaps between oil reservoir vOTUs and viral contigs in the IMG/VR v3 dataset, we used rapid genome clustering to identify vOTUs that share 95% average nucleotide identity (ANI) and 85% alignment fraction of the smallest scaffolds with viral contigs in the IMG/VR v3 dataset based on the scripts provided in CheckV v0.8.1.

Viral taxonomic assignments, viral function annotation, and identification of auxiliary metabolic genes (vAMGs)

Open reading frames (ORFs) of vOTUs were predicted with Prodigal v2.6.3 (-p meta -g 11 -f gff). To understand the taxonomy of vOTUs, as suggested previously72, we used PhaGCN2.073 and geNomad v1.974 with the ICTV classification to explore the taxonomic affiliation of vOTUs at the family level. The results from these two tools were considered; for a given genome, (1) it was assigned as ‘unclassified’ if both tools failed to assign it, or it was assigned to different taxa, and (2) it was assigned to the taxonomic level determined by one of the tools if the other failed to assign. Detailed information of the vOTUs quality and taxonomy is provided in Supplementary Data 2. Additionally, to understand the function of vOTUs, the predicted viral proteins were first merged and dereplicated using CD-HIT v4.775 (parameters: -c 0.90 -s 0.8 -n 5 -g 1 -d 0). The dereplicated viral proteins were assigned to the eggNOG Orthologous Groups database (version 5.0) using eggNOG-mapper v2.0.1 (-m diamond) to identify COG functional classifications.

Moreover, in order to identify putative viral auxiliary metabolic genes (AMGs) that may have a role in host metabolism during the infection cycle, we used DRAM-v v1.3.576 to recover putative AMGs from all viruses (n = 6,218). Because DRAM-v requires VirSorter output, we re-ran 6,218 viruses through VirSorter v2.1. For DRAM-v output, as suggested previously28,77,78, to be conservative, we manually scanned the annotation output to improve the confidence in AMG identification, in particular, only putative AMGs with an auxiliary score <4 were retained, and no viral flag (F), transposon flag (T), viral-like peptidase (P), or attachment flag (A) could be present, and putative AMGs that did not have a gene ID or a gene description were also discarded77. In addition, putative AMGs predicted to be involved in organic nitrogen, nucleotide metabolism, and predicted to be glycosyl transferases or ribosomal proteins were removed from downstream consideration, because some viruses can encode their own glycosyl transferases79. Moreover, to avoid false-positive results for selected AMGs caused by possible contamination from host sequences, we searched for the presence of viral hallmark genes or virus-like genes upstream and downstream of the putative AMGs, only putative AMGs located between or alongside two viral hallmark genes or virus-like genes were considered high-confidence viral AMGs and used for further analysis. Finally, genome maps for six viruses encoding AMGs were visualized based on DRAM-v and VirSorter2 annotations, protein structural homology searches were performed using the Phyre2 web portal80 with a confidence threshold of = 98%, coverage threshold of >80%, and identity threshold of >30%.

Statistical analysis of viral community

All statistical analyses were performed in R version 4.0.081. In all statistical analyses, no statistical method was used to predetermine sample size, no data were excluded from the analyses. The world map was generated by the function map_ data (“world”) in the “ggplot2” package82. All histogram plots were created using the geom_histogram function in the “ggplot2” package, all bar plots were created using the geom_bar function in the “ggplot2” package, and all pie plots were created using the pie function. Alpha and beta diversity of viral communities were calculated using “vegan” v2.5-7 package81,83. Nonmetric multidimensional scaling (NMDS) was conducted based on Bray-Curtis dissimilarities generated from vOTUs tables with viral abundances (RPKM) using the “vegdist” function. To further determine the significant difference of viral community composition between different continents, a similarity analysis (ANOSIM) was performed using the “anosim” function. In addition, to understand distance-decay relationships of viral community, pairwise geographic distances between samples were calculated from the latitude and longitude coordinates using the “geosphere” library, and the relationships between geographic distances and viral community similarities (1 − dissimilarity of the Bray-Curtis metric) were calculated by ordinary least-squares regressions.

To determine the contribution of different ecological processes to community assembly, null model analysis was carried out using the framework described by Stegen et al.84. The null model expectation was generated using 999 randomizations in R. Two metrics, including β-nearest taxon index (βNTI) and Bray-Curtis-based Raup-Crick (RCBray), were calculated to divide community assembly into five processes, namely, homogeneous selection, variable selection, dispersal limitation, homogeneous dispersal, and drift. βNTI > 2 indicates heterogeneous selection, βNTI < −2 indicates homogeneous selection. |βNTI | <2 and RCBray < −0.95 indicate homogenizing dispersal, |βNTI | <2 and RCBray > 0.95 indicate dispersal limitation, |βNTI | <2 and |RCBray | <0.95 indicate drift assembly84. Heatmaps were created using the geom_raster function in the “ggplot2” package.

Moreover, we utilized the 59 newly sequenced metagenomes that contained sufficient environmental information data in this work to perform Mantel tests analysis and multiple regression analysis using multiple regression on matrices (MRM) approach. Mantel tests with 9999 permutations within the “vegan” package to examine the correlations between viral community similarity and geographic distance or environmental distance (Spearman correlation). Because there was strong collinearity among particular environmental factors, before applying MRM, we used variable clustering to assess the redundancy of the environmental variables by the ‘varclus’ function in the “Hmisc” package. The variables with higher correlation (Spearman’s ρ > 0.5) were removed from the MRM analysis (including Cl, Na+, Ca2+, depth), but kept all other variables in the models.

Network analysis

Protein-sharing network analysis of vOTUs was performed by vConTACT v.2.030. Briefly, all vOTUs of the oil reservoirs were compared to vOTUs ( ≥10 kb) from other ecosystems in previously published data: (1) Wetland sediment (n = 1,075); (2) Stordalen thawing permafrost (n = 1,682)17; (3) Acid mine drainage sediments (n = 5,184)29; (4) Cold seeps (n = 2,490)28, (5) Minnesota peat (n = 3,566)77; and from other ecosystems in IMG/VR v385: (6) Hydrothermal vents (n = 531); (7) Groundwater (n = 1,372); (8) Non-marine Saline and Alkaline (n = 1,756); (9) Thermal springs (n = 268). For each vOTU, ORFs were called using Prodigal v2.6.358, and the predicted protein sequences were used as input for vConTACT v2.0. The protein sequences of the vOTUs were grouped into protein clusters (PCs) using vConTACT v2.0 (parameters: -rel-mode Diamond -pcs-mode MCL -vcs-mode ClusterONE). The degree of similarity between the vOTUs was calculated based on the number of shared PCs. The networks were visualized by Cytoscape v3.5.186 (http://cytoscape.org) using an applied preferred layout model. Additionally, all vOTUs of the oil reservoirs were also compared to the NCBI Prokaryotic Viral RefSeq v201 database using vConTACT v2.0.

Virus-host prediction

Two different in silico methods were used to predict virus-host interactions. (1) tRNA match. ARAGORN v 1.2.3887 was used to identify tRNAs from sequences of vOTUs (parameters: -t). Identified tRNAs were compared to metagenomic contigs using fuzznuc88 from the EMBOSS:6.6.0.0 package with no mismatches allowed. (2) CRISPR spacer match. CRISPR spacers were recovered from metagenomic contigs using metaCRT (modified from CRT1.2)29,89 with default parameters. Extracted spacers were compared to vOTUs using fuzznuc from the EMBOSS:6.6.0.0 package with no mismatches allowed90. Detailed information of the virus-host interactions is provided in Supplementary Data 3.

Microcosm experiments setup

To further study the viral impact on sulfate reduction, microcosm experiments were performed. We used the production water samples collected from the Huabei Oilfields (Block B51-11, 38 °C, primary recovery) as seed banks, and constructed two types microcosms with different initial counts of virus-like particles (VLPs), namely low-VLPs microcosms, and high-VLPs microcosms. As shown in supplementary Fig. 12, approximately 500 mL of production water was filtered through sterile gauze to remove crude oil and obtain the fraction containing microbes and viruses. The filtrate was further divided into two parts, part of the filtrate was centrifuged (3500 g at 4 °C for 40 min) to obtain microbial pellets. Another part of the filtrate was filtered through a 0.22-μm-pore-size polycarbonate membrane (45 mm diameter; Millipore, Bedford, MA, United States) to obtain virus-rich filtrate. Afterward, part of the 0.22 μm filtrate was filtered through tangential flow filtration equipment (TFF) with a 100 kDa tangential flow membrane package (Sartorius, VIVAFLOW 50 100,000 MWCO, Germany) to obtain virus-poor filtrate. At this point, the basic components required for the microcosms were acquired.

The two types microcosms were set up using the above-mentioned basic components as described below. In brief, all microcosms were set up with clean and sterilized serum bottles (internal volume 600 mL), containing 2 g sterilized crude oil, 30 mL autoclaved Brackish medium, and 300 mL virus-rich filtrate or virus-poor filtrate, amendment with this amount of crude oil (0.6% w/v) follows previous incubations that used a similar concentration91. Specifically, in high-VLPs microcosms, microbial pellets were diluted three times with virus-rich filtrate (Microbes+Viruses). Similarly, in low-VLPs microcosms, microbial pellets were diluted three times with virus-poor filtrate (Microbes−Viruses). As a result, the high-VLPs microcosms featured a higher initial count of VLPs in comparison to the low-VLPs microcosms. The Brackish medium contained: (g/L): NaCl, 7.0; MgCl2·6H2O, 1.20; CaCl2·2H2O, 0.10; NH4Cl, 0.25; KH2PO4, 0.20; and KCl, 0.5. The pH of Brackish medium was adjusted to a range of 7.0~7.3 by addition of HCl (2 M) or NaOH solution (1 M). Then Brackish medium was autoclaved at 121 °C for 30 min. After cooling to room temperature, the following sterile solutions were added: 330 μl of trace elements stock solution, 990 μL of vitamin stock solution, 330 μL of selenite-tungstate solution, 9.9 mL of NaHCO3 solution (84 g/L), and 330 of μL resazurin solution (1 g/L). The trace elements stock solution contained (g/L): FeSO4·7H2O, 2.10; H3BO3, 0.03; MnCl2·4H2O, 0.10; CoCl2·6H2O, 0.19; NiCl2·6H2O, 0.024; CuCl2·2H2O, 0.002; ZnSO4·7H2O, 0.14; Na2MoO4·2H2O, 0.036; and 12.5 mL HCl (25%, v/v). The vitamins stock solution contained (g/L): p-aminobenzoic acid, 0.04; D (+)-biotin, 0.01; nicotinic acid, 0.1; Ca-D (+)-pantothenate, 0.05; pyridoxine·HCl, 0.15; vitamin B12, 0.005; and thiamine·HCl, 0.10. The selenite-tungstate solution contained (g/L): NaOH, 0.4; Na2SeO3·5H2O, 0.006; and Na2WO4·2H2O, 0.008.

All microcosms were sealed with a butyl rubber stopper and an aluminum seal. To create anaerobic conditions, high-purity nitrogen gas (99.99%) was introduced into each microcosm using a sterile needle and continuously passed through for a duration of 1.5 h. Additionally, a small amount of Na2S solution (48 g/L) were added to each microcosm until the color of the indicator in the bottle changed from pink to colorless. In addition, high-VLPs microcosms and low-VLPs microcosms were supplemented with sulfate to a final concentration of 4 g/L at the beginning of the culture. All treatments were set up in triplicate and were incubated at 30 °C in the dark. During the cultivation process, the concentration of sulfides was measured using the total Sulfide Quantification Kit (LIANHUA, China). Detailed information of the sulfide production is provided in Supplementary Data 4.

DNA extraction, library construction, and sequencing of microcosms

After 160 days (T1) and 570 days (T2) of incubation, 330 mL of each culture were filtered through 0.22-μm-pore-size polycarbonate membranes (45 mm diameter; Millipore, Bedford, MA, United States) to obtain microbial cells. Total microbial DNA was extracted and sequencing as described previously. Metagenomes of all microcosms were analyzed as described before.

Statistical analyses of microcosm experiments

All statistical analyses were performed in R version 4.0.081,83. The difference analysis of H2S between different microcosms was calculated using two-tailed t-test. The alpha diversity metrics, including richness, and Shannon (Shannon-Wiener diversity), were calculated using the ‘vegan’ and ‘picante’ packages. For beta diversity, the principal components analysis (PCoA) based on Bray-Curtis distance and a similarity analysis (ANOSIM) were performed in the ‘vegan’ package. The differential enrichment of KEGG modules related to energy metabolism and hydrocarbon degradation were identified according to their reporter scores and reporter features. The final set of differential enrichment KEGG modules was determined as the union of the KEGG modules based on reporter scores and reporter features. Reporter scores and reporter features were calculated as described previous92,93. To understand the significant change of gene abundance between different microcosms, KEGG homologs were combined and summed according to gene clusters table of non-redundant gene sets. We then conducted difference analysis for KO using two-tailed Student’s t-test.