Introduction

The candidate phyla radiation (CPR), as a recently discovered bacterial lineage, constitutes a substantial fraction of all bacterial diversity1,2. CPR organisms are widely distributed, but our current knowledge is limited due to the challenges associated with obtaining pure isolate cultures3. Recent advancements in cultivation-free methods, such as genome-resolved metagenomics and single-cell sequencing, have greatly expanded the diversity of this radiation4,5,6. CPR members were initially recognized as a monophyletic group within the bacterial domain and were classified under the superphylum Pateschibacteria according to the GTDB Genome Taxonomy Database7. Recent studies have further subdivided CPR bacteria into a minimum of 80 potentially phylum-level lineages to achieve improved resolution and establish distinct names for each lineage8,9,10. Growing metagenomic explorations have extended the CPR members in diverse habitats, such as drinking water11, sediments12, soils13, and the human oral cavity3, and CPR organisms are known to be dominant in groundwater environments5.

Detangling the diversity and function of CPR memberships is crucial for understanding the biogeochemical roles of CPR lineages in groundwater ecosystems. Previous research has indicated that CPR can be differentiated based on specific host preferences or hydrogeochemical parameters14. For instance, Ca. Uhrbacteria could establish commensal or mutualistic relationships with Ignavibacteria and Betaproteobacteria in agriculturally impacted groundwater10, whereas Patescibacteria might exhibit preferred associations with Omnitrophota, Bacteroidota, and Nitrospirota in oxic and anoxic groundwaters15. Numerous metagenomic studies have demonstrated that the relative abundance and dominance of CPR members can be influenced by environmental fluctuations, such as acetate stimulation in groundwater adjacent to the Colorado River16 and shifts between reduction and oxidation conditions in groundwater in the Ohio Department11. Importantly, different CPR members have been reported to potentially participate in various nutrient cycling processes, including carbon transformation, nitrogen denitrification, and sulfate reduction10,11,17. However, comprehensive information regarding the composition structure and metabolic functions of CPR bacteria in coping with brackish-saline groundwater remains largely unknown.

The majority of CPR members exhibit small genomes (~0.5–1.5 Mbp) with limited biosynthetic capabilities10,18. Some studies have indicated that CPR bacteria can evolve specific functions while retaining essential functions for growth and reproduction to adapt to groundwater environments5. Insights from metabolic reconstructions suggest that CPR bacteria may play essential roles in consuming carbon compounds derived from plants and bacteria but rely on extracting energy and essential biomolecules (e.g. amino acids, vitamins, and nucleotides) from potential host cells9,19,20. These findings suggest that CPR bacteria possibly exhibit host-dependent lifestyles in various environments, enabling them to obtain nutrients and energy sources from their hosts2,9,21. While it has been shown that the overall bacterial community can vary in compositions and metabolic capacities across salinity gradients in groundwater systems22,23,24, the interactions between host-dependent CPR bacteria and their potential bacterial hosts, as well as the metabolic roles of CPR bacteria and their host partners may play in brackish-saline groundwater, remain poorly characterized.

Here, we successfully recovered 399 medium-high quality CPR genomes from 66 brackish-saline groundwater samples collected from the Aquifer beneath Tang-he Wastewater Reservoir (ATWR) (Supplementary Fig. 1 in Supplementary File 1). Based on previous studies on CPR bacteria9,10, we categorized these 399 CPR genomes into 44 previously proposed phyla and 8 candidate novel phyla. We have provided the first integrated biogeography pattern of CPR members and unveiled the crucial contributions of CPR bacteria to maintaining the stability of the microbial community in brackish-saline groundwater. Furthermore, we have predicted the metabolic capacities and ecological roles of CPR lineages by characterizing possible metabolic interconnections with their co-occurring organisms. These findings highlighted the essential roles played by CPR bacteria and their co-varying partners in coping with environmental stressors and driving elemental cycles within complex groundwater systems.

Methods

Groundwater sampling and hydrochemistry measurement

A total of 66 groundwater samples were collected in wet and dry seasons from 39 newly constructed monitoring wells of the Aquifer beneath Tang-he Wastewater Reservoir (ATWR) (38°47′43′′–38°48′58′′N, 115°40′11′′–115°50′33′′E), Hebei Province, China (Supplementary Fig. 1). The detailed sampling sites were provided in Supplementary Table 1, and the surface geology of the study area was described in our previous work25. Groundwater in this area was mainly derived from atmosphere precipitation and predominated by the Na–Cl–SO4 type23,25. All the samples were collected after purging the well volume and flushing the system for more than 30 min. Physicochemical parameters including pH, conductivity (COND), water temperature (T), and dissolved oxygen (DO) were measured in situ. To collect the biomass, 3000–12,000 L of groundwater was pumped and filtered through 0.01-μm hollow fiber membranes (Toray, Japan). Filters were transported to laboratories with dry ice and stored at −80 °C until processed. Meanwhile, the 10 L well-mixed groundwater samples were collected and stored at 4 °C for measuring physicochemical parameters, including total dissolved solids (TDS), chemical oxygen demand (COD), ammonium nitrogen (NH4+-N), nitrate nitrogen (NO3-N), total nitrogen (TN), chloride (Cl), fluorine (F), sulfate (SO42−), and total organic carbon (TOC), according to the Environmental Quality Standards for Surface Water (GB3838-2002) recommended by the Ministry of Ecology and Environment of China25,26. The sulfur isotopic composition (δ34S–SO4) of groundwater was analyzed on a Thermo Scientific Delta V Plus IRMS with a Flash elemental analyzer after the pretreatment by saturating BaCl2, precipitated as barite, and mixed with excess V2O5 for in-line combustion25. Metals including sodium (Na), magnesium (Mg), calcium (Ca), potassium (K), aluminum (Al), iron (Fe), titanium (Ti), manganese (Mn), zinc (Zn), barium (Ba), copper (Cu), lead (Pb), arsenic (As), nickel (Ni), tin (Sn), lithium (Li), chromium (Cr), selenium (Se), cobalt (Co), cadmium (Cd), antimony (Sb), beryllium (Be), thallium (Tl), and molybdenum (Mo) were determined using ICP–MS (X Series II, Thermo Fisher Scientific, USA) and ICP-OES (Prodigy, Leeman, USA), as described in our previous studies25,27.

DNA extraction and sequencing

The biomass on the filters was captured using ultra-sonication and centrifugation. Genomic DNA was extracted using the FastDNA Spin Kit Soil (MP Biomedicals, USA) following the manufacturer’s protocol. Genomic DNA was quantified using a NanoDrop Spectrophotometer (NanoDrop Technologies Inc., Wilmington, DE, USA). The metagenomic sequencing was performed on an Illumina Hiseq 3000/4000 platform (Majorbio Company in Shanghai, China), resulting in about 30 Gbp raw reads per sample.

Metagenomic assembly and binning

SeqPrep v1.2 (https://github.com/jstjohn/SeqPrep) was used to remove Illumina adapters. Raw reads were trimmed using Sickle v1.33 with the default quality threshold of 20 (−q20 −l 50) (https://github.com/najoshi/sickle; default parameters). Further quality filter was carried out using fastp v0.21.028 with parameters: ‘-l 140 -t 1 -c’ to remove reads shorter than 140 bp, trim the last base of every read, and correct bases in overlapping regions of paired-end reads. Clean reads were assembled by megahit v1.2.9 with k-mer parameters: ‘--k-min 19 --k-max 139 --k-step 10’29. Assembled contigs over 200 bp were used for the downstream binning analysis by MetaBAT2 v2.12.1 with default parameters30.

Taxonomic assignments and abundance calculation for metagenome-assembled genomes (MAGs)

After binning, the completeness and contamination of MAGs were evaluated using CheckM v1.1.231, and the medium-high quality MAGs (completeness > 70%, contamination < 10%) were kept for further analysis5,10. MAG quality was further assessed using CheckM2 v1.0.232, which can provide accurate genome quality predictions specifically for CPR genomes. The results indicated that the majority of our CPR genomes exhibited completeness levels exceeding 90% (Supplementary Table 2). The preliminary taxonomic assignments for MAGs were performed using GTDB-Tk v1.3.07. The redundant MAGs were removed by dRep v3.2.033 through the ‘dereplicate’ workflow34. We also conducted digital DNA–DNA hybridization tests35 through the online service GGDC (https://ggdc.dsmz.de/ggdc.php)36 to further validate the robustness of the dereplication process performed by dRep (Supplementary Table 3).

To estimate the relative abundance of these MAGs, clean reads from each sample were first mapped to all non-redundant MAGs using BBMap v38.86 with parameters “minid = 0.95”. Then, the sequencing depth of each contig in each sample was calculated by the script ‘jgi_summarize_bam_contig_depths’ from MetaBAT2 v2.12.130 with default parameters (identity threshold = 97%). Subsequently, the sequencing depth of each MAG in each sample was calculated by the weighted arithmetic mean of the sequencing depths of all its contigs (weighting by the length of each contig). Finally, the relative abundance of a certain MAG in each sample was calculated as ‘sequencing depth of the MAG in this sample/total sequencing depths of all MAGs in this sample’.

CPR identification and a straightforward cluster analysis

A total of 399 MAGs annotated as “Patescibacteria” in the GTDB results were classified as CPR. According to previous studies of CPR bacteria9,10, 16 ribosomal proteins (L2, L3, L4, L5, L6, L14, L15, L16, L18, L22, L24, S3, S8, S10, S17, and S19) were chosen as phylogenetic markers. First, we retained 381 CPR MAGs that contained at least 8 single-copy genes from the 16 ribosomal protein genes for further phylogenetic analysis. Reference genomes were obtained from previously published studies to serve as the backbones9,10. The genes encoding the ribosomal proteins in MAGs and reference CPR genomes were extracted and searched using hmmsearch (HMMER v3.337). Any multiple-copy ribosomal protein gene in MAGs was excluded. Amino acid sequences corresponding to each ribosomal protein were individually aligned using MAFFT v.7.47138 with default parameters, followed by trimming and concatenation. Absent marker genes during concatenation were filled with gaps, as shown in Supplementary Table 4 of Supplementary File 2. The phylogeny of CPR MAGs was constructed using IQ-TREE v2.1.339 with ModelFinder assigning the best-fit ML model (LG + F + R10) according to Bayesian information criterion (BIC), supported by 1000 ultrafast bootstrap replicates. Second, the remaining 18 CPR MAGs, which had fewer than 8 ribosomal proteins, were classified by comparing their average nucleotide identities to those of their relatives. Finally, the 399 CPR genomes obtained in this study were affiliated with 44 previously proposed phyla9,10 and 8 candidate novel phyla. Additional phylogenetic analyses with other bacterial phyla as referential backbone (a selected subset of GTDB representative genomes from each bacterial phyla) were also conducted using a concatenation of 120 bacterial markers based on 5036 informative amino-acid sites under the best-fitting model LG + R10. These 120 markers were selected and the concatenating process was conducted both by GTDB-tk7.

Functional annotation and metabolic prediction

Open reading frames of MAGs were predicted using prodigal v2.6.340 with the ‘-p meta’ parameter, and annotated using eggNOG-mapper v2.0.041 against the eggNOG database 5.0 with parameters ‘-m diamond --seed_ortholog_evalue 1e−5’. Ambiguous annotations or fusion genes were checked and adjusted manually by comparing the results of eggNOG and the online alignment results of the conserved domain database42. Meanwhile, the intron prediction was also conducted for each MAG. All the 16S rRNA genes in MAGs were predicted by barrnap from Prokka43 with default parameters and later manually curated. Then these 16S rRNA genes were searched against the online Rfam44 database to predict the introns inside these genes. Protein coding regions in these introns were also searched according to the abovementioned eggNOG annotation results.

The metabolisms of these CPR bacteria were predicted by examining the ko numbers in eggNOG annotation results. These ko numbers were mapped to the pathways of the Kyoto Encyclopedia of Genes and Genomes (KEGG) database and then identified for key genes related to biogeochemical cycles (carbon, sulfur, nitrogen, and hydrogen), energy conservation, essential biomolecule syntheses, and other pathways of interests. Pathway completeness was calculated based on the percentage of ko numbers presenting in each KEGG module (pathway). The metabolic reconstructions of MAGs were manually conducted by collectively summarizing the core pathways in the MAGs. The crucial metabolic pathways, ko numbers, and gene names used for the downstream analysis were provided in the Supplementary Table 5.

Statistical analysis

Non-metric multidimensional scaling (NMDS) and analysis of similarity statistics (ANOSIM) were conducted to explore the differences in CPR members among the prior sampling groups using the vegan package in R45. One-way analysis of variance (one-way ANOVA) was also performed to test the significant differences of specific CPR at the phylum level among groups. The co-occurrence network between bacteria was visualized using Gephi46, based on statistically strong (|r| > 0.80) and significant (FDR-adjusted P < 0.01) Spearman’s correlations.

Driving forces on CPR members were explored based on the following methods. First, distance-decay patterns of microbial communities (Bray–Curtis distance matrices) along geographical and environmental distances were investigated using the Mantel test based on Spearman’s correlations with vegan. Second, random forest analysis was carried out to identify the importance of each environmental variable in structuring CPR members26. The significance of the models and cross-validated R2 values were determined using the A3 package (1000 permutations). Similarly, the significance of each predictor on CPR members was assessed based on the increase in the mean square error (IncMSE) using the rfPermute package. Afterward relationships among seasonal factors, geographical factors, Co-CPR, and CPR were explored by a partial least-squares path model (PLS-PM) using the vegan package.

Results

Recovery of 399 CPR and 2007 non-CPR bacterial genomes

A total of 2406 medium-high quality non-redundant MAGs (metagenome-assembled genomes) (completeness > 70%, contamination < 10%, dereplicated at 99% average nucleotide identity) were reconstructed from 66 groundwater samples. The genome dataset included 399 CPR genomes and 2007 non-CPR bacterial genomes according to the preliminary taxonomy assignment by GTDB-tk7 (Supplementary Tables 2 and 6 in Supplementary File 2). Utilizing a phylogenetic analysis with 16 ribosomal protein genes as the marker set10, we further categorized the 399 CPR MAGs into 44 previously proposed CPR phyla9,10 and identified 8 candidate novel CPR branches (Fig. 1, alignments provided in Supplementary File 3, and detailed statistics of each protein gene provided in Supplementary Table 4). To solidify the phylogenetic placement of these 8 candidate novel CPR, we constructed an additional phylogenetic tree using 120 marker genes47 recommended by GTDB-Tk (Supplementary Fig. 2, alignments provided in Supplementary File 4), which consistently support their classification as novel lineages. We provisionally designated these new phylum-level lineages as CNCPR1-8 (Fig. 1 and Supplementary Table 2). Further phylogenetic analysis, incorporating referential genomes from every bacterial order in GTDB (Supplementary Fig. 3, alignments provided in Supplementary File 5), supported the notion that CPR bacteria form a monophyletic group2,8,13,48. Notably, the 399 CPR MAGs exhibited diminutive genome sizes, with an average value of 868.6 ± 232.2 kbp (Fig. 1b). The median sizes of most CPR MAGs were smaller than the smallest free-living Pelagibacter (1.3 Mbp)49, potentially indicating a symbiotic lifestyle for these CPR bacteria as previously reported9.

Fig. 1: Phylogenetic tree and genome size of candidate phyla radiation (CPR) bacteria.
figure 1

a The maximum-likelihood tree was inferred from the concatenation of 16 ribosomal proteins and spans a dereplicated set of 381 strain-representative ATWR-MAGs (from a redundant set of 525 ATWR-MAGs) and 283 publicly available reference genomes. The eight phyla colored in red are candidate novel CPR (CNCPR) affiliating to the superphylum Parcubacteria. The numbers in rounded brackets “()” and curly brackets “{}” indicate the numbers of ATWR-MAGs before and after the strain-level de-replication (>99% ANI), respectively. b The size distribution of the obtained CPR MAGs from the ATWR.

We conducted a comparative analysis of the co-occurrence networks for the entire bacterial community and the subset excluding CPR bacteria, focusing on strong (Spearman |r| > 0.8) and significant (P-value < 0.01) correlations. We found that the entire bacterial network exhibited significantly higher node, edge, and degree, but lower average path length, graph density, clustering coefficient, and betweenness centralization (Supplementary Fig. 4a), compared to the co-occurrence network of non-CPR bacteria. This suggested that CPR bacteria played a crucial role in maintaining the stability and complexity of the overall bacterial communities in the face of environmental disturbances50,51. Notably, CPR bacteria, particularly those affiliated with Sungbacteria, Wildermuthbacteria, Nealsonbacteria, Daviesbacteria, Yanofskybacteria, and Uhrbacteria, harbored the highest average connection degree in the whole bacterial co-occurrence network, reinforcing their significant roles in maintaining the community stability and complexity (Supplementary Fig. 4b). Compared with the non-CPR bacterial network, the co-occurrence network between CPR and non-CPR displayed higher degree, graph density, and betweenness centralization despite lower nodes and edges (Supplementary Fig. 4a), implying that the complex network structure may contribute to enhancing the resilience of CPR and Co-CPR (non-CPR bacteria co-occurred with CPR) in their adaption to environmental stresses50,51.

We further revealed positive associations between 99 CPR MAGs and 255 non-CPR bacterial MAGs (Fig. 2). Among these CPR members, Daviesbacteria, Nealsonbacteria, Yanofskybacteria, Wildermuthbacteria, CNCPR8, Sungbacteria, and Levybacteria displayed the highest connection degrees, with counts of 171, 102, 99, 97, 65, 56, and 48, respectively. Non-CPR bacterial phyla such as Omnitrophota (degree: 430), Chloroflexota (333), Desulfobacterota (151), Planctomycetota (26), Nitrospirota (23), Acidobacteriota (21), Bacteroidota (14), and Methylomirabilota (14) exhibited strong co-occurrence pattern with the CPR members. Specifically, CPR taxa belonging to Daviesbacteria and Nealsonbacteria tended to co-exist with microbes affiliated with Omnitrophota, Chloroflexota, and Nealsonbacteria, while CNCPR8 exhibited co-occurrence patterns with Omnitrophota, Desulfobacterota, and Chloroflexota. These bacteria co-occurred with CPR (Co-CPR) could provide important guidance for identifying putative hosts of CPR in saline groundwater environments10.

Fig. 2: Co-occurrence pattern between CPR and non-CPR bacteria.
figure 2

a Networks displaying significant correlations (Spearman’s r > 0.8, P < 0.01) between CPR and non-CPR bacteria. Each node represents one microbial species, and the node size is proportional to the number of connection degrees. b Connection relationships between CPR and non-CPR bacteria. The inner and outer circles, respectively, indicate CPR and non-CPR bacteria. The bar charts showing the connection degree of CPR (c) and non-CPR bacteria (d).

Variations in relative abundances of CPR bacteria along the hydrochemical gradient

Among the retrieved 52 phylum-level CPR (Fig. 3a), Daviesbacteria, Peregrinibacteria, CNCPR4, and Harrisonbacteria showed notably high relative abundances. These phyla appear to be dominant CPR phyla within this groundwater environment, accounting for 3.02%, 2.58%, 1.60%, and 0.99% of the whole microbial communities, respectively. We further categorized the 66 groundwater samples into three groups based on their salinity levels. Group1 was characterized by the lowest levels (e.g., TDS: 93.85–956.31 mg/L, SO42−: 70.26–579.65 mg/L, Na: 88.85–386.58 mg/L), while samples in Group2 displayed moderate salinity (e.g., TDS: 173.72–4120.69 mg/L, SO42−: 92.75–2761.22 mg/L, Na: 154.73–1168.64 mg/L), and those in Group3 exhibited the highest salinity (e.g., TDS: 2108.60–9529.62 mg/L, SO42−: 1273.50–5724.38 mg/L, Na: 779.25–2023.03 mg/L) (Supplementary Figs. 57). It was observed that the relative abundance and composition of CPR bacteria varied significantly along the three hydrochemical gradients (Fig. 3b and Supplementary Fig. 8, ANOSIM r = 0.478, P = 0.001), while a non-significant relation was observed between wet and dry seasons (ANOSIM P > 0.05).

Fig. 3: Spatiotemporal distribution of CPR in the groundwater.
figure 3

a Stacked bar chart of CPR phyla from all sampling stations in both wet and dry seasons. b Relative abundance of CPR members obtained from the three groundwater groups. “Group1”, “Group2”, and “Group3” indicate the groundwater samples categorized into the lowest, moderate, and highest salinity levels, respectively. c Circular visualization of the CPR phyla differentially distributed among the three groups.

The taxa responsible for the differentiation in CPR members among the three groups were further identified (Fig. 3c and Supplementary Fig. 9). In the low-salinity Group1, three CPR candidate phyla, including Taylorbacteria (relative abundance: 1.36% of the whole microbial communities), Niyogibacteria (0.64%), and CNCPR1 (0.21%) displayed significantly higher relative abundance than other sampling groups. In the moderate-salinity Group2, five phyla including Azambacteria (0.35%), Daviesbacteria (4.18%), and Gottesmanbacteria (0.29%) showed the highest proportion. In the highest-salinity Group3, CPR members associated with Komeilibacteria (1.56%), Uhrbacteria (1.11%), Roizmanbacteria (1.05%), Berkelbacteria (0.71%), Campbellbacteria (0.42%), Brennerbacteria (0.26%), CNCPR3 (0.25%), Gribaldobacteria (0.18%), Buchananbacteria (0.16%), and Katanobacteria (0.14%) were markedly predominant.

Driving forces for variations in relative abundances of CPR

We observed that the relative abundances of CPR bacteria may be more vulnerable to environmental changes than those of non-CPR bacteria in brackish-saline groundwater. Firstly, the relative abundances of CPR bacteria displayed stronger distance–decay relationships along environmental distance (Supplementary Fig. 10b, partial Mantel r = 0.545, P = 0.001) than geographic distance (Supplementary Fig. 10a, partial Mantel r = 0.357, P = 0.001), implying that environmental selection rather than dispersal limitation had a more significant impact on regulating CPR structure from the perspective of the relative abundance52. The relative abundances of non-CPR community, including Co-CPR bacteria and residual non-CPR members (Non-Co-CPR), appeared to be less affected by environmental distance compared to the CPR community (Supplementary Fig. 10c–f). The random forest analysis further revealed that environmental parameters, such as Co, Na, Cl, TDS, and SO42−, explained about 72.1% of the variations in relative abundances of CPR bacteria, which was higher than those observed for Co-CPR (55.3%) and Non-Co-CPR bacteria (58.4%) (Fig. 4a–c). Secondly, the PLS-PM analysis confirmed that environmental factors had a more substantial impact on the relative abundances of CPR bacteria (Fig. 4d–f, path coefficient λ = 0.354) than Co-CPR members (λ = 0.291). Thirdly, the relative abundances of CPR phyla predominantly distributed in high-salinity Group3 were positively associated with Na+, SO42−, Co2+, Ni2+, and Cl (Supplementary Fig. 11), potentially reflecting significant niche differentiation of CPR due to the environmental heterogeneity in saline groundwater. Additionally, the PLS-PM analysis also indicated that the relative abundances of Co-CPR bacteria had a significant direct effect (λ = 0.491) on the relative abundances of the CPR population, highlighting their indispensable role in shaping the variations in relative abundances of CPR.

Fig. 4: Driving forces of CPR and non-CPR bacteria.
figure 4

Random forest (RF) indicating the potential environmental drivers of variations in the CPR community (a), non-CPR bacteria that co-occurred with CPR (Co-CPR) (b), and non-CPR bacteria that not co-occurred with CPR (Non-Co-CPR) (c). PLS-PM describing the relationships between environmental factors, seasonal groups, geographical factors, non-CPR bacteria that co-occurred with CPR and CPR (d). Numbers adjacent to each arrow denote partial correlation coefficients (significance codes: ***p ≤ 0.001; **p ≤ 0.01; *p ≤ 0.05). R2 values display the proportion of variance explained for each factor. The bar-chart showing the standardized total effect of each factor on the CPR (e) and Co-CPR (f).

Metabolic potentials of CPR bacteria in groundwater

Compared with non-CPR bacteria (Supplementary Fig. 12 and Supplementary Table 7), the 399 CPR non-redundant MAGs exhibited limited metabolic capabilities in terms of biosynthesis, biogeochemical cycles, and energy metabolisms (Fig. 5, Supplementary Fig. 13, and pathway completeness in Supplementary Table 8). The most frequently observed fully complete metabolisms in CPR genomes included pyrimidine ribonucleotide biosynthesis (UMP to UDP/UTP/CDP/CTP without the de novo part: 54.89% of genomes), de novo purine biosynthesis (de novo to inosine monophosphate: 32.05%; inosine monophosphate to ATP/ADP: 32.58%) and oxidative or non-oxidative PPP (oxidative: 17.54%; non-oxidative: 20.80%). Glycolysis and gluconeogenesis were seldom fully present in these genomes (complete glycolysis/gluconeogenesis only in 0.75%/1.75% of genomes), but a near-complete glycolysis or gluconeogenesis pathway with more than 75% completeness was common (presenting in 44.61%/57.89% of genomes), suggesting that CPR bacteria might be involved in carbon scavenging and carbon degradation10. CPR bacteria also contained genes encoding enzymes capable of degrading complex carbohydrates such as cellulose and starch, highlighting their contribution to carbon cycling in the study area9.

Fig. 5: Metabolic heatmap inferred for CPR phyla.
figure 5

The occurrence of genes is calculated based on the number of MAGs regardless of their abundance. Acetyl-CoA acetyl-coenzyme A, a pivotal intermediate in multiple carbon pathways; TCA cycle tricarboxylic acid cycle; PPP pentose phosphate pathway; ETC electron transport chain.

We identified some distinctive metabolic potentials of CPR bacteria in the brackish-saline groundwater (Fig. 5). A considerable proportion of CPR MAGs encoded genes associated with nitrite reduction (nirK/nirS in 23.13% CPR MAGs, Fig. 5), reflecting the potential involvement of CPR bacteria in the nitrogen cycle, although complete nitrogen reduction has not been observed in CPR members. Genes responsible for reducing sulfur compound, particularly converting sulfate to sulfite (cysH/aprA in 4.95% CPR MAGs), were also identified in specific CPR genomes, indicating the probable involvement of CPR bacteria in the process of sulfur reduction within this sulfate-rich groundwater. Hydrogenases, especially [NiFe] group 3 hydrogenases, were moderately distributed in CPR bacteria (71.42%), possibly indicating the potential for producing or consuming hydrogen of these MAGs.

Besides the above metabolic potentials, stress-related genes were also identified in CPR genomes (Fig. 5). For example, heat shock protein genes (especially htpX, a member of the Hsp90 family) were found in CPR bacteria, which may act as molecular chaperones to facilitate proper protein folding and regulate protein homeostasis under high-salinity conditions53. The presence of osmoprotectant genes encoding trehalose, inositol, and proline in some CPR genomes can provide these lineages with an advantage to adapt to the osmotic pressure caused by high-sulfate concentrations54. We also identified an HTH-type transcriptional regulator gene (plcR) in several CPR genomes that may function as quorum sensing mechanisms of bacteria to control cell density, express virulence factors, or participate in host-symbiont communication55. The presence of these genes in CPR suggests that they likely play crucial roles in the survival of CPR bacteria in brackish-saline groundwater. To validate this hypothesis, we conducted a comparative analysis of stress-related genes between CPR genomes from the present study and publicly available references related to lakes56, acid mine drainage57, and groundwater2,6,10,16,58 (Supplementary Table 9). Notably, genes associated with heat shock protein (htpX) and osmoprotectants (ostA, tps, betB, gldA, proC, glnA, and ggpS) were significantly more abundant in this high-salinity environment compared to other environments, indicating their crucial role in the remarkable adaptability of CPR bacteria to high-salinity groundwater.

Additional distinctive features of CPR bacteria were further explored. One of the most important traits was the presence of introns, which set CPR apart from other non-CPR bacteria2. Among the 399 CPR MAGs, 73.4% were predicted to harbor 16S rRNA genes, and the absence of 16S rRNA in the rest, 26.6%, was possibly attributed to genome incompleteness. Of those containing 16S rRNA genes, 15.4% contained a total of 56 introns (Supplementary Table 10), indicating the widespread intron distribution of CPR bacteria. Many of these introns encode RNA-based self-splicing ribozymes (Group I catalytic introns) and/or amino-acid-based LAGLIDADG homing endonucleases, suggesting potential roles in gene expression regulation or ribosomal function. Another notable feature of CPR bacteria is the scarcity of CRISPR-Cas systems in their genomes, with only 1.4% of publicly available CPR MAGs reported to possess CRISPR arrays59. Likewise, only three CPR MAGs in the ATWR were identified with genes encoding cas3, cas5, or cas7, indicating the absence of CPR CRISPR-Cas systems in this aquifer. Consistent with a previous study57, we identified low gene occurrences and even gene losses in CPR genomes, particularly in biosynthesis genes responsible for essential amino acids, fatty acids, and nucleotides (Supplementary Table 11). This finding underscores the significance of genomic streamlining in the evolution of CPR genomes, which may facilitate the minimization of cellular complexity and ultimately result in a decrease in genome size60. Meanwhile, the absence of CPR CRISPR-Cas systems could also be the consequence of genomic streamlining for high metabolic efficiency. Interestingly, the lack of CRISPR-Cas systems could position CPR bacteria as “viral decoys” for their hosts, thereby potentially protecting the hosts and reducing their viral load59.

On the basis of the above functional prediction, we observed that many metabolic genes of CPR MAGs displayed the highest abundances in the high-salinity Group3, suggesting that these metabolic potentials may contribute to the environmental adaptation of CPR bacteria to high-salinity groundwater (Supplementary Table 12 and Fig. 6). The relevant functional genes were mainly related with amino acids biosynthesis (serA, ilvE, leuA, cysK, and metH), fatty acid metabolism (fabGZ and fadBDN), electron transport chain (nuoFG and atpBCDFI), glycolysis/gluconeogenesis (gpi/pgi-pmi, fbp, glk/hk, and gap), nitrogen metabolism (nirK/nirS, nirB/nirD/nrfA, and nosZ), and sulfur metabolism (phsA and asrA). Particularly, CPR genomes exhibited the highest gene abundances associated with sulfate reduction steps (the production or consumption of HS: SO32− → HS, S2O32− → HS, and S(n) → HS) in the highest-salinity Group3 (0.74–2.90%) compared to the medium-salinity Group2 (0.28–1.51%) and the lowest-salinity Group1 (0.35–0.66%) (Supplementary Figs. 14 and 15), indicating an elevated contribution of CPR bacteria to sulfate reduction processes in high-salinity condition. This observation reflected a significant selective pressure for sulfate reduction on CPR exerted by the highest sulfate concentration in Group3 (Fig. 4 and S6). Pressure-regulating genes of CPR MAGs, including heat shock protein genes (htpX) and osmoprotectant transporter genes (lcdH and sdmt), also showed the highest relative abundances in Group3 than other groups (Supplementary Table 12 and Fig. 6).

Fig. 6: Genes harboring by CPR significantly showed the highest or lowest abundance in Group3 based on one-way ANOVA analysis.
figure 6

“Group1”, “Group2”, and “Group3” indicate the groundwater samples categorized into the lowest, moderate, and highest salinity levels, respectively.

Potential reciprocal relationships between CPR and Co-CPR

We observed that several metabolic genes of Co-CPR bacteria, such as fbp, htpX, lcdH, pyrI, and asrA, exhibited the lowest relative abundance in Group3, contrasting with their highest abundance in CPR bacteria within the same Group3 (Fig. 6). Conversely, certain genes of CPR MAGs showed the lowest abundance in Group3 but were markedly abundant in Co-CPR members. Examples included genes nuoHN (electron transport chain), rpe (pentose phosphate pathway), guaB (purine biosynthesis), purn (purine biosynthesis), pyre (purine biosynthesis), and mdh (TCA cycle). Given the symbiotic lifestyle of CPR bacteria, these findings potentially implied a potential metabolic synergy between CPR and Co-CPR bacteria, indicating the intricate interplay within the microbial community in salinity-stressed groundwater ecosystems.

Based on the crucial genes for the key steps of biogeochemical cycling (Supplementary Table 5), metabolic reconstructions of CPR and non-CPR bacteria revealed the auxiliary contributions of CPR bacteria to carbon, nitrogen, and sulfur cycling (Fig. 7 and Supplementary Fig. 14). Compared with Non-Co-CPR bacteria, both CPR and Co-CPR played modest roles in biogeochemical cycling, but Co-CPR showed a higher abundance of genes in specific pathways. These included glycolysis [99.42% (Co-CPR) vs. 88.81% (Non-Co-CPR), phosphofructokinase gene pfk], de novo lipogenesis (96.57% vs. 89.26%, S-malonyltransferase gene fabD), nitrite reduction (81.13% vs. 79.48%, both assimilatory and dissimilatory nitrite reductase nirA, nirB and nrfA), assimilatory sulfate/sulfite reduction (90.50%/89.46% vs. 79.37%/83.06%, phosphoadenosine phosphosulfate reductase and sulfite reductase genes cysH, cysJ and sir), and carbon fixation (81.34% vs. 51.15%, mostly through Wood–Ljungdahl pathway as indicated by the key acetyl-CoA synthase gene acsB) (Fig. 7, Supplementary Fig. 14, and Supplementary Table 7). These pathways might represent predicted hotspots of interactions between CPR and Co-CPR (Supplementary Fig. 16). Furthermore, Co-CPR bacteria also exhibited high functional potentials associated with the electron transport chain, pentose phosphate pathway, purine biosynthesis, and TCA cycle under high-salinity conditions, which could further help the co-occurred CPR to participate in the corresponding pathways. Additionally, the higher relative abundance of heat shock protein and osmoprotectant genes in Co-CPR compared to Non-Co-CPR may contribute to the adaptation of CPR through putative metabolites exchanges between CPR members and their co-existed partners.

Fig. 7: The abundance of genes related to sulfur, nitrogen, and carbon cycles for all bacteria, CPR, Co-CPR, and Non-Co-CPR.
figure 7

Detailed information on these genes in the lowest-salinity Group1 (a), the moderate-salinity Group2 (b), and the highest-salinity Group3 (c). The percentage near each arrow indicates the MAG capacity for catalyzing metabolic steps. The percentage was calculated by (the sequencing depth of genomes that possess the genes for certain steps/total sequencing depths of the selected genomes).

Discussion

Drawing insights from metagenomic binning, our study restructured 399 CPR MAGs and expanded the CPR diversity by affiliating them with 44 previously proposed phyla and 8 new phyla in brackish-saline groundwater. We observed pronounced distribution patterns in the relative abundances of CPR bacteria along the environmental heterogeneity. Particularly, CPR bacteria not only played essential roles in regulating the stability and complexity of microbial community, but also exhibited high metabolic potentials associated with sulfur reduction, heat shock proteins, osmoprotectants, and fatty acid metabolism, enabling them to effectively adapt to the high-salinity environmental stress.

Our findings revealed that the relative abundance of CPR members was susceptive to variations in groundwater hydrochemistry, particularly salinity characterized by Na+ and SO42−. Salinity emerged as a paramount driver, shaping the differentiation in relative abundances and ecological niches of CPR species61,62. With increasing salinity levels, CPR bacteria experienced heightened selective pressure. Adaptive CPR members exhibited a higher relative abundance of pressure-regulating genes, such as osmoprotectant genes, enabling them to effectively regulate osmotic pressure under high-salinity conditions53,54,63. Although the salinity levels in the studied brackish-saline groundwater were not as extreme as those inhabited by extreme halophilic Euryarchaeota and halotolerant ammonia-oxidizing archaea, insights from previous research also shed light on the role of osmoprotectants in coping with high salt concentrations64,65. CPR bacteria also harbored abundant sulfurate-reduction genes and fatty acid metabolism genes, facilitating energy and nutrient acquisition in sulfur-rich environments66. Moreover, environmental heterogeneity not only directly influenced the structure of CPR members5 but also indirectly drove variations in the CPR community through Co-CPR bacteria (Fig. 4d–f), which might serve as potential hosts of CPR. Therefore, it could be concluded that the joint influence of environmental heterogeneity and Co-CPR bacteria drove the dynamics of CPR in groundwater, attributed to their limited metabolic capabilities and host-dependent lifestyles10,12,67,68,69. However, it is essential to acknowledge that relative abundance in metagenomic analyses can be influenced by species dominance and sequencing depth. Thus, the dynamics of CPR bacteria across various environmental conditions should be further confirmed based on absolute abundance measurements in future studies.

CPR bacteria were found to be of great significance in maintaining the stability and complexity of the whole bacterial communities against external environmental perturbations (Supplementary Fig. 4). Furthermore, we observed that non-CPR bacteria belonging to Chloroflexota, Desulfobacterota, and Planctomycetes could be potential hosts for CPR species from Daviesbacteria, Wildermuthbacteria, Taylorbacteria, and CNCPR8, from the perspective of co-occurrence network. Although the identified potential CPR partners need to be further verified by alternative approaches, such as cryogenic transmission electron microscopy imaging, culture-based method, and single-cell genomic analyses10,70, these findings expanded the spectrum of potential partners for CPR members and provided significant insights into understanding complex symbiotic associations in underground ecosystems. Notably, the strongest links with CPR bacteria were observed in the phyla of Omnitrophota, Chloroflexota, and Desulfobacterota. The latter two were highly related to sulfate reduction, suggesting the potential relationships between CPR and sulfating reduction bacteria in this sulfate-rich environment. These findings suggested that CPR bacteria might actively or passively select different reciprocal partners in response to changing environmental conditions, thereby enhancing their survival and the resilience of the whole bacterial community in stressful groundwater.

Furthermore, the epi-symbiotic lifestyle common among CPR members, facilitated by their attachment to host membranes via pili and the need to acquire essential biomolecules, suggests a potential mechanism for the joint adaptation of CPR and Co-CPR bacteria to high-salinity environments10,56,71,72. We have developed a concept model to illustrate how CPR bacteria cope with environmental stressors (Fig. 8). On one hand, CPR bacteria exhibited a unique genetic makeup, encoding many fragmented genes across multiple incomplete pathways. They participated in crucial metabolic processes and regulated responses to salinity stress (“do itself”). In detail, the relative abundance of sulfate reduction genes in CPR genomes positively correlates with sulfate levels in groundwater, potentially promoting their adaptation to sulfate-rich environments. Higher relative abundances related to fatty acid metabolism may provide CPR with energy and nutrients to tolerate environmental stress66. Genes encoding osmoprotectant or heat shock protein assist CPR bacteria in osmoregulation, maintaining positive cell turgor and growth groundwater with varying salinity levels53,54. On the other hand, the limited biosynthetic potential of CPR bacteria suggests a high dependency on other organisms for resources (as shown in “outsourcing”)9. Co-CPR bacteria may serve as a promising source of essential biomolecules for CPR members since they predominantly encode pathways for autotrophic carbon fixation. These pathways could produce organic carbon and synthesize complex biomolecules that were lacking in CPR bacteria. In turn, CPR bacteria potentially facilitate Co-CPR by bypassing several missing steps in certain near-complete pathways, such as the partial TCA cycle (gltA, citrate synthase) or glycolysis/gluconeogenesis (gpm, phosphoglycerate mutase, 3-phosphoglycerate ↔ 2-phosphoglycerate) (Supplementary Fig. 16). This could be supported by the gene occurrence patterns observed for CPR and Co-CPR bacteria (Fig. 7, Supplementary Figs. 12 and 14). There are also chances for CPR horizontally exchanging genes with the putative hosts72 granting these CPR members to evolve more specialized niches with their hosts in groundwater once certain key genes are transferred. The abundant quorum sensing gene (plcR) in several CPR genomes could stimulate the metabolic interactions between CPR and relevant co-existing partners to adapt to the salinity stress55.

Fig. 8: Proposed diagram of metabolic capacities and incapabilities in CPR, as well as their interactions with Co-CPR.
figure 8

Key intermediates might be exchanged between these organisms to complete the partial pathways. noPPP non-oxidative pentose phosphate pathway, TCA cycle tricarboxylic acid cycle, WL pathway Wood–Ljungdahl pathway.

Overall, potential collaborations between CPR bacteria and Co-CPR members might play essential roles in driving the transformations of primary elements such as carbon, nitrogen, and sulfur in complex groundwater ecosystems. Although it is difficult to determine whether the co-occurred partners of CPR are real hosts, these findings highlighted the combined contribution of CPR and their potential reciprocal partners to environmental adaptation, ecological stability, and biogeochemical cycling, substantially expanding our understanding of the ecological roles of CPR in groundwater20,73,74.

Overall, we depicted the integrated biogeography patterns and metabolic adaptations of CPR microorganisms within brackish-saline groundwater ecosystems, by capturing 399 non-redundant CPR genomes spanning 44 previously proposed phyla and 8 potential novel phyla. Compared with non-CPR bacteria, environmental heterogeneity significantly influenced the spatial distribution and the niche differentiation of CPR memberships. CPR bacteria, as key components regulating the stability and complexity of microbial community, exhibited high relative abundances of functional genes associated with sulfate reduction, heat shock proteins, osmoprotectants, and fatty acid metabolism, potentially enabling them to cope with the high-salinity environmental stressors. More importantly, we disclosed the vital roles of co-occurred partners of CPR bacteria, which not only dominated key metabolic pathways such as the TCA cycle, gluconeogenesis/glycolysis process, sulfur metabolism, and nitrogen metabolism but also potentially engaged in metabolic interactions with CPR bacteria, facilitating their adaptation to high-salinity conditions. Together, this study elucidated the remarkable adaptation of environment-sensitive CPR to high-salinity conditions and highlighted the significance of close cooperation between CPR and their co-existing potential partners in mediating ecological stability and driving biogeochemical cycles within dynamic groundwater environments.