Background & Summary

Latitude determines microbial species migration, distribution, and ecological function through climatic and geographic variation1,2,3. In recent years, the relationships between various microbial taxa and latitudinal gradients have been investigated in numerous regions by amplicon sequencing4,5,6. However, as amplicon sequencing targets only one or a few gene regions, it often fails to provide knowledge of the overall microbial diversity. In contrast, metagenomics provides abundant information about microbial genes, allowing us to gain a deeper understanding of microbial distribution patterns and functional potentials7,8,9. Despite this advantage, large-scale investigations of microbial diversity along latitudinal gradients using metagenomic approaches remain scarce. Understanding how microbial communities respond to environmental changes along latitudinal gradients is critical for comprehending ecosystem functions and assessing microbial adaptation to climate variation.

Rivers play a crucial role in supporting biodiversity and providing a wide range of ecosystem services10,11. Due to the complex hydrological environments, river ecosystems harbor highly diverse microbial communities12. Despite numerous studies investigating microbial diversity in specific rivers using metagenomics9,13,14,15,16, a notable gap remains in large-scale observational studies, particularly those along latitudinal gradients. With the expansion of urbanization and agriculture, river flow regimes and water quality are increasingly impacted and degraded17. However, the metabolic activities of river microbiomes can mitigate these impacts by influencing river water purification, greenhouse gas emissions, nitrogen cycling, and food webs18,19,20. Given that microbes are critical regulators of ecological processes and functions, a deeper assessment of their ecological and biogeochemical contributions across diverse river systems is urgently needed. Therefore, further research is necessary to clarify the distribution patterns of microbial communities in various rivers, elucidating their ecological roles and responses to anthropogenic perturbations.

Here, we present 236 metagenome-assembled genomes (MAGs) reconstructed from channel sediments, riparian bulk soils (at two depths: 0–20 and 40–60 cm), and riparian rhizosphere soils collected from 30 river wetlands (e.g., the Yangtze River and Yellow River) along a latitudinal gradient in China (Table 1; Fig. 1a). All MAGs met the quality thresholds of ≥50% completeness and ≤10% contamination, in accordance with the standard Medium-quality MAGs metrics21. Of these, 48.3% (114 MAGs) exhibited >70% completeness, and 8.1% (19 MAGs) were classified as “near complete” with >90% completeness and <5% contamination. Additionally, 65.7% (155 MAGs) had low contamination (<5%), and 3.8% (9 MAGs) had no contamination (Table S1). Genome sizes, estimated using CheckM v1.0.1222, ranged from 0.50 Mbp to 8.19 Mbp, with an average value of 2.61 Mbp (Table S1).

Table 1 Geographic locations and the number of species-level MAGs of 30 river sites.
Fig. 1
figure 1

Map of sampling sites (a) and bioinformatics workflow (b) for MAG reconstruction. Each site includes four samples: channel sediments, riparian rhizosphere soils, and riparian bulk soils (0–20 and 40–60 cm).

These draft genomes were classified as 225 bacteria and 11 archaea based on the Genome Taxonomy Database (GTDB)23 (Fig. 1b). For bacteria, approximately 19 phyla were identified, with the majority belonging to the phyla Pseudomonadota (34.67%), Actinomycetota (20.89%), and Bacteroidota (12.89%). However, only 20% (45 MAGs) could be assigned to currently known taxa at the species level, while 80% (180 MAGs) represent potentially novel taxa (Fig. 2a; Table S2). No significant correlation was observed between contamination and completeness, or between genome size and N50 length, although species-level bacterial MAGs whose completeness was < 80% tended to exhibit higher contamination (Fig. 2c,e). For archaea, the 11 MAGs were classified into the phyla Halobacteriota (18.18%) and Thermoproteota (81.82%), with approximately 90.91% representing novel taxa (Fig. 2b; Table S2). Similarly, no significant correlations were detected between contamination and completeness, or between genome size and N50 length among archaeal MAGs (Fig. 2d,f).

Fig. 2
figure 2

Overview of the MAGs. Number and distribution of all species-level bacterial (a) and archaeal (b) MAGs at the phylum level. Relationship between N50 length and genome size for species-level bacterial (c) and archaeal (d) MAGs. Relationship between contamination and completeness for species-level bacterial (e) and archaeal (f) MAGs.

Methods

Sampling

In September 2018, channel sediments, riparian bulk soils, and riparian rhizosphere soils were collected from 30 river wetlands along an approximately 3500 km latitudinal transect in eastern China (Fig. 1a). For each wetland, a representative sampling site was established in areas where the riparian wetlands had flat topography and formed relatively wide bands (>10 m). Approximately 200 g of surface channel sediments (0–20 cm) were collected at three random points, 5–10 m from the shore, using a homemade grab sampler. Riparian bulk soils (0–20 and 40–60 cm) were obtained from bare areas using a soil drilling machine (SSD, Zhonghe Technology, Beijing, China). To collect riparian rhizosphere soils, a 1 × 1 m plot was established within a representative plant community at each sampling site. Then, the most dominant plant species within the plot were carefully excavated to a depth of approximately 20 cm, and rhizosphere soil was collected using a sterilized soft brush after gently shaking off loosely attached soils from the plant roots. More detailed information about these river wetlands and sampling methodology has been previously described4.

DNA extraction and metagenomic sequencing

For all soil and sediment samples, DNA was extracted using the PowerSoil® DNA Isolation Kit following the manufacturer’s instructions (MoBio, Carlsbad, California, USA). DNA quantity was measured using the ExKubit dsDNA HS test kit (ExCell Biotech Co., Ltd., Shanghai, China) with a Qubit fluorimeter (Invitrogen, Carlsbad, CA, USA). All high-quality DNA extracts were subsequently sent for metagenomic sequencing.

Metagenomic sequencing libraries were prepared using the TrueLib DNA Library Rapid Prep Kit for Illumina (Vazyme Biotech Co., Ltd., China) according to the manufacturer’s protocol. Index codes were added into the attribute sequences of each sample for differentiation during sequencing. Sequencing was conducted on the Illumina NovaSeq. 6000 System (Illumina, San Diego, CA, USA), using a paired-end sequencing approach with read lengths of 2 × 150 bp (total size 350 bp). Adapter sequences were removed from the raw reads. The low-quality reads, containing quality values ≤ 10 exceeding 50% of the read length, and the content of N base up to 10% of the read length, were filtered by using fastp v0.23.2 (parameters: default)24. The high-quality reads from each river wetland were co-assembled using MEGAHIT v1.2.9 (parameters: default)25. The quality of the assembled contigs was evaluated with CheckM v1.0.1222 to ensure the integrity and completeness of the assembly.

Genome binning, refinement, and dereplication

The MetaWRAP v1.3.221 pipeline, integrating MaxBin2 v2.2.6 and metaBAT2 v2.112.1 metagenomic binning software, was used to recover genome bins based on tetranucleotide frequencies, GC content, and coverage26. The MetaWRAP-Bin_refinement module (parameters: -c 50 × 10) was applied to refine the initial binning results, which originally included 3547 bins, resulting in 260 refined bins. The contamination and completeness of all bins were estimated by a lineage-specific work flow in CheckM. The refined bins were dereplicated using dRep v2.6.225 (parameters: -sa 0.95 -nc 0.30 -comp 50 -con 10) at 95% average nucleotide identity (ANI). This process led to the identification of 236 unreplicated species-level MAGs. Taxonomic classification of the MAGs was conducted using the classify_wf workflow in GTDB-TK v2.0.0, based on GTDB release 21427.

Data Records

The raw sequence data are available on the NCBI Sequence Read Archive (SRA) associated with BioProject number PRJNA77983228 and accession number SRP34607929. The environmental metadata for sample sites, along with the fasta sequences, genome annotation, and relative abundance information for 225 species-level bacterial MAGs and 11 species-level archaeal MAGs, are publicly available on Figshare30.

Technical Validation

To prevent sample contamination, all sampling tools and containers, including a homemade grab sampler, soil drilling machine, soft brushes, and plastic centrifuge tubes, were thoroughly sterilized prior to fieldwork. To preserve DNA integrity, collected samples were immediately stored at −80 °C until DNA extraction. Genome completeness and contamination were assessed using CheckM, and only MAGs with ≥50% completeness and ≤10% contamination were included in the final analysis. Additionally, to further reduce contamination and improve completeness, MetaWRAP was employed to reassemble and refine the genome bins.