Background & Summary

Freshwater ecosystems play a pivotal role in global biogeochemical cycles, driving the transformation and movement of carbon and nutrients through processes such as sedimentation, atmospheric exchange, and hydrological transport1,2,3,4,5. Microorganisms are central to these processes, making the study of freshwater microbial communities essential for understanding global environmental changes6,7,8,9. Over the past few decades, advances in molecular biological techniques have significantly enhanced our knowledge of microbial diversity and function in freshwater environments10,11,12,13. High-throughput sequencing of 16S rRNA genes has revealed microbial diversity in freshwater ecosystems that were previously undetectable using traditional culture-based methods. These studies have demonstrated that microbial communities are predominantly composed of oligotrophic microorganisms14,15,16, exhibiting spatiotemporal dynamics and microdiversification in response to environmental variability17,18,19. More recently, metagenomic sequencing approaches have uncovered extensive phylogenetic and functional diversity across global freshwater biomes19,20,21,22,23,24,25,26,27, emphasizing the need for more comprehensive datasets spanning diverse geographic and ecological contexts, as exemplified by recent studies on freshwater nitrifiers28,29.

Lake Soyang is a large, deep, and oligotrophic-to-mesotrophic artificial reservoir in South Korea, with a maximum depth of approximately 118 m, making it the deepest lake in the country. It is classified as a warm monomictic lake, characterized by seasonal overturning that begins in early winter and persists until surface warming in spring. Heavy monsoonal rainfall in Korea (July–August) drives substantial inflows into the reservoir’s metalimnion30, triggering post-monsoon phytoplankton blooms through the upward transport of nutrients to the epilimnion, coupled with rising water temperatures31,32. In recent years, we have studied the microbial ecology of Lake Soyang, focusing on the isolation and characterization of diverse bacterial lineages33,34, including the acI actinobacterial clade35. Additionally, viral assemblages in the lake have been examined36,37.

Here, we present microbial metagenomic data obtained from 28 water samples collected at varying depths from a single sampling station (Fig. 1) across two periods to capture seasonal variations. The first period set, collected from April 2014 to January 2015, comprised eight samples obtained at two depths (1 m and 50 m) at three-month intervals. The second period set, collected from January 2019 to November 2019 also at approximately three-month intervals, consisted of 20 samples obtained at five depths (1, 10, 20, 40, and 90 m). Physicochemical parameters were measured on-site (Table 1), while additional parameters such as chlorophyll a and nutrients were analyzed in the laboratory (Table 2). However, total dissolved nitrogen and dissolved organic carbon data were not available for the 2019 samples.

Fig. 1
figure 1

Location of the sampling station (a) and the sample processing scheme used in this study (b).

Table 1 Physicochemical characteristics of lake water samples collected in this study.
Table 2 Nutrients and biological parameters of lake water samples collected in this study.

The eight metagenomes collected from April 2014 to January 2015 were sequenced using the Illumina HiSeq 2500 platform, yielding 14.2–21.8 Gbp of raw bases per sample (Table 3). The 20 metagenomes collected in 2019 were sequenced using the Illumina NovaSeq platform, generating 9.3–12.2 Gbp of raw bases per sample (Table 3). Taxonomic classification of high-quality metagenomic reads using the Genome Taxonomy Database38 (GTDB) identified Nanopelagicaceae (Actinomycetia), Pelagibacteraceae (Alphaproteobacteria), and Burkholderiaceae (Gammaproteobacteria) as the dominant microbial taxa in the freshwater metagenomes (Fig. 2). This study presents a new freshwater metagenome dataset alongside physicochemical characteristics, providing a foundation for further investigations into microbial community dynamics and the genomic traits of freshwater microbial assemblages.

Table 3 Metagenome sequencing statistics of 28 samples collected in this study.
Fig. 2
figure 2

Taxonomic classification of freshwater metagenomic reads obtained from Lake Soyang. The bubbles represent the relative abundance (%) of taxonomic groups at the family level in the samples. Taxonomic classification was performed using Kraken2 based on a database constructed using the GTDB (R207).

Methods

Sample collection

Lake water samples were collected using a Niskin sampler at depths of 1 m and 50 m in front of the dam (37.9474 N, 127.8189 E) across four seasonal intervals between April 2014 and January 2015 (Fig. 1). During the second sampling period (January−November 2019), samples were collected at five depths (1, 10, 20, 40, and 90 m) at the same location and using the same sampling method. At each time point and depth, a single 10 L lake water sample was collected. Immediately after collection, samples were stored in darkness at 4 °C and transported to the laboratory. Temperature, pH, dissolved oxygen, conductivity, and salinity were measured using a YSI 556 MPS Multiprobe System (YSI Incorporated, Yellow Springs, OH, USA) on-site. Dissolved oxygen concentrations were also analyzed using the standard Winkler titration method39. Concentrations of ammonium, nitrite, nitrate, phosphate, and silicate were analyzed after filtration through 0.45 μm pore-size membrane filters (Advantec, Toyo Roshi Kaisha, Tokyo, Japan). During the first sampling period, analyses were conducted by the National Instrumentation Center for Environmental Management (NICEM, South Korea), while during the second sampling period, Hach reagent kits (Hach, Loveland, CA, USA) were used. Dissolved organic carbon (DOC) and total dissolved nitrogen (TDN) were also measured by NICEM during the first sampling period. For chlorophyll a analysis, 1 L of each water sample was filtered through a GF/F glass microfiber filter (0.7 μm-pore-size, Whatman, Kent, UK). Chlorophyll a concentrations were quantified after acetone extraction40 using a Turner Designs™ 10-AU fluorometer (Turner Designs, Sunnyvale, CA, USA) during the first sampling period and a UV-VIS spectrophotometer (Model UV-2600, Shimadzu, Japan) during the second sampling period. For metagenome sequencing, 1 L of each water sample was sequentially pre-filtered through a 3.0 μm-pore-size mixed cellulose ester membrane filter (Advantec, Toyo Roshi Kaisha, Tokyo, Japan), followed by final filtration through a 0.2 μm-pore-size, 47 mm polyethersulfone membrane filter (Pall, NY, USA). Identical filter types and brands were used consistently for all samples throughout the entire sampling period. Filtration was completed within 12 hours of collection, and filters were stored at −80 °C until DNA extraction.

DNA extraction and metagenome sequencing

DNA extraction was performed using the entire 0.2 μm filter for each sample. For samples collected between April 2014 to January 2015, 0.2-μm-pore-size filters were placed in 5 ml screw-cap tubes. Following a manual lysis protocol described in previous studies41,42,43,44, lysozyme solution (5 μl, 10 mg ml−1) was added to the tubes along with 1 ml of lysis buffer (20 mM EDTA, 50 mM Tris, 400 mM NaCl, 0.75 M sucrose), followed by incubation at 30 °C for 30 minutes in a hybridization oven with rotation speed of 5 rpm. After incubation, proteinase K (5 μl, 20 mg ml−1) and 10% sodium dodecyl sulfate (100 μl) were added, and samples were incubated overnight at 55 °C. All solutions in 5 ml tubes were transferred to 15 ml conical tubes, and genomic DNA was extracted using the DNeasy Blood & Tissue Kit (Qiagen, Maryland, USA) following the manufacturer’s instructions, starting with the addition of RNase A and Qiagen AL buffer. For samples collected between January and November 2019, genomic DNA was extracted from 0.2-μm polyethersulfone filters using the DNeasy PowerSoil Pro Kit according to the manufacturer’s protocol. We used two different DNA extraction protocols because our laboratory procedures for extracting DNA from membrane filters changed between the two sampling periods. Given the largely consistent taxonomic profiles observed across both sampling periods (Fig. 2), any biases potentially introduced by the different protocols are considered minimal, if present at all.

Extracted DNA was used for library preparation using the TruSeq DNA Library Kit, followed by sequencing on either the Illumina HiSeq platform (250 bp paired-end; CJ Bioscience, Korea) for 2014–2015 samples or the Illumina NovaSeq platform (150 bp paired-end; CJ Bioscience, Korea) for 2019 samples. The use of two different sequencing platforms was due to practical considerations. To our knowledge, any potential bias arising from the use of different Illumina sequencing platforms is presumed to be negligible. No controls were included for DNA extraction, library construction, or sequencing.

Quality enhancement and taxonomic classification of metagenomic reads

Summary statistics of the raw sequencing data were assessed using Seqkit (v.2.8.0)45, with the “seqkit stats -a” option. Prior to taxonomic profiling, raw Illumina reads from both sampling periods (2014–2015 and 2019) were pre-processed using BBduk in BBtools46 for adapter trimming, quality filtering, and removal of phiX174 sequences. The command used for trimming and quality filtering was: bbduk.sh ref = adapters.fa ktrim = r k = 23 mink = 11 hdist = 1 qtrim = rl trimq = 10 minlen = 100. The command for phiX removal was: bbduk.sh ref = phix174_ill.ref.fa.gz k = 31 hdist = 1. Pre-processed metagenomic reads were taxonomically classified using Kraken2 (v2.1.3)47 with the GTDB database (R207)38, provided by Struo248 (http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release207/), using the “--paired” option. The resulting taxonomic abundance data were processed using custom scripts and visualized with the R package ‘tidyverse’49. The processed taxonomic classification tables and scripts used to generate Fig. 2 are available in the project’s GitHub repository: https://github.com/SuhyunInha/SYMETA.

Data Records

The raw Illumina sequencing reads (fastq format) of the metagenomes obtained in this study have been deposited in the European Nucleotide Archive (ENA) under project accession number PRJEB2757850 (samples from 2014 and 2015) and in GenBank under project accession number PRJNA108710551 (samples from 2019).

Technical Validation

All experiments were conducted in a strictly controlled environment to prevent potential sample contamination. Before sampling and experiments, all equipment that directly contacted water samples, including filter units, carboys, bottles, and flasks, was immersed in HCl solution (10% v/v) for at least 24 hours, followed by rinsing with Milli-Q water five times. After washing, the labware was autoclaved, thoroughly dried, and stored with dust caps or wrapped in foil until use. The Niskin sampler was first washed in the laboratory and then rinsed with lake water at the sampling station at least three times before use. Water samples were transferred to sterile 20 L or 4 L carboys, placed in a cold-storage box, and immediately transported to the laboratory to prevent bacterial overgrowth. No field blanks or negative controls were included during sample collection; however, the risk of contamination was minimized through strict adherence to field protocols. Physicochemical parameters, including temperature, dissolved oxygen, pH, and conductivity, were measured on-site. Sample filtration and additional water chemistry analyses were performed on the same day as sampling. DNA extraction from the filters was performed using the same batch of the Qiagen DNeasy Blood & Tissue Kit for the 2014–2015 samples, and DNeasy PowerSoil Pro Kit for the 2019 samples. DNA extractions for each sample set were performed on the same day, respectively, within a dedicated laboratory space.