Background & Summary

The soil environment is host to many microbes with diverse functions1,2, but the high heterogeneity of the soil-associated microbial community results in challenging microbial culturing conditions in lab settings3,4. For example, some bacterial soil taxa have a slow growth rate or complex nutrient demands including the dependence on the “cross-feeding” in the natural soil environment5,6. These characteristics not only complicate the establishment of optimal growth conditions, but also require precise environmental mimicking to achieve successful cultivation. The difficulty of some microbes to be cultured has propelled the field of metagenomics to grow both in use and popularity7,8. Metagenomics has been used to categorize the complex microbes; their identity and functional potential, living in the environment9,10.

Metagenome-assembled genomes (MAGs) are assembled from the short fragments of DNA, and potentially increase the identity resolution of the microbial populations in a community11,12. The use of genome resolved metagenomes has enabled ground-breaking work in exploring potential functions across various communities like soil13, human and animal guts14, and many more15,16. Soil-associated MAGs are challenging to assemble in vivo17,18. One of the reasons is because of the complexity and biodiversity within the soil community, and the low presence of the soil microorganisms in the reference databases19,20. The soil environment contains many microorganisms like viruses, fungi, archaea, and bacteria, adding to the challenge of assembling microbial genomes19,21. Similarly, some soil-associated microbes are low in abundance, and getting enough coverage, coupled with limited information on these microbes22,23, makes assembling these microbial populations extremely hard to achieve17,24.

Understanding the distribution of bacterial populations across precipitation gradients is crucial for understanding how environmental factors such as moisture influence microbial diversity, function and overall ecosystem health. Microbes play essential roles in nutrient cycling25, organic matter decomposition2,26, and plant health27. All of which can be affected significantly with changes in precipitation28,29. With the ability to assemble genomes, and analyze them across different precipitation levels, thus, enabling the identification of adaptive traits and community shifts that respond to water availability, and therefore, provide insights into how ecosystems may respond to global climate change. This is vital for predicting and managing the impacts of shifting precipitation patterns on soil health, agricultural productivity, conservation efforts, and overall ecosystem stability13,30.

In this study, we assembled 679 metagenome-assembled genomes (MAGs) using a co-assembly strategy of 18 field soil core samples and 36 enriched monolith samples from two locations in Kansas, USA. The two locations were Hays, Kansas (Western Kansas; WKS), which has less yearly precipitation compared to Ottawa, Kansas (Eastern Kansas; EKS) (Fig. 1). One of the field core samples did not meet the minimum sequencing requirements, resulting in a total of 53 samples for this study. While co-assembly strategy has its limitations, we used this method to recover MAGs because (1) it resulted in higher read depth to enhance more robust assembly31, capturing the diversity across the system32; (2) it facilitates our comparison of the MAGs across the samples17,33; and (3) it used differential coverage across the samples to substantially improve our ability of recovering genomes from the metagenomes31,34. The monolith samples enriched the respective field soil cores which harbored a long history of land precipitation. This allowed us to successfully assemble soil-associated MAGs. We assigned the assembled microbial genomes as MAGs if they had a completion ≥70% and a redundancy <10%. Of all the assembled genomes, 5 MAGs had a completion of 100%; two of the five MAGs had a redundancy of 5.6%, with the rest having a redundancy of 1.4%, 1.4%, and 2.8% respectively (Supplementary Table S1). Of the ~ 42 ± 21 million metagenomic reads, an average of ~ 39 ± 19 million reads per sample passed quality control criteria, and were used for the 4 co-assemblies. The average total length of the MAGs was ~ 3 ± 1.4 Mbp, with the average number of contigs being ~ 464 ± 360. The average N50 was ~ 15991 ± 20615, and the average GC content was ~ 63 ± 8% (Supplementary Table S1).

Fig. 1
figure 1

Experimental design demonstrating the sample collections and the bioinformatic workflow. The soil cores were collected from both the field and monolith samples; genomic DNA was extracted, and sent for shotgun sequencing. The bioinformatic workflow showed the processes used to resolve the non-redundant MAGs and metabolic modeling.

In this study, we resolved a large number of high-quality MAGs to the phyla Acidobacteriota (n = 96), Actinobacteriota (n = 271), and Proteobacteria (n = 105). The other MAGs were resolved to Bacteroidota (n = 1), Chloroflexota (n = 16), CSP1-3 (n = 1), Desulfobacterota (n = 10), Dormibacterota (n = 3), Eremiobacterota (n = 1), Firmicutes (n = 1), Gemmatimonadota (n = 17), Halobacteriota (n = 1), Methylomirabilota (n = 31), Myxococcota (n = 5), Nitrospirota (n = 7), Planctomycetota (n = 1), Tectomicrobia (n = 2), Thermoproteota (n = 42), Verrucomicrobiota (n = 36) (Fig. 2). There were 32 MAGs which were unidentified. There were 457 MAGs annotated to the genus level, with the highest number of MAGs annotated to the genera AV55 (n = 28) and Methyloceanibacter (n = 17). 395 MAGs were completed to the species level.

Fig. 2
figure 2

Phylogenetic tree of bacterial and archaeal domains and phylums. A majority of the MAGs belonged to the Bacteria domain, with Actinobacteriota being the top phylum. All but one MAG that was annotated to the Archaeal domain belonged to the Thermoproteota phylum.

We observed that MAGs were more highly detected in the monolith as compared to the field samples, which highlighted the enrichment of the field samples in the monoliths (Fig. 3). Our results also highlighted the influence of locations and depth of sampling on the detection of the MAGs. Most of the MAGs were more detected in Eastern Kansas where the precipitation level is higher than the locations in Western Kansas. Furthermore, we also observed higher detection of the MAGs in the shallow region (5 cm) as compared to the deeper regions (15 cm and 30 cm) (Fig. 3; Supplementary Table S2).

Fig. 3
figure 3

Heatmap demonstrating the detection of the MAGs across the different locations, type, depth, and soil legacy. There was a higher detection of MAGs where the precipitation was higher and on the shallower soil depth.

We used Distilled and Refined Annotation of Metabolism (DRAM) to profile and compare the metabolic potentials of the MAGs. We observed that around 80% of the MAGs had enzymes capable of breaking down chitin, and ~ 75% of the MAGs contain enzymes that can metabolize starch (Fig. 4, Supplementary Table S3). We also observed that 17 MAGs possessed beta-galactan cleaving enzymes, while none of the MAGs could potentially degrade mucin. It was surprising to see that only ~25% of the MAGs contained enzymes necessary to convert nitrite to nitric oxide, while only 4 of 679 MAGs could convert nitrogen to ammonia. Based on the metabolic potential profiles, it appeared that the MAGs were more likely to be involved in denitrification rather than the nitrification process. Putting it all together, we showed that precipitation had a strong impact on the MAGs’ detection as well as their metabolic potentials, across the two locations.

Fig. 4
figure 4

Nitrogen and CAZy metabolism of the non-redundant MAGs. There were more MAGs performing the denitrification process rather than nitrification. Overall, there were more MAGs possessing carbon-associated metabolic pathways as compared to nitrogen related metabolism.

Our findings provided insights into further understanding of how the soil microbial community, their essential functions and metabolism35,36 would be affected due to precipitation change and land use regime37,38. The availability of these high-resolution MAGs from this study would also provide the framework for exploring microbe-microbe interactions and microbial functional shifts under abiotic stresses39,40.

Methods

Experimental design and sample preparation

We used the Giddings probe (Giddings Machine Company, Windsor, CO, USA) to collect field soil samples (n = 3; depth 60 cm, diameter 5 cm) in June 2018 from 2 sites across the precipitation gradient (Fig. 1). The sites in Western (WKS) experience an average of 533 mm/year precipitation, and Eastern (EKS) Kansas average rainfall of 1045 mm/year41. Within the 3 plots at each precipitation site, we collected 1 soil core from 3 subplots. Soil collecting plots represented the land-use legacies: WKS (native/N: 38.84°N, 99.30°W, agricultural/Ag: 38.84°N, 99.31°W, and post-agricultural/PAg: 38.84°N, 99.32°W), and EKS (native/N: 38.18°N, 95.27°W, agricultural/Ag: 38.54°N, 95.25°W, and post-agricultural/PAg: 38.18°N, 95.27°W). Since we were interested in the representative samples of the land use rather than differences within plots, we then pooled and homogenized the soil cores to create 1 composite sample per plot. In this report, we were also interested in the gene distribution across the precipitation gradient. We aliquoted 25 g of soil for each depth segment (0–5 cm, 6–30 cm, 31–60 cm) for shotgun sequencing, and the rest of the samples were stored at −80 °C for archiving.

From August to October 2018, we collected monolith samples for the enrichment of soil microbial communities42. We collected intact soil cores (depth 60 cm, diameter 30 cm) from the same sites described above. We dried and stored monolith samples in the greenhouse (University of Kansas Field Station) until further processing. The monolith experiment was set up in April 2019 and continued for 6 months. Each monolith soil sample was rewetted and randomly assigned to the “dry” and “wet” watering treatment groups. The amount of water per treatment was determined by averaging the 30 years of EKS and WKS annual rainfall data and adding 50% of the total to account for the greenhouse conditions. As a result, the “dry” treatment group received 1000 mm/yr and the “wet” treatment reserved 2000 mm/yr. We also mimicked the summer rainstorm, by applying 450 mm/yr of each treatment in three 150 mm intense watering events. In November 2019, we harvested monolith samples, and segmented them according to the depth (0–5 cm, 6–30 cm, 31–60 cm) for shotgun sequencing, and the rest of the samples were stored at −80 °C for archiving.

Field soil samples were homogenized, and total genomic DNA was extracted using the DNeasy PowerSoil Pro Kit following the manufacturer’s protocol (Qiagen, Germantown, MD, USA). We extracted total DNA from 0.160 g of monolith soil samples using an Omega E.Z.N.A. Soil DNA Kit (Omega Biotek, Inc., Norcross, GA, United States) as per the manufacturer’s protocol with a slight modification. We mechanically lysed the cells in a Qiagen TissueLyser II (Qiagen, Hilden, Germany) including bead-beating (20 rev/s, 2 mins) and vortexing (3 mins) prior to downstream DNA extraction steps. The extracted DNA from both kits was eluted to a 100-μL final volume. Extracted DNA was sequenced on the Illumina NovaSeq. 6000 platform (Illumina, San Diego, CA, USA) at the University of Kansas Medical Center Genome Sequencing Facility. We used an S1 flow cell to undertake a 150-paired sequencing strategy using Nextera DNA Flex library preparation.

Bioinformatics

We co-assembled reads from 12 “metagenomic sets” - 6 from the field and 6 from the monolith soils based on the geographical locations (WKS, EKS) and their soil depths (0–5 cm, 6–30 cm, 31–60 cm). We automated our metagenomics bioinformatics workflow using the ‘anvi-run-workflow’43 in anvi’o v7.144,45. The workflows use Snakemake43 to implement numerous tasks. Details of each process as outlined below:

We used the program ‘iu-filer-quality-minoche’ to process the short metagenomic reads and removed low-quality reads accordingly46. We used MEGAHIT v1.2.931 to co-assemble quality-filtered short reads into longer contiguous sequences (contigs). Following the assembling of contigs, we used ‘anvi-gen-contigs- database’ to compute k-mer frequencies and identify open reading frames (ORFs) using Prodigal v2.6.347; ‘anvi-run-hmms’ to identify sets of bacterial and archaeal single-copy core genes using HMMER v.3.2.148; ‘anvi-run-ncbi-cogs’ to annotate ORFs from NCBI’s Clusters of Orthologous Groups (COGs)49; and ‘anvi-run-kegg-kofams’ to annotate ORFs from KOfam HMM databases of KEGG orthologs50. Next, we mapped metagenomic short reads to contigs using Bowtie2 v2.3.551 and profiled the BAM files using ‘anvi-profile’ with a minimum contig length of 1000 bp. Finally, we used ‘anvi-merge’ to combine all profiles into an anvi’o merged profile for all downstream analyses. For the construction of metagenome-assembled genomes (MAGs), we first used ‘anvi-cluster-contigs’ to group contigs into initial bins using CONCOCT v1.1.052, and used ‘anvi-refine’ to manually curate the bins based on tetranucleotide frequency and different coverage across the samples. We marked bins that were more than 70% complete and less than 10% redundant as MAGs. The completion and redundancy values of the MAGs were based on matching single copy genes in the MAGs with multiple hidden Markov models (HMM): Bacteria_7153, Archaea_7654, and Protista_83. Finally, we used ‘anvi-compute-genome-similarity’ to calculate the average nucleotide identity (ANI) of our MAGs using PyANI v0.2.955, and identified non-redundant MAGs.

We annotated the non-redundant MAGs with Distilled and Refined Annotation of Metabolism (DRAM), to provide a metabolic profile for each of the MAGs56. We used Prodigal47 v.2.6.3 to detect open reading frames (ORFs) and predict their amino acid sequences. We then used DRAM to search for all amino acid sequences against KEGG57, UniRef9058, and MEROPS59 using MMseqs. 260, with the best hits (defined by bit score with a default minimum threshold of 60). We also used DRAM to perform HMM profile searches of the Pfam61 database, and HMMER362 for dbCAN63, with coverage length > 35% of the model and e-value < 10–15 to be considered a hit 40.

Data Records

The shotgun metagenome reads as well as the sequences for the MAGs generated from this study are publicly available on the NCBI Sequence Read Archive (SRA) under BioProject PRJNA855256, SRR20019782-2001983464. All other analyzed data in the form of databases and fasta files are also accessible in github (https://github.com/SonnyTMLee/Recovery-of-679-soil-MAGs)65, and also available on NCBI GenBank66 - see Supplementary Table 1 for the individual genome (MAGs) accession numbers.

Technical Validation

All data processing steps, and software used in this study are described in the “Methods” section. The DNA yield and quality were measured using the NanoDrop One C (NanoDrop Technologies Inc, Wilmington, DE, USA) and a Qubit 4 Fluorometer dsDNA BR Assay Kit (Life Technologies, Paisley, UK).