Main

The vast majority of microorganisms are predicted to be undiscovered1. Traditionally, achieving genomes of previously uncharacterized microbial species involves isolating and cultivating the microorganisms, followed by sequencing2. While this method has successfully yielded thousands of previously undescribed genomes3, culturing can be labour intensive and time consuming4, and most microbes are estimated to be unsuitable for isolation5. In the past decade, genome-centric metagenomics has emerged as an alternative and expedient means of characterizing microbial diversity through recovery of metagenome-assembled genomes (MAGs)6,7. Despite potential issues of contamination and incompleteness (for example, different microbial strains8 or species9), metagenomics allows large-scale recovery of previously undescribed genomes from uncultured microorganisms10,11. So far, the Genome Taxonomy Database (GTDB, release 220) comprises 113,104 prokaryotic species, of which 72.5% are represented exclusively by MAGs12, highlighting the current limitations in culture-based genomics. Therefore, MAGs will be indispensable to obtain genomic coverage of the estimated 2–4 million prokaryotic species inhabiting the biosphere13.

Soil has the potential to greatly increase the number of microbial species in the databases given its enormous microbial diversity14. However, this complexity also makes soil exceptionally challenging for MAG recovery14. Several attempts have been made to improve MAG recovery from soil, such as reducing the complexity of the sample through species enrichment15, cell sorting16 or deep short-read sequencing (for example, over 100 Gbp to several Tbp of sequencing data)17,18. However, none of these approaches have resulted in the cost-effective recovery of high-quality microbial genomes. Hence, developing a solution for the efficient recovery of high-quality MAGs from soil and other microbially complex habitats is considered the ‘grand challenge’ of metagenomics19.

In recent years, long-read sequencing has substantially enhanced our ability to recover high-quality microbial genomes from medium-complexity samples20,21. This has been complemented by the development of bioinformatic methods that improve MAG recovery from challenging samples through the use of deep-learning algorithms22,23 or additional binning features24,25,26. Therefore, multiple sequencing and bioinformatic approaches have now become available for tackling the ‘terrestrial metagenome challenge’.

Here we performed deep long-read Nanopore sequencing (~100 Gbp per sample) of 154 complex environmental samples collected as part of the Microflora Danica project, which aims to genomically catalogue microbial diversity in Denmark27. By developing a bioinformatics workflow that uses state-of-the-art metagenomic binning tools, combined with multicoverage and iterative binning, we obtained over 15,000 species-level MAGs. The great majority (97.9%) of these MAGs represent previously undescribed microbial genera or species, substantially expanding the microbial tree of life.

Results

High-throughput MAG recovery from soils and sediments

Of the 10,683 environmental samples collected during the Microflora Danica sampling campaign27, 154 samples (125 soil, 28 sediment, 1 water) from 15 distinct habitats (Supplementary Table 1, Fig. 1 and Dataset 1) were selected (see Methods for selection criteria) for deep long-read Nanopore sequencing (Fig. 1a) to explore assembly performance across a wide breadth of sample types. A total of 14.4 Tbp long-read data was generated, with a median of 94.9 Gbp and an interquartile range (IQR) of 56.3–133.1 Gbp (Fig. 1b). The sequence reads had a median N50 (length cutoff where reads of that size or longer cover at least 50% of the total number of bases in the read dataset) of 6.1 kbp (IQR: 4.6–7.3 kbp) (Fig. 1c) and assembled into a total of 295.7 Gbp of metagenomic contigs, with a median contig N50 (length cutoff where contigs of that size or longer cover at least 50% of the total assembly) of 79.8 kbp (IQR: 45.8–110.1 kbp) per sample. The majority of reads were assembled into contigs, as a median 62.2% (IQR: 53.1–69.8%) of the sequence data was mapped back to the assemblies (Fig. 1d).

Fig. 1: Overview of the sequenced environmental samples.
figure 1

a, Geographic distribution of the samples, coloured by sample habitat (Microflora Danica habitat descriptor level 1). Samples from habitats that occur once in the dataset are grouped into the ‘Other’ habitat category. b, Per-sample sequencing yield in Gbp. c, Per-sample sequenced read N50 values in kbp. d, Per-sample percentage of sequencing data that was estimated as assembled into contigs. e, Per-sample count of the recovered high- or medium-quality MAGs (see Methods for quality criteria). f, Estimates for per-sample percentage of sequenced data represented by the recovered high- or medium-quality MAGs. For the boxplots, results are shown for the sequenced samples (n = 154); the central line represents the median, the hinges correspond to the 25th and 75th percentiles, and the whiskers extend up to 1.5 times the IQR. g, Schematic overview of the mmlong2 metagenomics workflow. ‘Extra short reads’ refers to the shallow metagenome datasets from the Microflora Danica project27 used for differential coverage binning (see Methods).

Source data

To improve MAG recovery from high-complexity environmental samples, we developed mmlong2, a metagenomics workflow that features multiple optimizations for recovering prokaryotic MAGs from extremely complex metagenomic datasets. Briefly, mmlong2 performs metagenome assembly, polishing, removal of eukaryotic contigs and extraction of circular MAGs (cMAGs) as separate genome bins (Fig. 1g). It then performs differential coverage binning (incorporating read mapping information from multisample datasets, Supplementary Dataset 2), ensemble binning (using multiple binners on the same metagenome, Supplementary Fig. 2a) and iterative binning (metagenome gets binned multiple times iteratively, Supplementary Fig. 2b), which all contribute to increased MAG recovery (see Methods for details). Compared with other contemporary metagenomic binning workflows, mmlong2 enables recovery of more MAGs from terrestrial metagenomes, with the trade-off of moderately increased compute times (Supplementary Table 2).

In total, 6,076 high-quality (HQ) and 17,767 medium-quality (MQ) MAGs (23,843 total, Supplementary Dataset 3) were recovered by the mmlong2 workflow from the 154 sequenced samples, including 3,349 (14.0%) MAGs recovered by iterative binning (Supplementary Table 3), with a median of 154 (IQR: 89–204) high- or medium-quality MAGs recovered per sample (Fig. 1e). The obtained MAGs were estimated to account for a median of 24.0% (IQR: 16.7–32.9%) of the sequence data within individual samples (Fig. 1f).

Lower per-sample sequencing yields were observed for samples originating from the two habitat categories of agricultural fields as well as the bogs, mires and fens habitat (Supplementary Fig. 3a), which might be attributed to suboptimal DNA extraction leaving contaminants that compromise the DNA sequencing. In addition, the agricultural field samples had low amounts of sequence data assembled into contigs (median 45.0%, IQR: 39.3–50.1%, Supplementary Fig. 3b) and also the lowest per-sample count of high- or medium-quality MAGs (median 56 MAGs, IQR: 34–89, Supplementary Fig. 3c), whereas coastal habitat samples yielded the highest MAG recovery metrics (Supplementary Fig. 3). To investigate whether the relatively poor MAG yield from agricultural field samples (Supplementary Fig. 3d) was only due to sequencing yield, three agricultural and coastal samples were selected and subsampled to specific sequencing depths (from 20 to 100 Gbp, see Methods for more details). Despite normalization for sequencing effort, the coastal habitat samples still exhibited greater MAG yield (Supplementary Fig. 4a).

Between the two habitat types, there were no substantial differences in non-prokaryotic DNA (Supplementary Fig. 4b) and overall, a comparable number of prokaryotic species was observed in the reads (Supplementary Fig. 4c) or contigs (Supplementary Fig. 4d), without signs of full microbial diversity capture at 100 Gbp sequencing depth. However, k-mer redundancy analysis indicated that for the coastal samples, more species were abundant, compared with the agricultural samples (Supplementary Fig. 4e–g). Furthermore, MAGs from the coastal samples also had lower rates of MAG polymorphism (proxy for microdiversity, Supplementary Fig. 4h). Hence, the relatively poor MAG recovery from the agricultural field samples was influenced by reduced sequencing yields, higher microdiversity and the absence of dominant species (Supplementary Fig. 4i).

The multiple reasons for variation in MAG recovery were assumed to be facilitated by extensive ecological differences between the two soil habitats, as the highly saline and low-nutrient coastal ecosystems select for salt-tolerant organisms28, while the microbial communities from high-nutrient agricultural fields are shaped by agricultural practices29,30. Across the Microflora Danica ~10,000 shallow metagenomes27, the agricultural and coastal samples (Supplementary Fig. 5a) featured distinct microbial community compositions (Supplementary Fig. 5b, analysis of similarities (ANOSIM) R = 0.755, p = 0.001, n = 1,046), while notable differences were also observed between the coastal habitats of salt marshes or meadows and the habitats of sea cliffs, shingle or stony beaches (ANOSIM R = 0.329, p = 0.001, n = 235). Furthermore, phylum-level differences were pronounced (Supplementary Fig. 5c), as the agricultural habitats exhibited greater relative abundances of Firmicutes (Supplementary Fig. 5d) and fewer Proteobacteria (Supplementary Fig. 5e) or Bacteroidota (Supplementary Fig. 5f). Hence, variation in microbial community composition and taxonomic diversity was also expected to affect MAG yield from terrestrial habitats.

Contribution to terrestrial microbiomes

The recovered 23,843 MAGs (Fig. 2a) were dereplicated into 15,640 different species-level MAGs (Fig. 2b), comprising 4,894 HQ and 10,746 MQ MAGs (Fig. 2c–f and Supplementary Fig. 6a,b). Since the MAGs were recovered with long-read Nanopore sequencing, we refer to the dereplicated genome set as the Microflora Danica long-read (MFD-LR) MAG catalogue. The genomic catalogue was inspected for potential Nanopore-associated sequencing errors, and several instances (n = 73, 0.5% of genomes) of coding density values <75% were detected for MAGs with lower coverage (Supplementary Fig. 6c) and reduced guanine-cytosine (GC) content (Supplementary Fig. 6d). However, the reduced coding density was also found to be mostly prevalent in archaeal MAGs with increased rates of long homopolymers (>6 repeating nucleotides, Supplementary Fig. 6e), which in turn were more frequent in MAGs with low GC content (Supplementary Fig. 6f).

Fig. 2: Overview of the MAGs recovered from the sequenced samples.
figure 2

a, Aggregated MAG counts by MAG contiguity, MAG quality rankings before and after MAG dereplication, together with MAG taxonomy rankings according to GTDB R220 classification. ‘Single-contig’ refers to MAGs that are composed of one contig, which was not reported as circular by the assembler. b, Species-level (>95% ANI) rarefaction curve for HQ and MQ MAGs recovered in this study. The solid line depicts the rarefaction (interpolation), while the dotted line shows the extrapolation of the curve. c, Distribution of dereplicated MAG size values in Mbp, grouped by MAG quality. d, Distribution of dereplicated MAG coverage values. e, Contig N50 for dereplicated MAGs in kbp. f, Coding density values for dereplicated MAGs. For the boxplots, results are shown for the dereplicated HQ (n = 4,894) and MQ (n = 10,746) MAGs; the central line represents the median, the hinges correspond to the 25th and 75th percentiles, the whiskers extend up to 1.5 times the IQR, and individual outliers are shown. g, Percentage coverage of Microflora Danica core genera27 at different sample metadata description levels by the recovered MAGs.

Source data

After species-level MAG dereplication, 51.4% (n = 12,255) of the recovered MAGs were singletons. Plotting a rarefaction curve for the MAGs showcased a near-linear (unsaturated) relationship between the number of recovered MAGs and the number of species-level clusters (Fig. 2b and Supplementary Fig. 7). The largest species-level cluster of 39 MAGs was recovered for the Pseudolabrys genus in the order Rhizobiales, and in total, 126 species-level clusters with >10 MAGs per cluster were obtained (Supplementary Fig. 8).

An advantage of long-read generated MAGs is that they mostly include ribosomal RNA (rRNA) operons, enabling direct comparison to the thousands of available 16S rRNA datasets and large-scale databases. Of the recovered dereplicated MAGs, 12,823 (82.0%) included at least one 16S rRNA gene and were taxonomically classified against the Microflora Global 16S rRNA database, which features 16S rRNA gene sequences from the original Microflora Danica project as well as major publicly available 16S rRNA databases27. Overall, 12,460 (97.2%) of these MAGs were classified to the genus level (>94.5% 16S rRNA gene identity), while 10,438 (81.4%) of the MAGs were assigned a species-level match (>98.7% 16S rRNA gene identity). Coverage of Microflora Danica core genera (across ~10,000 metagenomes27) by the MAG dataset in this study varied from 72.3% to 93.0%, depending on the metadata description category (Fig. 2g), and exceeded 90% for all soil habitats (Supplementary Fig. 9).

Overall, 183.4% more dereplicated MAGs were recovered from the 154 deeply sequenced terrestrial samples than the Microflora Danica short-read (MFD-SR) shallow metagenome study27 that sequenced close to 70-fold more samples (10,683 samples at ~5 Gbp each) (15,640 vs 5,518 HQ and MQ MAGs), and 11-fold more (4,894 vs 422) dereplicated HQ MAGs were recovered from this study. Also, more MAGs were recovered in this project than the recent genome catalogues of the Tibetan Plateau Microbial Catalogue (TPMC31) and the Old Woman Creek wetland microbial genome catalogue (OWC32; Fig. 3a and Table 1). Compared with global genomic catalogues that aggregate vast numbers of previously published sequencing data (including low-complexity samples), such as the Searchable, Planetary-scale mIcrobiome REsource (SPIRE33), the Genomes from Earth’s Microbiome (GEM14), Rare Biosphere Genomes (RBG34) and Soil Microbial Dark Matter Metagenome Assembled Genome (SMAG35), our 154 long-read sequenced samples still produced similar or higher numbers of HQ genomes, despite, for example, SPIRE utilizing almost 100,000 individual samples (Fig. 3a and Table 1). The inferred sequencing costs per HQ MAG recovered were also estimated to be the lowest for the MFD-LR catalogue (Supplementary Table 4).

Fig. 3: Comparison of MAG catalogues from large-scale terrestrial environment studies.
figure 3

a, Dereplicated MAGs (HQ and MQ) per genome catalogue. Individual counts for groups with >1,000 MAGs are presented when possible. The dereplicated MAG counts were achieved by applying the same genome quality control for all catalogues (see Methods). b, Upset plot for species-level MAG overlap between the catalogues. A genome cluster is marked as HQ if at least one of the genomes in the cluster is an HQ MAG. Groups with <200 MAGs were omitted from the plot. Total counts of gene clusters for rRNA operons (c), defence islands (d) and BGCs (e) predicted in the MAGs, grouped by gene cluster type and coloured by the fraction of clusters estimated as complete (see Methods). f, Classification rates for Microflora Danica shotgun metagenome datasets (9,916 samples above 1 Gbp yield) using different genome databases, grouped by sample type. ‘Public catalogues’ refers to previously described terrestrial genome catalogues (see Methods). For the boxplots, results are shown for the soil (n = 8,179), sediment (n = 1,518) and water (n = 219) sample types; the central line represents the median, the hinges correspond to the 25th and 75th percentiles, and the whiskers extend up to 1.5 times the IQR.

Source data

Table 1 Summary of terrestrial prokaryotic genome catalogues

Dereplicating MAGs between catalogues resulted in 138,407 species-level clusters, with most species-level overlaps occurring between the genome catalogues of SPIRE, SMAG and GEM, due to large overlaps in primary data sources (Fig. 3b). For MAGs from this study, most species-level overlaps occurred with the short-read Microflora Danica MAG catalogue (n = 1,423), although 12,750 dereplicated MAGs (and 3,653 HQ MAGs) from this project represent distinct species.

MAGs from this study also featured greater assembly contiguity with a median contig count of 20 (IQR: 10–36), compared with >100 for short-read MAG catalogues (Table 1). Improved genome contiguity suggests enhanced assembly of complex genomic regions and indeed, we observed greatly improved recovery of rRNA genes as part of complete operons (Fig. 3c) and more complete defence gene islands, especially CRISPR-Cas clusters (Fig. 3d). More complete biosynthetic gene clusters (BGCs) were also a feature of the long-read assemblies, and a median of 6.1-fold (IQR: 3.8–14.8) more complete BGCs were observed in the MAGs from this study than other short-read MAG catalogues (Fig. 3e).

The aforementioned genome catalogues were used as reference databases for classifying the ~10,000 shallow metagenome datasets from the Microflora Danica project27. Using the GTDB R220 database alone for read classification resulted in a median species-level classification rate of 3.0% (IQR: 1.8–4.1%), whereas including the short-read MAG catalogues increased the median classification rate to 17.4% (IQR: 14.3–24.4%). Addition of the long-read MAGs from this study resulted in a database of 229,714 non-redundant genomes and increased species classification to a median of 36.6% (IQR: 29.6–42.9%, Fig. 3f), with the greatest improvements occurring for soil samples (Supplementary Fig. 10).

Previously undescribed and expanded microbial lineages

Taxonomic classification using GTDB R220 resulted in average nucleotide identity (ANI)-based species-level assignments for 326 MAGs, which comprise 2.1% of the dereplicated MAGs. To determine the phylogenomic gain and diversity for the remaining 15,198 (97.9%) dereplicated MAGs that could not be assigned a species-level taxonomic label, de novo phylogenetic trees were constructed using MAGs from this study and GTDB R220 species representatives (Fig. 4). MAGs recovered in this study were found to increase the total branch length of the GTDB prokaryotic genome tree by 8.1% (Supplementary Fig. 11), with most of the branch expansion occurring at genus or species level in both the bacterial and archaeal domains (Supplementary Fig. 12). Based on relative evolutionary divergence (RED), this added diversity comprises 1 phylum, 21 orders, 91 families and 1,086 genera (Table 2).

Fig. 4: Distribution of the recovered MAGs across the microbial genome tree of life.
figure 4

Microbial genome trees with GTDB R220 representative species for Bacteria (120 marker genes) and Archaea (53 marker genes) were built separately with 100 bootstraps and merged into a single tree, spanning both domains. The tree branches are coloured by domain (Bacteria or Archaea). Tree tips of dereplicated MFD-LR MAGs are marked with red dots within the tree. The outer circle highlights tree tips of the 15 phyla with the highest number of genomes. Reproduced from GTDB, CC BY-SA 4.0.

Source data

Table 2 Summary for the contribution of MFD-LR MAGs to GTDB R220

The microbial lineages represented by MFD-LR MAGs were widely distributed across terrestrial habitats, with 98.8% (n = 9,930) of MFD-SR samples containing reads classified to MAGs from at least one previously undescribed genus (Supplementary Fig. 13a–c). Reads for genomes of previously uncharacterized families and orders were found in 68.2% (n = 6,849) and 24.7% (n = 2,480) of samples, respectively. Urban soils had the highest frequency of previously undescribed genera (median 35 per sample, IQR: 25–45, Supplementary Fig. 13d), contributing a median of 2.9% (IQR: 2.0–3.8%) of sequenced reads (Supplementary Fig. 13e), with the roadside habitat featuring the most previously undescribed genera (median 41, IQR: 26–55) and families (median 5, IQR: 4–6, Supplementary Fig. 14a). In contrast, uncharacterized genera and families, identified only by the 16S rRNA gene in terrestrial habitats27, were most common in sediment samples (Supplementary Figs. 13f,g and 14b), with 22,437 genera and 1,095 families currently lacking genomic representation.

The MAG representing a previously undescribed phylum was successfully re-assembled into a circular 2.9 Mbp genome with a GC content of 51.3% and a single rRNA operon. The coding density was 91.8%, although 57.7% of the predicted genes (n = 2,473) were hypothetical. We detected species-level matches of the MAG in 7 of the ~10,000 MFD-SR environmental metagenomes, representing four geographic locations (Supplementary Fig. 13a). Six of these metagenomes were from dystrophic lakes (characterized by high organic acid content, low nutrients and low pH), suggesting relatively low environmental prevalence of the lineage and habitat specificity. This was reflected in the genomic potential of the MAG, as metabolic reconstruction indicated that the bacterium is probably motile, Gram-negative and adapted to an anaerobic environment with available dissolved organic carbon. The MAG encoded the potential to ferment glucose to acetate, use ethanolamine as a source of nitrogen and energy, and fix or detoxify formaldehyde using the ribulose monophosphate pathway36 (Supplementary Dataset 4). Due to the considerable phylogenetic distinctiveness of the cMAG, we propose the name Oederibacterium danicum sp. nov. in honour of Georg Christian Oeder, a scientist who led the original Flora Danica project37.

A total of 207 previously undescribed genera and 1,170 species were represented by at least one HQ MAG comprising ≤10 contigs, for which we proposed names (Supplementary Dataset 5) under the SeqCode38. Since the MFD-LR MAGs were recovered from Danish habitats, genus names were derived from Danish towns that were nearby the sampling locations, and species names were derived from environmental features of the samples from which the MAGs were obtained (see Methods). For genomes that could be assigned to GTDB lineages with placeholder names, we also proposed higher rank names on the basis of the genus stems under the SeqCode to provide taxonomic congruence (Supplementary Dataset 5).

MAGs recovered in this study spanned 75 of the 217 currently recognized phyla, with 50% or higher increases in species-level MAGs for 10 phyla (Supplementary Figs. 15, 16 and 17a, and Table 5). Notably, MAGs were recovered for underrepresented phyla with placeholder names, such as JAUVQV01, CAKKQC01 and UBP4, all of which featured only 2 species in GTDB R220. Furthermore, the inclusion of species-level MAGs from this study has substantially expanded several highly populated phyla. Actinomycetota increased by 42.1% (from 11,737 to 16,683 species-level genomes), Chloroflexota by 38.5% (from 2,749 to 3,808 genomes) and Acidobacteriota by 134.2% (from 1,891 to 4,429 genomes). Similar increases in microbial lineage genome counts were observed when examining class, order and family ranks (Supplementary Fig. 17b–d and Table 5).

A total of 12,779 dereplicated and previously undescribed species MAGs were classified as 2,052 different known genera. Of these genera, 682 (32.2%) were represented by a single genome in GTDB and inclusion of MAGs recovered in this study expanded the species-level representatives by more than 100% for 1,065 genera (51.9% of existing genera with MFD-LR MAGs). The highest number of recovered genomes for a known genus was for Palsa-744 (Actinomycetota), with an increase of 1,230.8% (from 26 to 346 genomes), whereas the highest increase in a microbial lineage of 5,000% (from 1 to 51 genomes) was observed for the genus RYN-230 (Actinomycetota).

This study provides HQ genomes for 158 known microbial families and 612 known genera that were previously represented only by MQ genomes (Table 2). Notable examples of such lineages include the orders of Pacearchaeales (Nanoarchaeota) and Micrarchaeales (Micrarchaeota), which in GTDB R220 are represented by 235 and 189 MQ MAGs, respectively. Similarly, this study provides genomes with complete 16S rRNA genes for 436 known genera (Table 2) that were previously lacking such representation, including the Actinomycetota genera Gaiellasilicea, Gaiella and Desertimonas, which were all expanded more than 10-fold.

Discussion

Here we developed mmlong2, a bioinformatic workflow that capitalizes on high-throughput deep long-read sequencing to recover MAGs from highly complex terrestrial samples. To evaluate performance, we sequenced 154 soil and sediment samples across 15 environmental habitats. Overall, hundreds of MAGs could be recovered from each sample, thereby enabling cost-efficient MAG recovery at scale from soils and sediments. However, MAG recovery varied between habitats, especially with agricultural soils consistently yielding fewer MAGs. We show that the variance in MAG recovery between habitats was influenced by sequencing yield, microdiversity and community composition. Furthermore, non-biological factors can also impact MAG recovery, as terrestrial habitats can feature vastly different chemical compositions and abiotic factors39, which shape microbial communities30,40. Hence, we recommend that researchers take into consideration the unique features of each terrestrial habitat when conducting experimental design for future metagenomics projects (for example, habitat-optimized DNA extraction41, low-biomass-compatible sequencing protocols42). In general, we recommend sequencing at least 60 Gbp per sample, as this ensures access to the genomes of both dominant terrestrial species and low-abundance species as evidenced by no indication of saturation observed in the sequencing depth investigated (up to 100 Gbp). We also note that high-throughput recovery of multipartite or plasmid-containing genomes from terrestrial environments remains challenging, although recent advances in methylation-based binning offer promising improvements26.

Compared with other extensive genome-centric studies of terrestrial habitats14,31,33,35, this study used long reads to recover MAGs from terrestrial samples at scale. The improved long-read MAG contiguity permits higher resolution of complex genomic regions43, such as repeated operons and gene clusters. As the majority of the MAGs were recovered with 16S rRNA genes, most could be linked to the Microflora Global27 and other 16S rRNA gene databases. Since rRNA gene databases are generally more diverse than genome databases44, recovering more MAGs with complete 16S rRNA genes facilitates improved taxonomic classification and improved linkage between genome and 16S rRNA gene databases45. Furthermore, unlike previous terrestrial MAG catalogues, the majority of BGCs and CRISPR-Cas defence islands recovered in this study were estimated to be complete due to improved assembly of the long reads46 and represent the largest collection of complete BGCs from a MAG catalogue so far, which could facilitate the discovery of medically and industrially valuable biochemical compounds47.

Previously undescribed HQ and MQ MAGs were recovered for the great majority of genera reported as constituting the core microbiome of different terrestrial habitats27, thereby enabling further in-depth analysis of functional potential20. The recovered MAGs can also be used to design targeted cultivation strategies to establish pure cultures of select microbial species48. Furthermore, including MAGs from this study in taxonomically classifying the ~10,000 short-read Microflora Danica datasets increased median species-level classification from 17.3% to 36.8%, representing a substantial improvement in the ability to explore complex microbial communities at species level using short-read shotgun metagenomics. The considerable improvement in terrestrial metagenome classification also underscores the need for more localized metagenomics projects to acquire genomes of microbes unique to a particular environment or habitat type49.

Most of the recovered MAGs from this study constitute previously undescribed microbial species or genera, which is a common finding of recent large-scale terrestrial microbiome studies33,35, highlighting that each genome catalogue contributes substantially to characterizing the global microbiome. However, thousands of genera in terrestrial habitats still lack genomic representation, necessitating further genome recovery from complex environments. Although the addition of previously undescribed microbial lineages from this study occurred mainly at species or genus level, hundreds of recognized order or family level lineages were substantially expanded. As many microbial lineages are currently represented by a single placeholder MAG in GTDB, the expansion of these lineages is imperative to fill the gaps in the tree of life. This study also provides HQ MAGs for hundreds of GTDB lineages currently only represented by comparatively fragmented lower-quality MAGs. By proposing Latin names for microbial lineages under the SeqCode38 using contiguous HQ MAGs as nomenclatural types, we help to address the contemporary issue of a rapidly growing number of unnamed microbial taxa in public databases50. As microbial genome databases continuously improve3, the quality and not just the quantity of database additions should be emphasized. Hence, we anticipate this genome catalogue will serve as a valuable resource and template for gaining insights into the microbial ecology of the world’s most complex environments.

Methods

Sample selection

Samples used in this project include terrestrial samples collected as part of the Microflora Danica sampling campaign27. Briefly, bulk soil samples were collected using a weed extractor, which was cleaned with 70% ethanol before sampling, while also taking special care to avoid objects, such as sticks, leaves, grass and insects. The bulk sediment samples were collected using a gravity corer, followed by removal of any collected water or larger debris. A detailed description of the sample collection and processing is provided in the Microflora Danica study27. All environmental samples used in this study were collected and handled in a responsible manner and in accordance with local laws.

Samples for deep, long-read sequencing were selected using the Microflora Danica shallow metagenome 16S rRNA gene observational tables, aggregated to the genus level (on the basis of classification to the Microflora Global 16S rRNA gene reference database) of 10,683 environmental samples27. Initially, samples with >2 Gbp sequencing yield and with sample type of ‘soil’, ‘sediment’ or ‘water’ were selected to ensure that the picked samples are from environmental habitats and that the metagenomic-derived 16S rRNA gene profiles are adequately representative of the sample. Next, genera with at least 0.1% relative abundance and a minimum raw abundance (supporting read count) of 5 were counted, and samples that featured at least 75 of the selected genera with a combined relative abundance of 70% were further selected to omit samples that are mostly dominated by rare species or belong to a low-complexity metagenome. For the remaining samples, genera assigned with de novo taxonomy after classification to the Microflora Global 16S rRNA gene reference database27 and featuring a minimum relative abundance of 0.2% as well as minimum raw abundance of 10 were counted, and 300 samples with the highest number of uncharacterized genera were selected to optimize the likelihood of recovering previously undescribed MAGs. The remaining samples were then manually curated to optimize for microbial diversity between the samples by omitting samples that overlap based on sampling location, or feature high overlapping genus counts with the rest of the selected samples.

DNA extraction and Nanopore sequencing

DNA from the selected environmental samples was extracted using the DNeasy PowerSoil Pro kit (QIAGEN, 47016), and the quality of the extracted DNA was evaluated using the NanoDrop One spectrophotometer (Thermo Fisher) and the Qubit dsDNA HS kit (Thermo Fisher, Q33231) with a Qubit 3.0 fluorometer (Thermo Fisher) to measure DNA concentration. The DNA was then prepared for sequencing using the SQK-LSK114 Ligation Sequencing kit (Oxford Nanopore), loaded into FLO-PRO114M Nanopore flow cells (Oxford Nanopore) and sequenced in 400 bps sequencing speed mode using either the P2 or the P24 (Supplementary Dataset 1) sequencers (Oxford Nanopore).

Read data processing

The raw Nanopore sequencing data were collected using the MinKnow software (v.22.07.4-23.04.5, Supplementary Dataset 1, https://community.nanoporetech.com/downloads) and basecalled with Guppy (v.6.2.1-6.5.7, Supplementary Dataset 1, https://community.nanoporetech.com/downloads) in super-accurate mode. Due to irreversible updates to the MinKnow software, some samples were sequenced with the 4 kHz sampling rate, while others were acquired using the 5 kHz rate (indicated in Supplementary Dataset 1). The sequenced reads were then split with duplex-tools (v.0.2.14, https://github.com/nanoporetech/duplex-tools) and trimmed using Porechop (v.0.2.3)51. Reads of Phred Quality score <7 or length <0.2 kbp were filtered out with NanoFilt (v.2.6.0)52. The split, trimmed and filtered Nanopore read summary statistics were acquired using NanoQ (v.0.10.0)53.

MAG recovery with mmlong2

MAGs were recovered from the sequenced samples using a custom-developed mmlong2-lite metagenomics workflow v.1.0.2 (https://github.com/Serka-M/mmlong2-lite). Briefly, the mmlong2-lite metagenomics workflow v.1.0.2 is a Snakemake (v.7.26.0)54 bioinformatics workflow that can take long reads (Nanopore or PacBio HiFi) and perform metagenome assembly, contig filtering, binning and initial MAG quality check. For Nanopore datasets, the reads are assembled into metagenomes using Flye (v.2.9.2)55 with the ‘–meta’ and ‘–nano-hq’ options. Furthermore, the ‘-fmc’ flag of mmlong2-lite controls the ‘min_read_cov_cutoff’ option of Flye, which can be increased to filter out more low-coverage contigs and thus speed up the metagenome assembly turnaround time. For this study, the ‘-fmc 8’ option of the workflow was used with read datasets consisting of >50 Gbp of data to speed up the assembly.

The assembled Nanopore-only metagenomes were then polished with 1 round of Medaka (v.1.8.0, https://github.com/nanoporetech/medaka) to reduce the amount of indel errors in the initial assembly. Contigs <3 kbp were filtered out using SeqKit (v.2.4.0)56 and the remaining contigs were then classified with Tiara (v.1.0.3)57 to remove eukaryotic contigs from the assembly.

Before metagenomic binning, the ‘assembly_info.txt’ file outputted by Flye was used to extract circular contigs above the default length threshold of 250 kbp to be kept as separate bins. The remaining contigs were then used for iterative ensemble binning (Supplementary Fig. 2a) with MetaBAT2 (v.2.15)58, SemiBin2 (v.1.5)59, GraphMB (v.0.1.5)24, and DAS Tool (v.1.1.3)60 with the ‘–search_engine diamond’ setting. For this study, multiple shallow metagenome read datasets (2,819 different samples) from the Microflora Danica study27 were selected on the basis of overlapping genus-aggregated community profiles (specific samples indicated in Supplementary Dataset 2) and used as input for the workflow (with the ‘-cov’ option) to perform multicoverage metagenomic binning for improved MAG recovery. The coverage profiles were generated by mapping the read datasets to the metagenome using Minimap2 (v.2.26)61 and SAMtools (v.1.16.1)62, followed by coverage calculation using the ‘jgi_summarize_bam_contig_depths’ function of MetaBAT2. The concatenated coverage profiles were then provided as input to MetaBAT2 and GraphMB, while for SemiBin2, the mapping files were provided directly.

After recovering the ensemble bins with DAS Tool, CheckM2 (v.1.0.2)63 was used to acquire bin completeness and contamination metrics, followed by selection of bins meeting the requirements for HQ MAGs (>90% completeness, <5% contamination). The unselected contigs were then binned again using the same binners and all HQ or MQ MAGs (>50% completeness, <10% contamination) were selected (Supplementary Fig. 2b). Ensemble binning of the unselected contigs was repeated for the third time with the genome quality score filtering feature of DAS Tool turned off as well as the use of a pre-trained binning model for SemiBin2 (‘global’ model by default, ‘soil’ for samples from this study), followed by retainment of all HQ and MQ MAGs. The remaining unselected contigs were then binned one last time with MetaBAT2, and only HQ or MQ MAGs, as estimated by CheckM2, were kept as the final output of the MAG production workflow.

MAG relative abundance and coverage values were computed using CoverM (v.0.6.1)64, while quality metrics were obtained by running Quast (v.5.2.0)65 on the MAGs. Outputs from different tools were then aggregated to acquire a single dataframe with per-genome statistics.

The mmlong2-lite workflow is publicly available at https://github.com/Serka-M/mmlong2-lite and https://zenodo.org/record/8013498. The workflow is also the metagenomic binning component of the full mmlong2 pipeline (https://github.com/Serka-M/mmlong2), which supports automated MAG analyses that were omitted in this project to conserve computing resources.

MAG quality control and inspection

CheckM1 (v.1.2.2)66 was run on all the recovered MAGs in lineage-specific workflow and all MAGs with <50% completeness or >10% contamination were omitted. The MAGs were then assigned a quality score using CheckM1 metrics as follows: completeness − (5 × contamination). MAGs with a quality score <30 were omitted. The remaining MAGs were then dereplicated using dRep (v.2.6.2)67 with the following settings: ‘-comp 50’, ‘-con 10’, ‘-sa 0.95’, ‘-nc 0.4’. Furthermore, the MAGs were screened for tRNA genes with tRNAscan-SE (v.2.0.9)68 using bacterial and archaeal models, while rRNA genes were detected with Barrnap (v.0.9, https://github.com/tseemann/barrnap) and Bakta (v.1.9.4)69, which was run in metagenome mode.

Following the minimum information about metagenome-assembled genome (MIMAG) guidelines70, MAGs were classified into HQ MAGs if they exhibited >90% completeness and <5% contamination estimates by CheckM2 (v.1.0.2)63 while also featuring the 16S, 23S and 5S rRNA genes at least once, together with a minimum of 18 unique tRNA genes. MAGs not meeting these criteria but featuring >50% completeness and <10% contamination were classified as MQ MAGs. Only MIMAG HQ and MQ MAGs, as estimated by both CheckM1 and CheckM2, were used in this study (Supplementary Dataset 3).

Unless otherwise specified, MAG completeness, contamination and coding density values, as reported by CheckM2, were used in the plots and text. For cMAGs, dnaapler (v.1.1.0)71 was applied to verify that replication initiator genes are present in all recovered cMAGs.

MAG taxonomic classification

The recovered MAGs were classified with GTDB-Tk (v.2.4.0)72 against the GTDB R220 database using the ‘classify_wf’ workflow. The 16S rRNA sequences, which were extracted from the recovered MAGs, were classified with Usearch (v.11.0.667)73 against the Microflora Global (v.1.0)27 and SILVA (v.138.2)74 16S rRNA gene databases using the ‘-usearch_global -strand both -top_hit_only’ settings. The 16S rRNA taxonomic classification was considered species level if the top hit identity to the reference database sequence was ≥98.7%, genus level if ≥94.5%, family level if ≥86.5% and order level if ≥82.0% top hit identity.

Comparison of terrestrial habitats

To compare different soil habitats for MAG recovery at normalized sequencing depths, three sequenced samples per habitat for agricultural (MFD00392, MFD05176, MFD08497) and coastal (MFD02416, MFD05684, MFD01721) groups were randomly selected and subsampled to custom depths (20, 40, 60, 80, 100 Gbp) using Rasusa (v.2.0.0)75, followed by MAG recovery with mmlong2-lite (v.1.0.2). Detection of eukaryotic sequences was performed by classifying the reads and assembled contigs with Kaiju (v.1.10.1)76 using the ‘kaiju_db_nr_euk_2023-05-10’ database. Read and contig taxonomic profiling and species detection was done using Melon (v.0.2.0)77. Variant detection in MAGs for microdiversity assessment was performed with Longshot (v.1.0.0)78. Read k-mer counts were acquired using Jellyfish (v.2.2.10)79, while general read and contig statistics were achieved with Nanoq (v.0.10.0)53 and Cramino (v.0.14.1)80. The code used for performing yield-normalized metagenomics comparisons is available at https://github.com/Serka-M/mmcomp.

For comparing microbial compositions between different habitats, metagenomic datasets from the short-read Microflora Danica study27 were selected to include samples from coastal and agricultural habitats with at least 1 Gbp sequencing yield. To ensure similar sample counts per group, the agricultural samples were randomly subsetted to achieve up to 150 samples per habitat descriptor level 2. Next, the metagenomic-derived 16S rRNA profiles of the selected 1,046 samples were used to build a Bray–Curtis dissimilarity matrix, and the statistical significance of community composition differences was evaluated with the ANOSIM test (two-sided) using 999 permutations via vegan (v.2.6-6.1)81 in R (v.4.4.1)82. Principal coordinate decomposition of the dissimilarity matrix was performed with ape (v.5.8)83.

Comparison of metagenomic binning workflows

The mmlong2 workflow was compared against similar metagenomic binning pipelines that feature support for Nanopore long-read assembly and binning (Supplementary Table 2). A test run with 100 Gbp of sample MFD02416 was used to recover MAGs with mmlong2-lite (v.1.0.2) using the settings ‘-sem soil’ and ‘-med r1041_e82_400bps_sup_g615’. A test run was also performed with Aviary (v.0.11.0, https://github.com/rhysnewell/aviary) with the following settings: ‘aviary complete’, ‘-z ont_hq’, ‘-s 3000’, ‘-b 250000’, ‘-w recover_mags’, ‘–medaka-model r1041_e82_400bps_sup_g615’, ‘–skip-qc’, ‘–binning-only’. The SqueezeMeta workflow (v.1.6.5)84 was also included by using the settings ‘-a flye’, ‘-m sequential’, ‘-contiglen 3000’, ‘-map minimap2-ont’, ‘-mapping_options ‘-I 120 G -K 5 G’’, ‘-binners concoct,metabat2,maxbin’, ‘–nocog’, ‘–nokegg’, ‘–nopfam’, ‘-test 15’. All test runs were performed using 100 CPUs to ensure comparable run times. The recovered MAGs from different workflows were then classified according to MIMAG guidelines, which included processing the MAGs from SqueezeMeta with CheckM2 while reusing the CheckM2 quality scores outputted by mmlong2 and Aviary. HQ and MQ MAGs from different workflows were dereplicated to compare species-level genome recovery.

Comparison of MAG catalogues

Publicly available MAG catalogues were downloaded from the following studies that featured terrestrial MAGs and at least 1,000 dereplicated MAGs: SPIRE33, RBG34, GEM14, SMAG35, TPMC31, OWC32. MAG quality assessment and quality filtering was performed in the same manner as with the MAGs recovered in this study. For the GEM catalogue, the full MAG dataset was downloaded and dereplication was performed separately to obtain non-redundant MAGs that were recovered in the study. For the SPIRE catalogue, genome entries from the proGenomes database85 were omitted and the MAG catalogue was also dereplicated separately, as multiple instances of species-level redundancy were observed for the catalogue. For RBG, MAGs that were reported as representative of previously undescribed species by the authors were used.

MAGs from all catalogues were then annotated with Bakta (v.1.9.4)69 and screened for defence islands via DefenseFinder (v.1.3.0)86. Screening for secondary metabolites was performed with antiSMASH (v.7.1.0)87 using the following options: ‘–cb-general’, ‘–cb-subclusters’, ‘–cb-knownclusters’, ‘–genefinding-tool prodigal-m’, ‘–asf’, ‘–pfam2go’, ‘–smcog-trees’, ‘–rre’, ‘–tfbs’. The secondary metabolite data were then parsed and aggregated into a dataframe via the ‘tabulate_regions.py’ script from multiSMASH (v.0.3.0, https://github.com/zreitz/multismash), and gene clusters located at the edge of the contig were considered potentially incomplete. Genomes from the catalogues and GTDB R220 were used for building reference databases to classify the Microflora Danica short-read datasets27 (9,916 samples that had >1 Gbp yield) via Sylph (v.0.6.1)88.

Phylogenomic analysis

Automated MAG phylogenetic assessment was performed using a custom pipeline available from https://github.com/aaronmussig/mag-phylogeny. Briefly, marker genes were extracted from the MAGs and aligned with the marker genes of GTDB R220 representative genomes using the ‘infer’ module of GTDB-Tk (v.2.4.0)72. The marker gene alignment was then used to build bacterial and archaeal genome trees via FastTree (v.2.1.11)89 with the ‘WAG’ model and 100 bootstraps. RED values90 for the recovered MAGs were determined with PhyloRank (v.0.1.12, https://github.com/dparks1134/PhyloRank) and phylogenetic classification was performed with the ‘summary_novelty_of_genomes.py’ script of the workflow. GTDB representative species classifications, lineage taxonomies and genome quality rankings were acquired from GTDB R220 metadata files. Phylogenies of highly divergent MFD-LR MAGs were manually inspected and curated.

Recovery and analysis of Oederibacterium danicum genome

Fragmented MAG MFD01231.bin.1.34, representing the Oederibacterium danicum lineage, was re-assembled into a cMAG by initially classifying the contigs of sample MFD01231 assembly using mmseqs2 (v.14.7e284)91 to the National Center for Biotechnology Information (NCBI) nr database (release 9 November 2023). Reads mapping to contigs, which were given the NCBI taxonomy of ‘d_Bacteria’ without any subordinate taxa, were then extracted using the ‘view’ and ‘bam2fq’ modules of SAMtools, and assembled with Flye using the ‘–meta–extra-params min_read_cov_cutoff=18’ settings. The resulting assembly produced a linear contig, which could be linked to MFD01231.bin.1.34 through 16S rRNA gene sequence matching (rRNA prediction with Barrnap, alignment with Usearch). The reads mapping to the selected linear contig were then extracted using SAMtools ‘view’ and ‘bam2fq’ modules with ‘-q 20 -m 1000’ settings, and assembled again via Flye with the ‘–extra-params min_read_cov_cutoff=26’ option. The resulting circular contig (as reported by Flye) was polished with Medaka and contig circularity was further manually confirmed by inspecting read mappings to both ends of the contig. The initial and re-assembled genomes for Oederibacterium danicum were compared with Quast to check for irregularities.

Metabolic potential of the cMAG for Oederibacterium danicum was inferred through annotation with DRAM (v.1.4.6)92 (Supplementary Dataset 4) and the MicroScope Microbial Genome Annotation and Analysis Platform (v.3.17.3)93. MicroScope classified the MAG as belonging to a Gram-negative population. Motility was inferred on the basis of extensive flagellar motility operons (for example, from K02391 in Supplementary Dataset 4). Nearly all genes involved in glycolysis (8/9) and the pentose phosphate (5/6) pathways were identified. Pyruvate oxidation to acetate was indicated by the presence of a gene for pyruvate:ferredoxin oxidoreductase (PFOR; KEGG orthology number K03737), positioned next to a group 3b NiFe hydrogenase operon94 (for example, K04656 and syntenic region) and a gene for acetyl-CoA synthetase (ACS, K01895). A bacterial microcompartment for ethanolamine use95 was identified on the basis of DRAM and MicroScope analysis (for example, K04021 and syntenic region). Anaerobic metabolism was inferred due to missing key tricarboxylic acid cycle and electron transport chain genes (for example, genes for succinate dehydrogenase and cytochrome c oxidase). In addition, genes for high-affinity cytochrome bd oxidase (K00426/K00425) were located next to a gene for superoxide dismutase (K04564), indicating the potential to combat oxidative stress96. The ribulose monophosphate cycle was complete, including a gene for the key enzyme 3-hexulose-6-phosphate synthase (HPS-PHI, K13831). Several genes of the reductive acetyl-CoA or Wood–Ljungdahl pathway were also identified, but a gene for the key enzyme acetyl-CoA synthase was not detected. Due to the phylogenetic distinctiveness of the MAG, missing genes may result from homologues not currently in databases, or a variation of known C1 processing pathways97.

Naming microbial lineages

HQ MAGs with 10 or less contigs that were unclassified in both GTDB R220 and Silva 138.2 were selected for naming under the SeqCode registry38 (https://registry.seqco.de/). MAGs with 16S rRNA gene matches to undefined or placeholder taxonomic ranks in Silva 138.2 (for example, incertae sedis) were also included in the naming. Latinized genus names were derived from the names of Danish cities or parishes located within 15 km of the sampling sites. Whenever possible, different city or parish names were used to reduce name redundancy. The Latinized names were then given suffixes indicating a microbial lineage. The suffix -bacter was assigned only to lineages whose genomes contained the shape-determining mreB gene98, while -coccus was used for lineages lacking mreB. Furthermore, the suffix -monas was proposed only for lineages with genomes containing flagellar genes, whereas -plasma was assigned to lineages lacking both mreB and genes associated with a peptidoglycan-containing cell wall (for example, genes for peptidoglycan glycosyltransferases, peptidoglycan-binding protein LysM). If the named lineages featured parent taxa with placeholder names in GTDB, the genus names were used to generate Latinized replacements for the placeholders. However, if a parent taxon with a placeholder name featured any subordinate taxa with Latinized names, the replacements were not proposed due to disagreements in taxonomic opinion. Genera, which were proposed as nomenclatural types for phyla, were named in honour of the leading scientists of the Flora Danica project37.

For species names, the species epithet that reflects the sample’s environmental conditions (for example, sample type or habitat) was used whenever possible. If metadata-based naming was not possible, or to reduce name redundancy, generic species names (for example, danicum, nordicum) were assigned. Explanations for each genus and species name are provided in Supplementary Dataset 5, and the names are scheduled for manual registration under SeqCode post publication.

Statistics and reproducibility

No statistical method was used to predetermine sample size. No data were excluded from the analyses. The sequencing experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.