Genomic GC bias correction improves species abundance estimation from metagenomic data

Holcik, Laurenz; von Haeseler, Arndt; Pflug, Florian G.

doi:10.1038/s41467-025-65530-4

Download PDF

Article
Open access
Published: 26 November 2025

Genomic GC bias correction improves species abundance estimation from metagenomic data

Nature Communications volume 16, Article number: 10523 (2025) Cite this article

2364 Accesses
16 Altmetric
Metrics details

Subjects

Abstract

Metagenomic sequencing measures the species composition of microbial communities and has revealed the crucial role of microbiomes in the etiology of a range of diseases such as colorectal cancer. Quantitative comparisons of microbial communities are, however, affected by GC-content-dependent biases. Here, we present GuaCAMOLE, a computational method to detect and remove GC bias from metagenomic sequencing data. The algorithm relies on comparisons between individual species in a single sample to estimate the sequencing efficiency at levels of GC content, and outputs unbiased species abundances. GuaCAMOLE thus works regardless of the specific amount or direction of GC-bias present in the data and does not rely on calibration experiments or multiple samples. Applying our algorithm to 3435 gut microbiomes of colorectal cancer patients from 33 individual studies reveals that the type and severity of GC bias vary considerably between studies. In many studies, we observe a clear bias against GC-poor species in the abundances reported by existing methods. GuaCAMOLE successfully removes this bias and corrects the abundance of clinically relevant GC-poor species such as F. nucleatum (28% GC) by up to a factor of two. GuaCAMOLE thus contributes to a better quantitative understanding of microbial communities by improving the accuracy and comparability of species abundances across experimental setups.

Construction of high-quality genomes and gene catalogue for culturable microbes of sugarcane (Saccharum spp.)

Article Open access 24 May 2024

Trait biases in microbial reference genomes

Article Open access 09 February 2023

Recovery of human gut microbiota genomes with third-generation sequencing

Article Open access 02 June 2021

Introduction

Metagenomic sequencing has enabled the comprehensive and quantitative analysis of taxa abundances in a wide range of microbial communities¹. It has uncovered the importance of microbiomes amongst others in health, disease, nutrition and ecology, and has revealed a complex interplay between microbial consortia and their hosts^2,3,4.

Metagenomic sequencing relies on comprehensive high-throughput sequencing of all DNA in a sample to quantify the abundance of all present taxa. To prepare a sample for sequencing, the DNA is extracted, purified, fragmented, amplified, and finally outfitted with sequencing adapters. Numerous protocols have been established for these library preparation steps, each differing in methods and materials used^5,6,7. After sequencing, the reads are assigned to taxa, and relative read counts are used as proxies for the taxa’s abundances^8,9. This assumes that reasonably accurate genomes of all species present in the sample are available. Alternatively, unknown genomes can in principle be assembled from individual reads through a process called metagenomic assembly^10,11,12,13. Metagenomic assembly, however, has a bias profile very different from that of read assignment, so we do not consider this case further here.

While metagenomic sequencing is in principle agnostic to the specific taxa in a sample, library preparation can introduce sequence-dependent biases.¹⁴. In particular, the GC content (i.e., fraction of G and C bases in the sequence) has been shown to strongly affect sequencing efficiency¹⁵. Metagenomic sequencing is particularly affected because the genomic GC content often differs significantly between species⁵. The magnitude and even direction of this bias, however, vary between different library preparation and sequencing protocols¹⁶. For example, a low GC content can either increase or decrease sequencing efficiency depending on the precise protocol used. As a result, computational correction for GC bias has been challenging¹⁷.

The species on the extreme ends of the genomic GC content range are particularly prone to biases. Amongst these species, we find pathogenic taxa such as F. nucleatum (28% GC content, associated with colorectal cancer) and M. pneumoniae (25% GC content, associated with pneumonia)^18,19,20. With common sequencing protocols, the abundance of these taxa will be underestimated^5,17, and this can affect even comparisons between samples analyzed using the same protocol²¹.

Ideally, GC bias should therefore be removed on a per-sample level. Computational methods to remove GC bias have been developed for various sequencing-based methods, and have been shown to be crucial to avoid skewed results^15,22. These methods, however, assume reads are aligned to a reference genome. For metagenomic samples possibly containing thousands of taxa, creating such an alignment is prohibitively computationally expensive. Instead, metagenomic reads are typically assigned to taxa using k-mer-based methods^8,23,24,25. This makes existing methods inapplicable and requires a novel approach to GC bias correction.

We present the GuaCAMOLE (Guanosine Cytosine Aware Metagenomic Opulence Least Squares Estimation) algorithm for the efficient detection and removal of GC bias from metagenomic samples. GuaCAMOLE is an alignment-free algorithm and instead assigns reads to taxa using Kraken2⁸. The algorithm also does not require calibration data or any a priori assumptions about the quantitative relationship between GC content and bias (such as extremely GC-rich and GC-poor species being more prone to bias), and thus works equally well for all sequencing protocols.

Using both simulations and experimental data¹⁶, we show that GuaCAMOLE uncovers protocol-specific GC bias and improves abundance estimates over existing methods. To show that GC bias correction can be relevant in a clinical setting, we apply GuaCAMOLE to a large number of metagenomic stool samples of colorectal cancer patients^26,27,28. Here, we observe that the type and severity of GC bias varies strongly between studies, and that accounting for GC bias significantly increases the estimated abundances of clinically relevant taxa on both ends of the GC spectrum.

Results

The GuaCAMOLE algorithm processes the raw sequencing reads of a metagenomic sample and outputs bias-corrected abundances for all detected taxa. GuaCAMOLE also infers and outputs GC-dependent sequencing efficiencies, which reflect the probability (relative to the maximum) that a DNA fragment with a certain GC content successfully undergoes all library preparation steps and sequencing. These GC-dependent sequencing efficiencies thus measure the extent of the GC bias present in the raw data. Briefly, GuaCAMOLE works as follows (Fig. 1, see Methods for details): Reads are first assigned to individual taxa using Kraken2⁸, and within each taxon to discrete bins representing the read’s GC content (these bins are subsequently referred to as taxon-GC bins). Reads which cannot be assigned to a specific taxon unambiguously by Kraken2 are redistributed probabilistically to the likeliest taxon using the Bracken algorithm²⁹. Read counts in each taxon-GC-bin are then normalized based on expected read counts computed from the genome lengths and genomic GC content distributions of individual taxa. The resulting quotients only depend on the unknown abundances (one for each taxon) and unknown GC-dependent sequencing efficiency (one per GC-bin). From these quotients, we then compute bias-corrected abundance estimates and the GC-dependent sequencing efficiencies. GuaCAMOLE reports the estimated abundances either as sequence abundances proportional to the total amount of DNA present, or taxonomic abundances proportional to the number of genomes³⁰.

GuaCAMOLE improves accuracy on simulated communities

We first demonstrate that GuaCAMOLE infers the correct abundances and GC-dependent sequencing efficiencies independent of the specific type of GC bias present. We ran GuaCAMOLE on data simulated using three different models of GC bias (see Methods for details): peak efficiency at 50% GC, efficiency increasing with GC content, and efficiency decreasing with GC content (Fig. 2A). For all three simulated datasets, GuaCAMOLE produced virtually unbiased estimates (mean relative error less than 1%) and correctly recovered the GC-dependent sequencing efficiencies used for the simulation. The Bracken estimates showed considerable GC bias in comparison (relative errors 10% to 30% depending on the bias model).

**Fig. 2: Performance on simulated metagenomic data.**

We next confirmed that GuaCAMOLE performs well for metagenomic communities with different complexities and species compositions (Fig. 2B). We simulated sequencing libraries representing communities comprising different numbers of taxa (5, 10, 50, 100, or 400 taxa) with log-normally distributed abundances. Taxa were chosen from the RefSeq database, and selected to either have predominantly low or high genomic GC content (extreme), predominantly GC content around 50% (medium), or a GC content uniformly distributed over the whole range (uniform). For each combination of community complexity and distribution of GC content, we simulated 5 libraries with a GC-dependent sequencing efficiency of 1 − 10 ⋅ (g−0.5)² so that the efficiency of g = 30% and g = 70% GC content was 40%. We then compared the relative estimation errors of GuaCAMOLE with those of the other RefSeq-based algorithms, Bracken and MetaPhlAn4. For libraries comprising 50 taxa or more, we find that GuaCAMOLE consistently shows the lowest mean estimation error (Fig. 2B). As expected, the advantage over other algorithms is the largest for communities that predominantly contain taxa with extreme GC content (extreme). When species mostly have a GC content around 50% (medium) Bracken and GuaCAMOLE perform similarly. For small communities comprising mostly taxa with extremely high or low GC content, the accuracy of GuaCAMOLE is severely reduced. For 8 out of 75 simulated communities (4 comprising 5 taxa, 3 comprising 10 taxa, one comprising 50 taxa), GuaCAMOLE was unable to produce reliable estimates and exited with a warning. The likely cause is that for these communities, the GC content distributions of individual taxa did not overlap sufficiently. Since this occurs mainly when all taxa have extreme GC content, we expect this case to be rare in practice.

Improved accuracy across a range of experimental protocols

Having tested GuaCAMOLE on simulated data, we went on to show that it improves abundance estimates for experimental data produced using different protocols, and that GuaCAMOLE can uncover the GC-dependent sequencing efficiencies of these protocols (Fig. 3). We re-analyzed data published by Tourlousse et al.¹⁶ of a mock community sequenced using 28 different protocols (Table 1) with GuaCAMOLE, Bracken²⁹, MetaPhlAn4⁹, SingleM³¹, Sylph³² and mOTUS³³. The mock community comprises 19 bacterial species representative of human-associated microbiota and was sequenced using 11 different commercially available library preparation kits (labeled A-K below, see Table 1). For each kit, Tourlousse et al. tested up to three PCR amplification regimes: 500 ng input DNA with no PCR amplification (suffix 0), 50 ng input DNA with 4-8 PCR cycles (suffix L), and 1 ng input DNA with 8-15 PCR cycles (suffix H).

**Fig. 3: Performance of GuaCAMOLE for experimental mock community data.**

Table 1 DNA Library Preparation Kits tested by Tourlousse et al.¹⁶

Full size table

We find that the GC-dependent sequencing efficiencies estimated by GuaCAMOLE differ strongly between different protocols (Fig. 3A). Some protocols show uniform efficiencies, while others show a strong dependence on the GC content. In accordance with the results of Tourlousse et al. ¹⁶ we see that the protocols DH, FH, GH, IL, and IH (see Table 1) show the strongest dependency on GC content (Fig. 3A, colored lines). The nature of this dependence differs qualitatively between protocols. While protocols IL and IH show decreasing sequencing efficiency with increasing GC content, DH, FH, and GH show an increase in efficiency for higher GC content.

For the protocols most strongly affected by GC content (DH, FH, GH, IH, IL) GuaCAMOLE reduces the mean relative abundance error drastically compared to the other algorithms (Fig. 3B, colored dots). For other protocols GuaCAMOLE and Sylph overall show the smallest error, with a considerably larger variation of errors for Sylph than for GuaCAMOLE (Fig. 3B). Looking at individual protocols, GuaCAMOLE and Sylph show the smallest estimation errors for all protocols except protocol JH (Fig. 3C). A common source of GC bias is PCR amplification³⁴, and accordingly the advantage of GuaCAMOLE over the other algorithms increases with the number of PCR cycles (Fig. 3D). However, GuaCAMOLE also offers a clear advantage over the other algorithms for PCR-free protocols A0 and C0, (Fig. 3C).

The quantification error per bacterial species shows for GuaCAMOLE only a weak residual dependence on genomic GC content (Fig. 3E). In comparison, the quantification error of the other tested algorithms increases significantly for taxa on the extremal ends of the GC content range. The runtime of GuaCAMOLE (including the runtime of Kraken2 itself) is slightly longer than most other algorithms (Table S1), but the difference of up to 2x makes GuaCAMOLE still a practical choice.

The choice of protocol strongly affects GC-dependent sequencing efficiencies also in other datasets. For a human gut mock community³⁵ comprising 18 taxa sequenced on two different sequencing platforms, GuaCAMOLE reveals that the sequencing platform (Illumina HiSeq 2500 rapid and NovaSeq 6000 SP) has a significant influence (Supplementary Fig. S1). Single-species libraries containing only Fusobacterium sp. C1¹⁷ yields similar results (Supplementary Fig. S2).

GC-dependent sequencing efficiency differs widely between studies

To test how much GC-dependent sequencing efficiencies affect real-world studies, we ran GuaCAMOLE on 3435 samples from 33 studies of human gut microbiomes of healthy patients and patients with colorectal cancer (CRC), selected according to sample quality^27,28. 3031 of those samples were usable and reported reliable sequencing efficiencies (less than 50% reads assigned to taxa classified as false positives). By clustering these 33 studies according to the first three principal components of their average GC-dependent sequencing efficiencies (Figs. S3 and S4), we were able to identify 4 qualitatively different shapes of GC-dependent efficiency curves (Fig. 4A). With an efficiency between 50% and 100% throughout the whole GC spectrum, clusters I (10 studies) and II (16 studies) show a noticeable but limited effect of GC-content on sequencing efficiencies. In comparison, the studies in clusters III (3 studies) and IV (4 studies) exhibit a much stronger effect. Whereas in cluster III the sequencing efficiency is reduced only for GC-poor species, both GC-poor and GC-rich species are affected in cluster IV. Overall, we observe that GC-dependent sequencing efficiencies can vary drastically between otherwise similarly designed studies.

**Fig. 4: Performance of GuaCAMOLE for human gut microbiomes.**

Correct abundance estimation of GC-poor and GC-rich species

The distribution of average genomic GC content across detected species agrees between clusters I, II, and IV, but is noticeably shifted towards higher GC content for cluster III (Fig. 4B). This indicates that the loss of sequencing efficiency for GC-poor species for studies in cluster III is severe enough to cause species to be overlooked. For GC-rich species, the effect is reversed. Here, cluster III exhibits the highest sequencing efficiency, and as a result, more species with high GC content are detected in cluster III than in other clusters.

The abundances of species with differing genomic GC content show a similar trend (Fig. 4C, D). For GC-poor species in cluster III, the uncorrected estimates underestimate the abundance up to 10-fold. In clusters I and IV, both GC-rich and GC-poor species are underestimated up to 3-fold.

The taxa particularly strongly affected by GC-bias are Endlipuvirus intestinihominis (NCBI:txid2955861), Kahucivirus intestinalis (NCBI:txid2956048), Fusobacterium nucleatum (NCBI:txid851), and Mycolicibacter (NCBI:txid1073531). F. nucleatum in particular has been associated with colorectal cancer^18,19,20, and is consistently underestimated by Bracken due to GC-bias; on average 1.9-fold in cluster I (96 samples), 1.2-fold in cluster II (81 samples), 10-fold in cluster III (2 samples) and 1.4-fold in cluster IV (1 sample).

GC-bias affects not only the abundance estimates of individual species, but also summary statistics about the composition of microbial communities. In particular, we observe that for most samples the alpha diversity computed from uncorrected samples is noticeably lower than after GC-bias correction (Fig. S5).

False positive removal

False positive taxa are created by reads that are wrongfully assigned by Kraken to taxa not present in the sample. Filtering detected taxa based on read counts is a common way to remove some of these false positives, but this is not always effective³⁶. GuaCAMOLE additionally filters taxa based on how well their observed reads match their reference genomes by comparing observed and expected GC distributions for each taxon (see Methods). Briefly, if observed and expected read counts across a taxon’s GC bins vary more than a pre-defined threshold, the taxon is removed as an outlier, all abundances are recomputed, and another round of outlier detection is initiated (see Fig. S6 for an example). This process is repeated a user-defined number of times, by default 5. This default was set based on the mock community data of Tourlousse et al., where it reduces the number of false-positive taxa from 18 ± 9 to 8 ± 6 and only removes a true positive in two cases (protocols JH and B0). By adjusting these thresholds, users can trade sensitivity (i.e., detecting as many taxa as possible) against specificity (avoiding false positives) and accuracy of efficiency estimates (preventing false-positive taxa from interfering with the sequencing efficiency estimates).

For the simulated mock communities (Fig. 2B), GuaCAMOLE’s outlier removal reduces the number of false-positive taxa from 342 ± 218 to 120 ± 86 (Fig. S7). Altogether, these taxa typically represent only a few percent of all reads (Fig. S8)). The more stringent filtering of GuaCAMOLE increases the false negatives (taxa which are present but not detected) from 6 ± 10 for Bracken to 13 ± 17 for GuaCAMOLE. MetaPhlAn4 is generally much more specific and less sensitive; it finds 35 ± 54 false-positive taxa but misses 54 ± 68 taxa actually present.

For users desiring maximal sensitivity and accuracy of abundance estimates without the risk of interference by false-positive taxa, GuaCAMOLE offers a mode that computes GC-corrected abundance estimates for all taxa detected by Bracken. In this mode, outlier removal affects only the set of taxa used to estimate GC-dependent efficiencies, and the estimated efficiencies are then used to correct the Bracken abundance estimates for GC-bias (thus effectively borrowing information between taxa with similar GC content). This mode was used for the analysis of the CRC samples presented in Fig. 4.

Discussion

GuaCAMOLE infers both bias-corrected abundances and GC-dependent sequencing efficiencies from a single sample without prior information about the amount or direction of GC-bias present in the data. The algorithm is agnostic about the specific sequencing protocol used and can correctly detect and correct for GC-bias without calibration data or prior knowledge about the expected type of bias. For most sequencing protocols, the bias-corrected abundances reported by GuaCAMOLE are more accurate than those reported by both Bracken and MetaPhlAn4. The advantage provided by GuaCAMOLE increases with the amount of GC bias present, and thus in particular with the amount of PCR amplification done prior to sequencing. Interestingly, we do not observe an advantage of MetaPhlAn4 over Bracken, even though we might expect the marker gene-based approach of MetaPhlAn4 to be less susceptible to bias. In fact, Bracken and MetaPhlan4 often show a relatively similar quantification error. This further corroborates that the improvement offered by GuaCAMOLE does indeed stem from successful correction of GC bias and not from other algorithmic differences.

In addition to bias-corrected abundances, GuaCAMOLE reports accurate GC-dependent sequencing efficiencies. This is useful as a quality control to check that library preparation and sequencing perform as expected. It also provides a way to estimate the amount of bias that affects taxa that remained unobserved. Finally, it allows different library preparation and sequencing protocols to be compared without the need for mock communities with known abundances.

GC bias can affect the abundance estimates of clinically relevant pathogens such as F. nucleatum, which has been associated with a range of diseases^18,19,20. We observe this in published microbiomes of colorectal cancer patients, where, depending on the study, the abundance of F. nucleatum is often underestimated 2-fold, and can be underestimated up to 10-fold before GC bias correction. Generally, we observe that the under-estimation of GC-poor and GC-rich species can differ widely between different studies of human gut microbiomes. One cause for GC-bias is likely the commonly used Nextera XT library preparation kit^5,17. However, the four qualitatively different shapes of GC-bias we observe in real-world human gut microbiome data suggest that many other common library preparation techniques introduce biases as well.

Even uniform biases can skew comparisons between different samples under some circumstances^21,37. The large qualitative difference between the GC-biases affecting different real-world human gut microbiome studies thus poses a risk for meta-analyses such as refs. ^27,28. While the quantitative effect of such will depend heavily on the design of such meta-analyses, GuaCAMOLE can help to mitigate these risks by uncovering the GC-bias affecting different studies and by offering a way to correct it.

GuaCAMOLE detects false-positive taxa by checking for outliers within the deviations of observed from expected read counts. This offers more power to detect false-positive taxa than read-count thresholding and ensures that such outliers do not skew the estimated sequencing efficiencies and abundances. However, this false-positive detection assumes reasonably accurate reference genomes. Therefore, taxa that are present but whose reference genomes are inaccurate are at risk of being flagged as false positives and subsequently removed. If this is a concern, the GC-dependent efficiencies reported by GuaCAMOLE can be used to correct the bias present in the abundances estimated by other tools such as Bracken or MetaPhlAn4. For Bracken, GuaCAMOLE already implements this mode of operation as an option.

GuaCAMOLE relies on the overlap between the GC distributions of taxa to estimate sequencing efficiencies. For small communities, particularly if most taxa have an extreme genomic GC content, insufficient overlap can reduce the accuracy of GuaCAMOLE. In such cases, the fraction of reads assigned to taxa flagged as false-positives is typically also high. As a safeguard, GuaCAMOLE thus warns the user and refuses to report estimates if that fraction exceeds 50%.

The runtime of GuaCAMOLE is mostly on par with most other tools, if slightly slower. The longer runtime is to a large degree caused by the need to re-read all sequencing reads to compute their GC content. While GuaCAMOLE is even now fast enough to be practical, a future optimization could be to modify Kraken2 to compute each read’s GC content during taxonomic mapping. Doing so would remove the need for GuaCAMOLE to access the sequencing data and would improve performance considerably.

GuaCAMOLE provides a computational method to detect and correct for GC bias in sequencing protocols. For a wide range of sequencing protocols, GuaCAMOLE substantially improves abundance estimates over alternative methods. Taxa whose abundance estimates are improved include clinically relevant species. GuaCAMOLE is in principle applicable to all types of metagenomic samples, but performs best when reasonably complete reference genomes are available for all species. To facilitate its use by the community and its integration into standard pipelines, GuaCAMOLE is available as an easy-to-use and fast Python package under https://github.com/Cibiv/GuaCAMOLE.

Methods

The GuaCAMOLE algorithm

GuaCAMOLE operates on a pre-defined taxonomy comprising nodes K₁, …, K_n, which are arranged in a tree. We often refer to these nodes simply as taxa. Leaf nodes represent individual species (or strains), whereas internal nodes represent higher taxonomic groups such as genera, families etc. Leaf nodes K_j always have an associated genome G_j, for internal nodes, this is optional. The taxonomy, together with all associated genomes, is referred to as a database.

GuaCAMOLE estimates the abundances of these taxa from a metagenomic sequencing library containing a number N of sequencing reads. Each read is assumed to stem from one of the taxa in the taxonomy. The GC content of a read is the fraction of bases that are either guanine (G) or cytosine (C). We assign reads into one of b equally spaced bins according to their GC content, and denote the GC content by the index g of the respective bin.

GuaCAMOLE assumes that the composition of the sequencing library depends on (i) the abundances a₁, …, a_n of all taxa (zero for all taxa not present in the sample), (ii) the GC-dependent sequencing efficiency η_g, (iii) the genomic GC content distributions f(j, g) of the taxa (defined as the expected fraction of reads from taxon j which fall into GC bin g, normalized such that ∑_g f(j, g) = 1; see section the genomic GC content distributions below), and (iv) the lengths l₁, …, l_n of the taxa’s genomes. In terms of these quantities, GuaCAMOLE assumes that the number O(j, g) of fragments stemming from taxon j with GC content g in the library is

$$O(j,g)=N\cdot \frac{{a}_{j}\cdot {\eta }_{g}\cdot {l}_{j}\cdot f(j,g)}{{\sum }_{g=1}^{b}\mathop{\sum }_{i=1}^{n}{a}_{i}\cdot {\eta }_{g}\cdot {l}_{i}\cdot f(i,g)}.$$

(1)

Note that abundances a_i and efficiencies η_g are defined only up to a factor by Eq. (1), we normalize these quantities by demanding that ${\sum }_{j}{a}_{j}=\mathop{\max }_{g}{\eta }_{g}=1$.

GuaCAMOLE estimates abundances and GC-dependent efficiencies by plugging observed fragment counts O(j, g), GC distributions f(j, g), and genome lengths l_g into Eq. (1) and solving for a₁, …, a_n and η₁, …, η_b. Note that this system is typically over-determined: it contains on the order of nb equations for n + b unknowns.

The number of reads per taxon and GC bin

To compute observed read counts O(j, g), reads are first assigned to taxonomic nodes using Kraken2⁸. This yields the number of assigned reads (read pairs for paired-end libraries) M(j) for every node j in the taxonomy. The reads assigned to each node are then further subdivided according to their GC content to obtain M(j, g), the number of reads assigned to taxon j with GC content g.

The counts M(j, g), however, are biased by similarities between genomes of different taxa. When Kraken2 is unable to unambiguously assign a read to a taxon due to such similarities, Kraken2 assigns those reads to the lowest common ancestor (LCA) of all matching taxa. To correct for this systematic bias, we use the same approach as the Bracken algorithm introduced by Lu et al. ²⁹. We recall that G_j denotes the genome associated with node K_j in the taxonomy, which always exists for leaf nodes but is optional for internal nodes. Like Bracken, we compute the conditional probabilities P(r ∈ G_j ∣ K_i) that a read which was assigned to taxon K_i actually stems from a descendant K_j of K_i with associated genome G_j in the taxonomic tree (see “The GuaCAMOLE algorithm”). This is done by first computing P(K_i ∣ r ∈ G_j), the probability that a read stemming from genome G_j is assigned to taxon i, by finding the taxon assigned to every possible read from genome G_j (of the same length as in the data). The desired probabilities P(r ∈ G_j ∣ K_i) are then found by applying Bayes’ theorem, see ref. ²⁹ for details. The original Bracken algorithm uses P(r ∈ G_j ∣ K_i) to redistribute reads assigned to K_i. GuaCAMOLE follows the same approach, but additionally keeps track of the GC content when redistributing reads. To estimate the number of reads in GC bin g that stem from taxon j, we thus compute

$$\tilde{O}(j,g)=\mathop{\sum}_{i}M(i,g)\cdot P(r\in {G}_{j}| {K}_{i})$$

(2)

where the sum runs over all ancestors K_i of K_j in the taxonomic tree (if K_i is not an ancestor of K_j, P(r ∈ G_j ∣ K_i) = 0). We emphasize that the total number of reads is invariant under the redistribution done by Eq. (2); this is ensured by the fact that ∑_jP(r ∈ G_j ∣ K_i) = 1. In particular, reads redistributed from K_i to a descendant K_j are removed from K_i. We also note that we have assumed here for simplicity that P(r ∈ G_j ∣ K_i) does not depend on the read’s GC content.

To make it possible to compare abundances of higher taxonomic levels, such as genera, we then sum the corrected read counts over descendants. The per-taxon and per-GC-bin read counts plugged into Eq. (1) are thus

$$O(j,g)=\tilde{O}(j,g)+{\sum}_{i\in {D}_{j}}\tilde{O}(i,g)$$

(3)

where D_j denotes the descendants of node K_j.

The genomic GC content distributions

For Eq. (1) to hold, the observed counts O(j, g) must, in theory, arise by sampling from the genomic GC content distributions f(j, g). These distributions must thus take the redistribution of reads into account. To find f(j, g), we first compute the individual GC distributions q(j, g) of the genomes in the taxonomy. Given the fragment length ℓ_f and read length ℓ_r of the experimental data, q(j, g) reflects the fraction of windows of length ℓ_f whose GC content within the parts covered by reads (i.e. the first and last ℓ_r bases for paired-end reads) is g. Here, we use the correct experimental fragment- and read length to avoid systematic errors. We then find the expected GC content distribution of the reads assigned to a specific taxon

$$h(j,g)={\sum}_{i\in {D}_{j}}q(i,g)\cdot P(r\in {G}_{i}| {K}_{j})$$

(4)

where the sum runs over the descendants of K_j (otherwise, P(r ∈ G_i ∣ K_j) = 0). In Eq. (4), we have thus propagated the GC distributions of individual genomes upwards in the taxonomic tree to account for the fact that Kraken2 will assign some reads to taxa at higher taxonomic levels. We now propagate these mixed GC distributions of internal nodes back downwards to find the expected GC distribution after fragment redistribution. Mimicking Eq. (2) we thus compute

$$\tilde{f}(j,g)=\frac{{\sum }_{i}h(i,g)\cdot M(i)\cdot P(r\in {G}_{j}| {K}_{i})}{{\sum }_{i}M(i)\cdot P(r\in {G}_{j}| {K}_{i})}$$

(5)

where the sums run over the ancestors of taxon K_j. Finally, we proceed similarly to Eq. (3) and average the GC distributions of all descendants, weighted by their fragment counts,

$$f(j,g)=\frac{\tilde{f}(j,g)\cdot \tilde{O}(j,g)+{\sum }_{i\in {D}_{j}} \; \, \tilde{f}(i,g)\cdot \tilde{O}(i,g)}{{\sum }_{\gamma=1}^{b}\left(\tilde{f}(j,\gamma )\cdot \tilde{O}(j,\gamma )+{\sum }_{i\in {D}_{j}} \; \; \tilde{f}(i,\gamma )\cdot \tilde{O}(i,\gamma )\right)}.$$

(6)

Since the tails of these distributions are typically noisy, we restrict these distributions to the range between the 2.5% and 97.5% quantiles for every taxon j.

Genome lengths

We assign a single genome length ℓ_j to every taxon j, independent of its taxonomic level or number of associated genomes. To do so, we average over the lengths of all assigned genomes of a taxon and all of its descendants. To account for the observed read distribution, we weight each genome length with the prior probability P(K_j) that a random read stems from taxon j as computed by Bracken²⁹.

Estimating abundances

To estimate abundances a₁, …, a_n, we rearrange Eq. (1) into the following expression for the GC-dependent efficiencies η_g in bin g,

$${\eta }_{g}=\frac{C}{N}\cdot \underbrace{\frac{O(j,g)}{{l}_{j}f(j,g)}}_{{{\rm{Obs}}}/{{\rm{Exp}}}\,{{\rm{reads}}}}\cdot \frac{1}{{a}_{j}}.$$

(7)

where $C=\mathop{\sum }_{g=1}^{b}\mathop{\sum }_{i=1}^{h}{a}_{i}\cdot {\eta }_{g}\cdot {l}_{i}\cdot f(i,g)$ is a normalization factor. Note the correspondence to Fig. 1: For a fixed taxon j, η_g is proportional to the obs/exp ratio O(j, g)/l_jf(j, g). After scaling with inverse abundances, ${a}_{j}^{-1}$ these ratios become comparable across taxa.

Given abundances a₁, …, a_n, Eq. (7) provides a separate estimate of η_g for every taxon whose genomic GC distribution overlaps g. This allows us to estimate the abundances by maximizing the agreement between these separate estimates of η_g. In terms of the inverse abundances, ${a}_{1}^{-1},\ldots,{a}_{n}^{-1}$ this can be expressed as the minimization of the quadratic form

$$G({a}_{1}^{-1},\ldots,{a}_{n}^{-1})=\mathop{\sum }_{i=1}^{n}{\sum }_{\begin{array}{c}{\scriptstyle{j=1}}\\ i\ne j\end{array}}^{n}{\sum }_{g=1}^{b}{\left(\frac{O(i,g)}{{l}_{i}f(i,g)}\cdot {a}_{i}^{-1}-\frac{O(j,g)}{{l}_{j}f(j,g)}\cdot {a}_{j}^{-1}\right)}^{2}.$$

(8)

Here, we have dropped the pre-factor C/N from the GC-dependent efficiencies η_g. In practice, we drop all terms from the sum in Eq. (8) that are either undefined or unreliable. A term is undefined if one of the two GC distributions does not overlap bin g (i.e., f(i, g) or f(j, g) is undefined). A term is considered to be unreliable if the total number of reads assigned to one of the taxa including descendants (i.e. ${\sum }_{{D}_{j}}M(j)$, for taxon j, similarly for i) lies below some user-defined threshold (minimum read threshold, default 500).

Regularization

If the taxa in a sample can be partitioned into two sets A and B such that Eq. (8) contains no term containing an abundance from A and an abundance from B, the relative abundances between sets A and B are undefined. In Fig. 1, this would be represented as two groups of taxa whose GC efficiency curves do not mutually overlap. In this situation, Eq. (8), as stated, does not have a unique minimum. To avoid this, we regularize the quadratic form G by penalizing large differences in efficiency between neighboring GC bins. Using Eq. (7), we express η_g sans the prefactor C/N as a weighted average of taxon-specified efficiencies,

$${\lambda }_{g}=\frac{1}{{n}_{g}}{\sum }_{i=1}^{n}\log \left(O(i,g)+1\right)\cdot \frac{O(i,g)}{{l}_{i}f(i,g)}\cdot {a}_{i}^{-1}.$$

(9)

We now define the regularized objective function

$$\tilde{G}({a}_{1}^{-1},\cdots \,,{a}_{n}^{-1})=\frac{1-r}{{n}^{2}}G({a}_{1}^{-1},\cdots \,,{a}_{n}^{-1})+r{\sum }_{k=1}^{b}{\sum }_{l=1}^{b}{e}^{-| k-l| }{\left({\lambda }_{k}-{\lambda }_{l}\right)}^{2}$$

(10)

which is still quadratic since λ_g is linear in ${a}_{1}^{-1},\ldots,{a}_{n}^{-1}$. The regularized program is thus still efficiently solvable. Here, r is a hyperparameter that controls the amount of regularization to apply. Smaller values of r allow more extreme and small-scale variations in sequencing efficiency to be corrected, but increase the risk of incorrect estimates in the case of taxa partitions with non-overlapping GC distributions.

To find abundances a₁, …, a_n, we first minimize the regularized objective function $\tilde{G}$ in terms of ${a}_{1}^{-1},\ldots,{a}_{n}^{-1}$ subject to ${\sum }_{i}{a}_{1}^{-1}=1$ using the Python package cvxopt. We then compute a₁, …, a_n, normalized such that ∑_ja_j = 1, and compute the GC-dependent sequencing efficiencies ${\eta }_{g}={\lambda }_{g}/{\max }_{\gamma }{\eta }_{\gamma }$.

False-positive detection and removal

The set of taxa with a non-zero number O(j, g) of assigned reads often contains taxa that are not actually present in the sample. The reads assigned to such a false-positive taxon are consequently not uniformly random draws from the taxon’s genome, and we hence expect to see some deviation from Eq. (1). To detect false-positives we therefore look for outliers amongst the relative residuals $\xi (j,g)=\left(O(j,g)-\bar{O}(j,g)\right)/\bar{O}(j,g)$, where $\bar{O}(j,g)$ is the expected number of reads computed using Eq. (1). Taxa are removed if the variation $\phi (j)=\mathop{\max }_{g}\xi (j,g)-{\min }_{g}\xi (j,g)$ of ξ(j, g) of their residuals ξ(j, g) exceeds a predefined threshold T. After removing taxa, all abundances are re-computed, the threshold is halved, and another round of false-positive removals is done. We stop after a specified number of rounds (per default 5).

Simulated mock community data

We generated synthetic communities by sampling 5, 10, 50, 100 or 400 bacterial genome assemblies from RefSeq. Only assemblies of taxa that are represented with at least one genome in our Kraken2 database were considered, but we allowed the strains to differ. To control the distribution of genomic GC content within generated communities, assemblies were partitioned into N = 12 equally sized GC-bins according to their genomic GC content and sampled in a two-stage process. First, a GC-bin i ∈ {1, …, 12} was drawn either with (i) uniform probability ${P}_{i}^{{{\rm{unif}}}}=1/N$, (ii) with probabilities ${P}_{i}^{{{\rm{med}}}}\propto \exp (-{(g-N/2)}^{2}\,18/{N}^{2})$ skewed towards a GC content of 50%, (iii) with probabilities ${P}_{i}^{{{\rm{ext}}}}\propto 1-{P}_{i}^{{{\rm{med}}}}/\mathop{\max }_{j}{P}_{j}^{{{\rm{med}}}}$ skewed towards extreme GC content. From the selected GC bin, an assembly was then drawn randomly. After sampling the genomes present in each community, the corresponding species abundances were sampled from a log-normal distribution. Finally, a paired-end sequencing library was generated for each community using a modified version of InSilicoSeq v2.0.1³⁸, available under https://github.com/Cibiv/InSilicoSeq-GCBias. In our modified version of InSilicoSeq, a bias against reads with extremal GC content is introduced by rejecting reads with probability 10 ⋅ (g−0.5)².

Experimental mock community data

For the analysis of the mock community data of Tourlousse et al. ¹⁶ (SRA accession SRS7661134), we used the RefSeq release 220 database containing human, archaeal, viral, plasmid, and bacterial DNA³⁹. We ran GuaCAMOLE (in taxonomic abundance mode), Bracken, and MetaPhlAn4. For Bracken and GuaCAMOLE, we set the read threshold to 500. For GuacAMOLE we set the number of false-positive removal rounds to 5 (4 for JH and B0), the read length to 150bp and the fragment length to the value observed for each protocol: 200bp for D0, IL and IH, 250bp for DH, BL, BH, EH, EL, 325bp for F0, FL, FH, 350bp for JH, AH, C0, CL, 400bp for AL, A0, JL, and 300bp for all other protocols. For MetaPhlAn4, we use the CHOCOPhlAn database v202103 with default parameters, and manually corrected the classification of F. prausnitzii to F. duncaniae since this recent reclassification is not reflected by CHOCOPhlAn v202103. Since Bracken always reports sequence abundances, we adjusted the Bracken-estimated abundances using the same genome length estimates that GuaCAMOLE uses to make them comparable. For Sylph, we profiled the reads using the GTDB database v220⁴⁰. To ensure comparability across all methods, we manually mapped the taxonomic profiles from Sylph, mOTUS3, and SingleM to their corresponding NCBI taxonomy IDs using the provided genus and species names.

Analyzing colorectal cancer microbiome data

We ran GuaCAMOLE on all paired-end human gut microbiome samples with non-zero read length from the curated list of samples from 33 studies found in Table S2 of Murovec et al. ²⁸. For the study of Yachida et al. ²⁶ we included all 645 samples, even those not included in the list of Murovec et al. GuaCAMOLE was run with default parameters except for activating genome length correction and specifying the read length reported in the SRA metadata of each sample. We then averaged the inferred GC-dependent sequencing efficiencies across each of the 33 studies (Fig. S3), performed a principal component analysis on the resulting efficiency curves, and clustered the studies based on the first three principal components using hierarchical clustering (R’s hclust command) with Euclidean distances (Fig. S4). Based on visual inspection of the resulting dendrogram, we split the studies into 4 clusters. For each cluster, we computed the number of species, average corrected abundance, and average abundance correction compared to Bracken for each GC-bin from 25% to 75% (Fig. 4). As “corrected abundances” we used the abundances corrected based on the inferred sequencing efficiency of each taxon and we included taxa flagged by GuaCAMOLE’s false-positive detection. For plotting, abundances and abundance corrections were smoothed with R’s stat_smooth command with method ‘GAM’.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The sequencing data used in this study is publicly available from the short read archive (SRA), the data of Tourlousse et al. ¹⁶ under accession PRJNA650228 and the data of Mori et al. ³⁵ under accession PRJNA650228. The SRA accessions of all samples used in this study, including the curated colorectal cancer (CRC) samples from refs. ^27,28, together with the processed data and scripts required to reproduce the main analyses and figures of this publication are available at https://github.com/Cibiv/GenomicGCBiasCorrectionImprovesAbundanceEstimation.

Code availability

GuaCAMOLE is available at https://github.com/CIBIV/GuaCAMOLE under an open-source license. All analyses in this study were done using with version ref. ⁴¹.

References

Morgan, J. L., Darling, A. E. & Eisen, J. A. Metagenomic sequencing of an in vitro-simulated microbial community. PloS One 5, e10209 (2010).
Article PubMed PubMed Central ADS Google Scholar
Sunagawa, S. et al. Structure and function of the global ocean microbiome. Science 348, 1261359 (2015).
Article PubMed Google Scholar
Integrative, H. et al. The integrative human microbiome project. Natur 569, 641–648 (2019).
Article ADS Google Scholar
Trivedi, P., Leach, J. E., Tringe, S. G., Sa, T. & Singh, B. K. Plant–microbiome interactions: from community assembly to plant health. Nat. Rev. Microbiol. 18, 607–621 (2020).
Article CAS PubMed Google Scholar
Sato, M. P. et al. Comparison of the sequencing bias of currently available library preparation kits for illumina sequencing of bacterial genomes and metagenomes. DNA Res. 26, 391–398 (2019).
Article CAS PubMed PubMed Central Google Scholar
Bowers, R. M. et al. Impact of library preparation protocols and template quantity on the metagenomic reconstruction of a mock microbial community. Bmc Genomics 16, 1–12 (2015).
Article Google Scholar
Jones, M. B. et al. Library preparation methodology can influence genomic and functional predictions in human microbiome research. Proc. Natl Acad. Sci. 112, 14024–14029 (2015).
Article CAS PubMed PubMed Central ADS Google Scholar
Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 1–13 (2019).
Article Google Scholar
Blanco-Míguez, A. et al. Extending and improving metagenomic taxonomic profiling with uncharacterized species using metaphlan 4. Nature Biotechnology 1–12 (2023).
Ghurye, J. S., Cepeda-Espinoza, V. & Pop, M. Metagenomic assembly: overview, challenges and applications. Yale J. Biol. Med. 89, 353–362 (2016).
CAS PubMed PubMed Central Google Scholar
Uritskiy, G. V., DiRuggiero, J. & Taylor, J. MetaWRAP—a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome 6, 158 (2018).
Article PubMed PubMed Central Google Scholar
Li, D., Liu, C.-M., Luo, R., Sadakane, K. & Lam, T.-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31, 1674–1676 (2015).
Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).
Dohm, J. C., Lottaz, C., Borodina, T. & Himmelbauer, H. Substantial biases in ultra-short read data sets from high-throughput dna sequencing. Nucleic acids Res. 36, e105 (2008).
Article PubMed PubMed Central Google Scholar
Benjamini, Y. & Speed, T. P. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40, e72–e72 (2012).
Article CAS PubMed PubMed Central Google Scholar
Tourlousse, D. M. et al. Validation and standardization of dna extraction and library construction methods for metagenomics-based human fecal microbiome measurements. Microbiome 9, 1–19 (2021).
Article Google Scholar
Browne, P. D. et al. Gc bias affects genomic and metagenomic reconstructions, underrepresenting GC-poor organisms. GigaScience 9, giaa008 (2020).
Article CAS PubMed PubMed Central ADS Google Scholar
Yu, T. et al. Fusobacterium nucleatum promotes chemoresistance to colorectal cancer by modulating autophagy. Cell 170, 548–563 (2017).
Article CAS PubMed PubMed Central Google Scholar
Han, Y. W. Fusobacterium nucleatum: a commensal-turned pathogen. Curr. Opin. Microbiol. 23, 141–147 (2015).
Article CAS PubMed Google Scholar
Yang, Y. et al. Fusobacterium nucleatum increases proliferation of colorectal cancer cells and tumor development in mice by activating toll-like receptor 4 signaling to nuclear factor-κb, and up-regulating expression of MicroRNA-21. Gastroenterology 152, 851–866.e24 (2017).
Article PubMed Google Scholar
McLaren, M. R., Nearing, J. T., Willis, A. D., Lloyd, K. G. & Callahan, B. J. Implications of taxonomic bias for microbial differential-abundance analysis. bioRxiv 2022–08 (2022).
Love, M. I., Hogenesch, J. B. & Irizarry, R. A. Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation. Nat. Biotechnol. 34, 1287–1291 (2016).
Article CAS PubMed PubMed Central Google Scholar
Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, 1–12 (2014).
Article Google Scholar
Kim, D., Song, L., Breitwieser, F. P. & Salzberg, S. L. Centrifuge: Rapid and sensitive classification of metagenomic sequences. Genome Res. 26, 1721–1729 (2016).
Article CAS PubMed PubMed Central Google Scholar
Dilthey, A. T., Jain, C., Koren, S. & Phillippy, A. M. Strain-level metagenomic assignment and compositional estimation for long reads with metamaps. Nat. Commun. 10, 3066 (2019).
Article PubMed PubMed Central ADS Google Scholar
Yachida, S. et al. Metagenomic and metabolomic analyses reveal distinct stage-specific phenotypes of the gut microbiota in colorectal cancer. Nat. Med. 25, 968–976 (2019).
Article CAS PubMed Google Scholar
Gupta, V. K. et al. A predictive index for health status using species-level gut microbiome profiling. Nat. Commun. 11, 4635 (2020).
Article CAS PubMed PubMed Central ADS Google Scholar
Murovec, B., Deutsch, L. & Stres, B. Predictive modeling of colorectal cancer using exhaustive analysis of microbiome information layers available from public metagenomic data. Front. Microbiol.15 (2024).
Lu, J., Breitwieser, F. P., Thielen, P. & Salzberg, S. L. Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci. 3, e104 (2017).
Article PubMed Google Scholar
Sun, Z. et al. Challenges in benchmarking metagenomic profilers. Nat. methods 18, 618–626 (2021).
Article CAS PubMed PubMed Central Google Scholar
Woodcroft, B. J. et al. SingleM and Sandpiper: Robust microbial taxonomic profiles from metagenomic data (2024).
Shaw, J. & Yu, Y. W. Rapid species-level metagenome profiling and containment estimation with sylph. Nat. Biotechnol. 1–12 (2024).
Ruscheweyh, H.-J. et al. mOTUs: profiling taxonomic composition, transcriptional activity and strain populations of microbial communities. Curr. Protoc. 1, e218 (2021).
Article CAS PubMed Google Scholar
Aird, D. et al. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 12, 1–14 (2011).
Article Google Scholar
Mori, H. et al. Assessment of metagenomic workflows using a newly constructed human gut microbiome mock community. DNA Res. 30, dsad010 (2023).
Article PubMed PubMed Central Google Scholar
Bradford, L. M., Carrillo, C. & Wong, A. Managing false positives during detection of pathogen sequences in shotgun metagenomics datasets. BMC Bioinforma. 25, 372 (2024).
Article Google Scholar
McLaren, M. R., Willis, A. D. & Callahan, B. J. Consistent and correctable bias in metagenomic sequencing experiments. Elife 8, e46923 (2019).
Article PubMed PubMed Central Google Scholar
Gourlé, H., Karlsson-Lindsjö, O., Hayer, J. & Bongcam-Rudloff, E. Simulating Illumina metagenomic data with insilicoseq. Bioinformatics 35, 521–522 (2019).
Article PubMed Google Scholar
O’Leary, N. A. et al. Reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation. Nucleic acids Res. 44, D733–D745 (2016).
Article PubMed Google Scholar
Parks, D. H. et al. Gtdb: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank-normalized and complete genome-based taxonomy. Nucleic acids Res. 50, D785–D794 (2022).
Article CAS PubMed Google Scholar
Holcik, L., Pflug, F. G. & von Haeseler, A. Genomic GC bias correction improves species abundance estimation from metagenomic data. https://doi.org/10.5281/zenodo.17355036, code repository (2025).

Download references

Acknowledgements

We thank Simon Haendeler and Fausto Bradke for helpful suggestions regarding algorithm implementation and optimization. For help and support with the high-performance computing (HPC) infrastructure we thank Robert Happel and the Scientific Computing and Data Analysis (SCDA) section of Core Facilities at OIST. Finally we thank all members of the Biological Complexity Unit at OIST, the Center for Integrative Bioinformatics Vienna (CIBIV) and the Vienna Biocenter PhD program for inspiring discussions.

Author information

Authors and Affiliations

Center for Integrative Bioinformatics Vienna (CIBIV), University of Vienna, Vienna, Austria
Laurenz Holcik, Arndt von Haeseler & Florian G. Pflug
Ludwig Boltzmann Institute for Network Medicine, University of Vienna, Vienna, Austria
Arndt von Haeseler
Biological Complexity Unit, Okinawa Institute of Science and Technology, Onna, Okinawa, Japan
Florian G. Pflug

Authors

Laurenz Holcik
View author publications
Search author on:PubMed Google Scholar
Arndt von Haeseler
View author publications
Search author on:PubMed Google Scholar
Florian G. Pflug
View author publications
Search author on:PubMed Google Scholar

Contributions

L.H., F.G.P. and Av.H. designed the algorithm; L.H. implemented, tested and benchmarked the algorithm; F.G.P. and L.H. applied the algorithm to human gut microbiomes; L.H. and F.G.P. created the figures; L.H., Av.H., and F.G.P. drafted and revised the manuscript; Av.H. and F.G.P. supervised the project.

Corresponding author

Correspondence to Florian G. Pflug.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Svetlana Kutuzova, Simon Rasmussen, and the other anonymous reviewer for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Transparent Peer Review file

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Holcik, L., von Haeseler, A. & Pflug, F.G. Genomic GC bias correction improves species abundance estimation from metagenomic data. Nat Commun 16, 10523 (2025). https://doi.org/10.1038/s41467-025-65530-4

Download citation

Received: 19 October 2024
Accepted: 17 October 2025
Published: 26 November 2025
Version of record: 26 November 2025
DOI: https://doi.org/10.1038/s41467-025-65530-4