Processing-bias correction with DEBIAS-M improves cross-study generalization of microbiome-based prediction models

Austin, George I.; Brown Kav, Aya; ElNaggar, Shahd; Park, Heekuk; Biermann, Jana; Uhlemann, Anne-Catrin; Pe’er, Itsik; Korem, Tal

doi:10.1038/s41564-025-01954-4

Article
Published: 27 March 2025

Processing-bias correction with DEBIAS-M improves cross-study generalization of microbiome-based prediction models

Nature Microbiology volume 10, pages 897–911 (2025)Cite this article

2185 Accesses
4 Citations
37 Altmetric
Metrics details

Subjects

Abstract

Every step in common microbiome profiling protocols has variable efficiency for each microbe, for example, different DNA extraction efficiency for Gram-positive bacteria. These processing biases impede the identification of signals that are biologically interpretable and generalizable across studies. ‘Batch-correction’ methods have been used to address these issues computationally with some success, but they are largely non-interpretable and often require the use of an outcome variable in a manner that risks overfitting. We present DEBIAS-M (domain adaptation with phenotype estimation and batch integration across studies of the microbiome), an interpretable framework for inference and correction of processing bias, which facilitates domain adaptation in microbiome studies. DEBIAS-M learns bias-correction factors for each microbe in each batch that simultaneously minimize batch effects and maximize cross-study associations with phenotypes. Using diverse benchmarks including 16S rRNA and metagenomic sequencing, classification and regression, and a variety of clinical and molecular targets, we demonstrate that using DEBIAS-M improves cross-study prediction accuracy compared with commonly used batch-correction methods. Notably, we show that the inferred bias-correction factors are stable, interpretable and strongly associated with specific experimental protocols. Overall, we show that DEBIAS-M facilitates improved modelling of microbiome data and identification of interpretable signals that generalize across studies.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Processing-bias correction with DEBIAS-M.**

**Fig. 2: DEBIAS-M outperforms batch-correction methods in cross-study prediction benchmarks.**

**Fig. 3: DEBIAS-M is robust across dataset characteristics in simulations.**

**Fig. 4: DEBIAS-M infers biases that are stable, interpretable and consistent with experimental processes.**

**Fig. 5: Regression and multi-task learning with DEBIAS-M improves metabolite predictions.**

**Fig. 6: DEBIAS-M improves cross-study prediction of cervical cancer.**

Predicting bacterial phenotypic traits through improved machine learning using high-quality, curated datasets

Article Open access 07 June 2025

Machine learning and deep learning applications in microbiome research

Article Open access 06 October 2022

Benchmarking microbiome transformations favors experimental quantitative approaches to address compositionality and sampling depth biases

Article Open access 11 June 2021

Data availability

All datasets analysed in this study are publicly available. The HIV dataset is available from Synapse (https://www.synapse.org/#!Synapse:syn18406854). The colorectal cancer and melanoma immunotherapy datasets are available through the R curatedMetagenomicData package⁶¹. The cervical neoplasia datasets were compiled from data provided with each publication, with information detailed in Supplementary Tables 4 and 5. The paired vaginal microbiome and metabolite data were obtained from supplementary data 2 of ref. ⁴⁹ (also available from dbGaP, accession phs001739.v1.p1) and supplementary table 1 of ref. ⁴⁸ (also available from MetaboLights, accession number MTBLS702). Available accession codes for all other datasets are listed in Supplementary Tables 1 and 3–6.

Code availability

DEBIAS-M is available on GitHub at https://korem-lab.github.io/DEBIAS-M/ (ref. ²⁸). Code used to generate all analyses and plots can be found on GitHub at https://github.com/korem-lab/DEBIAS-M-Analysis/ (ref. ⁶³).

References

Ramspek, C. L., Jager, K. J., Dekker, F. W., Zoccali, C. & van Diepen, M. External validation of prognostic models: what, why, how, when and where? Clin. Kidney J. 14, 49–58 (2021).
Article PubMed Google Scholar
Orouji, S., Liu, M. C., Korem, T. & Peters, M. A. K. Domain adaptation in small-scale and heterogeneous biological datasets. Sci. Adv. 10, eadp6040 (2024).
Article PubMed PubMed Central Google Scholar
Golob, J. L. et al. Microbiome preterm birth DREAM challenge: crowdsourcing machine learning approaches to advance preterm birth research. Cell Rep. Med. 5, 101350 (2023).
Article PubMed PubMed Central Google Scholar
Wirbel, J. et al. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat. Med. 25, 679–689 (2019).
Article PubMed PubMed Central CAS Google Scholar
Thomas, A. M. et al. Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat. Med. 25, 667–678 (2019).
Article PubMed PubMed Central CAS Google Scholar
McOrist, A. L., Jackson, M. & Bird, A. R. A comparison of five methods for extraction of bacterial DNA from human faecal samples. J. Microbiol. Methods 50, 131–139 (2002).
Article PubMed CAS Google Scholar
Salonen, A. et al. Comparative analysis of fecal DNA extraction methods with phylogenetic microarray: effective recovery of bacterial and archaeal DNA using mechanical cell lysis. J. Microbiol. Methods 81, 127–134 (2010).
Article PubMed CAS Google Scholar
Wesolowska-Andersen, A. et al. Choice of bacterial DNA extraction method from fecal material influences community structure as evaluated by metagenomic analysis. Microbiome 2, 19 (2014).
Article PubMed PubMed Central Google Scholar
Kennedy, N. A. et al. The impact of different DNA extraction kits and laboratories upon the assessment of human gut microbiota composition by 16S rRNA gene sequencing. PLoS ONE 9, e88982 (2014).
Article PubMed PubMed Central Google Scholar
Brooks, J. P. et al. The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies. BMC Microbiol. 15, 66 (2015).
Article PubMed PubMed Central Google Scholar
Costea, P. I. et al. Towards standards for human fecal sample processing in metagenomic studies. Nat. Biotechnol. 35, 1069–1076 (2017).
Article PubMed CAS Google Scholar
Sinha, R. et al. Assessment of variation in microbial community amplicon sequencing by the Microbiome Quality Control (MBQC) project consortium. Nat. Biotechnol. 35, 1077–1086 (2017).
Article PubMed PubMed Central CAS Google Scholar
Clausen, D. S. & Willis, A. D. Evaluating replicability in microbiome data. Biostatistics 23, 1099–1114 (2022).
Article PubMed Google Scholar
Walker, A. W. & Hoyles, L. Human microbiome myths and misconceptions. Nat. Microbiol. 8, 1392–1396 (2023).
Article PubMed CAS Google Scholar
McLaren, M. R., Willis, A. D. & Callahan, B. J. Consistent and correctable bias in metagenomic sequencing experiments. Elife 8, e46923 (2019).
Article PubMed PubMed Central Google Scholar
Gibbons, S. M., Duvallet, C. & Alm, E. J. Correcting for batch effects in case-control microbiome studies. PLoS Comput. Biol. 14, e1006102 (2018).
Article PubMed PubMed Central Google Scholar
Dai, Z., Wong, S. H., Yu, J. & Wei, Y. Batch effects correction for microbiome data with Dirichlet-multinomial regression. Bioinformatics 35, 807–814 (2019).
Article PubMed CAS Google Scholar
Ling, W. et al. Batch effects removal for microbiome data via conditional quantile regression. Nat. Commun. 13, 5418 (2022).
Article PubMed PubMed Central CAS Google Scholar
Ma, S. et al. Population structure discovery in meta-analyzed microbial communities and inflammatory bowel disease using MMUPHin. Genome Biol. 23, 208 (2022).
Article PubMed PubMed Central Google Scholar
Wang, Y. & Lê Cao, K.-A. PLSDA-batch: a multivariate framework to correct for batch effects in microbiome data. Brief. Bioinform. 24, bbac622 (2023).
Article PubMed PubMed Central Google Scholar
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
Article PubMed Google Scholar
Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).
Article PubMed PubMed Central Google Scholar
Mecham, B. H., Nelson, P. S. & Storey, J. D. Supervised normalization of microarrays. Bioinformatics 26, 1308–1315 (2010).
Article PubMed PubMed Central CAS Google Scholar
Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
Article PubMed PubMed Central Google Scholar
Austin, G. I. & Korem, T. Compositional transformations can reasonably introduce phenotype-associated values into sparse features. Preprint at bioRxiv https://doi.org/10.1101/2024.02.19.581060 (2024).
Poore, G. D. et al. Microbiome analyses of blood and tissues suggest cancer diagnostic approach. Nature 579, 567–574 (2020); retraction 631, 694 (2024).
Article PubMed PubMed Central CAS Google Scholar
Gihawi, A. et al. Major data analysis errors invalidate cancer microbiome findings. mBio 14, e0160723 (2023).
Article PubMed Google Scholar
Austin, G. I. et al. DEBIAS-M. GitHub https://korem-lab.github.io/DEBIAS-M/ (2025).
Tuddenham, S. A. et al. The impact of human immunodeficiency virus infection on gut microbiota α-diversity: an individual-level meta-analysis. Clin. Infect. Dis. 70, 615–627 (2020).
Article PubMed Google Scholar
Feng, Q. et al. Gut microbiome development along the colorectal adenoma–carcinoma sequence. Nat. Commun. 6, 6528 (2015).
Article PubMed CAS Google Scholar
Gupta, A. et al. Association of Flavonifractor plautii, a flavonoid-degrading bacterium, with the gut microbiome of colorectal cancer patients in India. mSystems 4, e00438-19 (2019).
Article PubMed PubMed Central CAS Google Scholar
Hannigan, G. D. et al. Diagnostic potential and interactive dynamics of the colorectal cancer virome. mBio 9, e02248-18 (2018).
Article PubMed PubMed Central Google Scholar
Vogtmann, E. et al. Colorectal cancer and the human gut microbiome: reproducibility with whole-genome shotgun sequencing. PLoS ONE 11, e0155362 (2016).
Article PubMed PubMed Central Google Scholar
Yu, J. et al. Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut 66, 70–78 (2017).
Article PubMed CAS Google Scholar
Zeller, G. et al. Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol. Syst. Biol. 10, 766 (2014).
Article PubMed PubMed Central Google Scholar
Audirac-Chalifour, A. et al. Cervical microbiome and cytokine profile at various stages of cervical cancer: a pilot study. PLoS ONE 11, e0153274 (2016).
Article PubMed PubMed Central Google Scholar
Di Paola, M. et al. Characterization of cervico-vaginal microbiota in women developing persistent high-risk human papillomavirus infection. Sci. Rep. 7, 10200 (2017).
Article PubMed PubMed Central Google Scholar
Lee, J. E. et al. Association of the vaginal microbiota with human papillomavirus infection in a Korean twin cohort. PLoS ONE 8, e63514 (2013).
Article PubMed PubMed Central Google Scholar
Ilhan, Z. E. et al. Deciphering the complex interplay between microbiota, HPV, inflammation and cancer through cervicovaginal metabolic profiling. EBioMedicine 44, 675–690 (2019).
Article PubMed PubMed Central CAS Google Scholar
Onywera, H., Williamson, A.-L., Mbulawa, Z. Z. A., Coetzee, D. & Meiring, T. L. The cervical microbiota in reproductive-age South African women with and without human papillomavirus infection. Papillomavirus Res. 7, 154–163 (2019).
Article PubMed PubMed Central Google Scholar
Tsementzi, D. et al. Comparison of vaginal microbiota in gynecologic cancer patients pre- and post-radiation therapy and healthy women. Cancer Med. 9, 3714–3724 (2020).
Article PubMed PubMed Central CAS Google Scholar
Łaniewski, P., Cui, H., Roe, D. J., Chase, D. M. & Herbst-Kralovetz, M. M. Vaginal microbiota, genital inflammation, and neoplasia impact immune checkpoint protein profiles in the cervicovaginal microenvironment. npj Precis. Oncol. 4, 22 (2020).
Article PubMed PubMed Central Google Scholar
Jiang, L. et al. Influence of pelvic intensity-modulated radiation therapy with concurrent cisplatin-based chemotherapy of cervical cancer on the vaginal microbiome. Front. Oncol. 11, 615439 (2021).
Article PubMed PubMed Central CAS Google Scholar
Whalen, S., Schreiber, J., Noble, W. S. & Pollard, K. S. Navigating the pitfalls of applying machine learning in genomics. Nat. Rev. Genet. 23, 169–181 (2022).
Article PubMed CAS Google Scholar
Ma, S. et al. A statistical model for describing and simulating microbial community profiles. PLoS Comput. Biol. 17, e1008913 (2021).
Article PubMed PubMed Central CAS Google Scholar
Klindworth, A. et al. Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies. Nucleic Acids Res. 41, e1 (2013).
Article PubMed CAS Google Scholar
Bukin, Y. S. et al. The effect of 16S rRNA region choice on bacterial community metabarcoding results. Sci. Data 6, 190007 (2019).
Article PubMed PubMed Central CAS Google Scholar
Kindschuh, W. F. et al. Preterm birth is associated with xenobiotics and predicted by the vaginal metabolome. Nat. Microbiol. 8, 246–259 (2023).
Article PubMed PubMed Central CAS Google Scholar
Elovitz, M. A. et al. Cervicovaginal microbiota and local immune response modulate the risk of spontaneous preterm delivery. Nat. Commun. 10, 1305 (2019).
Article PubMed PubMed Central Google Scholar
Bar, N. et al. A reference map of potential determinants for the human serum metabolome. Nature 588, 135–140 (2020).
Article PubMed Google Scholar
Mallick, H. et al. Predictive metabolomic profiling of microbial communities using amplicon or metagenomic sequences. Nat. Commun. 10, 3136 (2019).
Article PubMed PubMed Central Google Scholar
Lee, K. A. et al. Cross-cohort gut microbiome associations with immune checkpoint inhibitor response in advanced melanoma. Nat. Med. 28, 535–544 (2022).
Article PubMed PubMed Central CAS Google Scholar
Wind, T. T. et al. Gut microbial species and metabolic pathways associated with response to treatment with immune checkpoint inhibitors in metastatic melanoma. Melanoma Res. 30, 235–246 (2020).
Article PubMed CAS Google Scholar
Peters, B. A. et al. Relating the gut metagenome and metatranscriptome to immunotherapy responses in melanoma patients. Genome Med. 11, 61 (2019).
Article PubMed PubMed Central Google Scholar
Clausen, D. S. & Willis, A. D. Modeling complex measurement error in microbiome experiments. Preprint at https://doi.org/10.48550/arXiv.2204.12733 2022).
Clausen, D. S. & Willis, A. D. Estimating fold changes from partially observed outcomes with applications in microbial metagenomics. Preprint at https://doi.org/10.48550/arXiv.2402.05231 (2024).
Nearing, J. T., Comeau, A. M. & Langille, M. G. I. Identifying biases and their potential solutions in human microbiome studies. Microbiome 9, 113 (2021).
Article PubMed PubMed Central Google Scholar
Morgan, J. L., Darling, A. E. & Eisen, J. A. Metagenomic sequencing of an in vitro-simulated microbial community. PLoS ONE 5, e10209 (2010).
Article PubMed PubMed Central Google Scholar
Austin, G. I. et al. Contamination source modeling with SCRuB improves cancer phenotype prediction from microbiome data. Nat. Biotechnol. 41, 1820–1828 (2023).
Article PubMed PubMed Central CAS Google Scholar
Callahan, B. J. et al. DADA2: high-resolution sample inference from Illumina amplicon data. Nat. Methods 13, 581–583 (2016).
Article PubMed PubMed Central CAS Google Scholar
Pasolli, E. et al. Accessible, curated metagenomic data through ExperimentHub. Nat. Methods 14, 1023–1024 (2017).
Article PubMed PubMed Central CAS Google Scholar
Mukherjee, S. et al. Twenty-five years of Genomes OnLine Database (GOLD): data updates and new features in v.9. Nucleic Acids Res. 51, D957–D963 (2023).
Article PubMed CAS Google Scholar
Austin, G. I. et al. DEBIAS-M Analysis. GitHub https://github.com/korem-lab/DEBIAS-M-Analysis/ (2025).

Download references

Acknowledgements

We thank members of the Korem group and G. D. Sepich-Poore for useful discussions, and the RECOMB2024 reviewers and attendees for insightful comments. We thank all authors and participants involved in the generation of data used in this study. This work was supported by the Program for Mathematical Genomics at Columbia University (T.K.), R01HD106017 (T.K.), R01HD114715 (T.K.), T15LM007079 (G.I.A.), and U54DK104309 (A.-C.U.).

Author information

Authors and Affiliations

Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, USA
George I. Austin
Program for Mathematical Genomics, Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
George I. Austin, Aya Brown Kav, Shahd ElNaggar, Jana Biermann, Itsik Pe’er & Tal Korem
Division of Infectious Diseases, Columbia University Irving Medical Center, New York, NY, USA
Heekuk Park & Anne-Catrin Uhlemann
Department of Medicine, Division of Hematology/Oncology, Columbia University Irving Medical Center, New York, NY, USA
Jana Biermann
Herbert Irving Comprehensive Cancer Center, Columbia University Irving Medical Center, New York, NY, USA
Jana Biermann & Tal Korem
Department of Computer Science, Columbia University, New York, NY, USA
Itsik Pe’er
Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA
Tal Korem

Authors

George I. Austin
View author publications
Search author on:PubMed Google Scholar
Aya Brown Kav
View author publications
Search author on:PubMed Google Scholar
Shahd ElNaggar
View author publications
Search author on:PubMed Google Scholar
Heekuk Park
View author publications
Search author on:PubMed Google Scholar
Jana Biermann
View author publications
Search author on:PubMed Google Scholar
Anne-Catrin Uhlemann
View author publications
Search author on:PubMed Google Scholar
Itsik Pe’er
View author publications
Search author on:PubMed Google Scholar
Tal Korem
View author publications
Search author on:PubMed Google Scholar

Contributions

G.I.A. and T.K. conceived and designed the study, designed analyses, interpreted the results and wrote the manuscript with input from all authors. G.I.A. conceived and wrote DEBIAS-M and conducted all analyses. S.E., J.B., A.B.K and H.P. acquired, harmonized and processed data as well as assisted with analyses. A.-C.U. and I.P. provided critical comments on the design, method and analysis. T.K. supervised the study.

Corresponding author

Correspondence to Tal Korem.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Microbiology thanks Wodan Ling, Hanbo Dong, Chenlian Fu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Overview of information used in various microbiome batch-correction prediction benchmarks.

Top legend: Description of information that is typically incorporated in microbiome batch correction, which is 1) the samples themselves; 2) the labels to be predicted in downstream modeling; 3) other covariates; and 4) target labels. a, The primary batch-correction evaluation strategy used in this work for DEBIAS-M, in which the samples from all studies are used during batch correction, but only the labels from the training set are used during batch correction or model training. b, The primary batch-correction evaluation strategy used for methods that require an outcome label for all studies, in which a covariate is provided during batch-correction instead of the target label. c, The batch-correction strategy used in our ‘online’ benchmark in Extended Data Fig. 2a–c, in which no information from the test set is used during batch correction or model training. Once all bias-correction factors and predictive model weights are learned and fixed for the training set, bias correction is performed separately for the test set by adjusting its bias-correction factors to optimize cross-batch similarity. d, An approach that has been used in some previous benchmarks, in which the labels of the test set are used during batch correction itself. This risks ‘information leakage’, and is used in this work only in Extended Data Fig. 3a, b, and, for ConQuR and percnorm, also Extended Data Fig. 6. e, A description of how DEBIAS-M is used as a preprocessing step (Fig. 6 and Extended Data Fig. 9a). DEBIAS-M is applied to samples from all studies, but receives the target label only for studies from the training set. A prediction model is then trained on the training set which was processed by DEBIAS-M, and evaluated on the test set.

Extended Data Fig. 2 DEBIAS-M maintains its performance in log space and without observing the test set during training.

a, The progression of cross-batch similarity loss as a fitted online DEBIAS-M model adapts to samples from a previously unobserved study, by solely minimizing the cross-batch similarity loss. b, the predictive performance of the fitted online DEBIAS-M model throughout the adaptation iterations. Although not directly used during the adaptation itself, the auROC of the model’s prediction on the held-out test increases as the cross-batch similarity increases. c, Box and swarm plots (Box, IQR; line, median; whiskers, nearest point to 1.5*IQR) comparing the auROC of DEBIAS-M (fitted and evaluated using the strategy in Extended Data Fig. 1a) with ‘Online DEBIAS-M’ (fitted and evaluated using the strategy in Extended Data Fig. 1c) on each held-out study, for the benchmarks used in Fig. 2a–c. Online DEBIAS-M demonstrated equivalent predictive performance on held-out studies. p - two-sided Wilcoxon signed-rank test. See Supplementary Tables 1, 3, 4 for information on studies and sample sizes. d-g, Same as Fig. 2, but comparing log-additive DEBIAS-M to batch-correction methods on clr-transformed data. Most methods operate only in relative abundance or count space, and we therefore transformed their outcome with centered log transformation. For a fair comparison, we also clr-transformed the output of regular DEBIAS-M (denoted ‘CLR(DEBIAS-M)’) PLSDA-batch was run on clr-transformed abundances. Voom-SNM is not included as it cannot operate on clr-transformed values and its output is not in non-negative relative abundance or count space. *, Fisher’s multiple comparison of one-sided DeLong tests p < 0.01 vs. DEBIAS-M. See Supplementary Table 7 for exact p values.

Extended Data Fig. 3 Using test set labels during batch correction can drastically increase measured predictive performance in downstream benchmarks.

a-b, The same cross-study colorectal cancer prediction benchmark as in Fig. 2b and Extended Data Fig. 2e, respectively, but percnorm, ConQuR, and Voom-SNM were provided all colorectal cancer labels, including for the test set, during batch correction (as in Extended Data Fig. 1d). The prediction accuracy (auROC) of certain methods inflated drastically beyond the results observed in the primary benchmark (Fig. 2b and Extended Data Fig. 2e), highlighting potential issues with assessing a batch-correction method by measuring the ability of a downstream machine learning model to predict information used during batch correction. This trend is consistent in both relative abundance space (a) and center log ratio (b). Voom-SNM is not run for (b) as its output is neither in non-negative relative abundance nor in count space. See Supplementary Table 2 for information on studies and sample sizes. c, The same cross-study HIV prediction benchmark as in Fig. 2a, comparing the accuracy on each held-out study for ConQuR, percnorm, and Voom-SNM when provided with gender or age as a covariate (see Extended Data Fig. 1b). See Supplementary Table 1 for information on studies and sample sizes. p, one-sided Wilcoxon signed-rank test. Across all panels: Box, IQR; line, median; whiskers, nearest point to 1.5*IQR. One point corresponds to the specified model and dataset’s predictive performance on a held-out study.

Extended Data Fig. 4 DEBIAS-M is robust to dataset characteristics in simulations under its model.

a-d, Box and swarm plots, showing results for the same in silico simulations as in Fig. 3a–d, but evaluating the ability of DEBIAS-M to infer the ground-truth biases (N = 25 experiments per boxplot). DEBIAS-M shows consistent ability across varying sequencing depths (a) and strength of association with phenotype (b). However, it shows improvement with the number of batches (b; especially >2) and batch size (c; especially >40). N = 25 experiments per box. e-f, Box and swarm plots similar to Fig. 3a, except in (e) all samples in each batch have a batch-specific read depth (with a global mean across batches), which simulates rarefaction to a batch-specific threshold; and in (f) each sample has its own read depth, with a different mean for each batch (and a global mean across batches), simulating non rarefaction. DEBIAS-M is robust to sequencing depth in all cases. p: Raw vs. DEBIAS-M one-sided Wilcoxon signed-rank test; DEBIAS-M vs. DEBIAS-M two-sided Mann-Whitney U test. g, The strength of association between simulated phenotype and phenotype predictions obtained using ground-truth weights and biases on the microbiome data (Methods), demonstrating the effect of increasing measurement noise in Fig. 3f. Across panels: Boxplots: line, median; box, IQR; whiskers, nearest point to 1.5*IQR. One point corresponds to one simulation, evaluating DEBIAS-M’s (a-f) or the ground-truth (g) performance on the held-out batch of that simulation.

Extended Data Fig. 5 DEBIAS-M is robust across dataset characteristics using non-simulated data.

a-c, Box and swarm plots (line, median; box, IQR; whiskers, nearest point to 1.5*IQR) showing results for simulations performed by subsetting the colorectal cancer dataset (analyzed in Fig. 2b), evaluating accuracy of classifying cancer presence using the gut microbiome under varying sequencing depth (a), number of samples per study (b) and the number of studies (c) to each held-out study (see Supplementary Table 3 for information on studies and sample sizes). DEBIAS-M significantly outperforms alternative methods in all benchmarks, but demonstrates reduced performance when the number of samples per study drops to 20. *, one-sided Wilcoxon signed-rank p < 0.05 between DEBIAS-M and marked method. See Supplementary Table 7 for exact p values.

Extended Data Fig. 6 DEBIAS-M outperforms alternative methods across dataset characteristics in simulations under its model.

a-f, Similar to Fig. 3a–f, respectively, but showing comparisons to alternative methods that produce outputs in counts or relative abundance space. Each point corresponds to the held-out batch of a single simulation. N = 25 experiments per box. Note that ConQuR and percnorm are receiving the phenotype for the test-set in these analyses (as in Extended Data Fig. 1d). g, Box and swarm plots (line, median; box, IQR; whiskers, nearest point to 1.5*IQR) showing the accuracy (as auROC) of predicting the phenotype evaluated on held-out test batches. In this simulation, ConQuR, percnorm, and Voom-SNM receive a simulated covariate (Methods) during batch correction (as in Extended Data Fig. 1b). As simulations are under the generative model of DEBIAS-M, they illustrate how the assumptions of other methods differ from the multiplicative bias model. *, two-sided Wilcoxon signed-rank p < 0.05 between DEBIAS-M and marked method. See Supplementary Table 7 for exact p values.

Extended Data Fig. 7 DEBIAS-M inference yields biological insights into sequencing bias.

Analyses of a fitted DEBIAS-M model on the collection of HIV studies used in Figs. 2a, 4, with bias-correction factors for species not found in a certain study imputed to the largest observed factor across all datasets. a, Heatmap illustrating the presence (blue) and absence (orange) of each OTU across each of the HIV studies analyzed, displayed using agglomerative clustering (Methods). The OTU detection patterns of the different studies cluster according to the 16S region amplified. b, Adonis PERMANOVA explained variance and p values for the effect of different experimental factors (Supplementary Table 1) on these bias correction factors. c, PCA plot of the bias-correction factors inferred by DEBIAS-M, same as Fig. 4c, but with bias-correction factors for OTUs not found in a certain study imputed to the largest observed factor across all datasets. Color represents extraction kit type and shape the 16S rRNA region used. d, Box and swarm plots (Box, IQR; line, median; whiskers, nearest point to 1.5*IQR) showing the standard deviation of bias-correction factors for each of 17 studies (Supplementary Table 1), comparing those with manual and robotic processing. p, one-sided Mann-Whitney U test. e, Scatterplot showing the bias-correction factors inferred by DEBIAS-M plotted versus the 16S copy number of the same species.

Extended Data Fig. 8 Multi-task DEBIAS-M improves classification and regression of metabolite levels.

a, Same as Fig. 5c, showing 399 metabolites with >50 observations for which MelonnPan⁵¹ does not provide a prediction, using samples from 218 individuals. b, Same as (a), but classifying samples where a metabolite had levels that were higher than the median, for all metabolites. This analysis was performed for 120 metabolites selected by MelonnPan out of 509 metabolites that had values lesser and greater than their median in both batches, such that a classification task could be used to train and evaluate a model. Samples from 218 individuals were used. The multi-task version of DEBIAS-M outperformed both its single-task version (one-sided Wilcoxon signed-rank p = 0.02, as well as MelonnPan (p = 5.6 × 10⁻¹⁴). Box, IQR; line, median; whiskers, nearest point to 1.5*IQR.

Extended Data Fig. 9 DEBIAS-M improves cross-study prediction of melanoma immunotherapy response and works within a single study.

a, Box and swarm plots of auROCs, each evaluating the cross-study generalization performance models using gut microbiome data to predict immunotherapy response in melanoma patients (defined as 12-month progression-free survival⁵²). Each auROC is calculated on a held-out study. See Supplementary Table 6 for information on studies and sample sizes. ‘Log10 Linear’ denotes the pipeline used by Lee et al.⁵². Preprocessing with DEBIAS-M shows a consistent albeit small improvement across all studies, with a particularly strong effect for one study. b, Box and swarm plots of auROCS, each representing the aggregate accuracy over the held-out points for a logistic regression model evaluated under 5-fold cross validation. ‘Cervical carcinoma’, ‘Colorectal cancer’ and ‘HIV’ correspond to the benchmark datasets from Fig. 2d, Fig. 2b, and Fig. 2a, respectively, with information on studies and samples sizes in Supplementary Tables 5, 2 and 1. p, one-sided Wilcoxon signed-rank test comparing between linear and DEBIAS-M single-batch across both datasets. Box, IQR; line, median; whiskers, nearest point to 1.5*IQR.

Extended Data Fig. 10 Methodological consideration for DEBIAS-M.

a-c, An analysis performed on the HIV benchmark (analyzed in Fig. 2a) using a standard implementation of DEBIAS-M as well as one using data with an added pseudocount of one, which allows DEBIAS-M to add non-measured taxa, and in which feature values with relative abundance below a threshold of 10⁻⁴ were replaced with zero, which allows DEBIAS-M to completely remove a taxon. The average pairwise Jaccard index (a) and Bray-Curtis dissimilarity (b) between studies are lower for DEBIAS-M with pseudocount and thresholding, indicating an improved ability to reduce batch effects. However, this does not result in improved accuracy in the benchmark itself (c). d,e, Box and swarm plot of auROCs, each evaluating the generalization performance of DEBIAS-M to a held out study from the HIV (Fig. 2a; d) and colorectal cancer (Fig. 2b; e) benchmarks. In each case, we have applied DEBIAS-M to data aggregated to different taxonomic levels, as available from each of the datasets. DEBIAS-M becomes less effective with more aggregation, and, for colorectal cancer prediction, actually reduces predictive performance compared to uncorrected data after aggregation to family level. f,g, Box and swarm plot of auROCs, each evaluating the generalization performance of DEBIAS-M to a held out study from the HIV (Fig. 2a; f) and colorectal cancer (Fig. 2b; g) benchmarks using different prediction loss functions. Most alternative loss functions are not significantly different from the binary cross-entropy used as default by DEBIAS-M for classification. The L¹ loss shows significant improvement on colorectal cancer prediction (p = 0.02; e), but a significant reduction in performance on the HIV benchmark (p = 0.002; d). h, i, The total runtime of DEBIAS-M on a standard laptop for simulated datasets with an increasing number of batches, each containing 96 samples (h), as well as for 8 batches, each with an increasing number of samples (i). DEBIAS-M runs in <15 minutes (median 187 seconds) even for 64 batches with 6,144 samples in total, and in <43 seconds (median 33 seconds) for 8 batches with 4,096 samples in total. Box, IQR; line, median; whiskers, nearest point to 1.5*IQR; p, two-sided Wilcoxon signed-rank test.

Supplementary information

Supplementary Information

Supplementary Notes 1–3 and Tables 1–7.

Reporting Summary

Supplementary Tables 1–5

Supplementary Table 1 | Study design and experimental metadata collated for HIV dataset. Table 2 | Area under the precision-recall curve for different analyses. Table 3 | Study information for colorectal cancer dataset. Table 4 | Study information for cervical intraepithelial neoplasia dataset. Table 5 | Study information for cervical carcinoma dataset. Table 6 | Study information for immunotherapy dataset. Table 7 | Exact P values displayed in figures.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Austin, G.I., Brown Kav, A., ElNaggar, S. et al. Processing-bias correction with DEBIAS-M improves cross-study generalization of microbiome-based prediction models. Nat Microbiol 10, 897–911 (2025). https://doi.org/10.1038/s41564-025-01954-4

Download citation

Received: 08 February 2024
Accepted: 11 February 2025
Published: 27 March 2025
Issue date: April 2025
DOI: https://doi.org/10.1038/s41564-025-01954-4

This article is cited by

Microbiome data integration via shared dictionary learning
- Bo Yuan
- Shulei Wang
Nature Communications (2025)