Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Processing-bias correction with DEBIAS-M improves cross-study generalization of microbiome-based prediction models

Abstract

Every step in common microbiome profiling protocols has variable efficiency for each microbe, for example, different DNA extraction efficiency for Gram-positive bacteria. These processing biases impede the identification of signals that are biologically interpretable and generalizable across studies. ‘Batch-correction’ methods have been used to address these issues computationally with some success, but they are largely non-interpretable and often require the use of an outcome variable in a manner that risks overfitting. We present DEBIAS-M (domain adaptation with phenotype estimation and batch integration across studies of the microbiome), an interpretable framework for inference and correction of processing bias, which facilitates domain adaptation in microbiome studies. DEBIAS-M learns bias-correction factors for each microbe in each batch that simultaneously minimize batch effects and maximize cross-study associations with phenotypes. Using diverse benchmarks including 16S rRNA and metagenomic sequencing, classification and regression, and a variety of clinical and molecular targets, we demonstrate that using DEBIAS-M improves cross-study prediction accuracy compared with commonly used batch-correction methods. Notably, we show that the inferred bias-correction factors are stable, interpretable and strongly associated with specific experimental protocols. Overall, we show that DEBIAS-M facilitates improved modelling of microbiome data and identification of interpretable signals that generalize across studies.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Processing-bias correction with DEBIAS-M.
Fig. 2: DEBIAS-M outperforms batch-correction methods in cross-study prediction benchmarks.
Fig. 3: DEBIAS-M is robust across dataset characteristics in simulations.
Fig. 4: DEBIAS-M infers biases that are stable, interpretable and consistent with experimental processes.
Fig. 5: Regression and multi-task learning with DEBIAS-M improves metabolite predictions.
Fig. 6: DEBIAS-M improves cross-study prediction of cervical cancer.

Similar content being viewed by others

Data availability

All datasets analysed in this study are publicly available. The HIV dataset is available from Synapse (https://www.synapse.org/#!Synapse:syn18406854). The colorectal cancer and melanoma immunotherapy datasets are available through the R curatedMetagenomicData package61. The cervical neoplasia datasets were compiled from data provided with each publication, with information detailed in Supplementary Tables 4 and 5. The paired vaginal microbiome and metabolite data were obtained from supplementary data 2 of ref. 49 (also available from dbGaP, accession phs001739.v1.p1) and supplementary table 1 of ref. 48 (also available from MetaboLights, accession number MTBLS702). Available accession codes for all other datasets are listed in Supplementary Tables 1 and 36.

Code availability

DEBIAS-M is available on GitHub at https://korem-lab.github.io/DEBIAS-M/ (ref. 28). Code used to generate all analyses and plots can be found on GitHub at https://github.com/korem-lab/DEBIAS-M-Analysis/ (ref. 63).

References

  1. Ramspek, C. L., Jager, K. J., Dekker, F. W., Zoccali, C. & van Diepen, M. External validation of prognostic models: what, why, how, when and where? Clin. Kidney J. 14, 49–58 (2021).

    Article  PubMed  Google Scholar 

  2. Orouji, S., Liu, M. C., Korem, T. & Peters, M. A. K. Domain adaptation in small-scale and heterogeneous biological datasets. Sci. Adv. 10, eadp6040 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  3. Golob, J. L. et al. Microbiome preterm birth DREAM challenge: crowdsourcing machine learning approaches to advance preterm birth research. Cell Rep. Med. 5, 101350 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  4. Wirbel, J. et al. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat. Med. 25, 679–689 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  5. Thomas, A. M. et al. Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat. Med. 25, 667–678 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  6. McOrist, A. L., Jackson, M. & Bird, A. R. A comparison of five methods for extraction of bacterial DNA from human faecal samples. J. Microbiol. Methods 50, 131–139 (2002).

    Article  PubMed  CAS  Google Scholar 

  7. Salonen, A. et al. Comparative analysis of fecal DNA extraction methods with phylogenetic microarray: effective recovery of bacterial and archaeal DNA using mechanical cell lysis. J. Microbiol. Methods 81, 127–134 (2010).

    Article  PubMed  CAS  Google Scholar 

  8. Wesolowska-Andersen, A. et al. Choice of bacterial DNA extraction method from fecal material influences community structure as evaluated by metagenomic analysis. Microbiome 2, 19 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  9. Kennedy, N. A. et al. The impact of different DNA extraction kits and laboratories upon the assessment of human gut microbiota composition by 16S rRNA gene sequencing. PLoS ONE 9, e88982 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Brooks, J. P. et al. The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies. BMC Microbiol. 15, 66 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  11. Costea, P. I. et al. Towards standards for human fecal sample processing in metagenomic studies. Nat. Biotechnol. 35, 1069–1076 (2017).

    Article  PubMed  CAS  Google Scholar 

  12. Sinha, R. et al. Assessment of variation in microbial community amplicon sequencing by the Microbiome Quality Control (MBQC) project consortium. Nat. Biotechnol. 35, 1077–1086 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  13. Clausen, D. S. & Willis, A. D. Evaluating replicability in microbiome data. Biostatistics 23, 1099–1114 (2022).

    Article  PubMed  Google Scholar 

  14. Walker, A. W. & Hoyles, L. Human microbiome myths and misconceptions. Nat. Microbiol. 8, 1392–1396 (2023).

    Article  PubMed  CAS  Google Scholar 

  15. McLaren, M. R., Willis, A. D. & Callahan, B. J. Consistent and correctable bias in metagenomic sequencing experiments. Elife 8, e46923 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  16. Gibbons, S. M., Duvallet, C. & Alm, E. J. Correcting for batch effects in case-control microbiome studies. PLoS Comput. Biol. 14, e1006102 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  17. Dai, Z., Wong, S. H., Yu, J. & Wei, Y. Batch effects correction for microbiome data with Dirichlet-multinomial regression. Bioinformatics 35, 807–814 (2019).

    Article  PubMed  CAS  Google Scholar 

  18. Ling, W. et al. Batch effects removal for microbiome data via conditional quantile regression. Nat. Commun. 13, 5418 (2022).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  19. Ma, S. et al. Population structure discovery in meta-analyzed microbial communities and inflammatory bowel disease using MMUPHin. Genome Biol. 23, 208 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  20. Wang, Y. & Lê Cao, K.-A. PLSDA-batch: a multivariate framework to correct for batch effects in microbiome data. Brief. Bioinform. 24, bbac622 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).

    Article  PubMed  Google Scholar 

  22. Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Mecham, B. H., Nelson, P. S. & Storey, J. D. Supervised normalization of microarrays. Bioinformatics 26, 1308–1315 (2010).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  24. Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  25. Austin, G. I. & Korem, T. Compositional transformations can reasonably introduce phenotype-associated values into sparse features. Preprint at bioRxiv https://doi.org/10.1101/2024.02.19.581060 (2024).

  26. Poore, G. D. et al. Microbiome analyses of blood and tissues suggest cancer diagnostic approach. Nature 579, 567–574 (2020); retraction 631, 694 (2024).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  27. Gihawi, A. et al. Major data analysis errors invalidate cancer microbiome findings. mBio 14, e0160723 (2023).

    Article  PubMed  Google Scholar 

  28. Austin, G. I. et al. DEBIAS-M. GitHub https://korem-lab.github.io/DEBIAS-M/ (2025).

  29. Tuddenham, S. A. et al. The impact of human immunodeficiency virus infection on gut microbiota α-diversity: an individual-level meta-analysis. Clin. Infect. Dis. 70, 615–627 (2020).

    Article  PubMed  Google Scholar 

  30. Feng, Q. et al. Gut microbiome development along the colorectal adenoma–carcinoma sequence. Nat. Commun. 6, 6528 (2015).

    Article  PubMed  CAS  Google Scholar 

  31. Gupta, A. et al. Association of Flavonifractor plautii, a flavonoid-degrading bacterium, with the gut microbiome of colorectal cancer patients in India. mSystems 4, e00438-19 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  32. Hannigan, G. D. et al. Diagnostic potential and interactive dynamics of the colorectal cancer virome. mBio 9, e02248-18 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  33. Vogtmann, E. et al. Colorectal cancer and the human gut microbiome: reproducibility with whole-genome shotgun sequencing. PLoS ONE 11, e0155362 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  34. Yu, J. et al. Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut 66, 70–78 (2017).

    Article  PubMed  CAS  Google Scholar 

  35. Zeller, G. et al. Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol. Syst. Biol. 10, 766 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  36. Audirac-Chalifour, A. et al. Cervical microbiome and cytokine profile at various stages of cervical cancer: a pilot study. PLoS ONE 11, e0153274 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  37. Di Paola, M. et al. Characterization of cervico-vaginal microbiota in women developing persistent high-risk human papillomavirus infection. Sci. Rep. 7, 10200 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  38. Lee, J. E. et al. Association of the vaginal microbiota with human papillomavirus infection in a Korean twin cohort. PLoS ONE 8, e63514 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  39. Ilhan, Z. E. et al. Deciphering the complex interplay between microbiota, HPV, inflammation and cancer through cervicovaginal metabolic profiling. EBioMedicine 44, 675–690 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  40. Onywera, H., Williamson, A.-L., Mbulawa, Z. Z. A., Coetzee, D. & Meiring, T. L. The cervical microbiota in reproductive-age South African women with and without human papillomavirus infection. Papillomavirus Res. 7, 154–163 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  41. Tsementzi, D. et al. Comparison of vaginal microbiota in gynecologic cancer patients pre- and post-radiation therapy and healthy women. Cancer Med. 9, 3714–3724 (2020).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  42. Łaniewski, P., Cui, H., Roe, D. J., Chase, D. M. & Herbst-Kralovetz, M. M. Vaginal microbiota, genital inflammation, and neoplasia impact immune checkpoint protein profiles in the cervicovaginal microenvironment. npj Precis. Oncol. 4, 22 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  43. Jiang, L. et al. Influence of pelvic intensity-modulated radiation therapy with concurrent cisplatin-based chemotherapy of cervical cancer on the vaginal microbiome. Front. Oncol. 11, 615439 (2021).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  44. Whalen, S., Schreiber, J., Noble, W. S. & Pollard, K. S. Navigating the pitfalls of applying machine learning in genomics. Nat. Rev. Genet. 23, 169–181 (2022).

    Article  PubMed  CAS  Google Scholar 

  45. Ma, S. et al. A statistical model for describing and simulating microbial community profiles. PLoS Comput. Biol. 17, e1008913 (2021).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  46. Klindworth, A. et al. Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies. Nucleic Acids Res. 41, e1 (2013).

    Article  PubMed  CAS  Google Scholar 

  47. Bukin, Y. S. et al. The effect of 16S rRNA region choice on bacterial community metabarcoding results. Sci. Data 6, 190007 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  48. Kindschuh, W. F. et al. Preterm birth is associated with xenobiotics and predicted by the vaginal metabolome. Nat. Microbiol. 8, 246–259 (2023).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  49. Elovitz, M. A. et al. Cervicovaginal microbiota and local immune response modulate the risk of spontaneous preterm delivery. Nat. Commun. 10, 1305 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  50. Bar, N. et al. A reference map of potential determinants for the human serum metabolome. Nature 588, 135–140 (2020).

    Article  PubMed  Google Scholar 

  51. Mallick, H. et al. Predictive metabolomic profiling of microbial communities using amplicon or metagenomic sequences. Nat. Commun. 10, 3136 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  52. Lee, K. A. et al. Cross-cohort gut microbiome associations with immune checkpoint inhibitor response in advanced melanoma. Nat. Med. 28, 535–544 (2022).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  53. Wind, T. T. et al. Gut microbial species and metabolic pathways associated with response to treatment with immune checkpoint inhibitors in metastatic melanoma. Melanoma Res. 30, 235–246 (2020).

    Article  PubMed  CAS  Google Scholar 

  54. Peters, B. A. et al. Relating the gut metagenome and metatranscriptome to immunotherapy responses in melanoma patients. Genome Med. 11, 61 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  55. Clausen, D. S. & Willis, A. D. Modeling complex measurement error in microbiome experiments. Preprint at https://doi.org/10.48550/arXiv.2204.12733 2022).

  56. Clausen, D. S. & Willis, A. D. Estimating fold changes from partially observed outcomes with applications in microbial metagenomics. Preprint at https://doi.org/10.48550/arXiv.2402.05231 (2024).

  57. Nearing, J. T., Comeau, A. M. & Langille, M. G. I. Identifying biases and their potential solutions in human microbiome studies. Microbiome 9, 113 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  58. Morgan, J. L., Darling, A. E. & Eisen, J. A. Metagenomic sequencing of an in vitro-simulated microbial community. PLoS ONE 5, e10209 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  59. Austin, G. I. et al. Contamination source modeling with SCRuB improves cancer phenotype prediction from microbiome data. Nat. Biotechnol. 41, 1820–1828 (2023).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  60. Callahan, B. J. et al. DADA2: high-resolution sample inference from Illumina amplicon data. Nat. Methods 13, 581–583 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  61. Pasolli, E. et al. Accessible, curated metagenomic data through ExperimentHub. Nat. Methods 14, 1023–1024 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  62. Mukherjee, S. et al. Twenty-five years of Genomes OnLine Database (GOLD): data updates and new features in v.9. Nucleic Acids Res. 51, D957–D963 (2023).

    Article  PubMed  CAS  Google Scholar 

  63. Austin, G. I. et al. DEBIAS-M Analysis. GitHub https://github.com/korem-lab/DEBIAS-M-Analysis/ (2025).

Download references

Acknowledgements

We thank members of the Korem group and G. D. Sepich-Poore for useful discussions, and the RECOMB2024 reviewers and attendees for insightful comments. We thank all authors and participants involved in the generation of data used in this study. This work was supported by the Program for Mathematical Genomics at Columbia University (T.K.), R01HD106017 (T.K.), R01HD114715 (T.K.), T15LM007079 (G.I.A.), and U54DK104309 (A.-C.U.).

Author information

Authors and Affiliations

Authors

Contributions

G.I.A. and T.K. conceived and designed the study, designed analyses, interpreted the results and wrote the manuscript with input from all authors. G.I.A. conceived and wrote DEBIAS-M and conducted all analyses. S.E., J.B., A.B.K and H.P. acquired, harmonized and processed data as well as assisted with analyses. A.-C.U. and I.P. provided critical comments on the design, method and analysis. T.K. supervised the study.

Corresponding author

Correspondence to Tal Korem.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Microbiology thanks Wodan Ling, Hanbo Dong, Chenlian Fu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Overview of information used in various microbiome batch-correction prediction benchmarks.

Top legend: Description of information that is typically incorporated in microbiome batch correction, which is 1) the samples themselves; 2) the labels to be predicted in downstream modeling; 3) other covariates; and 4) target labels. a, The primary batch-correction evaluation strategy used in this work for DEBIAS-M, in which the samples from all studies are used during batch correction, but only the labels from the training set are used during batch correction or model training. b, The primary batch-correction evaluation strategy used for methods that require an outcome label for all studies, in which a covariate is provided during batch-correction instead of the target label. c, The batch-correction strategy used in our ‘online’ benchmark in Extended Data Fig. 2a–c, in which no information from the test set is used during batch correction or model training. Once all bias-correction factors and predictive model weights are learned and fixed for the training set, bias correction is performed separately for the test set by adjusting its bias-correction factors to optimize cross-batch similarity. d, An approach that has been used in some previous benchmarks, in which the labels of the test set are used during batch correction itself. This risks ‘information leakage’, and is used in this work only in Extended Data Fig. 3a, b, and, for ConQuR and percnorm, also Extended Data Fig. 6. e, A description of how DEBIAS-M is used as a preprocessing step (Fig. 6 and Extended Data Fig. 9a). DEBIAS-M is applied to samples from all studies, but receives the target label only for studies from the training set. A prediction model is then trained on the training set which was processed by DEBIAS-M, and evaluated on the test set.

Extended Data Fig. 2 DEBIAS-M maintains its performance in log space and without observing the test set during training.

a, The progression of cross-batch similarity loss as a fitted online DEBIAS-M model adapts to samples from a previously unobserved study, by solely minimizing the cross-batch similarity loss. b, the predictive performance of the fitted online DEBIAS-M model throughout the adaptation iterations. Although not directly used during the adaptation itself, the auROC of the model’s prediction on the held-out test increases as the cross-batch similarity increases. c, Box and swarm plots (Box, IQR; line, median; whiskers, nearest point to 1.5*IQR) comparing the auROC of DEBIAS-M (fitted and evaluated using the strategy in Extended Data Fig. 1a) with ‘Online DEBIAS-M’ (fitted and evaluated using the strategy in Extended Data Fig. 1c) on each held-out study, for the benchmarks used in Fig. 2a–c. Online DEBIAS-M demonstrated equivalent predictive performance on held-out studies. p - two-sided Wilcoxon signed-rank test. See Supplementary Tables 1, 3, 4 for information on studies and sample sizes. d-g, Same as Fig. 2, but comparing log-additive DEBIAS-M to batch-correction methods on clr-transformed data. Most methods operate only in relative abundance or count space, and we therefore transformed their outcome with centered log transformation. For a fair comparison, we also clr-transformed the output of regular DEBIAS-M (denoted ‘CLR(DEBIAS-M)’) PLSDA-batch was run on clr-transformed abundances. Voom-SNM is not included as it cannot operate on clr-transformed values and its output is not in non-negative relative abundance or count space. *, Fisher’s multiple comparison of one-sided DeLong tests p < 0.01 vs. DEBIAS-M. See Supplementary Table 7 for exact p values.

Extended Data Fig. 3 Using test set labels during batch correction can drastically increase measured predictive performance in downstream benchmarks.

a-b, The same cross-study colorectal cancer prediction benchmark as in Fig. 2b and Extended Data Fig. 2e, respectively, but percnorm, ConQuR, and Voom-SNM were provided all colorectal cancer labels, including for the test set, during batch correction (as in Extended Data Fig. 1d). The prediction accuracy (auROC) of certain methods inflated drastically beyond the results observed in the primary benchmark (Fig. 2b and Extended Data Fig. 2e), highlighting potential issues with assessing a batch-correction method by measuring the ability of a downstream machine learning model to predict information used during batch correction. This trend is consistent in both relative abundance space (a) and center log ratio (b). Voom-SNM is not run for (b) as its output is neither in non-negative relative abundance nor in count space. See Supplementary Table 2 for information on studies and sample sizes. c, The same cross-study HIV prediction benchmark as in Fig. 2a, comparing the accuracy on each held-out study for ConQuR, percnorm, and Voom-SNM when provided with gender or age as a covariate (see Extended Data Fig. 1b). See Supplementary Table 1 for information on studies and sample sizes. p, one-sided Wilcoxon signed-rank test. Across all panels: Box, IQR; line, median; whiskers, nearest point to 1.5*IQR. One point corresponds to the specified model and dataset’s predictive performance on a held-out study.

Extended Data Fig. 4 DEBIAS-M is robust to dataset characteristics in simulations under its model.

a-d, Box and swarm plots, showing results for the same in silico simulations as in Fig. 3a–d, but evaluating the ability of DEBIAS-M to infer the ground-truth biases (N = 25 experiments per boxplot). DEBIAS-M shows consistent ability across varying sequencing depths (a) and strength of association with phenotype (b). However, it shows improvement with the number of batches (b; especially >2) and batch size (c; especially >40). N = 25 experiments per box. e-f, Box and swarm plots similar to Fig. 3a, except in (e) all samples in each batch have a batch-specific read depth (with a global mean across batches), which simulates rarefaction to a batch-specific threshold; and in (f) each sample has its own read depth, with a different mean for each batch (and a global mean across batches), simulating non rarefaction. DEBIAS-M is robust to sequencing depth in all cases. p: Raw vs. DEBIAS-M one-sided Wilcoxon signed-rank test; DEBIAS-M vs. DEBIAS-M two-sided Mann-Whitney U test. g, The strength of association between simulated phenotype and phenotype predictions obtained using ground-truth weights and biases on the microbiome data (Methods), demonstrating the effect of increasing measurement noise in Fig. 3f. Across panels: Boxplots: line, median; box, IQR; whiskers, nearest point to 1.5*IQR. One point corresponds to one simulation, evaluating DEBIAS-M’s (a-f) or the ground-truth (g) performance on the held-out batch of that simulation.

Extended Data Fig. 5 DEBIAS-M is robust across dataset characteristics using non-simulated data.

a-c, Box and swarm plots (line, median; box, IQR; whiskers, nearest point to 1.5*IQR) showing results for simulations performed by subsetting the colorectal cancer dataset (analyzed in Fig. 2b), evaluating accuracy of classifying cancer presence using the gut microbiome under varying sequencing depth (a), number of samples per study (b) and the number of studies (c) to each held-out study (see Supplementary Table 3 for information on studies and sample sizes). DEBIAS-M significantly outperforms alternative methods in all benchmarks, but demonstrates reduced performance when the number of samples per study drops to 20. *, one-sided Wilcoxon signed-rank p < 0.05 between DEBIAS-M and marked method. See Supplementary Table 7 for exact p values.

Extended Data Fig. 6 DEBIAS-M outperforms alternative methods across dataset characteristics in simulations under its model.

a-f, Similar to Fig. 3a–f, respectively, but showing comparisons to alternative methods that produce outputs in counts or relative abundance space. Each point corresponds to the held-out batch of a single simulation. N = 25 experiments per box. Note that ConQuR and percnorm are receiving the phenotype for the test-set in these analyses (as in Extended Data Fig. 1d). g, Box and swarm plots (line, median; box, IQR; whiskers, nearest point to 1.5*IQR) showing the accuracy (as auROC) of predicting the phenotype evaluated on held-out test batches. In this simulation, ConQuR, percnorm, and Voom-SNM receive a simulated covariate (Methods) during batch correction (as in Extended Data Fig. 1b). As simulations are under the generative model of DEBIAS-M, they illustrate how the assumptions of other methods differ from the multiplicative bias model. *, two-sided Wilcoxon signed-rank p < 0.05 between DEBIAS-M and marked method. See Supplementary Table 7 for exact p values.

Extended Data Fig. 7 DEBIAS-M inference yields biological insights into sequencing bias.

Analyses of a fitted DEBIAS-M model on the collection of HIV studies used in Figs. 2a, 4, with bias-correction factors for species not found in a certain study imputed to the largest observed factor across all datasets. a, Heatmap illustrating the presence (blue) and absence (orange) of each OTU across each of the HIV studies analyzed, displayed using agglomerative clustering (Methods). The OTU detection patterns of the different studies cluster according to the 16S region amplified. b, Adonis PERMANOVA explained variance and p values for the effect of different experimental factors (Supplementary Table 1) on these bias correction factors. c, PCA plot of the bias-correction factors inferred by DEBIAS-M, same as Fig. 4c, but with bias-correction factors for OTUs not found in a certain study imputed to the largest observed factor across all datasets. Color represents extraction kit type and shape the 16S rRNA region used. d, Box and swarm plots (Box, IQR; line, median; whiskers, nearest point to 1.5*IQR) showing the standard deviation of bias-correction factors for each of 17 studies (Supplementary Table 1), comparing those with manual and robotic processing. p, one-sided Mann-Whitney U test. e, Scatterplot showing the bias-correction factors inferred by DEBIAS-M plotted versus the 16S copy number of the same species.

Extended Data Fig. 8 Multi-task DEBIAS-M improves classification and regression of metabolite levels.

a, Same as Fig. 5c, showing 399 metabolites with >50 observations for which MelonnPan51 does not provide a prediction, using samples from 218 individuals. b, Same as (a), but classifying samples where a metabolite had levels that were higher than the median, for all metabolites. This analysis was performed for 120 metabolites selected by MelonnPan out of 509 metabolites that had values lesser and greater than their median in both batches, such that a classification task could be used to train and evaluate a model. Samples from 218 individuals were used. The multi-task version of DEBIAS-M outperformed both its single-task version (one-sided Wilcoxon signed-rank p = 0.02, as well as MelonnPan (p = 5.6 × 10−14). Box, IQR; line, median; whiskers, nearest point to 1.5*IQR.

Extended Data Fig. 9 DEBIAS-M improves cross-study prediction of melanoma immunotherapy response and works within a single study.

a, Box and swarm plots of auROCs, each evaluating the cross-study generalization performance models using gut microbiome data to predict immunotherapy response in melanoma patients (defined as 12-month progression-free survival52). Each auROC is calculated on a held-out study. See Supplementary Table 6 for information on studies and sample sizes. ‘Log10 Linear’ denotes the pipeline used by Lee et al.52. Preprocessing with DEBIAS-M shows a consistent albeit small improvement across all studies, with a particularly strong effect for one study. b, Box and swarm plots of auROCS, each representing the aggregate accuracy over the held-out points for a logistic regression model evaluated under 5-fold cross validation. ‘Cervical carcinoma’, ‘Colorectal cancer’ and ‘HIV’ correspond to the benchmark datasets from Fig. 2d, Fig. 2b, and Fig. 2a, respectively, with information on studies and samples sizes in Supplementary Tables 5, 2 and 1. p, one-sided Wilcoxon signed-rank test comparing between linear and DEBIAS-M single-batch across both datasets. Box, IQR; line, median; whiskers, nearest point to 1.5*IQR.

Extended Data Fig. 10 Methodological consideration for DEBIAS-M.

a-c, An analysis performed on the HIV benchmark (analyzed in Fig. 2a) using a standard implementation of DEBIAS-M as well as one using data with an added pseudocount of one, which allows DEBIAS-M to add non-measured taxa, and in which feature values with relative abundance below a threshold of 10−4 were replaced with zero, which allows DEBIAS-M to completely remove a taxon. The average pairwise Jaccard index (a) and Bray-Curtis dissimilarity (b) between studies are lower for DEBIAS-M with pseudocount and thresholding, indicating an improved ability to reduce batch effects. However, this does not result in improved accuracy in the benchmark itself (c). d,e, Box and swarm plot of auROCs, each evaluating the generalization performance of DEBIAS-M to a held out study from the HIV (Fig. 2a; d) and colorectal cancer (Fig. 2b; e) benchmarks. In each case, we have applied DEBIAS-M to data aggregated to different taxonomic levels, as available from each of the datasets. DEBIAS-M becomes less effective with more aggregation, and, for colorectal cancer prediction, actually reduces predictive performance compared to uncorrected data after aggregation to family level. f,g, Box and swarm plot of auROCs, each evaluating the generalization performance of DEBIAS-M to a held out study from the HIV (Fig. 2a; f) and colorectal cancer (Fig. 2b; g) benchmarks using different prediction loss functions. Most alternative loss functions are not significantly different from the binary cross-entropy used as default by DEBIAS-M for classification. The L1 loss shows significant improvement on colorectal cancer prediction (p = 0.02; e), but a significant reduction in performance on the HIV benchmark (p = 0.002; d). h, i, The total runtime of DEBIAS-M on a standard laptop for simulated datasets with an increasing number of batches, each containing 96 samples (h), as well as for 8 batches, each with an increasing number of samples (i). DEBIAS-M runs in <15 minutes (median 187 seconds) even for 64 batches with 6,144 samples in total, and in <43 seconds (median 33 seconds) for 8 batches with 4,096 samples in total. Box, IQR; line, median; whiskers, nearest point to 1.5*IQR; p, two-sided Wilcoxon signed-rank test.

Supplementary information

Supplementary Information

Supplementary Notes 1–3 and Tables 1–7.

Reporting Summary

Supplementary Tables 1–5

Supplementary Table 1 | Study design and experimental metadata collated for HIV dataset. Table 2 | Area under the precision-recall curve for different analyses. Table 3 | Study information for colorectal cancer dataset. Table 4 | Study information for cervical intraepithelial neoplasia dataset. Table 5 | Study information for cervical carcinoma dataset. Table 6 | Study information for immunotherapy dataset. Table 7 | Exact P values displayed in figures.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Austin, G.I., Brown Kav, A., ElNaggar, S. et al. Processing-bias correction with DEBIAS-M improves cross-study generalization of microbiome-based prediction models. Nat Microbiol 10, 897–911 (2025). https://doi.org/10.1038/s41564-025-01954-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41564-025-01954-4

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing