Extended Data Fig. 3: Using test set labels during batch correction can drastically increase measured predictive performance in downstream benchmarks.

a-b, The same cross-study colorectal cancer prediction benchmark as in Fig. 2b and Extended Data Fig. 2e, respectively, but percnorm, ConQuR, and Voom-SNM were provided all colorectal cancer labels, including for the test set, during batch correction (as in Extended Data Fig. 1d). The prediction accuracy (auROC) of certain methods inflated drastically beyond the results observed in the primary benchmark (Fig. 2b and Extended Data Fig. 2e), highlighting potential issues with assessing a batch-correction method by measuring the ability of a downstream machine learning model to predict information used during batch correction. This trend is consistent in both relative abundance space (a) and center log ratio (b). Voom-SNM is not run for (b) as its output is neither in non-negative relative abundance nor in count space. See Supplementary Table 2 for information on studies and sample sizes. c, The same cross-study HIV prediction benchmark as in Fig. 2a, comparing the accuracy on each held-out study for ConQuR, percnorm, and Voom-SNM when provided with gender or age as a covariate (see Extended Data Fig. 1b). See Supplementary Table 1 for information on studies and sample sizes. p, one-sided Wilcoxon signed-rank test. Across all panels: Box, IQR; line, median; whiskers, nearest point to 1.5*IQR. One point corresponds to the specified model and dataset’s predictive performance on a held-out study.