Extended Data Fig. 10: Methodological consideration for DEBIAS-M.

a-c, An analysis performed on the HIV benchmark (analyzed in Fig. 2a) using a standard implementation of DEBIAS-M as well as one using data with an added pseudocount of one, which allows DEBIAS-M to add non-measured taxa, and in which feature values with relative abundance below a threshold of 10−4 were replaced with zero, which allows DEBIAS-M to completely remove a taxon. The average pairwise Jaccard index (a) and Bray-Curtis dissimilarity (b) between studies are lower for DEBIAS-M with pseudocount and thresholding, indicating an improved ability to reduce batch effects. However, this does not result in improved accuracy in the benchmark itself (c). d,e, Box and swarm plot of auROCs, each evaluating the generalization performance of DEBIAS-M to a held out study from the HIV (Fig. 2a; d) and colorectal cancer (Fig. 2b; e) benchmarks. In each case, we have applied DEBIAS-M to data aggregated to different taxonomic levels, as available from each of the datasets. DEBIAS-M becomes less effective with more aggregation, and, for colorectal cancer prediction, actually reduces predictive performance compared to uncorrected data after aggregation to family level. f,g, Box and swarm plot of auROCs, each evaluating the generalization performance of DEBIAS-M to a held out study from the HIV (Fig. 2a; f) and colorectal cancer (Fig. 2b; g) benchmarks using different prediction loss functions. Most alternative loss functions are not significantly different from the binary cross-entropy used as default by DEBIAS-M for classification. The L1 loss shows significant improvement on colorectal cancer prediction (p = 0.02; e), but a significant reduction in performance on the HIV benchmark (p = 0.002; d). h, i, The total runtime of DEBIAS-M on a standard laptop for simulated datasets with an increasing number of batches, each containing 96 samples (h), as well as for 8 batches, each with an increasing number of samples (i). DEBIAS-M runs in <15 minutes (median 187 seconds) even for 64 batches with 6,144 samples in total, and in <43 seconds (median 33 seconds) for 8 batches with 4,096 samples in total. Box, IQR; line, median; whiskers, nearest point to 1.5*IQR; p, two-sided Wilcoxon signed-rank test.