Extended Data Fig. 1: Relative accuracy of WAAFLE, DarkHorse, and MetaCHIP on synthetic LGT and control contigs.
From: Profiling lateral gene transfer events in the human microbiome using WAAFLE

WAAFLE was penalized with a 20% holdout of its search database, while DarkHorse was evaluated using a translated version of the complete database, and MetaCHIP was evaluated without further constraints on its respective input. (a) DarkHorse only achieved non-negligible sensitivity (TPR) for the longest contigs (rightmost column) containing the most “extreme” LGT events (that is between pairs of species with the kingdom- or phylum-level LCAs). WAAFLE’s specificity (FPR) is stratified according to the taxonomic level of the LGT LCA as in Fig. 1 from the main text (for example an intragenus false positive is counted as a true negative at the family level; x-axis). This level of stratification was not possible for DarkHorse, and so a single FPR value is plotted at “genus” resolution for comparison. DarkHorse offered better specificity than WAAFLE on shorter contigs (where it made relatively few LGT calls) but not on the longest contigs. (b) Here, an additional comparison was performed between WAAFLE and MetaCHIP using a separate synthetic dataset designed for MetaCHIP compatibility. TPR and FPR were computed and plotted as in ‘(A)’ with TPR calculations restricted to taxonomic ranks assigned to at least 100 LGT LCAs (that is kingdom, order, and genus). Results are stratified according to the completeness of the metagenomic bins into which LGT and control contigs were grouped. WAAFLE’s sensitivity here was similar to that observed in the preceding evaluations and consistently higher than MetaCHIP. While MetaCHIP’s specificity was correspondingly very high, WAAFLE again exhibited a peak FPR of only ~0.5% at the intragenus level, improved at higher ranks. Notably, WAAFLE’s performance was not dependent on bin completeness, while MetaCHIP proved less sensitive to LGT events in less-complete bins.