Fig. 4: Analysis of feature importance and novel taxa.
From: Taxometer: Improving taxonomic classification of metagenomics contigs

a Contribution of abundances and TNFs features to Taxometer performance demonstrated on the CAMI2 Airways short-read dataset. The amount of correctly predicted contigs labels at each taxonomic level using a score threshold of 0.5. b Simulation analysis of unknown taxa. X-axis: Pearson correlation coefficient between the mean feature vectors of the deleted and the assigned species. Y-axis: ratio between the number of contigs of the deleted species (“deleted”) and the number of contigs of the species that was the most prevalent among the incorrectly assigned (“assigned”) in the training set. The color legend shows the share of correctly missing labels, equal to 1 − FP, where FP is the share of false positives. FP is high when the assigned species was more prevalent in the training set and TNFs and abundances are highly correlated between the deleted and the assigned species. Source data are provided as a Source Data file.