Extended Data Fig. 4: Novel protein families can be taxonomically annotated and greatly expand pangenomes of common gut taxa.
From: Discovery of bioactive microbial gene products in inflammatory bowel disease

a, Schematic of MetaWIBELE’s guilt-by-association approach for per-protein-family taxonomic annotation leveraging co-abundance profiles (MSPs). If reference sequence annotations are consistent within a group of co-varying proteins, their most-specific shared taxonomy can be transferred to other sequences within the family. b, We validated this novel taxonomic annotation method on a 20% holdout set of known proteins. c, To optimize the parameters, we tested different cut-offs for the fraction of protein families between the most and second-most dominant taxon within MSP using the holdout set in b. Stringent cut-offs (i.e., requiring more consistently classified taxa) reduced the power of taxonomic assignment for more specific levels (e.g., species or genus) but controlled false positives. Lenient cut-offs (i.e., requiring less consistently classified taxa) introduced more spurious assignments with good sensitivity to the assignment of species or genus. This sensitivity-specificity trade-off is best-balanced at our default cut-off value of 0.5. d, Comparison of taxonomic annotations by homology-based and guilt-by-association approaches. e, The top 25 genera with the highest number of newly annotated proteins (Supplementary Table 3). The first row indicates the number of genomes in RefSeq per genus. The second row indicates the mean relative abundance of known (i.e., SC and SU) and novel proteins (RH and NH), in which red dots represent the mean of known proteins and blue dots represent the mean of novel proteins. f, Uncharacterized proteins expanded common gut taxa. Each clade represents one genus. Circle bars show relative abundance of different categories of protein families. g, Similar representative genera with dominant abundance were identified in HMP2 and MetaHIT. The top 50 genera (with highest mean abundance) were selected for plotting. Box plot boxes indicate quartiles and whiskers show inner fences.