Extended Data Fig. 2: Uncharacterized protein families have comparable abundance distribution and sequence composition to known proteins. | Nature

Extended Data Fig. 2: Uncharacterized protein families have comparable abundance distribution and sequence composition to known proteins.

From: Discovery of bioactive microbial gene products in inflammatory bowel disease

Extended Data Fig. 2: Uncharacterized protein families have comparable abundance distribution and sequence composition to known proteins.

a, Nominally characterized and uncharacterized protein families were distinguished with homology-based search against UniRef90 (release 2019_01). We defined strong homology following the UniRef90 criterion of ≥90% identity and ≥80% coverage, remote homology as identity from 25% to 90% and coverage from 25% to 80%, and non-homologous proteins as those with <25% identity or <25% coverage or no hit to UniRef90 proteins. Here, we use ‘uncharacterized known proteins’ to refer to UniRef90 proteins that do not have any Gene Ontology annotations in UniProt (release 2019_01). Distribution of prevalences and abundances of protein families across the four categories of protein families. b, The fractions of novel proteins (proteins with remote homology or without homology to known proteins) are comparable to known proteins across samples. c, Bray-Curtis dissimilarities over protein family profiles between samples from different participants, samples from the same participant over time, and technical replicates. Variability among novel proteins was more extreme than among known proteins, but less extreme than among known proteins with rare abundance (bottom 50%). Box plot boxes indicate quartiles and whiskers show inner fences. d, Uncharacterized proteins with comparable abundance to known proteins fit a neutral model of microbiome assembly (Methods). ‘Unclassified taxon’ indicates a group of genes which lack taxonomic information but can be binned into the same MSP based on co-abundance information. eg, Uncharacterized proteins showed similar sequence composition with known proteins. Characterized and uncharacterized proteins had similar distributions of lengths of assembled contigs (e), protein lengths (f) and GC content (g).

Back to article page