Table 1 Metrics computed on the 233,756 protein functional clusters (PFC) from the sequence similarity network of MAGs proteins.

From: Towards omics-based predictions of planktonic functional composition from environmental data

PFC size

Functional scores

Taxonomy scores

Homogeneity

Unknowns quantification

Homogeneity

Unknowns quantification

Mean

3.24

Mean homogeneity score with EggNOG annotations (Number of NA values)

0.94 (35,818)

EggNOG annotations

PFCs only composed of annotated proteins (% of total PFCs)

181,595 (77.7%)

PFCs associated to only 1 Phylum (% of total PFCs) (% of PFCs with at least one Phylum annotation)

221,541 (94.8%) (97.5%)

Phylum level

Only proteins from annotated MAGs (% of total PFCs)

220,839 (94.5%)

Only proteins from unannotated MAGs (% of total PFCs)

6,367 (2.7%)

PFCs with at least one annotated protein (%of total PFCs)

197,938 (84.7%)

PFCs associated to only 1 Class (% of total PFCs) (% of PFCs with at least one Class annotation)

192,095 (82.2%) (96.8%)

Class level

Only proteins from annotated MAGs (% of total PFCs)

186,331 (79.7%)

Only proteins from unannotated MAGs (% of total PFCs)

35,338 (15.1%)

Minimum

2

PFCs only composed of unknown proteins (%of total PFCs)

35,818 (15.3%)

PFCs associated to only 1 Order (% of total PFCs) (% of PFCs with at least one Order annotation)

144,265 (61.7%) (93.8%)

Order level

Only proteins from annotated MAGs (% of total PFCs)

135,046 (57.8%)

Only proteins from unannotated MAGs (% of total PFCs)

79,921 (34.2%)

Mean homogeneity score with KEGG annotations (Number of NA values)

0.99 (113,321)

KEGG annotations

PFCs only composed of annotated proteins (% of total PFCs)

91,103 (39.0%)

PFCs associated with only 1 Family (% of total PFCs) (% of PFCs with at least one Family annotation)

100,801 (43.12%)(95.3%)

Family level

Only proteins from annotated MAGs (% of total PFCs)

88,404 (37.8%)

Maximum

1072

Only proteins from unannotated MAGs (% of total PFCs)

128,010 (54.76%)

PFCs with at least one annotated protein (%of total PFCs)

120,435 (51.5%)

PFCs associated to only 1 Genus (% of total PFCs) (% of PFCs with at least one Genus annotation)

21,921 (9.4%) (91.9%)

Genus level

Only proteins from annotated MAGs (% of total PFCs)

13,544 (5.8%)

PFCs only composed of unknown proteins (% of total PFCs)

113,321 (48.5%)

PFCs associated with only 1 MAG (% of total PFCs)

7146 (3.1%)

Only proteins from unannotated MAGs (% of total PFCs)

209,892 (89.8%)

  1. Functional scores are based on the functional annotation of MAGs proteins, with a functional homogeneity score of 1 meaning that all proteins in a PFC share the same annotation, while a score of 0 indicates that all proteins have different annotations (see “Methods” for details). By “unknown proteins” we refer both to sequences with no match in databases (KEGG and/or eggNOG) and to sequences existing in databases but with no functional and/or taxonomic annotation. Taxonomy scores are based on taxonomic annotations of MAGs available from Delmont et al.21. This way, the 6367 PFCs with only proteins from MAGs unannotated at the phylum level were only composed of proteins coming from the 45 Bacteria MAGs of the unidentified phylum. Detailed functional and taxonomic annotations for each protein sequence are available online, as well as detailed sizes and functional/taxonomy scores for each PFC (see “Data availability”).