Table 1 Summary table of datasets used.
Dataset name | Dataset description |
---|---|
GTDB representative set | Read-length DNA sequences from each of the 24,706 Bacterial and Archaeal representative genomes in the GTDB51 |
GTDB class set | Reduced set of read-length sequences from a representative genome of each class in the GTDB51 taxonomy |
mi-faser functional set | Functionally annotated reads from 100 metagenomes from evenly distributed environmental packages |
Swiss-Prot functional set | DNA read-length sequences of genes with experimentally validated functions from the Swiss-Prot database |
OG homolog set | Homologous and nonhomologous sequence pairs of gene sequences from 1000 orthologous groups from the OrthoDB database defined at multiple taxonomic levels: genus, family, order, class, and phylum |
Oxidoreductase model set | Read-length DNA sequences from genes corresponding to Bacterial and Archaeal oxidoreductases from the manually reviewed entries of the Swiss-Prot database |
Oxidoreductase metagenome set | Sequencing reads from 16 marine metagenomes, rarefied to 20 million sequences each, from latitudes spanning −62 to 76 degrees and two depths—surface and mesopelagic. Mesopelagic depths at 4 stations corresponded to an oxygen minimum zone (OMZ) |
Reading frame set | Read-length sequences, and labels corresponding to their true frame of translation, for gene coding sequences from one genome selected from each order in the GTDB taxonomy |
Optimal temp set | Read-length sequences from core genes associated with transcription and translation, and labels corresponding to their optimal enzyme temperature, inferred from the manually curated optimal growth temperature of 19,474 genomes. |