Table 1 Summary table of datasets used.

From: Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter

Dataset name

Dataset description

GTDB representative set

Read-length DNA sequences from each of the 24,706 Bacterial and Archaeal representative genomes in the GTDB51

GTDB class set

Reduced set of read-length sequences from a representative genome of each class in the GTDB51 taxonomy

mi-faser functional set

Functionally annotated reads from 100 metagenomes from evenly distributed environmental packages

Swiss-Prot functional set

DNA read-length sequences of genes with experimentally validated functions from the Swiss-Prot database

OG homolog set

Homologous and nonhomologous sequence pairs of gene sequences from 1000 orthologous groups from the OrthoDB database defined at multiple taxonomic levels: genus, family, order, class, and phylum

Oxidoreductase model set

Read-length DNA sequences from genes corresponding to Bacterial and Archaeal oxidoreductases from the manually reviewed entries of the Swiss-Prot database

Oxidoreductase metagenome set

Sequencing reads from 16 marine metagenomes, rarefied to 20 million sequences each, from latitudes spanning −62 to 76 degrees and two depths—surface and mesopelagic. Mesopelagic depths at 4 stations corresponded to an oxygen minimum zone (OMZ)

Reading frame set

Read-length sequences, and labels corresponding to their true frame of translation, for gene coding sequences from one genome selected from each order in the GTDB taxonomy

Optimal temp set

Read-length sequences from core genes associated with transcription and translation, and labels corresponding to their optimal enzyme temperature, inferred from the manually curated optimal growth temperature of 19,474 genomes.