Table 1 Comparison of recent gLMs with multi-species and single-species training approaches

Model	Parameters	Sequence length (in bp)	Genomes trained on	Human genome included	Training type
GPN MSA⁹	86,000,000	128	100	Yes	Multi-species
GPN²⁰	65,612,800^*	512	8	No	Multi-species
Evo2¹	40,000,000,000	1,000,000	128,000	Yes	Multi-species
Nucleotide transformer⁶	2,500,000,000	6000	850	Yes	Multi-species
DNABERT-2²¹	117,000,000	877 (BPE)	135	Yes	Multi-species
DNABERT¹³ (k = 6)	110,000,000	512	1	Yes	Single-genome
HyenaDNA²²	1,600,000	1,000,000	1	Yes	Single-genome
GROVER¹⁵	86,511,201^*	2076 (BPE)	1	Yes	Single-genome

This table compares eight gLMs based on their number of parameters, training data composition, and sequence handling capabilities. Parameter numbers are taken from the paper where possible, where indicated by^*, the model's parameter number is calculated from loading the HuggingFace model version. Models are categorized by their training approach (multi-species vs single-genome). For sequence length calculation, measurements are in DNA base pairs (bp). When tokens represent multiple bp, the total input length was calculated by multiplying tokens by bp per token (e.g., nucleotide transformer uses non-overlapping k-mers where k = 6, so 1000 tokens = 6000 bp). DNABERT-2 and GROVER use Byte Pair Encoding (BPE), which has varying length tokens based on the co-occurrence frequency of the characters and a pre-defined vocabulary size. Note that DNABERT-2’s sequence length estimation (877) represents approximately 128 tokens at 6.85 bp average per BPE token (calculated from HuggingFace https://huggingface.co/zhihan1996/DNABERT-2-117M/blob/main/tokenizer.json vocabulary excluding special tokens), and GROVER’s sequence length (2076) represents approximately 510 tokens at 4.07 bp average per BPE token.

Quick links

Search