Table 1 Comparison of recent gLMs with multi-species and single-species training approaches

From: Genomic language models could transform medicine but not yet

Model

Parameters

Sequence length (in bp)

Genomes trained on

Human genome included

Training type

GPN MSA9

86,000,000

128

100

Yes

Multi-species

GPN20

65,612,800*

512

8

No

Multi-species

Evo21

40,000,000,000

1,000,000

128,000

Yes

Multi-species

Nucleotide transformer6

2,500,000,000

6000

850

Yes

Multi-species

DNABERT-221

117,000,000

877 (BPE)

135

Yes

Multi-species

DNABERT13 (k = 6)

110,000,000

512

1

Yes

Single-genome

HyenaDNA22

1,600,000

1,000,000

1

Yes

Single-genome

GROVER15

86,511,201*

2076 (BPE)

1

Yes

Single-genome

  1. This table compares eight gLMs based on their number of parameters, training data composition, and sequence handling capabilities. Parameter numbers are taken from the paper where possible, where indicated by*, the model's parameter number is calculated from loading the HuggingFace model version. Models are categorized by their training approach (multi-species vs single-genome). For sequence length calculation, measurements are in DNA base pairs (bp). When tokens represent multiple bp, the total input length was calculated by multiplying tokens by bp per token (e.g., nucleotide transformer uses non-overlapping k-mers where k = 6, so 1000 tokens = 6000 bp). DNABERT-2 and GROVER use Byte Pair Encoding (BPE), which has varying length tokens based on the co-occurrence frequency of the characters and a pre-defined vocabulary size. Note that DNABERT-2’s sequence length estimation (877) represents approximately 128 tokens at 6.85 bp average per BPE token (calculated from HuggingFace https://huggingface.co/zhihan1996/DNABERT-2-117M/blob/main/tokenizer.json vocabulary excluding special tokens), and GROVER’s sequence length (2076) represents approximately 510 tokens at 4.07 bp average per BPE token.