Extended Data Fig. 1: Overview of model architecture, training procedure, datasets, and evaluations for Evo 2. | Nature

Extended Data Fig. 1: Overview of model architecture, training procedure, datasets, and evaluations for Evo 2.

From: Genome modelling and design across all domains of life with Evo 2

Extended Data Fig. 1: Overview of model architecture, training procedure, datasets, and evaluations for Evo 2.

(a) Data composition of OpenGenome2; total eukaryotic genomes per kingdom (left), total base pairs per training data subset (middle), and detailed breakdown of other/augmented training data subset (right). (b) Core input-dependent convolution operators in StripedHyena 2, with a diagram showing their composition in the architecture. (c) Scaling ablations on OpenGenome2, showing the loss convergence of multi-hybrids compared to previous generation hybrids and Transformers. Models of 7 billion parameters are compared after pretraining with the same 400 billion tokens. (d) Needle-in-a-haystack performance of Evo 2 7B, spanning input contexts of 512 to 1 million tokens. (e) A null distribution of needle-in-a-haystack scores by randomly shuffling the needle sequence across a sweep of haystack lengths and needle positions, computing the resulting retrieval score based on a categorical Jacobian analysis (Methods). The distribution of N = 1040 scores is plotted here. At our cutoff of a score of 0.8, we can reject the null hypothesis of no retrieval with a nominal P < 0.001.

Back to article page