Extended Data Fig. 6: SAE overview with training data, metrics, and feature embedding.
From: Genome modelling and design across all domains of life with Evo 2

(a) Composition of the prokaryotic sequences randomly subsampled for SAE training. (b) Composition of the eukaryotic sequences randomly subsampled for SAE training. (c) Feature density histograms for SAEs demonstrate that Layer 26 converged with fewer low-frequency features while the distribution peaks around a firing frequency of 1e-3 representing sparse yet generalizing features. (d) Activation density for all layer 26 SAE features over the E. coli K12 MG1655 genome (left) or for a length-matched segment of human chromosome 17 (right). (e) Mean non-zero activation for all layer 26 SAE features over the E. coli K12 MG1655 genome (left) or for a length-matched segment of human chromosome 17 (right). (f) UMAP embedding of layer 26 SAE feature weights colored by activation density difference between eukaryote and prokaryotic sequence for each feature, with features presented in Fig. 4 labeled.