Extended Data Fig. 7: SAE features reveal semantic, structural, and organizational details of prokaryotic genomes.
From: Genome modelling and design across all domains of life with Evo 2

(a) Diagram depicting the contrastive feature search strategy used to identify and quantify selected features. (b) Mean activations of the phage feature across phage regions annotated by geNomad and matched-length bacterial non-phage sequences across 100 randomly selected GTDB genomes. AUROC was computed over mean activations for each region. (c) (i) Scrambling spacer sequences does not ablate the phage feature activation pattern on spacer sequences. (ii) Using a constant scrambled CRISPR direct repeat sequence ablates phage feature activation for the first two spacers. (iii) Using different scrambled sequences instead of CRISPR direct repeats ablates the phage feature activation pattern. (iv) Natural sequence activation pattern, as in Fig. 4b. (d) Additional examples of sequences not annotated as phage sequences by geNomad which the phage feature activates on. (e) Activations of additional features associated with open reading frames (ORFs), plus strand or minus strand ORFs ((+) ORF and (–) ORF), and intergenic loci in a 100 kb region in E. coli K12 MG1655. (f) Mean activations for prokaryotic organizational features on different annotation types across the E. coli K12 MG1655 genome. AUROC was computed over mean activations for each region. Consecutive intergenic positions were merged into single regions. The ORF associated features were evaluated for their abilities to predict the presence of either plus strand or minus strand ORFs. (g) Mean activations for protein secondary structure features on different secondary structure types across ORFs in the E. coli K12 MG1655 genome. AUROC was computed over mean activations of positions annotated as each secondary structure type per protein. (h) Ablation experiments on SAE features with high F1 scores for biological elements demonstrate increases in average CE and ratio of ablated to original CE. These results suggest that learned features can be causally relevant downstream.