Extended Data Fig. 3: Evo 2 understands mutational effects on protein, RNA, and organismal fitness across all domains of life. | Nature

Extended Data Fig. 3: Evo 2 understands mutational effects on protein, RNA, and organismal fitness across all domains of life.

From: Genome modelling and design across all domains of life with Evo 2

Extended Data Fig. 3: Evo 2 understands mutational effects on protein, RNA, and organismal fitness across all domains of life.

(a) Evo 2 predicts mutations to be unlikely in the start codons of protein-coding genes, the first two bases of each codon of the coding region, and the ribosome-binding sites of the 5′ UTR, across 20 prokaryotic and 16 eukaryotic model species. (b) Evo 2 predicts mutations to be unlikely in the stop codons of protein-coding genes and the first two bases of each codon of the coding region before the stop codon. (c) Evo 2 40B predicts lower likelihoods for deletions in miRNA and snoRNA loci compared to Evo 2 7B. Red points in (c) are the same as is shown in Fig. 2d. The same sequences were analyzed with both models. (d) The translational codon ramp pattern detected across all coding sequences across four species, focusing on the first and last 100 codons for each coding sequence. The local mean tRNA-adaptation index (tAI) was calculated using pre-computed tAI values for each species, and then z-score normalized. Data is based on rolling 5-codon averages. (e) The average change in log-likelihood across hundreds of genes and codon positions in each species’ genome. Blue lines indicate synonymous codon mutations with a higher tAI than the reference sequence, while red lines indicate synonymous codon mutations with a lower tAI than the reference sequence. Each codon position was averaged, and then a rolling 5-codon average was applied. (f) Evo 2 predicts stop codons dependent on the sequence context and stop codons present in the genome sequence, responding to artificially altered stop codon code by predicting the mutations as high effect. Showing median z-score standardized median Δlikelihood values for two ciliate genomes across 6 sequence context lengths. (g) Length-adjusted Evo 2 likelihoods of human mRNA sequences showed a negative correlation with their experimentally measured decay rates. Borzoi was included as a supervised sequence-to-expression model by selecting and averaging RNA expression prediction tracks. (h) Zero-shot prokaryotic gene essentiality prediction including the base pretrained models and the final checkpoints extended to 1-million token context for both the 7B and 40B parameter Evo 2 models; compare to Fig. 2j. (i) DepMap human gene essentiality classification performance measured by AUROC and AUPRC metrics comparing conservation baselines and language models.

Back to article page