Extended Data Fig. 10: Additional data for generative chromatin accessibility. | Nature

Extended Data Fig. 10: Additional data for generative chromatin accessibility.

From: Genome modelling and design across all domains of life with Evo 2

Extended Data Fig. 10: Additional data for generative chromatin accessibility.

(a) Simpler patterns were designed in which each peak had uniform width. Good experimental validation of the predicted peaks was observed, especially when the peak width is longer and when all Enformer and Borzoi models agree in their predictions. Accuracy drops to an AUROC of 0.89 with short peaks of 384 bp in the “short wave” (right), especially when the Enformer and Borzoi predictions disagree. Compare to Fig. 6f-h. (b) Top: Dinucleotide frequencies in randomly proposed sequences after beam search filtering with Enformer and Borzoi still show significant deviation from the baseline mm39 frequency. Middle: Dinucleotide frequencies, generated by a bigram proposal distribution based on the mm39 reference genome, match the baseline mm39 frequency by construction. Bottom: Sequences generated by Evo 2 when prompted with a portion of the mouse genome have natural dinucleotide frequencies, despite this never being directly enforced during inference-time sampling. (c) Inference-time scaling results based on the AUROC of Enformer/Borzoi-predicted designs comparing a uniformly random proposal, a bigram proposal, and an Evo 2 7B proposal. The uniform proposal consistently lags Evo 2. The bigram proposal appears to have better initial scaling at lower beam search widths, but this plateaus as the beam search width increases. The Evo 2 proposal is the first to reach an AUROC > 0.95 threshold, above which designs tend to have qualitatively clear design success. Individual design runs are plotted as circles and the averages across design runs for each beam search width and each generative model are plotted as bold Xs. (d) Quantified agreement between the single Enformer and four aligned Borzoi prediction tracks using the Intraclass Correlation Coefficient (ICC) from a two-way mixed-effects model (ICC(2, k); Methods). A value of 1 indicates perfect agreement across all five tracks. Despite never directly optimizing for ensemble agreement, sequences generated by Evo 2 have consistently high ICC values of ~0.95 even at the lowest beam search widths. Both uniform and bigram proposals have much lower ensemble agreement, though the ICC tends to improve as the increasing beam search width filters out poor designs. Individual design runs are plotted as circles and the averages across design runs for each beam search width and each generative model are plotted as bold Xs. (e) The best scoring designs combining both AUROC and ICC for the “ARC” Morse code pattern across different generative proposals. Despite high AUROC and ICC values, the uniform and bigram proposals are qualitatively worse in terms of desired pattern agreement and Enformer/Borzoi ensemble agreement than the sequence generated by an Evo 2 proposal. We hypothesize that ensemble disagreement corresponds to greater uncertainty in the Enformer/Borzoi accessibility predictions and is consistent with adversarial inputs. (f) Genomic statistics are plotted for three experimentally validated Morse code designs (from left to right, “ARC”, “EVO2”, and “LO”). Then, these statistics are plotted combining all three designs into a single plot (“Evo 2”) followed by plotting statistics for the same three Morse code designs but generated by a uniform or a bigram proposal. Statistics are separated by regions with a designed peak and without a designed peak. The rightmost column plots the statistics for regions of the Mus musculus genome in experimentally determined DNASE-seq peak regions and in regions without peaks. Additional details on these statistics are provided in Methods. Individual plots are shaded gray if there is a significant difference (P < 0.05) between “peak” and “no peak” conditions under a two-sided Welch’s t-test. FFT, Fast Fourier Transform. (g) Genomic tracks visualizing the sequence statistics plotted in (f) for the “LO” design. AUROC indicates how predictive the statistic is for the binary peak region labels. Spearman r indicates the correlation between the statistic and the experimentally determined ATAC-seq coverage value. FFT, Fast Fourier Transform. (h) Genomic tracks visualizing TF motif-related statistics along the “LO” design, where the motifs have been restricted to TFs expressed in mESCs. (i) Distribution of genes plotted by log(1 + TPM) expression values on the x-axis, used to determine a gene expression cutoff for mESCs (log(1 + TPM) > 1). This expression cutoff was used to determine whether TF motifs found in the “LO” design were significantly enriched for TFs expressed in mESCs.

Back to article page