Fig. 1: Foundational capacities of our language model. | Nature Communications

Fig. 1: Foundational capacities of our language model.

From: A long-context language model for deciphering and generating bacteriophage genomes

Fig. 1

a Overview of the model applications. b In silico mutagenesis analysis to identify essential genes in the bacteriophage genome. c Model loss variation across the lambda phage genome in the mutagenesis analysis. Upper, essential, and non-essential genes in the genome. Lower: changes in model loss for 50 bp non-overlapping windows across the genome (blue). The step size is 50 bp, and moving averages of model loss across 5000 bp windows are denoted in red. d Zero-shot prediction of essential genes by calculating the effects of mutations in the gene coding region (blue), start codon (orange) and stop codon (green). Area under the ROC curve (AUROC) scores are reported. e Prediction of mutational effects on protein functions using model embeddings. f Prediction of mutational effects for the deep mutational scanning experiment of the infA gene. Spearman correlation coefficients of the predicted and reported fitness from fivefold cross-validation tests are reported (Blue: megaDNA, gray: DeepSequence). n is the number of training samples. g Prediction of the impacts of Single Nucleotide Polymorphisms (SNPs) in the T7 bacteriophage genome. Spearman correlation of the predicted and reported fitness from fivefold cross-validation tests is reported. h Prediction of regulatory element activity using model embeddings. i Prediction of translation efficiencies for non-model organisms and high-throughput gene expression libraries. For K. oxytoca, P. protegens, and E. coli DH10B, we evaluated the model performance on endogenous genes. Fivefold cross-validation tests were used for all calculations. j Classifying taxonomies of unannotated sequences using model embeddings. k UMAP visualization of model embeddings for sequences from bacteriophages, bacteria, and archaea (model middle layer, sample size: n = 5000 per group). For f, g, and i, data are presented as mean values ± SEM from fivefold cross-validation tests (n = 5 folds). Source data are provided as a Source Data file.

Back to article page