Fig. 2: Language model generates sequences with functional genomic structures. | Nature Communications

Fig. 2: Language model generates sequences with functional genomic structures.

From: A long-context language model for deciphering and generating bacteriophage genomes

Fig. 2

a The workflow schematic. b Comparison of gene length distributions between randomly sampled subsets of predicted genes in generated sequences and training dataset (sample size: n = 2000). Two-sided Kolmogorov–Smirnov test: p value = 0.15. c Comparison of the predicted virus scores for generated sequences (sample size: n = 1024) and the training dataset (sample size: n = 99,429). Median virus scores are indicated by white dots. Black bars denote the interquartile ranges (25th to 75th percentiles). d Predicted taxonomy for the generated sequences predicted as viral. Only taxonomies with >1 sequence are shown. geNomad20 was used to produce results in (c, d). e Functional annotation of a selected sequence fragment (generated sequence #87). f Predicted promoter activity for all the 5′UTRs in the generated sequence #87 (sample size: n = 44), along with the promoter activity of the random sequences with the same length. Promoter activities were calculated using the Promoter Calculator22. Two-sided Kolmogorov–Smirnov test: p value = 6.3 × 10−8. g Proportions of adenine (A) and guanine (G) nucleotides preceding the start codon of all the predicted genes in the generated sequence #87. h Mean predicted pLDDT scores for the ESMFold predicted structures23. We focused on proteins with geNomad markers from generated sequences (sample size: n = 343; median value: 28) against random peptide sequences of the same lengths (sample size: n = 343; median value: 18). A sample generated protein is shown on the right. Two-sided Kolmogorov–Smirnov test: p value = 6.7 × 10−42. i Top 10 predicted functions of proteins derived from the generated sequences, as identified by phold24. Source data are provided as a Source Data file.

Back to article page