Fig. 2: mRNABERT captures multi-level evolutionary homology information.
From: mRNABERT: advancing mRNA sequence design with a universal language model and comprehensive dataset

High-dimensional embeddings are projected into a two-dimensional space using t-SNE. Here, panels A and B depict the results of the mRNABERT model without contrastive learning, while the remaining four panels illustrate the results of the mRNABERT model. A–C The vocabulary embeddings from the model. Each point represents a codon or nucleotide, with colors corresponding to the amino acids of the codons. B–D Codons are then clustered based on amino acid properties. Codons encoding the same amino acid and those with similar biochemical properties tend to be spatially proximate. E Classification of different types of sequences, including lncRNA sequences that bear high similarity to mRNA and all regions of mRNA. F Species and sequence data were randomly sampled from the retained dataset, with each point representing a complete mRNA sequence. ARI Adjusted Rand Index, FMI Fowlkes-Mallows Index.