Fig. 1: Overview of the design and applications of mRNABERT. | Nature Communications

Fig. 1: Overview of the design and applications of mRNABERT.

From: mRNABERT: advancing mRNA sequence design with a universal language model and comprehensive dataset

Fig. 1: Overview of the design and applications of mRNABERT.The alternative text for this image may have been generated using AI.

A The mRNABERT model is developed in two stages: In the first stage, pretraining is conducted on a set of 18 million mRNA sequences. This dataset is carefully curated and processed using the ORF finder tool from NCBI to identify different regions within the mRNA sequences. Subsequently, the sequences are tokenized using a custom tokenizer and fed into the model for the MLM task. The model architecture of mRNABERT includes 12 transformer blocks and incorporates advanced techniques like Flash Attention to enhance its overall performance. B In the second stage, a selected set of 500,000 CDS data and their corresponding amino acid sequences are processed by separate models. Embeddings from these models are projected into a shared dimensional space for a custom contrastive learning task to facilitate the full training of mRNABERT. C mRNABERT exhibits adaptability for various downstream tasks through the utilization of different strategies, illustrating its versatility. MLM Masked Language Model, pLM Protein Language Model.

Back to article page