Fig. 6: RNA functional classification tasks.
From: RiNALMo: general-purpose RNA language models can generalize well on structure prediction tasks

Splice-site prediction. a A pre-mRNA transcript consists of non-coding, i.e., introns, and coding regions, i.e., exons. Introns are located between two exons of a gene. As part of the RNA processing pathway, introns are removed by cleavage at splice sites. These sites are found at \(5{\prime}\) and \(3{\prime}\) ends of introns, known as donor and acceptor splice sites, respectively. Most frequently, the \(5{\prime}\) end of introns begins with the dinucleotide GU, and the \(3{\prime}\) end of introns ends with AG. b An input to RiNALMo is a 400-nucleotide-long RNA sequence from the GS_1 dataset. We utilize only the CLS embedding that then passes through a two-layer MLP classification head. The output layer gives information on whether a sequence contains a donor/acceptor site or not. c Classification F1 score for splice-site prediction. Here, we report the average value of donor and acceptor prediction results. ncRNA family classification. d Given an RNA sequence the goal is to classify its ncRNA family. The procedure is again similar to the procedure in (b): Original and noisy RNA sequences from the Rfam dataset are input to RiNALMo. We utilize only the CLS embedding that then passes through a two-layer MLP classification head. The output layer determines which of the 88 Rfam families the input ncRNA belongs to. e ncRNA family classification accuracy for noiseless and noisy input sequences and the average accuracy. In (c and e), FT denotes whether we fine-tuned the model or represented direct citations from the original papers with the same split train/test datasets. The best result for each evaluation dataset is shown in bold.