Fig. 5: Comparison of protein property prediction tasks.
From: mRNABERT: advancing mRNA sequence design with a universal language model and comprehensive dataset

Predictions of protein melting point (A) and solubility (B). Each point represents a model, plotted by its R² value against the number of model parameters (log scale). Models are colored by their architectural family (e.g., mRNABERT, ESM, ProtT5). For each model, the central point is the mean R², and error bars represent the standard deviation (s.d.) across n = 5 folds of cross-validation. C Transcript abundance prediction across seven species. Results for each of the seven species are differentiated by color. Each box plot shows the distribution of R² values from n = 5 folds of cross-validation. The center line indicates the median, the box limits represent the upper and lower quartiles, and the whiskers extend to 1.5 times the interquartile range. Individual data points from each fold are overlaid as dots. In all panels, the cross-validation folds are considered computational replicates of the evaluation procedure. mRNABERT- refers to models that have not undergone contrastive learning.