Fig. 2: FusOn-pLM.
From: FusOn-pLM: a fusion oncoprotein-specific language model via adjusted rate masking

A Model pipeline. Data preparation: Fusion oncoprotein sequences (length L) undergo random masking, where each amino acid has equal likelihood of selection. The masking rate increases from 15% to 40% throughout each epoch according to a cosine scheduler. The masked sequence is fed as input and the original sequence as label into the model: 33-layer ESM-2-650M with an MLM head. The final eight layers are unfrozen for fine-tuning. Output: the MLM head outputs an attempted reconstruction of the original sequence, which is compared with the label to calculate loss. FusOn-pLM embeddings, of shape [L, 1280], are extracted from the final layer of the ESM-2-650M encoder stack. B Test set loss and perplexity (pPL) for various masking strategies. Fixed-rate masking is tested at three rates, and adjusted-rate masking is tested in five ranges. At the top-performing range (15%-40%), three schedulers are tested (cosine, log-linear, stepwise). Source data for this figure are provided in the Source Data file.