Abstract
Fusion oncoproteins, a class of chimeric proteins arising from chromosomal translocations, are major drivers of various pediatric cancers. These proteins are intrinsically disordered and lack druggable pockets, making them highly challenging therapeutic targets for both small molecule-based and structure-based approaches. Protein language models (pLMs) have recently emerged as powerful tools for capturing physicochemical and functional protein features but have yet to be trained on fusion oncoprotein sequences. We introduce FusOn-pLM, a fine-tuned pLM trained on a newly curated, comprehensive set of fusion oncoprotein sequences, FusOn-DB. Employing a unique cosine-scheduled masked language modeling strategy, FusOn-pLM dynamically adjusts masking rates (15%–40%) to optimize feature extraction and representation quality, surpassing baseline embeddings in fusion-specific tasks, including localization, puncta formation, and disorder prediction. FusOn-pLM uniquely predicts drug-resistant mutations, providing insights for therapeutic design that anticipates resistance mechanisms. In total, FusOn-pLM provides biologically relevant representations for advancing therapeutic discovery in fusion-driven cancers.
Similar content being viewed by others
Introduction
Fusion oncoproteins arise from chromosomal rearrangements that fuse segments of two distinct genes (Fig. 1A)1. The resulting mutants contain unrelated functional domains connected by long regions of disorder2. This flexible configuration promotes constitutive activation or aberrant regulation of the fusion proteins, driving oncogenic transformation and tumor development3. Thousands of unique fusion oncoproteins have been discovered by sequencing patient tumors, and several common culprits such as EWSR1::FLI1 in Ewing’s sarcoma4, PAX3::FOXO1 in alveolar rhabdomyosarcoma4,5, SS18::SSX1 in synovial sarcoma6, and EML4::ALK proteins in non-small-cell lung cancer7 are well characterized in the literature. However, even the best-understood fusion oncoproteins have proven to be elusive drug targets due to their structural instability and absence of defined binding pockets2. For small molecules that are able to bind fusion oncoproteins, such as EWSR1::FLI18,9, these compounds do not achieve strict fusion specificity, binding to one of their head or tail protein counterparts that are often critical regulators of cellular homeostasis. As such, biologics, such as antibodies, miniproteins, and peptides, represent attractive therapeutic alternatives but necessitate advanced design approaches for specific targeting to these undruggable proteins10,11,12,13.
A FOs are formed by chromosomal rearrangements between two independent genes, the 5’ head gene and 3’ tail gene. Created in BioRender. Chatterjee, P. (2025) https://BioRender.com/p82v208. B ESM-2 training data included the wild-type head and tail proteins involved in FOs, but not FOs themselves. FOs were compared to SwissProt, a representative subset of ESM-2’s training data, via BLAST. The best alignments for each FO are shown (% identity = total identities / length of FO sequence). C AlphaFold2 structures of four well-studied fusion oncoproteins: PAX3::FOXO1, EWSR1::FLI1, EML4::ALK, and SS18::SSX1. Structures are colored by composition (red = head, blue = tail) and pLDDT, AlphaFold2’s primary confidence metric. Each FO has multiple known breakpoints, producing different amino acid sequences. Breakpoint regions (rectangle), per-residue pLDDTs (bar coloring), and average pLDDTs (colored circle) are shown for each sequence. D The percentage of disordered residues per sequence for FOs and their respective heads and tails. Average disorder content is 45.9% for FOs, 33.7% for head proteins, and 32.7% for tail proteins. Only FOs with AlphaFold2 structures available on FusionPDB are included. Source data for this figure are provided in the Source Data file.
Recently, structure-based prediction and design models, such as AlphaFold and RFDiffusion14,15,16, have accelerated the design of biologics targeting pathogenic proteins. These tools, by default, fail to accurately capture the structure of numerous conformationally unstable proteins, limiting their usefulness for fusion oncoprotein targeting17. Meanwhile, protein language models (pLMs), such as ESM-2 and ProtT5, have been trained on millions of protein sequences, from the exceedingly stable to the intrinsically disordered18,19. They capture physicochemical, structural, and functional properties of proteins from their sequence alone, and have even been extended to design novel proteins20,21 and binders22,23,24. However, these models were not trained on fusion oncoprotein sequences, which are functionally and structurally distinct from their wild-type counterparts due to their altered binding sites and unique breakpoint junctions25.
To fill this critical gap, we fine-tune the state-of-the-art ESM-2 pLM on 44,414 fusion oncoprotein sequences collected from the FusionPDB and FOdb databases, collectively termed the new FusOn-DB database2,26. Training on FusOn-DB data, we unfreeze all of the weights of the final eight layers of the ESM-2-650M model and fine-tune these parameters using a masked language modeling (MLM) head. To enhance the model’s ability to learn the unique properties of fusion oncoproteins, we introduce a cosine-scheduled masking strategy, dynamically varying the masking rate from 15% to 40% during training. This approach enables our top-performing model, FusOn-pLM, to capture the distinct structural and functional features of fusion oncoproteins. As evidence, our results demonstrate that FusOn-pLM outperforms baseline embeddings on diverse fusion-specific tasks, including puncta formation propensity and the prediction of intrinsic disorder. Moreover, we showcase its utility in identifying drug-resistant mutations in fusion oncoproteins, highlighting its biological relevance and potential for advancing therapeutic design.
Results
Fusion oncoproteins comprise a distinct and diverse sequence dataset
ESM-2 was pretrained on ~65 million sequences from UniRef50, a database which includes over 9000 wild-type proteins known to act as the head or tail components of fusion oncoproteins27. However, ESM-2 was not trained on the fusions themselves (Fig. 1B)18. By collecting fusion oncoprotein sequences from the FusionPDB and FOdb databases2,26, two complementary resources that provide experimentally validated and computationally predicted fusion proteins with clinical or biological relevance, we assembled FusOn-DB, a comprehensive and non-redundant dataset of 44,414 fusion oncoprotein sequences. Running BLAST between FusOn-DB and SwissProt28 (full results in Supplementary Data S1) revealed a wide distribution of sequence homology. On average, fusion oncoproteins shared 71.0% identity with the top-aligning SwissProt sequence, which corresponded to either the head or tail protein in 87% of cases. Over 12,000 fusion oncoproteins had <60% maximum identity, and over 5000 had <50% maximum identity.
Fusion oncoproteins are also characterized by a high level of structural disorder. AlphaFold2 structures of four highly-studied fusion oncoproteins (PAX3::FOXO1, EWSR1::FLI1, EML4::ALK, and SS18::SSX1) largely exhibit low (50–70) and very low (<50) confidence pLDDT scores, indicating extensive intrinsic disorder (Fig. 1C). These structural trends are consistent across various sequences for the same fusion genes, arising from different breakpoints (Fig. 1C). To quantify the difference in disorder between fusion oncoproteins and wild-type proteins, we used a well-validated threshold to assign disorder labels (pLDDT <68.8 = disordered)17 to each residue in a set of fusion oncoproteins. Fusion oncoproteins were 45.9% disordered on average, while head proteins were 33.7% disordered and tail proteins were 32.7% disordered (Fig. 1D). Similarly to fusion oncoproteins, the gold-standard disorder dataset Disorder-NOX29 had a greater proportion of near-fully disordered proteins than fusion heads and tails. In contrast, fusion oncoproteins had a more right-skewed distribution (Supplementary Fig. 1). In total, these findings highlight the distinct sequence and structural characteristics of fusion oncoproteins, underscoring the need for better representations tailored to their properties.
Cosine-scheduled masking enables accurate fusion oncoprotein sequence recovery
Having curated a diverse dataset of fusion oncoproteins, we sought to fine-tune the standard ESM-2-650M model via an MLM objective (Fig. 2A)18. This classic training approach forces the model to reconstruct masked tokens from sequence context, refining representations to emphasize unique physicochemical properties of fusion oncoproteins. Fixed-rate masking at 15% is the established standard in most BERT-based MLM architectures18,30, but fusion oncoproteins’ intrinsic complexity prompted us to explore higher and variable masking rates. Recent findings from Wettig et al., have demonstrated that increasing masking rates (up to 40%) improves performance by forcing the model to rely more heavily on sequence context for token reconstruction31. Additionally, varying the masking rate during training balances representation learning (improved by lower masking rates) with reconstruction quality (improved by higher masking rates)32. Motivated by these findings, we fine-tuned the final eight layers of ESM-2-650M using a cosine scheduler to dynamically adjust the masking rate from 15% to 40% across each training epoch (Fig. 2A), hypothesizing that this approach would maximize model performance by gradually increasing the difficulty of the reconstruction task.
A Model pipeline. Data preparation: Fusion oncoprotein sequences (length L) undergo random masking, where each amino acid has equal likelihood of selection. The masking rate increases from 15% to 40% throughout each epoch according to a cosine scheduler. The masked sequence is fed as input and the original sequence as label into the model: 33-layer ESM-2-650M with an MLM head. The final eight layers are unfrozen for fine-tuning. Output: the MLM head outputs an attempted reconstruction of the original sequence, which is compared with the label to calculate loss. FusOn-pLM embeddings, of shape [L, 1280], are extracted from the final layer of the ESM-2-650M encoder stack. B Test set loss and perplexity (pPL) for various masking strategies. Fixed-rate masking is tested at three rates, and adjusted-rate masking is tested in five ranges. At the top-performing range (15%-40%), three schedulers are tested (cosine, log-linear, stepwise). Source data for this figure are provided in the Source Data file.
Our results strongly validated this hypothesis. When evaluated on a 15% masked, held-out test set from FusOn-DB, our fine-tuned model consistently outperformed both fixed-rate masking strategies and the non-fine-tuned ESM-2-650M baseline (Fig. 2B). ESM-2-650M, which uses a static 15% masking rate during pre-training18, performed poorly, with a loss of 1.83 and a pseudo-perplexity of 6.24. As a note, pseudo-perplexity (pPL) is a metric adapted from language modeling to evaluate how well a model predicts masked tokens, with lower values indicating better reconstruction performance and overall sequence comprehension. While far better, fine-tuning FusOn-pLM with fixed masking rates of 15%, 20%, and 25% produced progressively higher loss and pPL values, reflecting the difficulty of optimizing both sequence reconstruction and representation learning with static masking. In contrast, cosine-scheduled masking achieved better performance across all tested ranges, with the best results observed for a masking range of 15–40% (loss: 1.28, pPL: 3.61). Further exploration of different adjusted-rate masking schedulers, including log-linear and stepwise strategies, demonstrated that the cosine scheduler still remained optimal, achieving the lowest loss and pPL values (1.28 and 3.61, respectively) (Fig. 2B).
FusOn-pLM generates fusion oncoprotein-relevant representations
To determine if FusOn-pLM produces relevant embeddings, we sought to evaluate its performance on downstream fusion oncoprotein-specific tasks. We first assessed the embeddings’ ability to accurately predict the formation and localization of puncta, which are critical in driving cancer pathology2. Many fusion oncoproteins have been shown to form puncta via phase separation, and these condensates may localize to the nucleus and/or cytoplasm (Fig. 3A)2. Experimental data describing the puncta formation and localization of 178 fusion oncoproteins were used to train three FusOn-pLM-Puncta models, consisting of FusOn-pLM embeddings fed into a gradient boosting (XGBoost) classifier (Fig. 3B). For puncta formation, FusOn-pLM embeddings outperform ESM-2-650M, ProtT5, and FOdb physicochemical embeddings on four relevant classification metrics across the entire held-out test dataset (Fig. 3C). We observed similar results when predicting localization to the nucleus, the primary location of fusion oncoproteins (Fig. 3D)3. While manually-curated FOdb embeddings perform strongly on cytoplasm localization prediction, FusOn-pLM embeddings prove most effective on critical metrics, such as AUROC (Fig. 3E). In total, these results indicate that FusOn-pLM learns representations capturing key properties encoded in fusion oncoprotein sequences.
A Certain FOs form puncta (condensates) via phase separation. Puncta may localize to the nucleus, cytoplasm, or both. B Three XGBoost classifiers are trained on FusOn-pLM-embedded FOs. One predicts formation of puncta (puncta propensity); one predicts formation of nuclear puncta (nucleus localization); one predicts formation of cytoplasmic puncta (cytoplasm localization). C–E Performance on a held-out test set when predictors are trained on FusOn-pLM, ESM-2-650M, ProtT5-XL-U50, and FOdb embeddings. Created in BioRender. Chatterjee, P. (2025) https://BioRender.com/e57u556. Source data for this figure are provided in the Source Data file.
FusOn-pLM can accurately predict disordered content in wild-type and fusion oncoproteins
Given that fusions are structurally disordered, we hypothesized that FusOn-pLM’s embeddings may encode information pertinent to the properties of intrinsically disordered regions (IDRs). Specifically, we sought to predict: 1. Asphericity, which quantifies a protein’s ensemble shape and molecular conformation, 2. End-to-end radius (Re), the average distance between the N-terminal and C-terminal residue, 3. Radius of gyration (Rg), the average distance between a protein’s residues and its center of mass, and 4. Polymer scaling exponent, which describes an IDR’s behavior when solvated in water33. Individual FusOn-pLM-IDR regressors were trained on non-fusion IDR sequences for each property, using multi-layer perceptron (MLP) heads to predict the property values directly from FusOn-pLM embeddings (Fig. 4A). We demonstrate that FusOn-pLM-IDR models achieve a high coefficient of determination (R2) on all four properties, indicating a strong fit (Fig. 4B). We also find that FusOn-pLM and ESM-2-650M embeddings achieve nearly equivalent performance, signaling that FusOn-pLM did not overfit on fusion oncoproteins and lose ESM-2’s intrinsic ability to represent a wide range of proteins (Supplementary Fig. 2).
A FusOn-pLM-IDR models predict asphericity (A), end-to-end radius (Re), radius of gyration (Rg), and polymer scaling exponent (PS) by feeding FusOn-pLM embeddings through an MLP classification head. B FusOn-pLM-IDR predictions vs. true values. The coefficient of determination (R2) between predictions and labels was calculated for each model to assess goodness of fit. C FusOn-pLM-Diso utilizes a Transformer architecture to predict per-residue disorder labels from FusOn-pLM embeddings. D Disorder predictor performance in CAID2 competition when trained on FusOn-pLM vs. ESM-2-650M embeddings34. E FusOn-pLM-Diso performance on test set fusion oncoproteins, based on AlphaFold-pLDDT-derived disorder labels. Data are presented as first and third quartile +/− 1.5*IQR (interquartile range). Median line is indicated in black and circles represent outliers (left). The coefficient of determination (R2) between predictions and labels was calculated for each model to assess goodness of fit (right). F Visualization of FusOn-pLM embedding predictions of disorder propensity on AlphaFold2-predicted structure. Disorder probabilities are shaded according to the legend for interpolation. Source data for this figure are provided in the Source Data file.
Next, we sought to assess FusOn-pLM’s ability to identify IDR regions within protein sequences. The FusOn-pLM-Diso model was trained to predict per-residue probabilities of disorder directly from FusOn-pLM embeddings (Fig. 4C). When evaluated on the Disorder-NOX dataset used in the CAID2 competition34, FusOn-pLM achieved an AUROC of 0.825. Compared with a parallel architecture trained on ESM-2 embeddings (ESM-2-650M-Diso) and fourteen CAID2 competitors, FusOn-pLM-Diso ranked in the top 5 of all models (Fig. 4D)29. We then questioned whether FusOn-pLM embeddings could accurately distinguish between structured and disordered residues in fusion oncoproteins, specifically. On a set of proteins from FusOn-pLM’s test set, FusOn-pLM-Diso achieved average accuracy, precision, recall, F1, and AUROC metrics all above 0.9. We also observed a strong correlation (R2 = 0.84) between the disorder percentages predicted by FusOn-pLM-Diso and that of AlphaFold-pLDDT (Fig. 4E), further supporting the notion that FusOn-pLM embeddings capture the disorder properties of fusion oncoproteins. When visualizing the per-residue disorder probabilities for five well-studied fusion oncoproteins, we observe differential coloring between disordered and structured residues. We establish that FusOn-pLM correctly identifies structure in the α-helix and β-sheet-rich regions, coloring these areas dark blue (Fig. 4F). Overall, our results suggest that FusOn-pLM accurately encodes disorder-related information in its embeddings. Given that fusion oncoproteins are characterized by their disordered regions, we reason that FusOn-pLM embeddings more effectively represent fusion oncoproteins.
FusOn-pLM embeddings enable zero-shot discovery of relevant mutations
Fusion oncoproteins themselves are mutants, but they also have the potential to acquire additional mutations which can alter their structure, function, and druggability35. Beyond property and disorder prediction, we sought to establish the biological utility and relevance of FusOn-pLM by performing zero-shot discovery via its MLM head, which can sequentially unmask each position in an input sequence, outputting residue probabilities per unmasked position (Fig. 5A). As with any pLM, within evolutionarily conserved domains, the logits corresponding to the original residue are much higher than for any alternate residue. For example, in the TF::Kinase fusion TRIM24::RET, FusOn-pLM correctly identifies TRIM24’s zinc finger domains and RET’s kinase domain as highly conserved (Fig. 5B). FusOn-pLM also identifies that the EWSR1 activation domain and FLI1 DNA-binding domain in EWSR1::FLI1 are unlikely to mutate (Fig. 5B). In PAX3::FOXO1, the DNA-binding domains of PAX3 are highly conserved, but the truncated DNA-binding domain of FOXO1 (25/75 amino acids) is less strongly conserved (Fig. 5B), corroborated by studies showing FOXO1’s DNA binding activity is not critical for fusion function36,37. This result indicates that FusOn-pLM has implicitly captured the function of fusion oncoproteins, which is further strengthened by the observation of clear differences between TF::TF and Kinase::Kinase fusions in its latent space (Supplementary 3A).
A FusOn-pLM performs zero-shot mutation discovery via its MLM head through sequential unmasking of individual residues. Potential mutations are ranked by their logit values. B FusOn-pLM logits for the longest EWSR1::FLI1, PAX3::FOXO1, and TRIM24::RET sequences in FusOn-DB. Yellow regions are considered highly conserved domains. C Recovery of mutations found to cause drug resistance in patients with EML4::ALK and BCR::ABL1-driven cancers. D Case study on kinase fusion ETV6::NTRK3 (647 amino acids), which drives various cancers. FusOn-pLM predictions of NTRK3 kinase domain mutations identified in ETV6::NTRK3+ cancer patients with drug resistance are shown in the table. Based on logit values, disordered residues from the head protein ETV6 are indicated. Source data for this figure are provided in the Source Data file.
Although FusOn-pLM may not predict that change is likely within a conserved domain, its logits still provide rank-ordered, possible mutations within these regions. This feature holds promise for discovering potential drug resistance mutations, as small molecule drugs are designed to interact with well-structured, conserved binding pockets like kinase active sites38. Fusion oncoprotein mutations causing drug resistance have been identified in a small number of studies on kinase-containing fusions39,40,41. We sought to determine whether FusOn-pLM prioritizes the resistance-causing mutations discovered in patients with fusion-driven cancers. In EML4::ALK, a set of 14 mutation sites were linked to resistance to at least one of five drugs: Crizotinib, Ceritinib, Alectinib, Brigatinib, and Lorlatinib39. FusOn-pLM successfully predicted at least one true resistance mutation among the top three mutation logits for 12/14 sites (Fig. 5C, Supplementary Data S2). In BCR::ABL, whose sequence is nearly twice as long as EML4::ALK, a set of 28 mutation sites were linked to imatinib resistance40. FusOn-pLM recovered drug resistance mutations in 13 of these locations (Fig. 5C, Supplementary Data S3). Finally, we selected ETV6::NTRK3 as a case study for recovering known drug resistance mutations and investigating potential mutations away from the active site. FusOn-pLM successfully prioritized two resistance mutations in the NTRK3 kinase domain41, assigned high conservation probability throughout the kinase domain, and predicted the most volatile positions to be in the disordered region from head protein ETV6 (Fig. 5D). In total, these results highlight FusOn-pLM’s potential as a biologically-relevant tool for predicting resistance mutations both within conserved domains and in disordered regions critical to therapeutic outcomes.
Discussion
In this work, we introduce FusOn-pLM, an ESM-2-based pLM fine-tuned to generate fusion oncoprotein-specific embeddings. We further provide a newly-curated, comprehensive dataset, FusOn-DB, consisting of over 44,000 annotated fusion oncoprotein sequences. To our knowledge, no pLM has explicitly sought to learn the unique characteristics of fusion oncoproteins, which differ from most proteins due to their highly disordered nature and altered structural and functional properties. Our benchmarking results establish that via a cosine-scheduled MLM training strategy, FusOn-pLM embeddings outperform those of the original ESM-2-650M model18, the ProtT5 model19, as well as baseline FOdb descriptor embeddings2, on fusion oncoprotein-related tasks, while retaining distinct representations of fusion proteins from their head and tail counterparts (Supplementary Fig. 3B). We further demonstrate that by training on fusion oncoprotein sequences, which represent a large class of IDR-containing proteins, FusOn-pLM embeddings rank highly on the CAID2 benchmark for IDR detection34 and strongly predict IDR properties themselves. Finally, as a demonstration of the model’s biological relevance, we show that FusOn-pLM uniquely enables the prediction of current and future drug-resistant mutations in fusion oncoproteins, highlighting its potential for informing therapeutic strategies and anticipating resistance mechanisms.
While FusOn-pLM represents an important advancement, there are several limitations to address. First, despite leveraging over 44,000 fusion oncoprotein sequences, the diversity of the FusOn-DB dataset may not fully capture all fusion variants, particularly rare or less well-characterized fusions. Additional data, particularly from emerging databases and clinical studies, would further enhance the model’s generalizability. Second, due to GPU memory constraints, proteins longer than 2000 amino acids were excluded during training. While such cases are rare among known fusion oncoproteins, this limitation may exclude certain outliers with repetitive domains or extensive IDRs. Future optimizations in tokenization or memory-efficient architectures could enable the inclusion of these sequences, ensuring comprehensive coverage of fusion oncoprotein diversity. Third, while FusOn-pLM provides strong predictions for intrinsic disorder and drug-resistant mutations, its ability to predict driver mutations or to connect sequence embeddings with regulatory elements such as enhancers or transcription factors remains unexplored35. Future efforts could involve developing models that integrate FusOn-pLM embeddings with regulatory sequence data to elucidate mechanisms underlying oncogenesis42. Most importantly, experimental validation of FusOn-pLM’s predictions, including drug resistance mechanisms and therapeutic design tasks, will be essential to confirm its utility in practical settings.
Recently, our lab has trained ESM-2-based models to generate peptides provided only the sequence of the target protein, facilitating the design of peptide-E3 ubiquitin ligase fusions for the proteasomal degradation of diverse protein substrates22,23,24. As our main objective is to enable the degradation of fusion oncoproteins, our next steps will be to replace ESM-2 embeddings in these models with FusOn-pLM embeddings, enabling fusion-specific degrader design. Since post-translational modifications (PTMs) are also well known to affect the oncogenic activity of fusion oncoproteins43,44,45, we plan to retrain FusOn-pLM with our recent PTM-Mamba pLM46, which effectively tokenizes PTMs, enabling both fusion- and PTM-specific therapeutic design. Finally, by leveraging recent advancements in gene delivery, such as lipid nanoparticles and adeno-associated viral vectors47,48, we envision that fusion-specific biologics may eventually serve as safe and efficacious therapeutics for fusion-positive cancer patients. Overall, the results of our study motivate the use of FusOn-pLM embeddings for downstream fusion oncoprotein design tasks, serving as a major step toward this goal.
Methods
Model training set curation
Model training data was curated from FusionPDB and FOdb to create FusOn-DB, a dataset of 44,414 fusion oncoprotein sequences representing 16,364 unique head::tail fusions. FusionPDB contributed 41,456 unique sequences26, including AlphaFold2 predictions for 3.5K proteins, while FOdb added 4537 unique sequences derived largely from patient data2. After removing duplicates, sequences longer than 2000 amino acids were excluded, leaving 42,141 sequences for training. To create train-validation-test splits with low sequence homology, sequences were clustered using MMSeqs2 with a 30% sequence identity and 80% coverage threshold49. The test set included 250 sequences: 195 with experimental puncta data from FOdb and sequences for four well-studied fusions (EWSR1::FLI1, PAX3::FOXO1, BCR::ABL1, and EML4::ALK). Clusters overlapping these sequences were manually assigned to the test set, with the remaining clusters split into training (33,719 sequences, 80.01%), validation (4214 sequences, 10.00%), and testing (4208 sequences, 9.99%) sets.
BLAST and breakpoint mapping
To estimate sequence homology between FusOn-DB and SwissProt, local blastp (v2.16.0) was used. Head and tail gene names from FOdb and FusionPDB were mapped to UniProt IDs using the UniProt ID Mapping tool. Of the 44,414 fusion sequences, 44,257 had both head and tail components mapped, and 157 had one unmapped component (43 head, 114 tail). Both SwissProt and TrEMBL IDs were stored (Supplementary Data S4). For each fusion oncoprotein, three alignments were extracted: the top overall alignment, the top alignment corresponding to the head gene, and the top alignment corresponding to the tail gene. Alignments included all isoforms. Maximum percent identity was calculated as the number of identical amino acids in the alignment divided by the length of the fusion sequence. BLAST alignments were also used to determine breakpoints by identifying the indices corresponding to the top head and tail alignments. Overlapping regions were labeled as breakpoint regions, and specific loci were manually annotated where applicable for visualization purposes.
Benchmarking dataset curation
To evaluate FusOn-pLM, datasets were curated for three benchmarking tasks: puncta formation and localization, IDR ensemble dimensions, and intrinsic disorder prediction. Data for puncta formation and localization were collected from FOdb2, which includes 178 fusion oncoproteins with experimentally validated results. Train-test splits from FOdb were used, with 149 sequences for training and 29 for testing across three tasks: puncta formation propensity, nuclear localization, and cytoplasmic localization. Class distributions were maintained as reported in FOdb2. For IDR ensemble dimensions, 47,114 IDR sequences from synthetic and natural proteins were sourced from a published dataset33. Labels included asphericity, end-to-end radius (Re), radius of gyration (Rg), and polymer scaling exponent. Sequences were clustered using MMSeqs2 with a minimum sequence identity of 30% and split into training (80%), validation (10%), and testing (10%) sets49. Data distributions were normalized as needed, and sequences with multiple labels for the same property were averaged. Final dataset sizes were 47,114 for asphericity, 42,868 for Re, 22,912 for Rg, and 40,637 for the scaling exponent. For disorder prediction, training data included 5304 unique sequences after cleaning and deduplication, with 5264 sequences from IDP-CRF50 and 536 sequences from flDPnn51. The testing dataset comprised 210 gold-standard sequences from the CAID2 Disorder-NOX dataset with per-residue annotations indicating disorder (1) or structure (0)29,34. FusOn-pLM-Diso was trained on the combined dataset and benchmarked on Disorder-NOX. To analyze disorder in fusion oncoproteins, pseudo-labels were generated using AlphaFold-pLDDT scores, where residues with pLDDT <68.8 were labeled as disordered17. Structures for 523 fusion oncoproteins in the FusOn-pLM test set were obtained from FusionPDB26. The BeautifulSoup package (version 4.12.2) in python was used to scrape FusionPDB for structure download links.
Embedding exploration dataset curation
FusOn-pLM embeddings of transcription factor (TF) and kinase fusions were visualized in 2D plots. To efficiently determine which fusion oncoproteins possessed TF heads and tails or kinase heads and tails, a categorized list of fusion head and tail genes was consulted52. 595 fusion oncoproteins from FusOn-DB (364 TF::TF and 231 Kinase::Kinase) were identified.
Model architecture and training
FusOn-pLM is based on ESM-2-650M, a 33-layer transformer model pre-trained on UniRef50, and was fine-tuned to generate fusion oncoprotein-specific representations. To adapt ESM-2-650M for this task without overfitting, the final eight layers of the model were selectively fine-tuned. Specifically, the key, query, and value weight matrices of the self-attention mechanism in these layers were unfrozen, while earlier layers remained fixed. The multi-head self-attention mechanism is parameterized such that the attention output is computed as a weighted sum of values V, where the weights are derived from the scaled dot-product of queries: \(Q={W}_{q}h\) and keys: \(K={W}_{k}h\). For fine-tuning, the learnable parameters Wq, Wk, and Wv in the last eight layers were updated, enabling task-specific adaptation to fusion oncoproteins while preserving the general-purpose representations learned during pre-training.
Specifically, a cosine-scheduled masking strategy was employed during training to dynamically vary the masking rate.
Let \(x=\left({x}_{1},{x}_{2},\ldots,{x}_{n}\right)\) be the input amino acid sequence of length n. Define M as the set of masked positions such that \({|M|}=\lceil r\cdot n\rceil\), where the masking rate r varies within each training epoch according to a cosine schedule. The masking rate at step t within an epoch of T steps is given by:
where \({r}_{\min }=0.15\) and \({r}_{\max }=0.40\). At the start of each epoch, rt is reset to rmin, increasing to rmax and cycling back to rmin at the beginning of the next epoch.
Masked positions are selected uniformly at random from the set \(\{{{\mathrm{1,2}}},\ldots,n\}\) without replacement. Mathematically, the selection of M is described as:
All selected positions are replaced with a special mask token. The MLM objective is computed as:
where xi is the true amino acid at position i, and \({x}_{\setminus M}\) represents the sequence with masked tokens excluded.
A visualization of the masking strategy is shown in Fig. 1.
FusOn-pLM was trained on one NVIDIA H100 GPU with 80 GB of VRAM each for 30 epochs with batch size of 8 and learning rate of 3e-4. The Adam optimizer was utilized with no weight decay. Only fusion oncoproteins of length 2000 or shorter were used for training; short sequences were padded to this maximal length.
Fusion oncoprotein property benchmarks
Embedding performance on predicting the propensity of puncta formation, as well as predicting if puncta form in the nucleus or cytoplasm, were evaluated. Here, sequences from FOdb with conclusive experimental data on puncta formation were utilized for pLM embedding evaluation2. FOdb tested 195 total FOs for puncta formation, but only used the 178 with conclusive results to train the FO-Puncta ML model. Puncta formation and localization predictions were treated as a binary class, where label 0 or 1 represented a lack or presence of puncta formation in a given area. FusOn-pLM embeddings were compared against three others: (1) Base wild-type ESM-2-650M embeddings, (2) ProtT5-XL-UniRef50 embeddings19, and (3) FOdb embeddings2, which are 25 physicochemical features manually curated by FOdb for only the 195 proteins. The standard binary cross-entropy loss function was minimized for each task using the XGBoost model with 50 trees via xgboost (version 1.7.5)53. The binary cross-entropy loss is defined as:
Disorder property benchmark
Disorder properties were evaluated by training regression models that used FusOn-pLM embeddings of IDRs, to predict four ensemble features: asphericity, Re, Rg, and polymer scaling exponent33. For each property, a separate FusOn-pLM-IDR regression model was trained. These models fed FusOn-pLM embeddings through a multi-layer perceptron (MLP) network with three fully connected layers (Fig. 4A). The input layer performed dimensionality reduction to hidden dimension 640 and passed the output through a ReLU activation function, followed by layer normalization and dropout regularization with a probability of 0.2. This structure was repeated for two more iterations, shrinking the hidden dimension to 320 and finally culminating in a single neuron: the predicted value of the property. Each model was trained to minimize the mean square error (MSE), and early stopping was implemented to prevent overfitting. The MSE loss function is defined by:
Models were evaluated on a held-out test set by predicting each property given the sequence embedding alone. The coefficient of determination (R2) between predictions and labels was calculated for each model to assess goodness of fit. In order to maximize R2, a hyperparameter screen across two batch sizes (32, 64) and five learning rates (1e-5, 3e-4, 1e-4, 3e-3, 1e-3) was performed (Supplementary Data S5). The true values and predicted values were plotted in Matplotlib (version 3.7.2), with an ideal fit line included for reference. The entire process was repeated using ESM-2-650M embeddings rather than FusOn-pLM embeddings (Supplementary Fig. 2).
CAID benchmark
FusOn-pLM’s ability to predict intrinsic disorder was evaluated using a per-residue disorder prediction benchmark based on the CAID2 Disorder-NOX dataset29,34. Binary labels indicating whether each residue is disordered (1) or structured (0) were used to train FusOn-pLM-Diso, a per-residue disorder predictor. The predictor employs a multi-head self-attention Transformer architecture, minimizing binary cross-entropy loss. Hyperparameter optimization was performed for the number of attention heads (5, 8, 10), Transformer layers (2, 4, 6), and dropout rates (0.2, 0.5) (Supplementary Data S6). Models were trained for 2 epochs with a learning rate of 5e-5, and optimal hyperparameters were selected by maximizing AUROC. An equivalent model, ESM-2-650M-Diso, was trained using ESM-2-650M embeddings for comparison. Both models were trained and evaluated on the CAID2 Disorder-NOX dataset29,34, with per-residue predictions used for benchmarking. Predicted per-residue disorder probabilities were computed for each input sequence, and binary predictions were made using thresholds selected to optimize classification performance metrics. To extend the analysis to fusion oncoproteins, per-residue disorder predictions were made for sequences with available AlphaFold2 structures14. Percentage disorder was calculated by dividing the number of predicted disordered residues by sequence length. Additionally, predicted per-residue disorder probabilities were mapped onto 3D protein structures for visualization. AlphaFold2’s pLDDT metric was used as a reference for structural disorder to aid in the assessment of predicted regions14.
Embedding exploration
To explore how FusOn-pLM embeddings capture the physicochemical and functional properties of fusion oncoproteins, we first conducted a dimensionality reduction analysis on both fusion oncoprotein embeddings and/or their head and tail proteins using Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP)54 via the umap module (version 0.5.6). The FusOn-pLM embeddings of six highly-studied fusion oncoproteins (EWSR1::FLI1, PAX3::FOXO1, BCR::ABL1, CIC::DUX4, SS18::SSX1, and EML4::ALK) and their respective head and tail proteins (derived from the BLAST against SwissProt) were transformed by UMAP and plotted (Supplementary Fig. 3B). Additionally, 364 transcription factor (TF), where both head and tail were TFs, and 231 kinase fusions, where both head and tail were kinases, were embedded and plotted in UMAP coordinates (Supplementary Fig. 3A).
Zero-shot mutation prediction
Zero-shot mutation prediction was performed on a set of fusion oncoproteins. For each protein, the sequence was input to FusOn-pLM with its MLM head L times, where L is the protein length. During each iteration, a single <mask> token was introduced at a different position in the sequence, and only this position was unmasked. The raw logits for each of the twenty amino acids at the masked position were recorded. These logits were ranked in descending order, creating a list of the most to least likely amino acids predicted at that position. The top three predicted amino acids, based on their logits, were considered the “top 3 mutations.”
Heatmaps of the logits for the original amino acid at each position were constructed for representative fusion oncoproteins: EWSR1::FLI1, PAX3::FOXO1, and TRIM24::RET. Functional domains were identified using UniProt annotations for the reviewed SwissProt accession corresponding to the head and tail genes. Residue positions for these domains were converted from their coordinates on the original head or tail protein to their corresponding positions on the fusion protein using string indexing in Python. A binary conservation label was applied to logits, with values <0.7 designated as non-conserved (0) and values >0.7 as conserved (1).
Sequences for EML4::ALK and BCR::ABL1 were generously provided by the authors of Elshatlawy et al. 39, and O’Hare et al. 40, and were screened through the zero-shot mutation pipeline. Positions corresponding to known drug resistance mutations, as reported in the literature, were evaluated to determine whether one of the top three predicted amino acids matched a reported mutation (“hit”) or did not (“miss”). For positions where the original amino acid was among the top three predicted tokens, an additional token was included in the analysis. Structural models for these sequences were folded in AlphaFold2 and visualized using PyMOL.
Potential mutations in ETV6::NTRK3 were also predicted using the zero-shot prediction pipeline41. Literature-reported mutations in NTRK3 coordinates were converted to the corresponding positions in ETV6::NTRK3 coordinates. For example, NTRK3 G623R and G696A became ETV6::NTRK3 G504A and G431R. These positions were evaluated as “hit” or “miss” based on whether the top three predicted mutations included the correct token. Structural predictions were obtained from FusionPDB and visualized in PyMOL. Additionally, the top five mutations were identified as those with the smallest logits for the original amino acid.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All data needed to evaluate the conclusions are presented in the paper and tables. The FusOn-DB dataset can be found at https://huggingface.co/datasets/ChatterjeeLab/FusOn-DB. Source data are provided with this paper.
Code availability
The code used to develop the model/perform the analyses and generate results in this study is publicly available at https://huggingface.co/ChatterjeeLab/FusOn-pLM, under a Creative Commons Attribution Non Commercial No Derivatives 4.0 license. The specific version of the code associated with this publication is accessible via https://doi.org/10.57967/hf/421855 and is archived at the following Zenodo repository: https://doi.org/10.5281/zenodo.14706684.
References
Rabbitts, T. H. Chromosomal translocations in human cancer. Nature 372, 143–149 (1994).
Tripathi, S. et al. Defining the condensate landscape of fusion oncoproteins. Nat. Commun. 14, 6008 (2023).
Angione, S. D. A. et al. Fusion oncoproteins in childhood cancers: potential role in targeted therapy. J. Pediatr. Pharmacol. Ther. 26, 541–555 (2021).
Delattre, O. et al. Gene fusion with an ETS DNA-binding domain caused by chromosome translocation in human tumours. Nature 359, 162–165 (1992).
Linardic, C. M. PAX3-FOXO1 fusion gene in rhabdomyosarcoma. Cancer Lett. 270, 10–18 (2008).
McBride, M. J. et al. The SS18-SSX fusion oncoprotein hijacks baf complex targeting and function to drive synovial sarcoma. Cancer Cell 33, 1128–1141.e7 (2018).
Soda, M. et al. Identification of the transforming EML4-ALK fusion gene in non-small-cell lung cancer. Nature 448, 561–566 (2007).
Erkizan, H. V. et al. A small molecule blocking oncogenic protein EWS-FLI1 interaction with RNA helicase A inhibits growth of Ewing’s sarcoma. Nat. Med. 15, 750–756 (2009).
Vital, T. et al. MS0621, a novel small-molecule modulator of Ewing sarcoma chromatin accessibility, interacts with an RNA-associated macromolecular complex and influences RNA splicing. Front. Oncol. 13, 1099550 (2023).
Carter, P. J. & Rajpal, A. Designing antibodies as therapeutics. Cell 185, 2789–2805 (2022).
Hie, B. L. et al. Efficient evolution of human antibodies from general protein language models. Nat. Biotechnol. 42, 275–283 (2023).
Ham, J. M., Kim, M., Kim, T., Ryu, S. E. & Park, H. Structure-based De Novo design for the discovery of miniprotein inhibitors targeting oncogenic mutant BRAF. Int. J. Mol. Sci. 25, 5535 (2024).
Vadevoo, S. M. P. et al. Peptides as multifunctional players in cancer therapy. Exp. Mol. Med. 55, 1099–1109 (2023).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Abramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500 (2024).
Watson, J. L. et al. De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100 (2023).
Piovesan, D., Monzon, A. M. & Tosatto, S. C. E. Intrinsic protein disorder and conditional folding in AlphaFoldDB. Protein Sci. 31, e4466 (2022).
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).
Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).
Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023).
Brixi, G. et al. SaLT&PepPr is an interface-predicting language model for designing peptide-guided protein degraders. Commun Biol 6, 1081 (2023).
Bhat, S. et al. De novo design of peptide binders to conformationally diverse targets with contrastive language modeling. Science Advances, 11, adr368 (2025).
Chen, T. et al. PepMLM: target sequence-conditioned generation of therapeutic peptide binders via span masked language modeling. Preprint at https://arxiv.org/abs/2310.03842 (2024).
Verma, S. K., Witkin, K. L., Sharman, A. & Smith, M. A. Targeting fusion oncoproteins in childhood cancers: challenges and future opportunities for developing therapeutics. J. Natl Cancer Inst 116, 1012–1018 (2024).
Kumar, H., Tang, L.-Y., Yang, C. & Kim, P. FusionPDB: a knowledgebase of human fusion proteins. Nucleic Acids Res. 52, D1289–D1304 (2024).
Suzek, B. E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023).
Necci, M. et al. Critical assessment of protein intrinsic disorder prediction. Nat. Methods 18, 472–481 (2021).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT 4171–4186 (2019).
Wettig, A., Gao, T., Zhong, Z. & Chen, D. Should you mask 15% in masked language modeling? Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2985–3000 (2023).
Sahoo, S. S. et al. Simple and effective masked diffusion language models. Conference on Neural Information Processing Systems (2024).
Lotthammer, J. M., Ginell, G. M., Griffith, D., Emenecker, R. J. & Holehouse, A. S. Direct prediction of intrinsically disordered protein conformational properties from sequence. Nat. Methods 21, 465–476 (2024).
Del Conte, A. et al. Critical assessment of protein intrinsic disorder prediction (CAID) - Results of round 2. Proteins Struct. Funct. Bioinform. 91, 1925–1934 (2023).
Zhang, R., Dong, L. & Yu, J. Concomitant pathogenic mutations and fusions of driver oncogenes in tumors. Front. Oncol. 10, 544579 (2020).
Asante, Y. et al. PAX3-FOXO1 uses its activation domain to recruit CBP/P300 and shape RNA Pol2 cluster distribution. Nat. Commun.14, 1–19 (2023).
Crose, L. E. S. et al. Alveolar rhabdomyosarcoma-associated PAX3-FOXO1 promotes tumorigenesis via Hippo pathway suppression. J. Clin. Invest. 124, 285–296 (2014).
Cohen, P., Cross, D. & Jänne, P. A. Kinase drug discovery 20 years after imatinib: progress and future directions. Nat. Rev. Drug Discov. 20, 551–569 (2021).
Elshatlawy, M., Sampson, J., Clarke, K. & Bayliss, R. EML4-ALK biology and drug resistance in non-small cell lung cancer: a new phase of discoveries. Mol. Oncol. 17, 950–963 (2023).
O’Hare, T., Eide, C. A. & Deininger, M. W. N. Bcr-Abl kinase domain mutations, drug resistance, and the road to a cure for chronic myeloid leukemia. Blood 110, 2242–2249 (2007).
Drilon, A. et al. Efficacy of larotrectinib in TRK fusion-positive cancers in adults and children. N. Engl. J. Med. 378, 731–739 (2018).
Vicente-García, C. et al. Regulatory landscape fusion in rhabdomyosarcoma through interactions between the PAX3 promoter and FOXO1 regulatory elements. Genome Biol. 18, 106 (2017).
Yu, L., Davis, I. J. & Liu, P. Regulation of EWSR1-FLI1 function by post-transcriptional and post-translational modifications. Cancers 15, 382 (2023).
Thalhammer, V. et al. PLK1 phosphorylates PAX3-FOXO1, the inhibition of which triggers regression of alveolar Rhabdomyosarcoma. Cancer Res. 75, 98–110 (2015).
Pan, S. & Chen, R. Pathological implication of protein post-translational modifications in cancer. Mol. Aspects Med. 86, 101097 (2022).
Peng, Z., Schussheim, B. & Chatterjee, P. PTM-Mamba: a PTM-aware protein language model with bidirectional gated mamba blocks. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2024.02.28.581983v1 (2024). In press.
Hou, X., Zaks, T., Langer, R. & Dong, Y. Lipid nanoparticles for mRNA delivery. Nat. Rev. Mater. 6, 1078–1094 (2021).
Wang, J.-H., Gessler, D. J., Zhan, W., Gallagher, T. L. & Gao, G. Adeno-associated virus as a delivery vector for gene therapy of human diseases. Signal Transduct. Target. Ther. 9, 78 (2024).
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Liu, Y., Wang, X. & Liu, B. IDP−CRF: Intrinsically disordered protein/region identification based on conditional random fields. Int. J. Mol. Sci. 19, 2483 (2018).
Hu, G. et al. flDPnn: Accurate intrinsic disorder prediction with putative propensities of disorder functions. Nat. Commun. 12, 1–8 (2021).
Salokas, K., Weldatsadik, R. G. & Varjosalo, M. Human transcription factor and protein kinase gene fusions in human cancer. Sci. Rep. 10, 14169 (2020).
Buitinck, L. et al. API design for machine learning software: experiences from the scikit-learn project. European Conference on Machine Learning and Principles and Practices of Knowledge Discovery in Databases (2013).
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).
Vincoff, S. et al. FusOn-pLM. Hugging Face https://doi.org/10.57967/hf/4218 (2024).
Acknowledgements
We thank Mark III Systems and the Duke Computing Cluster for computing support. We further thank Zhangzhi Peng, Yinuo Zhang, and Tianlai Chen for their insights related to the manuscript. We thank Lauren Hong for rendering the FusOn-pLM logo. The work was supported by the National Cancer Institute (Awards #R21CA278468 and #3U54CA231630-01A1S4), the Wallace H. Coulter Foundation, The Hartwell Foundation.
Author information
Authors and Affiliations
Contributions
S.V. designed and implemented masking strategies and trained FusOn-pLM. S.V., S.G., K.K., R.P., and P.V. performed model benchmarking and visualizations. S.V. and P.C. wrote and reviewed the manuscript. P.C. conceived, designed, directed, and supervised the study.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Bargeen Turzo, Gian Gaetano Tartaglia and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Vincoff, S., Goel, S., Kholina, K. et al. FusOn-pLM: a fusion oncoprotein-specific language model via adjusted rate masking. Nat Commun 16, 1436 (2025). https://doi.org/10.1038/s41467-025-56745-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-025-56745-6
This article is cited by
-
Programmable protein stabilization with language model-derived peptide guides
Nature Communications (2025)
-
Target sequence-conditioned design of peptide binders using masked language modeling
Nature Biotechnology (2025)