FusOn-pLM: a fusion oncoprotein-specific language model via adjusted rate masking

Vincoff, Sophia; Goel, Shrey; Kholina, Kseniia; Pulugurta, Rishab; Vure, Pranay; Chatterjee, Pranam

doi:10.1038/s41467-025-56745-6

Download PDF

Article
Open access
Published: 07 February 2025

FusOn-pLM: a fusion oncoprotein-specific language model via adjusted rate masking

Sophia Vincoff¹,
Shrey Goel²,
Kseniia Kholina¹,
Rishab Pulugurta¹,
Pranay Vure¹ &
…
Pranam Chatterjee ORCID: orcid.org/0000-0003-3957-8478^1,2,3

Nature Communications volume 16, Article number: 1436 (2025) Cite this article

9076 Accesses
13 Citations
5 Altmetric
Metrics details

Subjects

Abstract

Fusion oncoproteins, a class of chimeric proteins arising from chromosomal translocations, are major drivers of various pediatric cancers. These proteins are intrinsically disordered and lack druggable pockets, making them highly challenging therapeutic targets for both small molecule-based and structure-based approaches. Protein language models (pLMs) have recently emerged as powerful tools for capturing physicochemical and functional protein features but have yet to be trained on fusion oncoprotein sequences. We introduce FusOn-pLM, a fine-tuned pLM trained on a newly curated, comprehensive set of fusion oncoprotein sequences, FusOn-DB. Employing a unique cosine-scheduled masked language modeling strategy, FusOn-pLM dynamically adjusts masking rates (15%–40%) to optimize feature extraction and representation quality, surpassing baseline embeddings in fusion-specific tasks, including localization, puncta formation, and disorder prediction. FusOn-pLM uniquely predicts drug-resistant mutations, providing insights for therapeutic design that anticipates resistance mechanisms. In total, FusOn-pLM provides biologically relevant representations for advancing therapeutic discovery in fusion-driven cancers.

InterPLM: discovering interpretable features in protein language models via sparse autoencoders

Article 29 September 2025

Learning the language of protein-protein interactions

Article Open access 07 January 2026

Large-scale discovery of chromatin dysregulation induced by oncofusions and other protein-coding variants

Article 24 July 2024

Introduction

Fusion oncoproteins arise from chromosomal rearrangements that fuse segments of two distinct genes (Fig. 1A)¹. The resulting mutants contain unrelated functional domains connected by long regions of disorder². This flexible configuration promotes constitutive activation or aberrant regulation of the fusion proteins, driving oncogenic transformation and tumor development³. Thousands of unique fusion oncoproteins have been discovered by sequencing patient tumors, and several common culprits such as EWSR1::FLI1 in Ewing’s sarcoma⁴, PAX3::FOXO1 in alveolar rhabdomyosarcoma^4,5, SS18::SSX1 in synovial sarcoma⁶, and EML4::ALK proteins in non-small-cell lung cancer⁷ are well characterized in the literature. However, even the best-understood fusion oncoproteins have proven to be elusive drug targets due to their structural instability and absence of defined binding pockets². For small molecules that are able to bind fusion oncoproteins, such as EWSR1::FLI1^8,9, these compounds do not achieve strict fusion specificity, binding to one of their head or tail protein counterparts that are often critical regulators of cellular homeostasis. As such, biologics, such as antibodies, miniproteins, and peptides, represent attractive therapeutic alternatives but necessitate advanced design approaches for specific targeting to these undruggable proteins^10,11,12,13.

**Fig. 1: Overview of fusion oncoproteins (FOs).**

Recently, structure-based prediction and design models, such as AlphaFold and RFDiffusion^14,15,16, have accelerated the design of biologics targeting pathogenic proteins. These tools, by default, fail to accurately capture the structure of numerous conformationally unstable proteins, limiting their usefulness for fusion oncoprotein targeting¹⁷. Meanwhile, protein language models (pLMs), such as ESM-2 and ProtT5, have been trained on millions of protein sequences, from the exceedingly stable to the intrinsically disordered^18,19. They capture physicochemical, structural, and functional properties of proteins from their sequence alone, and have even been extended to design novel proteins^20,21 and binders^22,23,24. However, these models were not trained on fusion oncoprotein sequences, which are functionally and structurally distinct from their wild-type counterparts due to their altered binding sites and unique breakpoint junctions²⁵.

To fill this critical gap, we fine-tune the state-of-the-art ESM-2 pLM on 44,414 fusion oncoprotein sequences collected from the FusionPDB and FOdb databases, collectively termed the new FusOn-DB database^2,26. Training on FusOn-DB data, we unfreeze all of the weights of the final eight layers of the ESM-2-650M model and fine-tune these parameters using a masked language modeling (MLM) head. To enhance the model’s ability to learn the unique properties of fusion oncoproteins, we introduce a cosine-scheduled masking strategy, dynamically varying the masking rate from 15% to 40% during training. This approach enables our top-performing model, FusOn-pLM, to capture the distinct structural and functional features of fusion oncoproteins. As evidence, our results demonstrate that FusOn-pLM outperforms baseline embeddings on diverse fusion-specific tasks, including puncta formation propensity and the prediction of intrinsic disorder. Moreover, we showcase its utility in identifying drug-resistant mutations in fusion oncoproteins, highlighting its biological relevance and potential for advancing therapeutic design.

Results

Fusion oncoproteins comprise a distinct and diverse sequence dataset

ESM-2 was pretrained on ~65 million sequences from UniRef50, a database which includes over 9000 wild-type proteins known to act as the head or tail components of fusion oncoproteins²⁷. However, ESM-2 was not trained on the fusions themselves (Fig. 1B)¹⁸. By collecting fusion oncoprotein sequences from the FusionPDB and FOdb databases^2,26, two complementary resources that provide experimentally validated and computationally predicted fusion proteins with clinical or biological relevance, we assembled FusOn-DB, a comprehensive and non-redundant dataset of 44,414 fusion oncoprotein sequences. Running BLAST between FusOn-DB and SwissProt²⁸ (full results in Supplementary Data S1) revealed a wide distribution of sequence homology. On average, fusion oncoproteins shared 71.0% identity with the top-aligning SwissProt sequence, which corresponded to either the head or tail protein in 87% of cases. Over 12,000 fusion oncoproteins had <60% maximum identity, and over 5000 had <50% maximum identity.

Fusion oncoproteins are also characterized by a high level of structural disorder. AlphaFold2 structures of four highly-studied fusion oncoproteins (PAX3::FOXO1, EWSR1::FLI1, EML4::ALK, and SS18::SSX1) largely exhibit low (50–70) and very low (<50) confidence pLDDT scores, indicating extensive intrinsic disorder (Fig. 1C). These structural trends are consistent across various sequences for the same fusion genes, arising from different breakpoints (Fig. 1C). To quantify the difference in disorder between fusion oncoproteins and wild-type proteins, we used a well-validated threshold to assign disorder labels (pLDDT <68.8 = disordered)¹⁷ to each residue in a set of fusion oncoproteins. Fusion oncoproteins were 45.9% disordered on average, while head proteins were 33.7% disordered and tail proteins were 32.7% disordered (Fig. 1D). Similarly to fusion oncoproteins, the gold-standard disorder dataset Disorder-NOX²⁹ had a greater proportion of near-fully disordered proteins than fusion heads and tails. In contrast, fusion oncoproteins had a more right-skewed distribution (Supplementary Fig. 1). In total, these findings highlight the distinct sequence and structural characteristics of fusion oncoproteins, underscoring the need for better representations tailored to their properties.

Cosine-scheduled masking enables accurate fusion oncoprotein sequence recovery

Having curated a diverse dataset of fusion oncoproteins, we sought to fine-tune the standard ESM-2-650M model via an MLM objective (Fig. 2A)¹⁸. This classic training approach forces the model to reconstruct masked tokens from sequence context, refining representations to emphasize unique physicochemical properties of fusion oncoproteins. Fixed-rate masking at 15% is the established standard in most BERT-based MLM architectures^18,30, but fusion oncoproteins’ intrinsic complexity prompted us to explore higher and variable masking rates. Recent findings from Wettig et al., have demonstrated that increasing masking rates (up to 40%) improves performance by forcing the model to rely more heavily on sequence context for token reconstruction³¹. Additionally, varying the masking rate during training balances representation learning (improved by lower masking rates) with reconstruction quality (improved by higher masking rates)³². Motivated by these findings, we fine-tuned the final eight layers of ESM-2-650M using a cosine scheduler to dynamically adjust the masking rate from 15% to 40% across each training epoch (Fig. 2A), hypothesizing that this approach would maximize model performance by gradually increasing the difficulty of the reconstruction task.

Our results strongly validated this hypothesis. When evaluated on a 15% masked, held-out test set from FusOn-DB, our fine-tuned model consistently outperformed both fixed-rate masking strategies and the non-fine-tuned ESM-2-650M baseline (Fig. 2B). ESM-2-650M, which uses a static 15% masking rate during pre-training¹⁸, performed poorly, with a loss of 1.83 and a pseudo-perplexity of 6.24. As a note, pseudo-perplexity (pPL) is a metric adapted from language modeling to evaluate how well a model predicts masked tokens, with lower values indicating better reconstruction performance and overall sequence comprehension. While far better, fine-tuning FusOn-pLM with fixed masking rates of 15%, 20%, and 25% produced progressively higher loss and pPL values, reflecting the difficulty of optimizing both sequence reconstruction and representation learning with static masking. In contrast, cosine-scheduled masking achieved better performance across all tested ranges, with the best results observed for a masking range of 15–40% (loss: 1.28, pPL: 3.61). Further exploration of different adjusted-rate masking schedulers, including log-linear and stepwise strategies, demonstrated that the cosine scheduler still remained optimal, achieving the lowest loss and pPL values (1.28 and 3.61, respectively) (Fig. 2B).

FusOn-pLM generates fusion oncoprotein-relevant representations

To determine if FusOn-pLM produces relevant embeddings, we sought to evaluate its performance on downstream fusion oncoprotein-specific tasks. We first assessed the embeddings’ ability to accurately predict the formation and localization of puncta, which are critical in driving cancer pathology². Many fusion oncoproteins have been shown to form puncta via phase separation, and these condensates may localize to the nucleus and/or cytoplasm (Fig. 3A)². Experimental data describing the puncta formation and localization of 178 fusion oncoproteins were used to train three FusOn-pLM-Puncta models, consisting of FusOn-pLM embeddings fed into a gradient boosting (XGBoost) classifier (Fig. 3B). For puncta formation, FusOn-pLM embeddings outperform ESM-2-650M, ProtT5, and FOdb physicochemical embeddings on four relevant classification metrics across the entire held-out test dataset (Fig. 3C). We observed similar results when predicting localization to the nucleus, the primary location of fusion oncoproteins (Fig. 3D)³. While manually-curated FOdb embeddings perform strongly on cytoplasm localization prediction, FusOn-pLM embeddings prove most effective on critical metrics, such as AUROC (Fig. 3E). In total, these results indicate that FusOn-pLM learns representations capturing key properties encoded in fusion oncoprotein sequences.

**Fig. 3: FusOn-pLM embedding benchmarks on puncta prediction tasks.**

FusOn-pLM can accurately predict disordered content in wild-type and fusion oncoproteins

Given that fusions are structurally disordered, we hypothesized that FusOn-pLM’s embeddings may encode information pertinent to the properties of intrinsically disordered regions (IDRs). Specifically, we sought to predict: 1. Asphericity, which quantifies a protein’s ensemble shape and molecular conformation, 2. End-to-end radius (R_e), the average distance between the N-terminal and C-terminal residue, 3. Radius of gyration (R_g), the average distance between a protein’s residues and its center of mass, and 4. Polymer scaling exponent, which describes an IDR’s behavior when solvated in water³³. Individual FusOn-pLM-IDR regressors were trained on non-fusion IDR sequences for each property, using multi-layer perceptron (MLP) heads to predict the property values directly from FusOn-pLM embeddings (Fig. 4A). We demonstrate that FusOn-pLM-IDR models achieve a high coefficient of determination (R²) on all four properties, indicating a strong fit (Fig. 4B). We also find that FusOn-pLM and ESM-2-650M embeddings achieve nearly equivalent performance, signaling that FusOn-pLM did not overfit on fusion oncoproteins and lose ESM-2’s intrinsic ability to represent a wide range of proteins (Supplementary Fig. 2).

**Fig. 4: FusOn-pLM prediction of IDR properties and regions.**

Next, we sought to assess FusOn-pLM’s ability to identify IDR regions within protein sequences. The FusOn-pLM-Diso model was trained to predict per-residue probabilities of disorder directly from FusOn-pLM embeddings (Fig. 4C). When evaluated on the Disorder-NOX dataset used in the CAID2 competition³⁴, FusOn-pLM achieved an AUROC of 0.825. Compared with a parallel architecture trained on ESM-2 embeddings (ESM-2-650M-Diso) and fourteen CAID2 competitors, FusOn-pLM-Diso ranked in the top 5 of all models (Fig. 4D)²⁹. We then questioned whether FusOn-pLM embeddings could accurately distinguish between structured and disordered residues in fusion oncoproteins, specifically. On a set of proteins from FusOn-pLM’s test set, FusOn-pLM-Diso achieved average accuracy, precision, recall, F1, and AUROC metrics all above 0.9. We also observed a strong correlation (R² = 0.84) between the disorder percentages predicted by FusOn-pLM-Diso and that of AlphaFold-pLDDT (Fig. 4E), further supporting the notion that FusOn-pLM embeddings capture the disorder properties of fusion oncoproteins. When visualizing the per-residue disorder probabilities for five well-studied fusion oncoproteins, we observe differential coloring between disordered and structured residues. We establish that FusOn-pLM correctly identifies structure in the α-helix and β-sheet-rich regions, coloring these areas dark blue (Fig. 4F). Overall, our results suggest that FusOn-pLM accurately encodes disorder-related information in its embeddings. Given that fusion oncoproteins are characterized by their disordered regions, we reason that FusOn-pLM embeddings more effectively represent fusion oncoproteins.

FusOn-pLM embeddings enable zero-shot discovery of relevant mutations

Fusion oncoproteins themselves are mutants, but they also have the potential to acquire additional mutations which can alter their structure, function, and druggability³⁵. Beyond property and disorder prediction, we sought to establish the biological utility and relevance of FusOn-pLM by performing zero-shot discovery via its MLM head, which can sequentially unmask each position in an input sequence, outputting residue probabilities per unmasked position (Fig. 5A). As with any pLM, within evolutionarily conserved domains, the logits corresponding to the original residue are much higher than for any alternate residue. For example, in the TF::Kinase fusion TRIM24::RET, FusOn-pLM correctly identifies TRIM24’s zinc finger domains and RET’s kinase domain as highly conserved (Fig. 5B). FusOn-pLM also identifies that the EWSR1 activation domain and FLI1 DNA-binding domain in EWSR1::FLI1 are unlikely to mutate (Fig. 5B). In PAX3::FOXO1, the DNA-binding domains of PAX3 are highly conserved, but the truncated DNA-binding domain of FOXO1 (25/75 amino acids) is less strongly conserved (Fig. 5B), corroborated by studies showing FOXO1’s DNA binding activity is not critical for fusion function^36,37. This result indicates that FusOn-pLM has implicitly captured the function of fusion oncoproteins, which is further strengthened by the observation of clear differences between TF::TF and Kinase::Kinase fusions in its latent space (Supplementary 3A).

**Fig. 5: Zero-shot mutation prediction.**

Although FusOn-pLM may not predict that change is likely within a conserved domain, its logits still provide rank-ordered, possible mutations within these regions. This feature holds promise for discovering potential drug resistance mutations, as small molecule drugs are designed to interact with well-structured, conserved binding pockets like kinase active sites³⁸. Fusion oncoprotein mutations causing drug resistance have been identified in a small number of studies on kinase-containing fusions^39,40,41. We sought to determine whether FusOn-pLM prioritizes the resistance-causing mutations discovered in patients with fusion-driven cancers. In EML4::ALK, a set of 14 mutation sites were linked to resistance to at least one of five drugs: Crizotinib, Ceritinib, Alectinib, Brigatinib, and Lorlatinib³⁹. FusOn-pLM successfully predicted at least one true resistance mutation among the top three mutation logits for 12/14 sites (Fig. 5C, Supplementary Data S2). In BCR::ABL, whose sequence is nearly twice as long as EML4::ALK, a set of 28 mutation sites were linked to imatinib resistance⁴⁰. FusOn-pLM recovered drug resistance mutations in 13 of these locations (Fig. 5C, Supplementary Data S3). Finally, we selected ETV6::NTRK3 as a case study for recovering known drug resistance mutations and investigating potential mutations away from the active site. FusOn-pLM successfully prioritized two resistance mutations in the NTRK3 kinase domain⁴¹, assigned high conservation probability throughout the kinase domain, and predicted the most volatile positions to be in the disordered region from head protein ETV6 (Fig. 5D). In total, these results highlight FusOn-pLM’s potential as a biologically-relevant tool for predicting resistance mutations both within conserved domains and in disordered regions critical to therapeutic outcomes.

Discussion

In this work, we introduce FusOn-pLM, an ESM-2-based pLM fine-tuned to generate fusion oncoprotein-specific embeddings. We further provide a newly-curated, comprehensive dataset, FusOn-DB, consisting of over 44,000 annotated fusion oncoprotein sequences. To our knowledge, no pLM has explicitly sought to learn the unique characteristics of fusion oncoproteins, which differ from most proteins due to their highly disordered nature and altered structural and functional properties. Our benchmarking results establish that via a cosine-scheduled MLM training strategy, FusOn-pLM embeddings outperform those of the original ESM-2-650M model¹⁸, the ProtT5 model¹⁹, as well as baseline FOdb descriptor embeddings², on fusion oncoprotein-related tasks, while retaining distinct representations of fusion proteins from their head and tail counterparts (Supplementary Fig. 3B). We further demonstrate that by training on fusion oncoprotein sequences, which represent a large class of IDR-containing proteins, FusOn-pLM embeddings rank highly on the CAID2 benchmark for IDR detection³⁴ and strongly predict IDR properties themselves. Finally, as a demonstration of the model’s biological relevance, we show that FusOn-pLM uniquely enables the prediction of current and future drug-resistant mutations in fusion oncoproteins, highlighting its potential for informing therapeutic strategies and anticipating resistance mechanisms.

While FusOn-pLM represents an important advancement, there are several limitations to address. First, despite leveraging over 44,000 fusion oncoprotein sequences, the diversity of the FusOn-DB dataset may not fully capture all fusion variants, particularly rare or less well-characterized fusions. Additional data, particularly from emerging databases and clinical studies, would further enhance the model’s generalizability. Second, due to GPU memory constraints, proteins longer than 2000 amino acids were excluded during training. While such cases are rare among known fusion oncoproteins, this limitation may exclude certain outliers with repetitive domains or extensive IDRs. Future optimizations in tokenization or memory-efficient architectures could enable the inclusion of these sequences, ensuring comprehensive coverage of fusion oncoprotein diversity. Third, while FusOn-pLM provides strong predictions for intrinsic disorder and drug-resistant mutations, its ability to predict driver mutations or to connect sequence embeddings with regulatory elements such as enhancers or transcription factors remains unexplored³⁵. Future efforts could involve developing models that integrate FusOn-pLM embeddings with regulatory sequence data to elucidate mechanisms underlying oncogenesis⁴². Most importantly, experimental validation of FusOn-pLM’s predictions, including drug resistance mechanisms and therapeutic design tasks, will be essential to confirm its utility in practical settings.

Recently, our lab has trained ESM-2-based models to generate peptides provided only the sequence of the target protein, facilitating the design of peptide-E3 ubiquitin ligase fusions for the proteasomal degradation of diverse protein substrates^22,23,24. As our main objective is to enable the degradation of fusion oncoproteins, our next steps will be to replace ESM-2 embeddings in these models with FusOn-pLM embeddings, enabling fusion-specific degrader design. Since post-translational modifications (PTMs) are also well known to affect the oncogenic activity of fusion oncoproteins^43,44,45, we plan to retrain FusOn-pLM with our recent PTM-Mamba pLM⁴⁶, which effectively tokenizes PTMs, enabling both fusion- and PTM-specific therapeutic design. Finally, by leveraging recent advancements in gene delivery, such as lipid nanoparticles and adeno-associated viral vectors^47,48, we envision that fusion-specific biologics may eventually serve as safe and efficacious therapeutics for fusion-positive cancer patients. Overall, the results of our study motivate the use of FusOn-pLM embeddings for downstream fusion oncoprotein design tasks, serving as a major step toward this goal.

Methods

Model training set curation

Model training data was curated from FusionPDB and FOdb to create FusOn-DB, a dataset of 44,414 fusion oncoprotein sequences representing 16,364 unique head::tail fusions. FusionPDB contributed 41,456 unique sequences²⁶, including AlphaFold2 predictions for 3.5K proteins, while FOdb added 4537 unique sequences derived largely from patient data². After removing duplicates, sequences longer than 2000 amino acids were excluded, leaving 42,141 sequences for training. To create train-validation-test splits with low sequence homology, sequences were clustered using MMSeqs2 with a 30% sequence identity and 80% coverage threshold⁴⁹. The test set included 250 sequences: 195 with experimental puncta data from FOdb and sequences for four well-studied fusions (EWSR1::FLI1, PAX3::FOXO1, BCR::ABL1, and EML4::ALK). Clusters overlapping these sequences were manually assigned to the test set, with the remaining clusters split into training (33,719 sequences, 80.01%), validation (4214 sequences, 10.00%), and testing (4208 sequences, 9.99%) sets.

BLAST and breakpoint mapping

To estimate sequence homology between FusOn-DB and SwissProt, local blastp (v2.16.0) was used. Head and tail gene names from FOdb and FusionPDB were mapped to UniProt IDs using the UniProt ID Mapping tool. Of the 44,414 fusion sequences, 44,257 had both head and tail components mapped, and 157 had one unmapped component (43 head, 114 tail). Both SwissProt and TrEMBL IDs were stored (Supplementary Data S4). For each fusion oncoprotein, three alignments were extracted: the top overall alignment, the top alignment corresponding to the head gene, and the top alignment corresponding to the tail gene. Alignments included all isoforms. Maximum percent identity was calculated as the number of identical amino acids in the alignment divided by the length of the fusion sequence. BLAST alignments were also used to determine breakpoints by identifying the indices corresponding to the top head and tail alignments. Overlapping regions were labeled as breakpoint regions, and specific loci were manually annotated where applicable for visualization purposes.

Benchmarking dataset curation

To evaluate FusOn-pLM, datasets were curated for three benchmarking tasks: puncta formation and localization, IDR ensemble dimensions, and intrinsic disorder prediction. Data for puncta formation and localization were collected from FOdb², which includes 178 fusion oncoproteins with experimentally validated results. Train-test splits from FOdb were used, with 149 sequences for training and 29 for testing across three tasks: puncta formation propensity, nuclear localization, and cytoplasmic localization. Class distributions were maintained as reported in FOdb². For IDR ensemble dimensions, 47,114 IDR sequences from synthetic and natural proteins were sourced from a published dataset³³. Labels included asphericity, end-to-end radius (R_e), radius of gyration (R_g), and polymer scaling exponent. Sequences were clustered using MMSeqs2 with a minimum sequence identity of 30% and split into training (80%), validation (10%), and testing (10%) sets⁴⁹. Data distributions were normalized as needed, and sequences with multiple labels for the same property were averaged. Final dataset sizes were 47,114 for asphericity, 42,868 for R_e, 22,912 for R_g, and 40,637 for the scaling exponent. For disorder prediction, training data included 5304 unique sequences after cleaning and deduplication, with 5264 sequences from IDP-CRF⁵⁰ and 536 sequences from flDPnn⁵¹. The testing dataset comprised 210 gold-standard sequences from the CAID2 Disorder-NOX dataset with per-residue annotations indicating disorder (1) or structure (0)^29,34. FusOn-pLM-Diso was trained on the combined dataset and benchmarked on Disorder-NOX. To analyze disorder in fusion oncoproteins, pseudo-labels were generated using AlphaFold-pLDDT scores, where residues with pLDDT <68.8 were labeled as disordered¹⁷. Structures for 523 fusion oncoproteins in the FusOn-pLM test set were obtained from FusionPDB²⁶. The BeautifulSoup package (version 4.12.2) in python was used to scrape FusionPDB for structure download links.

Embedding exploration dataset curation

FusOn-pLM embeddings of transcription factor (TF) and kinase fusions were visualized in 2D plots. To efficiently determine which fusion oncoproteins possessed TF heads and tails or kinase heads and tails, a categorized list of fusion head and tail genes was consulted⁵². 595 fusion oncoproteins from FusOn-DB (364 TF::TF and 231 Kinase::Kinase) were identified.

Model architecture and training

FusOn-pLM is based on ESM-2-650M, a 33-layer transformer model pre-trained on UniRef50, and was fine-tuned to generate fusion oncoprotein-specific representations. To adapt ESM-2-650M for this task without overfitting, the final eight layers of the model were selectively fine-tuned. Specifically, the key, query, and value weight matrices of the self-attention mechanism in these layers were unfrozen, while earlier layers remained fixed. The multi-head self-attention mechanism is parameterized such that the attention output is computed as a weighted sum of values V, where the weights are derived from the scaled dot-product of queries: $Q={W}_{q}h$ and keys: $K={W}_{k}h$. For fine-tuning, the learnable parameters W_q, W_k, and W_v in the last eight layers were updated, enabling task-specific adaptation to fusion oncoproteins while preserving the general-purpose representations learned during pre-training.

Specifically, a cosine-scheduled masking strategy was employed during training to dynamically vary the masking rate.

Let $x=\left({x}_{1},{x}_{2},\ldots,{x}_{n}\right)$ be the input amino acid sequence of length n. Define M as the set of masked positions such that ${|M|}=\lceil r\cdot n\rceil$, where the masking rate r varies within each training epoch according to a cosine schedule. The masking rate at step t within an epoch of T steps is given by:

$${r}_{t}={r}_{\min }+\frac{1}{2}\left({r}_{\max }-{r}_{\min }\right)\left(1-\cos \left(\frac{t\pi }{T}\right)\right)$$

(1)

where ${r}_{\min }=0.15$ and ${r}_{\max }=0.40$. At the start of each epoch, r_t is reset to r_min, increasing to r_max and cycling back to r_min at the beginning of the next epoch.

Masked positions are selected uniformly at random from the set $\{{{\mathrm{1,2}}},\ldots,n\}$ without replacement. Mathematically, the selection of M is described as:

$$M\sim {Uniform}\left(\{1,2,\ldots,n\},{{{\rm{\lceil }}}}r\cdot n{{{\rm{\rceil }}}}\right)$$

(2)

All selected positions are replaced with a special mask token. The MLM objective is computed as:

$${L}_{{MLM}}=-\sum\limits_{i\in M}\log P\left({x}_{\setminus M}\right)$$

(3)

where x_i is the true amino acid at position i, and ${x}_{\setminus M}$ represents the sequence with masked tokens excluded.

A visualization of the masking strategy is shown in Fig. 1.

FusOn-pLM was trained on one NVIDIA H100 GPU with 80 GB of VRAM each for 30 epochs with batch size of 8 and learning rate of 3e-4. The Adam optimizer was utilized with no weight decay. Only fusion oncoproteins of length 2000 or shorter were used for training; short sequences were padded to this maximal length.

Fusion oncoprotein property benchmarks

Embedding performance on predicting the propensity of puncta formation, as well as predicting if puncta form in the nucleus or cytoplasm, were evaluated. Here, sequences from FOdb with conclusive experimental data on puncta formation were utilized for pLM embedding evaluation². FOdb tested 195 total FOs for puncta formation, but only used the 178 with conclusive results to train the FO-Puncta ML model. Puncta formation and localization predictions were treated as a binary class, where label 0 or 1 represented a lack or presence of puncta formation in a given area. FusOn-pLM embeddings were compared against three others: (1) Base wild-type ESM-2-650M embeddings, (2) ProtT5-XL-UniRef50 embeddings¹⁹, and (3) FOdb embeddings², which are 25 physicochemical features manually curated by FOdb for only the 195 proteins. The standard binary cross-entropy loss function was minimized for each task using the XGBoost model with 50 trees via xgboost (version 1.7.5)⁵³. The binary cross-entropy loss is defined as:

$${BCE}\left(y,\hat{y}\right)=-\left({ylog}\left(\hat{y}\right)+\left(1-y\right)\log \left(1-\hat{y}\right)\right)$$

(4)

Disorder property benchmark

Disorder properties were evaluated by training regression models that used FusOn-pLM embeddings of IDRs, to predict four ensemble features: asphericity, R_e, R_g, and polymer scaling exponent³³. For each property, a separate FusOn-pLM-IDR regression model was trained. These models fed FusOn-pLM embeddings through a multi-layer perceptron (MLP) network with three fully connected layers (Fig. 4A). The input layer performed dimensionality reduction to hidden dimension 640 and passed the output through a ReLU activation function, followed by layer normalization and dropout regularization with a probability of 0.2. This structure was repeated for two more iterations, shrinking the hidden dimension to 320 and finally culminating in a single neuron: the predicted value of the property. Each model was trained to minimize the mean square error (MSE), and early stopping was implemented to prevent overfitting. The MSE loss function is defined by:

$${MSE}\left(y,\hat{y}\right)=\frac{1}{n}\sum\limits _{i=1}^{n}{\left({y}_{i}-\hat{{y}_{i}}\right)}^{2}$$

(5)

Models were evaluated on a held-out test set by predicting each property given the sequence embedding alone. The coefficient of determination (R²) between predictions and labels was calculated for each model to assess goodness of fit. In order to maximize R², a hyperparameter screen across two batch sizes (32, 64) and five learning rates (1e-5, 3e-4, 1e-4, 3e-3, 1e-3) was performed (Supplementary Data S5). The true values and predicted values were plotted in Matplotlib (version 3.7.2), with an ideal fit line included for reference. The entire process was repeated using ESM-2-650M embeddings rather than FusOn-pLM embeddings (Supplementary Fig. 2).

CAID benchmark

FusOn-pLM’s ability to predict intrinsic disorder was evaluated using a per-residue disorder prediction benchmark based on the CAID2 Disorder-NOX dataset^29,34. Binary labels indicating whether each residue is disordered (1) or structured (0) were used to train FusOn-pLM-Diso, a per-residue disorder predictor. The predictor employs a multi-head self-attention Transformer architecture, minimizing binary cross-entropy loss. Hyperparameter optimization was performed for the number of attention heads (5, 8, 10), Transformer layers (2, 4, 6), and dropout rates (0.2, 0.5) (Supplementary Data S6). Models were trained for 2 epochs with a learning rate of 5e-5, and optimal hyperparameters were selected by maximizing AUROC. An equivalent model, ESM-2-650M-Diso, was trained using ESM-2-650M embeddings for comparison. Both models were trained and evaluated on the CAID2 Disorder-NOX dataset^29,34, with per-residue predictions used for benchmarking. Predicted per-residue disorder probabilities were computed for each input sequence, and binary predictions were made using thresholds selected to optimize classification performance metrics. To extend the analysis to fusion oncoproteins, per-residue disorder predictions were made for sequences with available AlphaFold2 structures¹⁴. Percentage disorder was calculated by dividing the number of predicted disordered residues by sequence length. Additionally, predicted per-residue disorder probabilities were mapped onto 3D protein structures for visualization. AlphaFold2’s pLDDT metric was used as a reference for structural disorder to aid in the assessment of predicted regions¹⁴.

Embedding exploration

To explore how FusOn-pLM embeddings capture the physicochemical and functional properties of fusion oncoproteins, we first conducted a dimensionality reduction analysis on both fusion oncoprotein embeddings and/or their head and tail proteins using Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP)⁵⁴ via the umap module (version 0.5.6). The FusOn-pLM embeddings of six highly-studied fusion oncoproteins (EWSR1::FLI1, PAX3::FOXO1, BCR::ABL1, CIC::DUX4, SS18::SSX1, and EML4::ALK) and their respective head and tail proteins (derived from the BLAST against SwissProt) were transformed by UMAP and plotted (Supplementary Fig. 3B). Additionally, 364 transcription factor (TF), where both head and tail were TFs, and 231 kinase fusions, where both head and tail were kinases, were embedded and plotted in UMAP coordinates (Supplementary Fig. 3A).

Zero-shot mutation prediction

Zero-shot mutation prediction was performed on a set of fusion oncoproteins. For each protein, the sequence was input to FusOn-pLM with its MLM head L times, where L is the protein length. During each iteration, a single <mask> token was introduced at a different position in the sequence, and only this position was unmasked. The raw logits for each of the twenty amino acids at the masked position were recorded. These logits were ranked in descending order, creating a list of the most to least likely amino acids predicted at that position. The top three predicted amino acids, based on their logits, were considered the “top 3 mutations.”

Heatmaps of the logits for the original amino acid at each position were constructed for representative fusion oncoproteins: EWSR1::FLI1, PAX3::FOXO1, and TRIM24::RET. Functional domains were identified using UniProt annotations for the reviewed SwissProt accession corresponding to the head and tail genes. Residue positions for these domains were converted from their coordinates on the original head or tail protein to their corresponding positions on the fusion protein using string indexing in Python. A binary conservation label was applied to logits, with values <0.7 designated as non-conserved (0) and values >0.7 as conserved (1).

Sequences for EML4::ALK and BCR::ABL1 were generously provided by the authors of Elshatlawy et al. ³⁹, and O’Hare et al. ⁴⁰, and were screened through the zero-shot mutation pipeline. Positions corresponding to known drug resistance mutations, as reported in the literature, were evaluated to determine whether one of the top three predicted amino acids matched a reported mutation (“hit”) or did not (“miss”). For positions where the original amino acid was among the top three predicted tokens, an additional token was included in the analysis. Structural models for these sequences were folded in AlphaFold2 and visualized using PyMOL.

Potential mutations in ETV6::NTRK3 were also predicted using the zero-shot prediction pipeline⁴¹. Literature-reported mutations in NTRK3 coordinates were converted to the corresponding positions in ETV6::NTRK3 coordinates. For example, NTRK3 G623R and G696A became ETV6::NTRK3 G504A and G431R. These positions were evaluated as “hit” or “miss” based on whether the top three predicted mutations included the correct token. Structural predictions were obtained from FusionPDB and visualized in PyMOL. Additionally, the top five mutations were identified as those with the smallest logits for the original amino acid.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All data needed to evaluate the conclusions are presented in the paper and tables. The FusOn-DB dataset can be found at https://huggingface.co/datasets/ChatterjeeLab/FusOn-DB. Source data are provided with this paper.

Code availability

The code used to develop the model/perform the analyses and generate results in this study is publicly available at https://huggingface.co/ChatterjeeLab/FusOn-pLM, under a Creative Commons Attribution Non Commercial No Derivatives 4.0 license. The specific version of the code associated with this publication is accessible via https://doi.org/10.57967/hf/4218⁵⁵ and is archived at the following Zenodo repository: https://doi.org/10.5281/zenodo.14706684.

References

Rabbitts, T. H. Chromosomal translocations in human cancer. Nature 372, 143–149 (1994).
Article ADS CAS PubMed Google Scholar
Tripathi, S. et al. Defining the condensate landscape of fusion oncoproteins. Nat. Commun. 14, 6008 (2023).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Angione, S. D. A. et al. Fusion oncoproteins in childhood cancers: potential role in targeted therapy. J. Pediatr. Pharmacol. Ther. 26, 541–555 (2021).
PubMed PubMed Central MATH Google Scholar
Delattre, O. et al. Gene fusion with an ETS DNA-binding domain caused by chromosome translocation in human tumours. Nature 359, 162–165 (1992).
Article ADS CAS PubMed MATH Google Scholar
Linardic, C. M. PAX3-FOXO1 fusion gene in rhabdomyosarcoma. Cancer Lett. 270, 10–18 (2008).
Article CAS PubMed PubMed Central Google Scholar
McBride, M. J. et al. The SS18-SSX fusion oncoprotein hijacks baf complex targeting and function to drive synovial sarcoma. Cancer Cell 33, 1128–1141.e7 (2018).
Article CAS PubMed PubMed Central MATH Google Scholar
Soda, M. et al. Identification of the transforming EML4-ALK fusion gene in non-small-cell lung cancer. Nature 448, 561–566 (2007).
Article ADS CAS PubMed MATH Google Scholar
Erkizan, H. V. et al. A small molecule blocking oncogenic protein EWS-FLI1 interaction with RNA helicase A inhibits growth of Ewing’s sarcoma. Nat. Med. 15, 750–756 (2009).
Article CAS PubMed PubMed Central Google Scholar
Vital, T. et al. MS0621, a novel small-molecule modulator of Ewing sarcoma chromatin accessibility, interacts with an RNA-associated macromolecular complex and influences RNA splicing. Front. Oncol. 13, 1099550 (2023).
Article CAS PubMed PubMed Central Google Scholar
Carter, P. J. & Rajpal, A. Designing antibodies as therapeutics. Cell 185, 2789–2805 (2022).
Article CAS PubMed MATH Google Scholar
Hie, B. L. et al. Efficient evolution of human antibodies from general protein language models. Nat. Biotechnol. 42, 275–283 (2023).
Article PubMed PubMed Central MATH Google Scholar
Ham, J. M., Kim, M., Kim, T., Ryu, S. E. & Park, H. Structure-based De Novo design for the discovery of miniprotein inhibitors targeting oncogenic mutant BRAF. Int. J. Mol. Sci. 25, 5535 (2024).
Article CAS PubMed PubMed Central MATH Google Scholar
Vadevoo, S. M. P. et al. Peptides as multifunctional players in cancer therapy. Exp. Mol. Med. 55, 1099–1109 (2023).
Article CAS PubMed PubMed Central MATH Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Abramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500 (2024).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Watson, J. L. et al. De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100 (2023).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Piovesan, D., Monzon, A. M. & Tosatto, S. C. E. Intrinsic protein disorder and conditional folding in AlphaFoldDB. Protein Sci. 31, e4466 (2022).
Article CAS PubMed PubMed Central Google Scholar
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Article ADS MathSciNet CAS PubMed MATH Google Scholar
Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).
Article PubMed MATH Google Scholar
Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023).
Article CAS PubMed PubMed Central MATH Google Scholar
Brixi, G. et al. SaLT&PepPr is an interface-predicting language model for designing peptide-guided protein degraders. Commun Biol 6, 1081 (2023).
Article CAS PubMed PubMed Central Google Scholar
Bhat, S. et al. De novo design of peptide binders to conformationally diverse targets with contrastive language modeling. Science Advances, 11, adr368 (2025).
Chen, T. et al. PepMLM: target sequence-conditioned generation of therapeutic peptide binders via span masked language modeling. Preprint at https://arxiv.org/abs/2310.03842 (2024).
Verma, S. K., Witkin, K. L., Sharman, A. & Smith, M. A. Targeting fusion oncoproteins in childhood cancers: challenges and future opportunities for developing therapeutics. J. Natl Cancer Inst 116, 1012–1018 (2024).
Article CAS PubMed PubMed Central Google Scholar
Kumar, H., Tang, L.-Y., Yang, C. & Kim, P. FusionPDB: a knowledgebase of human fusion proteins. Nucleic Acids Res. 52, D1289–D1304 (2024).
Article CAS PubMed Google Scholar
Suzek, B. E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
Article CAS PubMed MATH Google Scholar
UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023).
Article Google Scholar
Necci, M. et al. Critical assessment of protein intrinsic disorder prediction. Nat. Methods 18, 472–481 (2021).
Article CAS PubMed PubMed Central MATH Google Scholar
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT 4171–4186 (2019).
Wettig, A., Gao, T., Zhong, Z. & Chen, D. Should you mask 15% in masked language modeling? Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2985–3000 (2023).
Sahoo, S. S. et al. Simple and effective masked diffusion language models. Conference on Neural Information Processing Systems (2024).
Lotthammer, J. M., Ginell, G. M., Griffith, D., Emenecker, R. J. & Holehouse, A. S. Direct prediction of intrinsically disordered protein conformational properties from sequence. Nat. Methods 21, 465–476 (2024).
Article CAS PubMed PubMed Central Google Scholar
Del Conte, A. et al. Critical assessment of protein intrinsic disorder prediction (CAID) - Results of round 2. Proteins Struct. Funct. Bioinform. 91, 1925–1934 (2023).
Article CAS MATH Google Scholar
Zhang, R., Dong, L. & Yu, J. Concomitant pathogenic mutations and fusions of driver oncogenes in tumors. Front. Oncol. 10, 544579 (2020).
Article PubMed Google Scholar
Asante, Y. et al. PAX3-FOXO1 uses its activation domain to recruit CBP/P300 and shape RNA Pol2 cluster distribution. Nat. Commun.14, 1–19 (2023).
Article Google Scholar
Crose, L. E. S. et al. Alveolar rhabdomyosarcoma-associated PAX3-FOXO1 promotes tumorigenesis via Hippo pathway suppression. J. Clin. Invest. 124, 285–296 (2014).
Article CAS PubMed Google Scholar
Cohen, P., Cross, D. & Jänne, P. A. Kinase drug discovery 20 years after imatinib: progress and future directions. Nat. Rev. Drug Discov. 20, 551–569 (2021).
Article CAS PubMed PubMed Central Google Scholar
Elshatlawy, M., Sampson, J., Clarke, K. & Bayliss, R. EML4-ALK biology and drug resistance in non-small cell lung cancer: a new phase of discoveries. Mol. Oncol. 17, 950–963 (2023).
Article CAS PubMed PubMed Central Google Scholar
O’Hare, T., Eide, C. A. & Deininger, M. W. N. Bcr-Abl kinase domain mutations, drug resistance, and the road to a cure for chronic myeloid leukemia. Blood 110, 2242–2249 (2007).
Article PubMed Google Scholar
Drilon, A. et al. Efficacy of larotrectinib in TRK fusion-positive cancers in adults and children. N. Engl. J. Med. 378, 731–739 (2018).
Article CAS PubMed PubMed Central Google Scholar
Vicente-García, C. et al. Regulatory landscape fusion in rhabdomyosarcoma through interactions between the PAX3 promoter and FOXO1 regulatory elements. Genome Biol. 18, 106 (2017).
Article PubMed PubMed Central MATH Google Scholar
Yu, L., Davis, I. J. & Liu, P. Regulation of EWSR1-FLI1 function by post-transcriptional and post-translational modifications. Cancers 15, 382 (2023).
Thalhammer, V. et al. PLK1 phosphorylates PAX3-FOXO1, the inhibition of which triggers regression of alveolar Rhabdomyosarcoma. Cancer Res. 75, 98–110 (2015).
Article CAS PubMed MATH Google Scholar
Pan, S. & Chen, R. Pathological implication of protein post-translational modifications in cancer. Mol. Aspects Med. 86, 101097 (2022).
Article CAS PubMed PubMed Central MATH Google Scholar
Peng, Z., Schussheim, B. & Chatterjee, P. PTM-Mamba: a PTM-aware protein language model with bidirectional gated mamba blocks. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2024.02.28.581983v1 (2024). In press.
Hou, X., Zaks, T., Langer, R. & Dong, Y. Lipid nanoparticles for mRNA delivery. Nat. Rev. Mater. 6, 1078–1094 (2021).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Wang, J.-H., Gessler, D. J., Zhan, W., Gallagher, T. L. & Gao, G. Adeno-associated virus as a delivery vector for gene therapy of human diseases. Signal Transduct. Target. Ther. 9, 78 (2024).
Article CAS PubMed PubMed Central Google Scholar
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Article CAS PubMed MATH Google Scholar
Liu, Y., Wang, X. & Liu, B. IDP⁻CRF: Intrinsically disordered protein/region identification based on conditional random fields. Int. J. Mol. Sci. 19, 2483 (2018).
Hu, G. et al. flDPnn: Accurate intrinsic disorder prediction with putative propensities of disorder functions. Nat. Commun. 12, 1–8 (2021).
Article ADS MATH Google Scholar
Salokas, K., Weldatsadik, R. G. & Varjosalo, M. Human transcription factor and protein kinase gene fusions in human cancer. Sci. Rep. 10, 14169 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Buitinck, L. et al. API design for machine learning software: experiences from the scikit-learn project. European Conference on Machine Learning and Principles and Practices of Knowledge Discovery in Databases (2013).
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).
Vincoff, S. et al. FusOn-pLM. Hugging Face https://doi.org/10.57967/hf/4218 (2024).

Download references

Acknowledgements

We thank Mark III Systems and the Duke Computing Cluster for computing support. We further thank Zhangzhi Peng, Yinuo Zhang, and Tianlai Chen for their insights related to the manuscript. We thank Lauren Hong for rendering the FusOn-pLM logo. The work was supported by the National Cancer Institute (Awards #R21CA278468 and #3U54CA231630-01A1S4), the Wallace H. Coulter Foundation, The Hartwell Foundation.

Author information

Authors and Affiliations

Department of Biomedical Engineering, Duke University, Durham, NC, USA
Sophia Vincoff, Kseniia Kholina, Rishab Pulugurta, Pranay Vure & Pranam Chatterjee
Department of Computer Science, Duke University, Durham, NC, USA
Shrey Goel & Pranam Chatterjee
Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA
Pranam Chatterjee

Authors

Sophia Vincoff
View author publications
Search author on:PubMed Google Scholar
Shrey Goel
View author publications
Search author on:PubMed Google Scholar
Kseniia Kholina
View author publications
Search author on:PubMed Google Scholar
Rishab Pulugurta
View author publications
Search author on:PubMed Google Scholar
Pranay Vure
View author publications
Search author on:PubMed Google Scholar
Pranam Chatterjee
View author publications
Search author on:PubMed Google Scholar

Contributions

S.V. designed and implemented masking strategies and trained FusOn-pLM. S.V., S.G., K.K., R.P., and P.V. performed model benchmarking and visualizations. S.V. and P.C. wrote and reviewed the manuscript. P.C. conceived, designed, directed, and supervised the study.

Corresponding author

Correspondence to Pranam Chatterjee.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Bargeen Turzo, Gian Gaetano Tartaglia and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Supplementary Datasets 1-6

Description of Additional Supplementary Files

Transparent Peer Review file

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Vincoff, S., Goel, S., Kholina, K. et al. FusOn-pLM: a fusion oncoprotein-specific language model via adjusted rate masking. Nat Commun 16, 1436 (2025). https://doi.org/10.1038/s41467-025-56745-6

Download citation

Received: 03 June 2024
Accepted: 24 January 2025
Published: 07 February 2025
Version of record: 07 February 2025
DOI: https://doi.org/10.1038/s41467-025-56745-6

This article is cited by

Learning physical interactions to compose biological large language models
- Joseph D. Clark
- Tanner J. Dean
- Diwakar Shukla
Communications Chemistry (2026)
Programmable protein stabilization with language model-derived peptide guides
- Lauren Hong
- Tianzheng Ye
- Pranam Chatterjee
Nature Communications (2025)
Target sequence-conditioned design of peptide binders using masked language modeling
- Leo Tianlai Chen
- Zachary Quinn
- Pranam Chatterjee
Nature Biotechnology (2025)