Introduction

The field of targeted protein degradation (TPD) represents a burgeoning frontier in therapeutic drug design, offering innovative strategies for the selective elimination of disease-related proteins1. Central to this approach is the ubiquitin–proteasome system (UPS), an endogenous cellular pathway that facilitates the breakdown of proteins through a process that tags them with ubiquitin for subsequent proteolysis2. The specificity and precision of the UPS have been leveraged in the development of Proteolysis Targeting Chimeras (PROTACs), a class of bifunctional molecules that recruit an E3 ubiquitin ligase to a target protein, leading to its degradation3. PROTAC technology has thus opened new avenues for addressing challenging targets in drug discovery4.

Optimal design of effective PROTAC molecules is contingent upon a deep understanding of the ternary complexes they form with their target proteins and E3 ligases5. Such an understanding has historically been derived from experimental structural biology techniques6; however, the advent of sophisticated in silico methods has revolutionized this process7,8,9,10,11,12,13,14,15,16,17,18,19,20,21. Computational modeling now plays a pivotal role in the initial stages of PROTAC design by predicting the formation and stability of these complexes. Curated databases such as PROTAC-DB15,16,18, PROTACpedia, and the PROTAC-DataBank17, compile experimentally validated degrader molecules, warhead-E3 ligand pairs, and ternary complex structures, offering valuable templates for docking and machine learning–based prediction workflows19. Newer algorithms like PROflow implement iterative refinement strategies to model PROTAC-induced interfaces more accurately than conventional docking approaches22. AI-assisted tools—such as graph neural network frameworks and reinforcement learning models for 3D linker generation—are increasingly applied to optimize degrader geometry and pharmacophoric compatibility21,23,24. Beyond modeling, data integration platforms such as the E3 Atlas, ELiAH, and BioGRID help identify biologically relevant E3–substrate pairs, expanding the degradable target landscape through interactome and ubiquitome analyses13,14,25. Together, these complementary tools underscore the need for hybrid approaches that combine structural prediction, biological context, and dynamic sampling to effectively model ternary complex formation in silico.

In the current study, we employed two state-of-the-art computational techniques, AlphaFold-3 (AF3) and PRosettaC, to model the arrangement of ternary complexes of PROTACs in silico9,11,26. Our objective was to compare the capabilities of these tools in predicting the structures of ternary complexes and to validate their accuracy against 36 known crystal structures. To this end, we have utilized DOCKQ v2 (hereafter simply referred to as DockQ), a validated scoring method, to quantitatively assess the predicted interfaces and overall structural congruence with experimentally determined data27,28.

Given AF3’s enhanced computational capabilities compared to its predecessor, AlphaFold-Multimer, we set out to explore whether the inclusion of additional scaffold proteins would influence model quality, as assessed by DockQ scores. Specifically, we investigated ternary PROTAC complexes targeting VHL or Cereblon by comparing predictions made with and without accessory proteins, namely Elongin B and Elongin C for VHL-based systems and DDB1 for Cereblon-based systems. These scaffold components are known to support proper E3 ligase architecture and may improve structural fidelity in predicted complexes. However, due to residue count limitations on the AF3 server, we were unable to include larger scaffold elements such as cullin-ring ligases or RBX domains, which might have further stabilized the folded conformation during prediction29,30.

Our comparative analyses highlights PRosettaC’s improved performance over AF3 in predicting ternary complex geometries that more closely resemble experimentally resolved structures. While the observed differences are modest in magnitude, they are consistent across multiple systems and suggest that PRosettaC may currently offer more reliable structural predictions for PROTAC-focused modeling. These findings are intended to inform the selection of in silico tools during the early stages of PROTAC development, particularly for applications where accurate ternary complex modeling can impact downstream design decisions. By contributing a systematic benchmarking of two widely used structural prediction platforms, we aim to support the broader community in refining computational strategies for targeted protein degradation.

Methods

PDB query and crystal structure selection

We queried the RCSB Protein Data Bank (PDB) on September 03, 2023 using the Advanced Search/Search API to identify experimentally determined PROTAC-mediated ternary complexes. The Boolean criteria were:

Full Text CONTAINS PHRASE “ternary complex PROTAC” AND Chemical Component Molecular Weight > = 450 AND Number of Protein Instances (Chains) per Assembly > 1 AND Structure Determination Methodology = experimental AND Experimental Method = X-RAY DIFFRACTION AND Polymer Entity Type = Protein.

The search returned 2,500 records (PDB entries). A custom results table was generated via the PDB Custom Report service and downloaded in tabular format for downstream screening. The table included the following fields: Entry ID, PubMed ID, DOI, Deposition Date, Release Date, Ligand, Value, PDB ID, Resolution (Å), Journal Name (Abbrev), Title, Publication Year, Structure Title, Sequence, Polymer Entity Sequence Length, Entity Macromolecule Type, Source Organism, Gene Name, Molecular Weight (Entity), Macromolecule Name, Asym ID, Auth Asym ID, Entity ID, Accession Code(s), Database Name, Ligand MW, Ligand ID, Ligand Name, Ligand SMILES.

All retrieved records were examined. First, we programmatically screened the results to confirm (i) the presence of at least one non-polymer ligand meeting the ≥ 450 Da threshold (proxy for a heterobifunctional PROTAC-sized molecule) and (ii) assemblies containing > 1 protein chain. Next, we annotated protein chains to identify putative E3 ubiquitin ligase components (e.g., CRBN, VHL, MDM2; gene-based and macromolecule-name pattern matching) and candidate target proteins distinct from the E3 component. Entries lacking both components in the same assembly or lacking a qualifying ligand were flagged for exclusion.

All remaining candidates were manually curated. For each, we visually inspected the biological assembly to confirm that the putative PROTAC ligand simultaneously engages (or is positioned to engage) both the E3 ligase and target protein, that ligand chemistry was intact (no missing atoms required for bifunctionality), and that the complex corresponded to an experimentally determined X-ray model. After manual curation, 36 non-redundant, high-confidence PROTAC ternary complex crystal structures were retained along with the corresponding reference publications31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47. These curated structures served as the empirical benchmarks for our computational predictions and were pivotal for subsequent comparative analyses.

AF3 minimal complex and AF3 full complex prediction protocol

Complex models were generated using the latest AlphaFold-3 (AF3) server (https://alphafoldserver.com/), which supports multimeric protein assembly predictions with high structural fidelity. For each of the 36 benchmarked systems, we generated two distinct model variations: (1) a Minimal Complex consisting solely of the target protein and E3 ligase, and (2) a Full Complex, which additionally included accessory proteins known to stabilize the E3 ligase complex, such as Elongin B/C in VHL systems or DDB1 in CRBN systems.

Inputs were prepared by concatenating relevant amino acid sequences without template guidance or manual restraints. Five models were generated per complex using default AF3 multimer settings, with efficient prediction runtimes typically completing within 10–30 min per submission.

Due to input size constraints imposed by the AF3 server, larger scaffold proteins such as cullin ring ligases (CUL2, CUL4A) and RING-box domains (RBX1) were excluded, focusing predictions explicitly on minimal functional components relevant to degrader binding and interface evaluation.

PRosettaC modeling technique

PRosettaC, a Rosetta-based protocol designed specifically for modeling PROTAC-induced ternary complexes, served as the second computational strategy in our benchmarking. We refer to PRosettaC predictions explicitly as Ternary Complexes. This method enforces geometric constraints derived from known warhead binding modes, facilitating the structure-guided assembly of ternary complexes involving an E3 ligase, a target protein, and a bifunctional degrader. We used a local implementation of PRosettaC for this work after necessary adjustments based on their GitHub repository (https://github.com/LondonLab/PRosettaC).

Pipeline inputs included experimentally resolved or modeled structures of target proteins and E3 ligases with their respective bound warhead and ligase recruiter. The PROTAC linker was input as a SMILES string, enabling PRosettaC to generate three-dimensional linker conformations compatible with the binding pocket geometries.

To enhance sampling depth beyond the original PRosettaC implementation, the protocol was modified to generate up to 1000 models per system, surpassing the default 200-model limit. The actual number of models generated varied by system based on convergence and constraint compatibility, ranging from 54 to 878 models. Specifically, the following numbers of models were generated per system: 5T35 (n = 199), 6BN7 (n = 216), 6BOY (n = 400), 6HAX (n = 200), 6HAY (n = 200), 6HR2 (n = 878), 6W7O (n = 200), 6ZHC (n = 204), 7JTO (n = 200), 7KHH (n = 400), 7PI4 (n = 400), 7Q2J (n = 400), 7S4E (n = 400), 7Z6L (n = 54), 7Z76 (n = 400), 7Z77 (n = 400), 7ZNT (n = 400), 8BDT (n = 400), 8BDX (n = 244), 8G1P (n = 400), 8G1Q (n = 800), 8PC2 (n = 214), 8QVU (n = 400), 8QW6 (n = 399), and 8QW7 (n = 400).

All generated models were scored using the standard Rosetta energy function and evaluated using DockQ without prefiltering. This comprehensive evaluation ensures that our benchmarking reflects the full conformational diversity produced by PRosettaC.

DockQ scoring methodology

To evaluate the structural accuracy of predicted ternary complexes, we used DockQ v2, the Python-based reimplementation of DockQ, which supports multimeric systems and automatic chain mapping. Its improved efficiency and portability made it well suited for high-throughput evaluation of both AlphaFold-3 (AF3) and PRosettaC predictions.

DockQ scores were computed by comparing each predicted model to the corresponding experimentally resolved crystal structure for all 36 benchmarked systems. For each AF3-generated model, we computed DockQ scores using three evaluation strategies: Full Complex: Predicted models including accessory proteins were evaluated against the full experimental structure; Core Complex: The same Full Complex models were stripped of accessory proteins before scoring, and compared to similarly stripped experimental structures; Minimal Complex: Models predicted using only the E3 ligase and target protein were compared to experimental structures containing only those components.

For PRosettaC-based Ternary Complex predictions DockQ evaluation was performed on the two protein chains—E3 ligase and target. These predictions were scored against experimental structures similarly stripped to contain only the E3 and target chains.

Each DockQ score reflects a composite of interface RMSD (iRMSD), ligand RMSD (LRMSD), and the fraction of native contacts (Fnat), producing a normalized value between 0 and 1, where higher values indicate closer structural agreement with the reference interface.

In addition to static comparisons, DockQ v2 was also used for time-resolved analysis by comparing predicted models to molecular dynamics (MD) trajectories of the corresponding crystal structures. This enabled us to assess transient structural alignment across thousands of simulation frames. For PRosettaC-generated models, which can include up to 1000 structures per system, this provided fine-grained insight into conformational diversity and model accuracy in dynamic solution contexts.

Molecular dynamics setup and simulation protocol

To assess the dynamic behavior and conformational stability of modeled ternary complexes, we performed all-atom MD simulations using GROMACS 2023.1. The CHARMM36-jul2022 force field was employed for all protein and ligand components to ensure compatibility with high-fidelity protein–ligand interactions. Ligand parameters were generated using the CGenFF server, with .str files converted to GROMACS-compatible .itp formats.

  1. (A)

    System preparation and solvation.

Each complex—whether derived from experimental crystal structures or computational models—was first converted into GROMACS format and assigned appropriate protonation states. Systems were solvated in a TIP3P water box with a 1.0 nm buffer and neutralized with Na⁺ and Cl⁻ ions to achieve an ionic strength of 0.15 M. Topologies were generated using standard CHARMM36 force field protocols.

  1. (B)

    Energy minimization and equilibration.

After solvation, systems underwent steepest descent energy minimization until a maximum force threshold of 1000 kJ/mol/nm was reached. Equilibration was conducted in two phases: a 100 ps NVT (constant Number, Volume, Temperature) ensemble using the modified Berendsen thermostat at 300 K, followed by a 100 ps NPT (constant Number, Pressure, Temperature) ensemble using the Parrinello–Rahman barostat at 1 bar. Position restraints were applied to heavy atoms of the protein and ligand during equilibration to allow solvent relaxation.

  1. (C)

    Production simulations.

Production runs were performed in the NPT ensemble for 50 ns per system using a 2 fs integration time step and LINCS constraints on all bonds. Long-range electrostatics were treated using the Particle Mesh Ewald (PME) method. Periodic boundary conditions were applied in all directions, and coordinates were saved every 10 ps, yielding 5000 frames per trajectory for downstream analysis.

All simulations were conducted on GPU-accelerated high-performance computing clusters to ensure efficient sampling. Trajectories were post-processed to correct for periodicity and protein displacement, allowing accurate computation of root-mean-square deviation (RMSD), root-mean-square fluctuation (RMSF), and frame-by-frame DockQ comparisons to predicted and experimental structures.

Statistical analysis

To comprehensively assess the structural accuracy of predicted ternary complexes, we conducted both descriptive and inferential statistical analyses across multiple modeling and evaluation strategies: AF3 Minimal, AF3 Full, AF3 Core, and PRosettaC Ternary. DockQ v2 scores were analyzed at the per-model and per-structure level to evaluate interface quality across 36 crystal benchmark systems.

Descriptive statistics—including median, mean, standard deviation, and interquartile range—were calculated for each method to summarize performance distributions. To assess significance in model performance, paired Wilcoxon signed-rank tests were applied to compare median DockQ values across methods (e.g., PRosettaC Ternary vs AF3 Minimal, AF3 Core vs AF3 Minimal, etc.), with Cohen’s d used to estimate effect sizes.

Visualization of score distributions was achieved through box plots, swarm plots, scatter plots, and heatmaps of per-complex deltas, allowing intuitive comparison across prediction strategies. Additionally, ranking frequency plots and median-centered bar charts were used to compare relative method performance across systems.

For MD simulations, time-series DockQ trajectories were generated across 5000 MD frames for five benchmark systems. We analyzed per-frame fluctuations in DockQ scores for each AF3 Minimal model (models 0–4), annotating peak scores and comparing temporal trends to assess whether any spontaneous realignment toward the crystal interface occurred. Maximum observed DockQ scores were noted for each trajectory, and graphical overlays were used to highlight these transient structural alignments.

In a follow-up dynamic evaluation, we also compared static PRosettaC Ternary models against MD-resolved frames of the crystal structures to detect transient conformational matches, enabling a frame-by-frame similarity analysis that captured model—receptor alignment beyond the static crystal pose.

All statistical analyses were performed using Python libraries including scipy.stats, pandas, and matplotlib, and visualizations were rendered using seaborn and plotly.

Results

Comparative performance of AF3 core and minimal complex predictions

AlphaFold-3 (AF3) is a revolutionary deep-learning structure-prediction method that couples transformer attention with equivariant geometric modules to output atomic coordinates directly from sequence26 In its multimer mode the network jointly reasons over multiple chains, optimizing inter-chain contacts, orientations, and backbone geometry within the same differentiable framework to deliver high-resolution protein-complex models directly from sequence alone.

To assess whether retention and removal of accessory proteins meaningfully impacted AF3 prediction accuracy, we compared Core versus Minimal Complex models across our 36 benchmark systems. In this context, Core models were generated as Full Complex predictions (including accessory proteins) but evaluated on only the E3–target interface, while Minimal models were predicted and evaluated using only the E3 and target proteins.

Overall performance trends

Contrary to initial impressions from Full-versus-Minimal artifacts, Core and Minimal workflows exhibited largely overlapping DockQ distributions. As shown in Figures S1 (scatter plots) and Figure S2 (bar graph), Core models displayed only marginal increases in median DockQ (typically + 0.02–0.05) compared to Minimal models, with many systems showing no substantive difference. Both methods generally produced low DockQ values (< 0.2) for the majority of complexes, indicating that accessory proteins conferred minimal genuine improvement at the E3-target interface when scored identically.

Model-level variability

Swarm and box plot visualizations (Figures S3 and S4) confirmed high overlap between individual Core (green) and Minimal (blue) scores. In some cases, such as 5T35 and 7KHH, Core models trended slightly higher, but variance ranges largely coincided, underscoring that benefits of accessory context were often insufficient to produce consistently better interface predictions.

Statistical validation.

A paired t-test on per-structure median scores yielded T = 0.842, p = 0.4081, Cohen’s d = 0.168, supporting that differences between Core and Minimal were not statistically significant at conventional thresholds. Therefore, the apparent advantage of including accessory proteins during modeling largely vanished when evaluation focused strictly on the E3-target interface.

Impact of accessory interfaces: detecting and quantifying DockQ inflation

Initially we had scored the Full complexes, which appeared highly accurate. However, as the preceding analysis suggested those gains were artefactual. We therefore measured the magnitude of accessory-driven inflation in two steps.

  1. A.

    Full versus minimal: identifying the artefact

When the models were evaluated with all chains (Full), DockQ scores clustered near 0.9 in roughly half of the benchmark systems. Scatter plots and median traces (Figures S5, S6) revealed consistently high-quality groupings for these complexes. Re-scoring the identical coordinates as Minimal complexes drove the medians down to about 0.2 or below for most structures. The main exceptions were three BRD4-BD1 PROTACs, 8BEB, 7KHH, a 8BDS, whose Minimal evaluations remained comparatively high at about 0.6, though still lower than their corresponding Full scores. Swarm and box plots (Figures S7, S8) underscored this bifurcation: About half of Full models populated the upper range, whereas Minimal models clustered at much lower scores. A heat-map of median differences (Full-Minimal, Figure S9) illustrates the same results recording positive shifts of 0.4–0.9 for about half the systems. A paired t-test on these results confirmed the significance of the effect (t = -5.55, p = 3.0 × 10⁻⁶), demonstrating that the apparent “high quality” of Full predictions arose largely from accessory-protein contacts.

  1. B.

    Full versus core: quantifying inflation after stripping accessories

Directly rescoring identical coordinates before and after accessory removal offered a stringent test of inflation. For nearly every complex whose Full evaluation yielded an excellent median DockQ (> 0.8), the corresponding Core score collapsed below 0.2 (Figure S10), with many individual Core values clustering near zero (Figure S11). Median drops (Full—Core) exceeded 0.8 in 15 systems and reached 0.95 in several cases (Figure S12). Three complexes, 8BEB, 7KHH, and 8BDS, declined more modestly, by 0.3–0.4. At the model level the pattern was similar: one-half of structures lost more than 0.3, and 40% lost more than 0.8 DockQ units (Figure S13). A histogram of DockQ scores (Figure S14) and the corresponding cumulative distribution function (Figure S15) reinforced the separation. Core complexes peaked at 0–0.1 and reached 80% cumulative frequency by, while Core complexes reached about 40% at DockQ < 0.2. Full complexes formed a sharp peak at 0.9–1.0, while core complexes reach a maximum DockQ score of about 0.6. Statistical tests corroborated these findings (t = − 12.9, p = 2.8 × 10⁻27; Wilcoxon p = 5.6 × 10⁻20).

In summary, High DockQ scores observed for AF3 Full complexes stemmed almost entirely from contacts with accessory proteins. Once evaluation was limited to the biologically relevant E3-target interface, either by rescoring Full models as Core or by comparing Core with Minimal predictions, performance gains largely vanished. Consequently, AF3 offered only marginal, statistically non-significant improvements for the true binary interaction, and initial “high-quality” assessments were revealed to be artefacts of non-target interfaces.

Comparative performance of PRosettaC and AlphaFold-3 predictions

To extend our benchmarking beyond the AF3 modeling strategies, we evaluated PROTAC complex predictions generated using PRosettaC—a constraint-guided Rosetta-based protocol for ternary assembly. Due to limitations associated with ligand or chain mapping in the original PRosettaC script, we were unable to model 11 of the 36 systems. The reasons for these failures (e.g., unresolvable chain conflicts, linker atom mismatches, or fragment docking crashes) are summarized in the Discussion section. Here, we restricted our analysis to the 25 systems that were successfully modeled using all three methods: PRosettaC, AF3 Minimal, and AF3 Core.

Across these 25 structures, we compared DockQ scores to quantify the accuracy of predicted interfaces relative to experimental crystal structures. PRosettaC predictions yielded a notably broader distribution of DockQ values than either AlphaFold-3 models (Fig. 1). In several cases—most prominently 6BN7, 6ZHC, 7KHH, and 8QVU—PRosettaC achieved high-scoring outliers or consistently superior medians. AF3 Core tended to yield tighter, low-variance distributions concentrated in the 0.02–0.12 range, while Minimal predictions were more uniform but also typically low-scoring. These findings indicated that PRosettaC’s constraint-guided sampling can access productive binding poses that AlphaFold’s end-to-end deep-learning models did not readily generate, especially in well-defined anchor geometries.

Fig. 1
figure 1

Median-centered DockQ score distributions per method for each reference benchmark system. PRosettaC displayed broader score variability and higher medians in several cases, especially 6BN7, 6ZHC, 7KHH, and 8QVU. AF3 Core and Minimum predictions clustered tightly at relatively low DockQ scores.

To visualize DockQ score coverage more holistically, we constructed cumulative distribution functions (CDFs) for each method (Fig. 2). The AF3 Minimal and Core models concentrated most predictions within the lower DockQ regime (< 0.2), while PRosettaC preserved a longer high-scoring tail. Notably, PRosettaC surpassed the 0.23 “acceptable” DockQ threshold more frequently than AF3 models (Fig. 3). These data reinforced the notion that constraint-guided docking retained the ability to generate native-like interactions even when the median performance is modest.

Fig. 2
figure 2

Cumulative distribution function (CDF) of DockQ scores across all predictions. AF3 Minimal and Core models exhibited highly skewed distributions toward low DockQ values, while PRosettaC maintained a longer high-score tail suggesting better predictions in many cases.

Fig. 3
figure 3

Normalized distribution of DockQ scores across the three prediction methods: AF3 Minimal and Core Complex models, and PRosettaC (Rosetta-based docking) Ternary Complex predictions. Each bar represents the fraction of models within a given DockQ bin, normalized within each group. The majority of models across all methods fell below the 0.1 DockQ threshold. PRosettaC showed a broader distribution and slightly higher frequencies in mid-to-high score bins, reflecting improved modeling of native-like interfaces in certain systems.

We next assessed the relative performance of each method through pairwise comparisons of median DockQ scores across benchmark reference structures. As shown in the delta heatmap (Fig. 4), PRosettaC consistently outperformed AF3 Minimal Complex models, with an average ΔDockQ of + 0.11, indicating a marked improvement in interface prediction when using Rosetta-based docking. AF3 Core occasionally approached PRosettaC in performance but rarely surpasses it; for example, most complexes, including 7PI4 and 8G1Q, still showed equal or superior performance by PRosettaC. These results highlighted PRosettaC’s strengths in sampling realistic ternary poses, while also illustrating that scaffold-stabilized AF3 predictions could, in select systems, produce moderately competitive models –though they seldom exceed PRosettaC in interface accuracy.

Fig. 4
figure 4

Heat map of median DockQ score differences. Rows list each head-to-head comparison in the format “Method A—Method B,” while columns show the benchmark complexes (PDB IDs). Cell colors represent the median DockQ of Method A minus Method B: red indicates Method A performed better, blue indicates Method B performed better, and grey denotes negligible difference. The plot highlights that PRosettaC Ternary predictions in most cases exceeded AF3 Minimal results, whereas AF3 Core models rarely outperformed PRosettaC.

To better visualize model-level variability, we plotted all individual DockQ scores per native complex using dot plots and swarm plots (Figs. 5, S16). PRosettaC generated a wide spectrum of poses across most systems, with several high-scoring models that far exceeded AF3 predictions. These standout examples suggested that while PRosettaC can struggle in difficult scenarios, it maintained potential for highly accurate predictions in favorable contexts.

Fig. 5
figure 5

Swarm plot of DockQ scores per model. PRosettaC consistently populated the high-scoring regime (> 0.5), particularly for 6BN7, 6ZHC, and 7KHH, while AF3 models clustered near baseline.

To further illustrate these results, we calculated and visualized the median DockQ scores for the three modeling methods by reference benchmark system in a grouped line plot (Fig. 6). For each PDB reference structure, the modeling methods were ranked and the method with the highest median was considered the “winner” (1st rank). Rankings were summarized in a heatmap Figure S17.

Fig. 6
figure 6

Median DockQ score per method for each native complex. PRosettaC outperforms in nearly half of the targets, with some competition from AF3 Core Complex (e.g., 6KHH, 5T35) and Ternary (e.g., 7Z77, 6HAX).

The stacked bar chart (Figure S18) summarized the overall method performance. PRosettaC was the top-performing method in 48% of the cases. The differences in rankings of AF3 Minimal and Core are not significant as sown above (Sect. “Results”). These results highlighted PRosettaC’s dominant, though not universal, advantage in ternary complex prediction.

Assessing AlphaFold-3 predictions against dynamic reference ensembles

As an additional methodology to evaluate whether AlphaFold-3 predictions corresponded to conformations of the native E3 ligase-target ternary complex in solution, we performed 50 ns MD simulations of five crystallographic PROTAC complexes. Rather than simulating all AF3 models themselves, we treated the crystal structures as reference trajectories that sample the complex’s intrinsic conformational landscapes. Each static AF3 prediction (Models 1 to 5) was then compared to every MD frame with DockQ, providing a frame-wise measure of the interface differences of the native ensemble compared to those predicted by AF3.

Throughout the simulations, DockQ scores stayed uniformly low for all AF3 models, indicating that the native complexes did not sample conformations close to the AF3 poses. By tracking DockQ on a frame-by-frame basis, we confirmed that none of the predicted interfaces aligned with the reference structures even under dynamic conditions. These results implied that the AF3 predictions occupied regions of conformational space that were largely inaccessible to the experimentally observed complexes under the simulated conditions conforming our results above based on static DockQ scores.

Reference structures 6HAX and 6HAY, both SMARCA2-bromodomain complexes with the VHL E3 ligase, displayed similar conformational-ensemble trajectories. Although early in the simulation (frames 900–1100), several AF3 models showed transient small DockQ upticks, all DockQ scores stayed below 0.1 for the entire run (Figs. 7 and S19).

Fig. 7
figure 7

Frame-by-frame DockQ scores for 6HAX (SMARCA2:VHL) over a 50 ns MD trajectory (5000 frames) as an illustrative example. Each colored trace reports the similarity between a static AF3 Minimal Complex model and successive MD frames of the crystal structure. Although Model #1 exhibited a brief peak at DockQ = 0.10 early in the simulation, the native complex did not sample conformations close to any AF3 model. All traces fluctuated below 0.1 throughout, indicating that crystal-derived dynamics did not traverse the predicted interfaces.

6HR2, a SMARCA4-targeting VHL complex, displayed a stable trajectory as described by DockQ progression for all models. Throughout the simulation DockQ scores never exceeded 0.12 for any of the AF3 models and even declined substantially in the last 20% of the simulation, again confirming our initial static results (Figure S20).

In the BTK-targeting system 8DSO, DockQ scores fluctuated between 0.01 and 0.06 for Models #1 though #4 and transiently increased up to 0.15 for Model #5. However, DockQ scores remained below the 0.23 threshold considered acceptable, consistent with previous results (Figure S21).

Finally, 8QVU, a KRAS-targeting complex, demonstrated the least variation throughout the simulation with DockQ values against all models tightly clustered between 0.01–0.04. These results suggested a stable reference ternary complex with none of the AF3 predictions resembling the experimentally determined interface in the dynamic setting (Figure S22).

Collectively, these results confirmed that, even after molecular dynamics sampling of the native complexes, AF3 Minimal models did not show improved alignment with the dynamic ensembles sampled by the reference structures. This reinforces the notion that, without explicit structural constraints or multimeric context, AF3 predictions for ternary complexes remain disconnected from experimentally observed poses—likely reflecting limited training data or intrinsic differences between natural protein assemblies and PROTAC-induced ternary complexes.

DockQ scoring of PRosettaC models against MD-simulated crystal ensembles

Previous assessments of PRosettaC model accuracy relied solely on comparisons to a static crystal structure, as deposited in the PDB. Such static structures capture only a single conformation of what is, in reality, a dynamic ensemble of states accessible to proteins in solution. This limitation can obscure potential compatibilities between predicted ternary complexes and transient or low-population conformations of the true complex.

To address this, we performed MD simulations of the experimental crystal structures for four benchmark complexes (6HAX, 6HAY, 6HR2, and 8QVU). Rather than running MD on the PRosettaC models themselves, we generated a trajectory of the experimental reference structure (sampling every 50 frames, corresponding to 0.5 ns intervals across a 50 ns trajectory) to represent an ensemble of biologically accessible conformations. For the 6HAX system, we selected the top 5 PRosettaC models (ranked by DockQ score against the static crystal structure) for detailed comparison against the MD trajectory, while the other complexes were evaluated using their single best-scoring model. These studies aimed to explore if any of the PRosettaC models align with a local ensemble of reference complex conformations accessible via MD simulations.

This ensemble-based analysis revealed structural compatibilities that were not apparent from static evaluations. While many PRosettaC predictions scored modestly against the single crystal frame, several achieved transient DockQ peaks (> 0.23) when aligned with specific frames from the MD trajectory. These observations suggest that PRosettaC models may capture interface geometries that match transient, but plausible, conformations of the ternary complex.

For example, the KRAS:VHL system (8QVU) showed excellent alignments with many of the PRosettaC models over prolonged durations of the simulation (Figure S23). The best PRosettaC model had achieved a DockQ score of 0.69 when compared to the static crystal structure reference (Fig. 5). This high-scoring model maintained strong alignment across numerous frames of the MD simulation (Figure S23) as would have been expected for a favorable model that maintained good alignment with nearby dynamically sampled reference conformations. This consistency further reinforced the plausibility of the PRosettaC model and demonstrated that this dynamic scoring might be an approach to further evaluate the viability of predicted ternary complexes.

Similarly, the SMARCA4:VHL complex (6HR2) showed moderate to high DockQ scores with many predictions exceeding 0.5 for durations of the reference ensemble time course (Figure S24). As before these results suggested that the best PRosettaC models would score well against an ensemble of nearby reference ternary complex conformations. In contrast, systems like 6HAY (SMARCA2:VHL) maintained low scores against all models throughout the entire simulation trajectory (Figure S25), indicating limited model-structure agreement even in the ensemble contexts.

The SMARCA2:VHL system (6HAX) was particularly interesting: while its top static PRosettaC model scored below the 0.23 threshold (Figs. 1 and 5, S16), transient DockQ spikes exceeding 0.4 were observed against specific frames of the reference ensemble (Fig. 8, Figure S26). These high-scoring bursts indicate that while the static alignment is weak, the model can transiently match conformations sampled by the crystal reference. By scoring the top 5 PRosettaC models against the entire MD trajectory of 6HAX (Fig. 8), we observed that each model exhibited distinct time points and DockQ scores of peak compatibility, supporting the ensemble nature of ternary complex interfaces. This reinforces the idea that evaluating models against dynamic references can reveal “hidden” compatibilities not apparent in static snapshots. Together, these results suggest that ternary complex modeling benefits from dynamic evaluation strategies, which can complement static scoring and guide degrader design by identifying conformational ensembles rather than static states.

Fig. 8
figure 8

Frame-by-frame DockQ scores for the five best PRosettaC models evaluated against a 50 ns MD ensemble of the SMARCA2:VHL ternary complex (PDB 6HAX). The reference trajectory was sampled every 50 frames, and each colored trace shows the DockQ score of a single PRosettaC model against the MD time course of the PDB reference ternary complex. Although most frames produced modest agreement, several stretches exceeded the 0.23 “acceptable” threshold, peaking at 0.45, revealing transient conformations in the native ensemble that align well with the PRosettaC predictions.

Having examined model quality with respect to both static structures and MD-derived ensembles, we next asked how one might identify the most plausible ternary pose a priori, i.e. without access to a reference complex. In practical PROTAC design the scoring function itself must rank candidate models, and for PRosettaC that ranking is driven by Rosetta’s all-atom energy. We therefore evaluated how well Rosetta’s internal scoring function tracks interface accuracy across our benchmark set. We analyzed the correlation between Rosetta total score and DockQ across all PRosettaC predictions (Figure S27). Our analysis revealed generally poor (in several cases inverse) correlation across most benchmark systems, with many exhibiting low to near-zero R2 values. This finding suggested that Rosetta’s energy landscape, while effective for global optimization, does not consistently reflect interface-level accuracy as measured by DockQ. A possible alternative might lie in combining a small, representative set of top-ranked models with short MD simulations and evaluating their agreement with the resulting dynamic ensembles. Such an approach could provide a more discriminating filter when experimental references are unavailable and would be pursued in future work.

Discussion

Developing therapeutics that function through targeted protein degradation poses unique challenges. PROTACs function by co-opting cellular ubiquitination machinery, requiring the formation of a stable ternary complex of a target protein, an E3 ligase, and the degrader molecule28. Modeling this assembly in silico is non-trivial, as it depends on both the global stability of the multi-protein architecture and the precise geometry of linker-mediated interfaces29. Our study presents a systematic comparison of two leading computational strategies—AF3 and PRosettaC—across 36 crystallographically resolved ternary complexes. By evaluating them across static, scaffold-aware, and dynamic simulation contexts, we uncovered tradeoffs in modeling assumptions, interface accuracy, and evaluation methodology (Table 1).

Table 1 Comparative summary of ternary complex modeling methods.

AF3 and PRosettaC adopt fundamentally different modelling strategies. AF3 predicts ternary complexes holistically, relying on deep-learning priors that optimize overall structural plausibility rather than explicit physics-based energies. This enables the method to propose plausible global folds without prior structural information. Yet, because it imposes no degrader-specific constraints, AF3 often produces protein orientations that cannot accommodate a physical linker between E3 ligase and target, even when the models carry high confidence scores and appear globally coherent30.

PRosettaC, in contrast, builds from known (or proposed) degrader molecules. It explicitly enforces anchor atom geometries derived from known binding modes and samples linker conformations that connect these anchors. This approach yields highly relevant interface predictions when anchor chemistry is compatible, but is brittle when the anchor distance, orientation, or sampling strategy fails to produce viable geometries. In such cases, PRosettaC can produce unfeasible complexes or fails entirely to generate a model.

This contrast underscores a fundamental trade-off: AF3 offers robustness and broad generalizability, whereas PRosettaC delivers precision through explicit constraints. The choice between them hinges on project goals. Exploratory modelling of novel degrader scaffolds favors AF3, while fine-tuning linker orientations in well-characterized systems benefits from PRosettaC.

Across our benchmark, AF3 delivered impressive DockQ values when Full Complexes were evaluated, yet these high scores mainly reflected contacts with accessory proteins like Elongin B/C, DDB1, and other scaffolds, rather than the degrader-relevant interface. Removing those scaffolds in the Core configuration (E3 ligase+target only) caused the scores to collapse, exposing the weak agreement at the true ternary interface.

This outcome highlights a key limitation of DockQ and similar global‐overlap metrics: they are agnostic to biochemical relevance. DockQ rewards any correctly superimposed surface, so extensive but functionally irrelevant contacts can mask errors at the interface that matters for degrader design. Consequently, high DockQ scores should be interpreted with caution; they do not guarantee a chemically actionable pose at the ligase-linker-target junction.

When PRosettaC succeeded, its models respected the expected warhead geometry and scored competitively on global metrics. In several complexes it matched or outperformed AF3 while sampling a narrower search space, showing that constraint-driven docking can surpass unconstrained deep-learning predictions when reliable anchor information is available.

PRosettaC’s failure modes were equally instructive. In 11 of the 36 systems, the protocol did not yield a viable ternary complex. Most failures arose during linker sampling, when the anchor atoms could not be connected within the protocol’s physical tolerances. These cases typically involved unconventional binding topologies, warheads that were folded, flipped, or oriented orthogonally, so a straightforward distance-based alignment was impossible. Because PRosettaC relies on fixed point-to-point distance constraints, it was ill-equipped to explore these geometries.

This rigidity reflects a broader challenge in degrader modeling: the spatial path between ligase and target is rarely straight. Flexible linkers often adopt convoluted shapes, especially in cases involving deep binding pockets or allosteric surfaces. Addressing these challenges may require next-generation sampling algorithms such as torsion-driven, ensemble-based, or MD-informed conformer generation to more fully explore the solution space.

Standard benchmarks compare models to a single crystallographic conformation, even though a crystal structure captures only one snapshot of a dynamic protein complex. To better assess pose plausibility, we introduced an MD-based framework that treats the crystal structure dynamically, extracting thousands of frames from 50 ns simulations of each ternary complex.

This approach let us test whether PRosettaC models—even those that scored poorly against the static crystal—ever coincided with conformations within a native-like ensemble sampled during MD. The results were striking. In several systems (e.g., 6HR2 and 8QVU) models scored sustained DockQ values above 0.5 against particular MD frames. For 6HAX, the best static score was below 0.2, yet multiple PRosettaC predictions well exceeded the 0.23 DockQ model feasibility threshold against MD-sampled frames, revealing improvements in alignments with transient geometries in the native-like ensemble.

This finding highlighted the value of assessing models against even a short, local conformational ensemble rather than a single crystal snapshot. A pose that transiently matches an accessible state in that ensemble can be as informative for drug design as one that perfectly reproduces the crystallographic structure. Although static coordinates remain central to structure-based design, our 50 ns MD comparisons showed that fleeting alignments might reveal biologically relevant conformations, such as induced-fit or low-population states that crystallography alone may miss. Ensemble-based evaluation thus complements traditional static benchmarks and offers a broader perspective for identifying viable degrader poses in flexible protein systems. Unsurprisingly, these observations suggest that protein interfaces are best assessed against local conformational ensembles rather than single static structures, and that brief MD sampling could further refine ternary-complex interface geometries. These approaches will therefore likely benefit the development of novel PROTAC molecules.

A recent study by Rovers and Schapira benchmarked three PROTAC-specific modeling platforms: PRosettaC, MOE, and ICM. Unlike general-purpose docking tools, these were evaluated specifically for their ability to generate ternary complexes consistent with experimental structures48. Their results demonstrated that while accurate poses could sometimes be recovered, a majority of predictions deviated significantly from crystallographic conformations. Importantly, their study incorporated molecular dynamics simulations to demonstrate that ternary complexes occupy a range of solution-accessible states, thereby challenging the notion of a single “correct” structure. While their conclusions align with our findings about conformational multiplicity, our study extends this concept by using MD trajectories of the crystal reference itself as the benchmark. This allows us to detect transient but high-fidelity model–frame alignments, highlighting not only the flexibility of ternary complexes but also the potential relevance of alternative poses generated by tools like PRosettaC. Together, these complementary studies reinforce the need for dynamic benchmarking frameworks and underscore the limitations of static evaluations in the context of degrader design.

Our benchmark showed that Rosetta’s all-atom energy, while valuable for global optimization, is a weak surrogate for interface accuracy: DockQ correlations were low or even inverse in many systems (Figure S27). We also observed that PRosettaC’s cluster-based ranking did not solve the problem. Its top-scoring cluster often lacked the model with the best DockQ, meaning cluster energy or size alone can still miss the best models. Together, these findings imply that relying solely on Rosetta energy (whether at the model or cluster level) risks selecting geometrically incorrect ternary complexes, particularly in landscapes where non-productive arrangements are energetically favored. A more robust workflow should layer complementary filters, including interface-aware scores such as DockQ, short MD-ensemble checks, or AI/knowledge-based metrics on top of Rosetta energy and clustering, ensuring that final selections are both low in energy and geometrically consistent with functional ternary geometries.

Despite the comprehensive scope of this benchmark, several limitations should be acknowledged. First, the use of the web-based AF3 server imposed residue count restrictions that prevented inclusion of full-length E3 ligase complexes, such as CUL2, CRL4A, and RBX1. While the accessory proteins Elongin B/C and DDB1 were included in the AF3 Full Complex models, omitting larger scaffold components may have reduced biological fidelity in multimeric systems and limited the generalizability of the results.

Second, the PRosettaC protocol failed to generate viable models in 11 out of the 36 benchmarked systems. These failures were primarily due to linker sampling constraints, unresolved atom naming inconsistencies, and misalignments of the PROTAC molecule linker. These issues reflect the inherent brittleness of constraint-based modeling when faced with non-canonical anchor geometries or complex topologies.

Third, the DockQ scoring metric, while valuable for quantifying structural alignment, was biased in cases where large scaffold proteins dominated the predicted interface. In such instances, scores often overstated model quality without accurately reflecting degrader-relevant interactions. Although our analysis used a scaffold-stripped evaluation (AF3 Core Complex) to mitigate this issue, the metric generally is agnostic to the functional relevance of the interface unless customized filtering is applied.

Fourth, the benchmarking framework primarily relied on static crystallographic structures as reference points. However, MD simulations revealed that some models—particularly from PRosettaC—aligned well with specific frames in the dynamic ensemble, despite scoring poorly against the static crystal pose. These results suggested that conventional evaluations may underestimate the biological plausibility of certain conformations.

A fifth limitation of this benchmark concerned the composition of the dataset. Of the 36 successfully modeled systems, 22 involved the VHL E3 ligase and 16 featured bromodomain targets such as BRD4, SMARCA2, and SMARCA4. This overrepresentation reflects the structural biases of the Protein Data Bank, where VHL- and bromodomain-based ternary complexes are disproportionately available due to their early success in crystallographic studies and degrader development. While biologically and clinically relevant, this skew may limit the broader applicability of our conclusions. Future benchmarks would benefit from expanded coverage of E3 ligases such as CRBN, DCAF1, or RNF114 and targets beyond bromodomains. Recently solved structures, such as the DCAF1–WDR5 complex, offer valuable opportunities to diversify comparative modeling efforts and better capture the range of degrader-relevant geometries49.

Finally, this study was limited to comparisons against available crystallographic reference structures and did not incorporate experimental validation of model function. While DockQ scores and MD-based ensemble comparisons provided valuable structural insights, they cannot fully capture the biological relevance of predicted complexes. Indeed, some predicted poses may differ from published PDB structures yet still represent functional ternary geometries that support degradation. Future work should incorporate functional readouts, such as cellular degradation assays, to test whether PROTAC molecules designed from specific modeled poses induce target protein degradation. Practical validation would require synthesizing series of PROTAC analogues guided by the predicted geometries and then correlating their degradation efficiencies with model-derived scores. Linking model accuracy to measured cellular activity in this way would strengthen benchmarking efforts and sharpen the practical utility of computational predictions in degrader design.

In conclusion, our results confirmed the need for degrader-specific evaluation strategies. AlphaFold-3 provided speed and flexibility but often arranged the ligase and target in orientations that cannot accommodate a linker. PRosettaC yielded accurate, interface-relevant poses when its geometric constraints were met, yet it failed when those constraints cannot be satisfied. DockQ, although convenient, can be misleading when scaffold contacts inflate the score or when dynamic flexibility is ignored. Future benchmarking pipelines should therefore add several refinements: (i) ensemble-based similarity scoring that averages over MD frames; (ii) interface-partitioned metrics focused on degrader-relevant contacts; (iii) evaluations of linker strain, burial, and solvent exposure; and (iv) torsion-aware sampling to capture non-canonical geometries. Blending global fold predictors with degrader-specific docking constraints and validating resulting PROTAC series through functional degradation assays offers a promising path forward. As experimental data remain the ultimate arbiter, computational tools must evolve to capture the full complexity of the degradome landscape.