Geometric deep learning of protein–DNA binding specificity

Mitra, Raktim; Li, Jinsen; Sagendorf, Jared M.; Jiang, Yibei; Cohen, Ari S.; Chiu, Tsu-Pei; Glasscock, Cameron J.; Rohs, Remo

doi:10.1038/s41592-024-02372-w

Download PDF

Article
Open access
Published: 05 August 2024

Geometric deep learning of protein–DNA binding specificity

Nature Methods volume 21, pages 1674–1683 (2024)Cite this article

38k Accesses
37 Citations
109 Altmetric
Metrics details

Subjects

Abstract

Predicting protein–DNA binding specificity is a challenging yet essential task for understanding gene regulation. Protein–DNA complexes usually exhibit binding to a selected DNA target site, whereas a protein binds, with varying degrees of binding specificity, to a wide range of DNA sequences. This information is not directly accessible in a single structure. Here, to access this information, we present Deep Predictor of Binding Specificity (DeepPBS), a geometric deep-learning model designed to predict binding specificity from protein–DNA structure. DeepPBS can be applied to experimental or predicted structures. Interpretable protein heavy atom importance scores for interface residues can be extracted. When aggregated at the protein residue level, these scores are validated through mutagenesis experiments. Applied to designed proteins targeting specific DNA sequences, DeepPBS was demonstrated to predict experimentally measured binding specificity. DeepPBS offers a foundation for machine-aided studies that advance our understanding of molecular interactions and guide experimental designs and synthetic biology.

Improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein

Article Open access 07 September 2024

TransBind allows precise detection of DNA-binding proteins and residues using language models and deep learning

Article Open access 05 April 2025

Predicting DNA structure using a deep learning method

Article Open access 09 February 2024

Main

Transcription factors play critical roles in various regulatory functions that are essential to all aspects of life¹. Therefore, understanding the mechanisms by which proteins target specific DNA sequences is crucial². Extensive research has uncovered myriad binding mechanisms that lead to specific high-affinity binding, including strong electrostatic interaction of arginine residues in the DNA minor groove³, deoxyribose sugar-phenylalanine stacking⁴, bidentate hydrogen bonds (H-bonds) between guanine (G) and arginine (Arg) in the major groove⁵, and other interactions^6,7,8.

Protein–DNA structures are typically⁹ obtained through X-ray crystallography, nuclear magnetic resonance spectroscopy or cryo-electron microscopy experiments and stored in the Protein Data Bank (PDB)¹⁰. Generally, these structures display one bound DNA sequence and the associated physicochemical interactions⁶ but do not encompass the full range of potentially bound DNA sequences. Conversely, this information can be experimentally obtained through protein-binding microarray¹¹, systematic evolution of ligands by exponential enrichment combined with high-throughput sequencing (SELEX–seq)¹², chromatin immunoprecipitation followed by sequencing¹³, high-throughput SELEX¹⁴ or related high-throughput approaches¹⁵. These experiments capture the range of possible bound DNA sequences but do not necessarily provide structural information. In essence, these sets of experiments are complementary, and manual examination is often required to correlate molecular interaction details from structural data with binding specificity data⁶.

Predicting binding specificity for a given protein sequence, across protein families, remains a challenging and unsolved problem, despite progress for specific protein families^{16,17,18,19,20,21,22,23}. Structural changes in the context of binding, along with large mechanistic diversity, contribute to the difficulty^15,24. Protein–DNA structures contain valuable information that artificial intelligence can leverage to achieve generalizability across protein families. In this framework, we introduce Deep Predictor of Binding Specificity (DeepPBS). This deep-learning model is designed to capture the physicochemical and geometric contexts of protein–DNA interactions to predict binding specificity, represented as a position weight matrix (PWM)²⁵ based on a given protein–DNA structure (Fig. 1a). DeepPBS functions across protein families (Fig. 2) and acts as a bridge between structure-determining and binding specificity-determining experiments.

**Fig. 1: Schematic illustration of the DeepPBS framework.**

**Fig. 2: Performance of DeepPBS for predicting binding specificity across protein families for experimentally determined structures.**

Input of DeepPBS is not limited to experimental structures (Fig. 1a). The rapid advancement of protein structure prediction methods, including AlphaFold²⁶, OpenFold²⁷ and RoseTTAFold²⁸, along with protein–DNA complex modelers, such as RoseTTAFoldNA (RFNA)²⁹, RoseTTAFold All-Atom³⁰, MELD-DNA³¹ and AlphaFold3 (ref. ³²), have led to an exponential increase in the availability of structural data for analysis. This scenario highlights the growing need for a generalized computational model to analyze protein–DNA structures. We demonstrate how DeepPBS can work in conjunction with structure prediction methods for predicting specificity for proteins without available experimental structures (Fig. 3a–d). In addition, the design of a protein–DNA complex can be improved by optimizing bound DNA using DeepPBS feedback (Fig. 3e–g). We show that this pipeline is competitive with the recent family-specific model rCLAMPS¹⁷ (Fig. 3h,i) while being more generalizable: specifically, DeepPBS is protein family-agnostic, can handle biological assemblies and can predict DNA flanking preferences.

**Fig. 3: Application of DeepPBS on predicted protein–DNA complex structures.**

In terms of interpretability, ‘relative importance’ (RI) scores for different heavy atoms in proteins that are involved in interactions with DNA can be extracted from DeepPBS (Fig. 4). As a case study on an important protein for cancer development, we analyze the p53–DNA interface via these RI scores and relate them with existing literature for validation. Additionally, we show that the DeepPBS scores align well with existing knowledge and can be aggregated to produce reasonable agreement with alanine scanning mutagenesis experiments³³ (Fig. 4h).

**Fig. 4: Visualization of DeepPBS importance scores in p53–DNA interface as a case study, and experimental validation.**

In additional proof-of-principle studies, we apply DeepPBS to in silico-designed protein–DNA complexes targeting specific DNA sequences (Fig. 5), obtained from a recent study that combines structural design with DNA mutagenesis experiments³⁴. Finally, we show that DeepPBS can also be used to analyze molecular simulation trajectories. We demonstrate an example by applying DeepPBS to a molecular dynamics (MD) simulation of Extradenticle (Exd) and Sex combs reduced (Scr) Hox heterodimer in complex with DNA³⁵ with an AlphaFold-based modeled protein linker (Supplementary Section 10, Supplementary Fig. 6 and Supplementary Video 1). DeepPBS is available as a webserver at https://deeppbs.usc.edu.

**Fig. 5: Application of DeepPBS to in silico-designed HTH scaffolds targeting a specific DNA sequence.**

Results

The DeepPBS framework

The DeepPBS framework is illustrated in Fig. 1. Input to DeepPBS (Fig. 1a) is composed of one protein–DNA complex structure, with one or more protein chains bound to a DNA double helix. Potential sources for such structures include experimental data (for example, PDB¹⁰), molecular simulation snapshots or designed complexes. DeepPBS processes the structure as a bipartite graph with distinct spatial graph representations for protein and DNA components. The protein graph is an atom-based graph, with heavy atoms as vertices. Several features are computed on these vertices (Fig. 1b). Further information on protein representation and feature computation is available in Supplementary Section 4. We represent DNA as a symmetrized helix (sym-helix), as detailed in Methods. This representation removes any sequence identity that the DNA possesses, while preserving the shape of the double helix³. Optionally, DNA sequence information can be reintroduced as a feature on the sym-helix points.

DeepPBS performs a series of spatial graph convolutions on the protein graph to aggregate atomic neighborhood information (Fig. 1d). The next crucial component of DeepPBS consists of a set of bipartite geometric convolutions applied from the protein graph to the sym-helix (Fig. 1d). Specific chemical interactions (for example, hydrogen bonds) depend on both location and orientation⁵. DeepPBS learns how the geometric orientation of the sym-helix points is associated with the orientations and chemistry of neighboring protein residues. Four distinct bipartite convolutions are employed for the sym-helix points, corresponding to the major groove, the minor groove and the phosphate and sugar moieties. Major and minor groove convolutions are referred to as ‘groove readout’. This term was chosen over the term ‘base readout’ due to the removal of base identity in the sym-helix. Phosphate and sugar moiety convolutions, combined with DNA shape information, form the ‘shape readout’ (Fig. 1e). The ‘groove readout’ and ‘shape readout’ factors collaboratively determine binding specificity to varying extents for different protein families. At this point, the sym-helix representation enables a straightforward flattening of aggregated features on the three-dimensional sym-helix to the one-dimensional (1D) base pair-level features. By adding DNA shape information and implementing 1D convolutional neural network and prediction layers (Fig. 1e), DeepPBS ultimately predicts binding specificity (Fig. 1f). Further architectural details are described in Supplementary Section 5.

Lack of an existing published standard dataset for predicting binding specificity across protein families from protein–DNA complex structure data made it necessary for us to build a dataset for cross-validation and benchmarking. Details of this process can be found in Methods.

DeepPBS performance for experimentally determined structures

The DeepPBS ensemble (Methods) was employed to evaluate model performance against a benchmark set, as outlined in Supplementary Section 1. The DeepPBS architecture allows models to be trained on two mechanisms: ‘groove readout’, which does not involve backbone convolutions and excludes shape information, and ‘shape readout’, which does not involve groove convolutions (Fig. 1d,e). Benchmark performances of DeepPBS (which performs both ‘groove readout’ and ‘shape readout’ modes combined) and these two variations are shown in Fig. 2a. The ‘groove readout’ version does better than the ‘shape readout’ version in terms of median performance, while the DeepPBS model improves upon either component in isolation (two-sided t-test P value <0.01; Fig. 2a). Pairwise t-test P values for these variations are available (Supplementary Data 1). A discussion of the outliers in Fig. 2a is provided in Supplementary Section 12.

The dataset was constructed using experimentally determined structures; thus, the co-crystal structure-derived DNA sequence typically serves as a reasonable example of a bound sequence. As expected, integrating sequence information into the sym-helix points (‘DeepPBS with DNA SeqInfo’) enhanced performance (Fig. 2a), significantly closing the gap toward the inherent performance limit in the dataset. The inherent performance limit originates from the fact that for the same protein the binding specificity data presented by two databases^36,37 used to create the dataset may disagree to some extent (Supplementary Fig. 1c). We computed the distribution of disagreement across all unique PWMs appearing in both databases (Supplementary Section 1). However, from both interpretability and design perspectives, particularly when the bound DNA sequence may not be representative, the ‘DeepPBS’ model is optimal due to its low sensitivity to the DNA sequence in the structure. This fact is evidenced by comparing performances of the ‘DeepPBS’ and ‘DeepPBS with DNA SeqInfo’ models in the context of the PWM–co-crystal-derived DNA alignment score (Supplementary Section 1). Compared with the line fit to the variation with DNA sequence information (slope −0.44 for root mean squared error (RMSE), slope −0.62 for mean absolute error (MAE); Supplementary Fig. 11), the slope of the line fit to the DeepPBS predictions was closer to zero (Fig. 2b and Supplementary Fig. 11).

As an example, we show the DeepPBS ensemble prediction for the NF-κB biological assembly from the benchmark dataset. Although the co-crystal structure-derived DNA sequence was not of the highest binding affinity, as indicated by experimental data from HOCOMOCO³⁷, our prediction circumvented this issue, predicting a binding specificity that was more closely aligned with the experimental data (Supplementary Fig. 5d). Similar trends (Supplementary Fig. 5a–c) can be observed from cross-validation predictions by individual DeepPBS models (Methods). We also included example DeepPBS ensemble predictions (Supplementary Fig. 7) for structures in the PDB that correspond to specific interactions but do not have a PWM in the two binding specificity databases considered (Methods). In addition, example DeepPBS ensemble predictions (Supplementary Fig. 8) for structures of nonspecific protein–DNA binding (for example, SSO7D–DNA interaction³⁸) present in the PDB are presented. These predictions have notably lower information content compared with those in Supplementary Fig. 7.

DeepPBS captures patterns of family-specific binding modes

Abundances of different protein families in the benchmark set are described in Fig. 2c (Supplementary Fig. 5b for cross-validation set). Family annotations were obtained from the Database of Protein Families (PFAM)³⁹. The dataset encompasses a wide range of DNA-binding protein families. Performance of DeepPBS for various protein families provides several key insights. DeepPBS showed reasonable generalizability across protein families, performing well even for families with relatively fewer structures (Fig. 2d and Supplementary Fig. 5c), such as heat shock factor proteins. This observation suggests that the model is learning the underlying mechanisms of protein–DNA binding rather than overfitting on family-specific patterns.

Further validation is provided by comparing performances of the DeepPBS ‘groove readout’ and ‘shape readout’ models (Fig. 2d and Supplementary Fig. 5c). For families like zf-C2H2, zf-C4 the ‘shape readout’ model did not perform as well as the ‘groove readout’ model. This result aligns with the common understanding of the binding mechanism of these families. For example, zf-C2H2 uses zinc finger motifs to scan DNA for suitable base interactions, with minimal DNA bending or conformational change⁴⁰. This binding mode makes the zf-C2H2 family a popular target of protein sequence-based binding specificity prediction and design^{16,18,19,23,41}. Conversely, families like interferon-regulatory factor (IRF) proteins (Fig. 2d and Supplementary Fig. 5c) and T-box proteins (Supplementary Fig. 5c) showed higher performances for the ‘shape readout’ model, consistent with their known binding mechanisms that involve significant conformational changes^4,42. For families such as homeodomain (HD) and forkhead (Fig. 2d and Supplementary Fig. 5c), the DeepPBS model outperformed both the ‘groove readout’ and ‘shape readout’ components. This result suggests that the network captures complex higher-order relationships of these components. Pairwise P values for the three readout variations for Fig. 2d and Supplementary Fig. 5c are available in Supplementary Data 1.

Application to in silico-predicted protein–DNA complexes

The DeepPBS framework is not limited to experimental structures. Recent advances in scalable structural prediction approaches, driven by artificial intelligence^26,28, offer unprecedented potential. Specifically, models like RFNA²⁹ and MELD-DNA³¹ can be used to predict the structures of protein–DNA complexes from sequence. Such prediction algorithms have paved the way for DeepPBS to be applicable to proteins that lack experimental DNA-bound structure data.

We suggest one potential approach for working with predictive structures in DeepPBS. First, we make an initial guess for the DNA (IG DNA) sequence bound to each protein of interest based on the corresponding protein family. Then, we use RFNA to predict the protein–DNA complex structure, followed by DeepPBS to predict binding specificity. We demonstrate this process (Fig. 3a–c) for three proteins classified as basic helix-loop-helix (bHLH) in JASPAR³⁶. In all three cases, the PDB lacked experimental protein–DNA complex structures. The IG DNA (Supplementary Section 8) has an enhancer box motif (‘CACGTG’) in the center, which is known⁴³ to be a bHLH family target. The first example (UniProt Q4H376; Fig. 3a) is a Max homodimer, for which DeepPBS predicted a specificity closely mirroring that of the IG DNA. The second example (TCF21 dimer, O43680) was more complicated; the central ‘CACGTG’ motif in the IG DNA was erroneously assumed, yet DeepPBS successfully predicted the correct motif as ‘CATATG’ (Fig. 3b). The third example (Fig. 3c, protein OJ1581_H09.2, Q6H878) does not conform to any enhancer box motif. Nevertheless, DeepPBS predicted a binding specificity closely mirroring the experimental data (Fig. 3c).

We ran the DeepPBS pipeline for full-length UniProt protein sequences, each with a unique JASPAR entry and no experimental structure for the complex, across three different families (Supplementary Section 8): bZIP, bHLH and HD families. DeepPBS predictions based on RFNA-predicted structures exhibited an improved MAE (that is, closer to experimental data) compared with the IG DNA baseline (Fig. 3d). An application of DeepPBS to a MELD-DNA-predicted complex of the mouse CREB1 protein is demonstrated in Supplementary Fig. 9b. Thus, DeepPBS can take predicted structures from suboptimal DNA sequences and predict binding specificity close to experimental data.

We next explored whether DeepPBS prediction could be used as feedback (in a loop) to enhance modeling of the protein complex (and, subsequently, improve DeepPBS prediction). We demonstrated this process for the human TGIF2LY protein (UniProt ID Q8IUE0, unstructured region trimmed; Supplementary Section 8) in Fig. 3e. In round 1, we applied RFNA to this protein sequence alongside the IG DNA sequence for the HD family and then used the predicted complexes as input for DeepPBS. For IG DNA position T15 (Fig. 3e, round 1), DeepPBS predicted a strong preference for G. In the round 1 RFNA output, Arg57 and T15 were involved in one hydrogen bond (H-bond) and one van der Waals interaction. These interactions are theoretically weaker than the possible bidentate H-bonds between a G and Arg57. In round 2, we altered the RFNA input by taking the argmax (the most preferred sequence) from the DeepPBS output (Fig. 3e, round 2). The subsequently folded structure reflected a more robust bidentate H-bond interaction between G15 and Arg57, with the DeepPBS prediction more closely aligning with the experimental data (note positions (round 2) A18, G19 and T14, corresponding to positions 4–6 in MA1572.1; Fig. 3e).

We repeated this DeepPBS prediction process for a total of seven rounds, for the set of HD monomer sequences (Supplementary Section 8). The RFNA-predicted confidence metric (predicted local distance difference test (pLDDT), LDDT⁴⁴ reflects similarity between the predicted and reference structure for a complex; Supplementary Section 8) improved over these rounds (Fig. 3f). To independently evaluate structure quality, we calculated the molecular mechanics and Poisson–Boltzmann surface area⁴⁵ binding energy (Supplementary Section 8). From round 1 to round 3+, the number of stable structures (binding energy <0 kJ mol⁻¹) increased (Supplementary Fig. 9c), while their binding energy distributions shifted toward lower values (Supplementary Fig. 9c). DeepPBS performance improved across the five rounds (Supplementary Fig. 9a). We also refolded the benchmark set datapoints via RFNA (Supplementary Section 8) and compared (for the full processable set (n = 98) and a high-confidence set, pLDDT >0.9, n = 31) the performances with the equivalent performance obtained for the experimental structures (Fig. 3g). There is a drop in performance. We can expect that it will improve when future models for structure prediction become available.

The DeepPBS approach for predicting binding specificity fundamentally differs from that of existing methods, which predict binding specificity solely on the basis of protein sequence information. As a result, comparisons with existing family-specific methods that operate exclusively on protein sequence are unfeasible. However, in conjunction with a complex structure prediction method, we can start from protein sequence information alone and predict binding specificity using DeepPBS. This process can be compared with the recent HD family-specific method, rCLAMPS¹⁷ (Supplementary Section 8). rCLAMPS can predict core 6-mer binding specificities for monomer HD proteins. A comprehensive overview of performances is shown in Fig. 3h. For different significant portions of the data, DeepPBS and rCLAMPS outperformed each other. DeepPBS outperformed rCLAMPS where the pLDDT scores were higher (Fig. 3i). Thus, the DeepPBS pipeline is comparable to rCLAMPS, while having broader applicability across families and biological assemblies as well as not being limited to predicting the DNA core binding region.

Assessing protein residue importance at p53–DNA interface

The DeepPBS architecture permits intentional activation or deactivation of specific edges in the bipartite geometric convolution stage (Fig. 1d and Supplementary Fig. 4). Perturbing a set of edges in this manner will alter the network-predicted result. The mean absolute difference between the original and altered prediction can be used (with proper normalization) as a quantification of the impact of the perturbed set of edges in determining binding specificity (Fig. 1g, Supplementary Fig. 4 and Methods).

We present results for perturbing edge sets for individual protein heavy atoms, which can also be aggregated to compute residue-level importance. As an example, we examined the protein–DNA interface of p53 (PDB ID: 3Q05), a protein crucial for regulating cancer development and cell apoptosis⁴⁶. The tumor suppressor p53 binds to DNA as a tetramer with two symmetric protein–DNA interfaces^47,48. We show the RI scores (with min–max normalization applied) calculated for heavy atoms within 5 Å of the sym-helix (Fig. 4a). Sphere sizes in Fig. 4a denote computed RI scores, with the largest being 1 and smallest 0. Lys120 (ref. ⁴⁹) is involved in both groove readout (H-bond with G) and shape readout-based binding specificity (H-bond with backbone phosphate) (Fig. 4b). The network deems G-Arg280 (ref. ⁴⁹) bidentate H-bonds as another strong driver of binding specificity⁵ (Fig. 4c). Cys277 confers specificity through its thiol sulfur, accepting an H-bond in the major groove⁴⁹ (Fig. 4d). Another important residue according to DeepPBS, Arg248 (ref. ⁵⁰), is present at the minor groove (Fig. 4e). This decision by the model is primarily based on the orientation of arginine relative to the sym-helix, which is devoid of DNA sequence information. Arg248 is attracted through enhanced negative electrostatic potential due to a narrowing of the minor groove where it binds⁴⁷. Among other residues in Fig. 4f, Ser241 is known⁵⁰ to be important for stabilizing Arg248. Ala276 (known for causing apoptosis upon mutation⁵¹) appears as another driver of specificity. This residue has been shown to be a driver of specificity via van der Waals contacts with the methyl group of T in the major groove⁴⁹. The binding specificity prediction of DeepPBS (Fig. 4g) aligns well with known binding patterns of p53, which follows the form RRRC(A/T)(A/T)GYYY (R denotes purine, and Y denotes pyrimidine). The interactions shown here are deemed^46,52 as significant drivers of p53 binding.

Comparison of residue-level importance with mutagenesis data

We next asked whether DeepPBS-derived importance scores, which reflect the degree to which an interaction determines output binding specificity, can be considered as reliable and potentially physically significant. Although high-affinity interactions can be nonspecific^38,53, interactions that contribute to high specificity would be expected to maximize binding affinity across different base pair possibilities. Therefore, the DeepPBS importance scores associated with these interactions should display some correlation with the corresponding binding affinities. We can test this hypothesis experimentally by using alanine scanning mutagenesis data (Supplementary Section 1). Sets of such experimental data have been made available through recent contributions⁵⁴ in the field. Utilizing these data⁴², we applied suitable filtering for our context and calculated the log sum aggregated residue level importance scores using DeepPBS (Methods).

A regression plot and Pearson’s correlation coefficient (PCC), as shown in Fig. 4h, illustrate the correspondence between computed values and experimental ΔΔG values for a diverse array of proteins and residues within the protein–DNA interface (Supplementary Table 1). The obtained PCC of 0.60 corroborates our hypothesis. It is noteworthy that the model was not trained to predict these values. These values were only obtained through perturbing the wild-type (WT) structures as input (Supplementary Fig. 4 and Supplementary Table 1). These results highlight the potential of DeepPBS as an economical guide for experimentalists who are selecting alanine scanning mutagenesis experiments to conduct at the protein–DNA interface.

Application to designed scaffolds targeting specific DNA

Recent work³⁴ made significant progress in designing structural models of fully synthetic helix-turn-helix (HTH) protein scaffolds targeting specific DNA sequences. We applied DeepPBS to synthetically designed proteins targeting a specific DNA sequence (GCAGATCTGCACATC), named DBP5/6/9/35, respectively (Fig. 5a,e,i,m). The predicted PWMs are shown (Fig. 5b,f,j,n) and the heavy atom level RI scores are visualized for the interfaces (Fig. 5c,g,k,o). We explored qualitative agreement of these predictions with experimental results obtained from the study (Fig. 5d,h,l,p, relative binding signal of all possible single base-pair mutations obtained via flow cytometry analysis³⁴ in yeast display competition assays). DeepPBS mostly correctly predicted the columns of high specificity (where the mutants show less binding that is darker red) except for a couple of cases. Some of the alternate base preference predictions by DeepPBS appear to agree with the experimental data. For example, for DBP35-position 11, DeepPBS predicts an alternate specific binding possibility to C along with the WT base A, and similarly for DBP35-position 9 and DBP5-position 7. Also, it is important to look at the flanking predictions for DeepPBS’ ability to produce sensible predictions for unbound DNA regions. For DBP9 and DBP6, the flanking predictions look remarkably uniform, which is consistent with the designed structure having mostly unbound canonical B-DNA structure. This baseline behavior is intuitive and nontrivial in this problem setting (given that there is a DNA sequence present in the design and the model has to circumvent overfitting of it). On the other hand, for DBP5 and DBP35, the flanks have a non-canonical shape with a narrow minor groove interaction with a loop region of the protein (obtained from PDB ID 1L3L). The DeepPBS prediction of a mostly A-tract preference (positions 3–8) is consistent with narrow minor groove preferred by such sequences⁵⁵. DNA shape prediction⁵⁶ for the top base prediction of these columns (AAATTT) is consistent with the shape visualized in the design (Supplementary Fig. 12), showing a significant dip in minor groove width. These examples illustrate the potential for DeepPBS as a computational guide to performing expensive and laborious wet lab experiments.

Discussion

Computationally identifying which DNA sequences, a given protein will bind to remains a challenging question. Although proteins from certain DNA-binding families, such as homeodomain^17,22,57,58 and C₂H₂ zinc finger proteins^{16,17,18,20,40,59}, have been studied extensively in this regard, a generalized model of binding specificity remains elusive. This complexity emanates, in part, from the pivotal role that the protein and DNA conformation or shape play in the context of binding specificity. For example, TBX5 undergoes an α- to 3₁₀-helix conformational change when interacting with DNA. Despite the energy penalty, this transformation, in conjunction with an appropriately matching DNA shape, instigates a strong phenylalanine-sugar ring stacking, thereby facilitating binding⁴. Another example is the Trp repressor protein, which exhibits an almost entirely geometry-driven binding specificity. This protein only forms direct and water-mediated H-bonds with the backbone phosphates⁶⁰, and the DNA shape required for optimal binding gives rise to sequence specificity. Capturing such interactions and how they lead to binding specificity with protein information alone is complicated and cannot be understood in a sequence space alone^24,61. Furthermore, for many protein families, the protein monomer is insufficient⁴⁹ for binding; a biological assembly, potentially with other interaction partners⁶², is often necessary.

DeepPBS achieves generality across protein families with the tradeoff of requiring a docked sym-helix, representing a significant step toward solving the larger unsolved problem. As demonstrated in this work, coupling DeepPBS with attempts to model protein–DNA complexes provides a significant step forward in predicting binding specificity across families, based solely on protein information.

DeepPBS allows exploration of exciting future possibilities, including the creation of DNA-targeted protein designs that could potentially contribute to therapeutic advancements. DeepPBS could serve as a preliminary screening tool for devised candidate complexes, ensuring their specificity to the intended target DNA sequence before any costly experimental validations. Moreover, recent studies have shown that transcription factor–DNA binding can energetically favor mismatched base pairs⁶³. Given the combinatorial complexity of possible hypotheses, deciding which DNA mismatch experiments to perform to discover more such instances poses a significant challenge. Although there is currently a lack of training data for base-pair mismatches, the DeepPBS architecture, in theory, could facilitate the prediction of mismatched base-pair binding specificity. This approach could assist in deciding which experiments to conduct.

In summary, we have introduced a computational framework that distills the intricate structural nuances of protein–DNA binding and bridges this understanding with binding specificity data, effectively connecting structure-determining and specificity-determining experiments. The DeepPBS architecture allows inspection of family-specific ‘groove readout’ and ‘shape readout’ patterns and their effects on binding specificity. Although structure prediction methods like RFNA²⁹, MELD-DNA³¹ and AlphaFold3 (ref. ³²) can predict a complex from given protein and DNA sequences, they cannot provide insights into binding specificity. The development of these computational methods for structure prediction expands the need of an approach like DeepPBS to derive protein–DNA binding specificity. DeepPBS operates on predicted complexes to yield the binding specificity of the system, thereby guiding the further improvement of modeling techniques for protein–DNA complexes. DeepPBS, despite its generality, exhibits performance comparable to the recently described family-specific method rCLAMPS¹⁷. In addition to modeled complexes for biologically existing systems, DeepPBS is also applicable to in silico synthetically designed proteins that target specific DNA sequences.

DeepPBS-derived RI scores are biologically relevant. They can be aggregated at a protein residue level, aligning with alanine scanning mutagenesis experimental data. Another advantage of DeepPBS is its speed in predicting binding specificity. Specifically, DeepPBS only requires a single forward call through the model (no required database search or multiple sequence alignment computation), making it suitable for high-throughput applications such as analyzing MD simulation trajectories (Supplementary Section 10 and Supplementary Fig. 6). In this context, DeepPBS is robust to small dynamical fluctuations and can respond to conformational changes (Supplementary Video 1).

The current version of DeepPBS has inherent limitations. It is tailored for double-stranded DNA and is not yet applicable to single-stranded DNA, RNA or chemically modified bases. However, there is potential for extending the model to accommodate these different scenarios as well as other polymer–polymer interactions and potentially for mechanistic mutations. Further limitations include data limitations, as discussed in Supplementary Section 12. The DeepPBS architecture can be refined and expanded in terms of applications and engineering enhancements. Collectively, these possibilities hint at an exciting future for molecular interaction studies and computationally driven synthetic biology.

Methods

Data sources

The dataset used for training was assembled by integrating protein–DNA structures from the PDB and their corresponding PWMs from JASPAR (2022)³⁶ and HOCOMOCO (V11)³⁷. These two databases were selected for their accessibility, comprehensive collection and nonredundancy. The detailed description can be found in Supplementary Section 1 and Supplementary Fig. 1.

Cross-validation regimen

A fivefold cross-validation set was constructed with 523 data points as described in Supplementary Section 1. Each datapoint corresponds to a biological assembly containing a protein chain with a corresponding PWM sampled from either JASPAR or HOCOMOCO. The PWM is aligned to DNA in the structure to create a correspondence for loss/metric calculation purposes using an ungapped local alignment process (Supplementary Section 2, ‘Performance Metrics’). For each fold, cross-validation predictions were made by a model (same for other variations as in Supplementary Fig. 5a) trained on the remaining four folds (reported in Supplementary Fig. 5a–c). Full details of training can be found in Supplementary Section 6.

Benchmark regimen

Datapoints not included in the cross-validation folds were resampled to create a separate benchmark dataset (biological assemblies corresponding to 130 protein chains). This sampling followed the same quality criterion described in Supplementary Section 1, and up to five members per cluster were sampled. Ensemble average predictions of models trained on cross-validation folds are reported for this set in Fig. 2a–d. Combined preprocessing and inference time for one biological assembly is on the order of seconds (for example, for PDB ID 5X6G, about 15–20 s). The DeepPBS ensemble described here was used for all applications of the predicted structures.

PWM

For the purposes of this study, a PWM is defined as an N × 4 matrix, where N represents the length of the DNA of interest, and the four positions correspond to the four DNA bases: adenine (A), cytosine (C), guanine (G) and thymine (T). Each column in the PWM represents the probabilities of the four bases occurring at that particular position.

$${{\mathrm{Co}}}{{\mathrm{l}}}_{{{\mathrm{PWM}}}}=\left[{{{P}}}_{{\rm{A}}},{{{P}}}_{{\rm{C}}},{{{P}}}_{{\rm{G}}},{{{P}}}_{{\rm{T}}}\right]$$

$${{{P}}}_{{\rm{A}}}+{{{P}}}_{{\rm{C}}}+{{{P}}}_{{\rm{G}}}+{{{P}}}_{{\rm{T}}}=1$$

DNA symmetrization

The DNA representation used is carefully designed with several considerations. First, the DNA sequence in the input complex might not correspond to a high-affinity sequence, particularly in designed structures. Second, an all-atom graph representation, similar to the protein, is not convenient because the model ultimately needs to predict a 1D representation (that is, the PWM) that describes binding specificity. Third, structural data are sparse, and the exact atomic conformation of a bound DNA sequence can make the model overly sensitive and less useful.

Considering these factors, we represent the DNA in a base-symmetrized manner. As shown in Fig. 1c and Supplementary Fig. 2, this is achieved by designing a symmetrization schema in the base-pair frame, which symmetrizes the seven key atomic interaction positions (four in the major groove and three in the minor groove)²⁴. Additionally, four positions are assigned for the sugar and phosphate moieties. For full details of this process, see Supplementary Section 3.

DeepPBS architecture and training details

Detailed description of the DeepPBS architecture can be found in Supplementary Section 5. Training, cross-validation and benchmarking details are available in Supplementary Section 6.

Performance metrics

Performance metrics used in this work are MAE and RMSE, defined as

$${{\mathrm{MAE}}}\left(Y,{Y}^{{\;{\mathrm{pred}}}}\right)=\frac{1}{N}\sum _{i\in \{0.N-1\}}\sum _{b\in [{\mathrm{A}},{\mathrm{C}},{\mathrm{G}},{\mathrm{T}}]}{{\Big|}}{Y}_{{ib}}-{Y}_{{ib}}^{{\;{\mathrm{pred}}}}{{\Big|}}$$

$${{\mathrm{RMSE}}}\left(Y,{Y}^{{\;{\mathrm{pred}}}}\right)=\sqrt{\frac{1}{N}\sum _{i\in \left\{0.N-1\right\}}\sum _{b\in \left[{\mathrm{A}},{\mathrm{C}},{\mathrm{G}},{\mathrm{T}}\right]}{\left({Y}_{{ib}}-{Y}_{{ib}}^{{\;{\mathrm{pred}}}}\right)}^{2}}.$$

N refers to the number of columns in the PWMs being compared. Both metrics follow ‘the lower the better’ principle. They are not independent but have different properties. A further discussion of the metrics is presented in Supplementary Section 7 and Supplementary Fig. 10.

Bipartite edge perturbation and protein heavy atom importance score calculation

Supplementary Fig. 4 schematically describes the bipartite edge perturbation process for calculating protein heavy atom (say, atom a) importance scores. Briefly, the prediction is calculated twice: once (say, Y_a) while considering edges corresponding to the protein heavy atoms, and again (say, Y_~a) while masking the same edges. This process results in differences in predictions, which can be calculated using the mean absolute difference measure. On their own, these values may not be meaningful, but they can be normalized to the 0–1 range by dividing by the maximum value within a structure. The normalized values, RI scores, signify how much the specificity prediction is influenced by interactions made by the corresponding heavy atom. Depending on the downstream use, RI scores can be aggregated at the residue level using either the average, max or sum aggregations. Mathematically,

$${\mathrm{R}}{{\mathrm{I}}}_{a}=\frac{{{\mathrm{MAE}}}\left({Y}_{a},{Y}_{ \sim a}\right)}{\mathop{\max }\limits_{\left\{b\in {{\mathrm{all}}\; {\mathrm{atoms}}}\right\}}{{\mathrm{MAE}}}\left({Y}_{b},{Y}_{ \sim b}\right)}.$$

Computationally, this process is like measuring the effect of a deactivating mutation, which is why we hypothesized that, at a residue level, these scores could correlate with alanine scanning mutagenesis data. For comparison with alanine scanning mutagenesis experiments (Fig. 4h) at a residue level, the log sum aggregated importance score was calculated. For each atom a of a residue r in the protein–DNA interface, let the calculated RI be RI_a. Then, this value is calculated as

$${{\mathrm{LogSum}}\; {\mathrm{aggregated}}\; {\mathrm{residue}}\; {\mathrm{importance}}}\left(r\right)={\log }_{2}\left(1+\sum _{a\in r}{\mathrm{R}}{{\mathrm{I}}}_{a}\right).$$

Structure visualizations presented were produced using PyMOL2.5.

Description of competitor assay for quantifying designed proteins’ binding specificity

Glasscock et al.³⁴ used a yeast display assay to quantify binding of their designed proteins. The proteins were expressed by integrating the corresponding synthetic oligonucleotide to a yeast surface expression vector. Yeast cells expressing designed proteins on their surface were labeled with biotinylated dsDNA targets, streptavidin–phycoerythrin and anti-c-Myc fluorescein isothiocyanate in a 96-well plate format, after which a binding signal was quantified on an Attune NxT flow cytometer. Excess addition of a competitor nonfluorescent target DNA reduces this binding signal. Thus, scanning single mutations for each position was possible through the competitor producing the data shown in Fig. 5d,h,l,p.

DeepPBS webserver

DeepPBS is available as a webserver at https://deeppbs.usc.edu. The webserver provides the functionality of the DeepPBS method of predicting a PWM on the basis of the structure of a protein–DNA complex. The structure can be uploaded as a PDB or macromolecular crystallographic information file. The webserver provides a documentation for users.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Datasets used for all analysis and associated custom scripts were deposited via figshare at https://doi.org/10.6084/m9.figshare.25678053 (ref. ⁶⁴). Accession codes for discussed structures from the PDB: 1L3L, 7CLI, 2R5Z, 1CIT, 1F4K, 1GJI, 1TC3, 2BSQ, 2C9L, 5ZGN, 1BBX, 1KLN, 1N5Y, 5YUZ, 1QAI, 1XC8, 6T8H, 4TUI, 1DH3, 7OH9 and 1APL. UniProt accession codes for protein sequences discussed (folded with RFNA): Q8IUE0, Q6H878, O43680 and Q4H376. Accession codes for discussed experimental specificity data from JASPAR2022 and HOCOMOCOv11: MA1897.1, MA1568.1, MA1031.1, MA1572.1, MA0112.2, MA0112.3, ESR1_HUMAN.H11MO.0 and NFKB2_HUMAN.H11MO.0.B. Mutagenesis experiment data used are available from the SAMPDI website (http://compbio.clemson.edu/media/download/SAMPDI_dataset.xlsx). MELD-DNA modeled complex data were taken from Zenodo at https://doi.org/10.5281/zenodo.7501937 (ref. ⁶⁵). Source data are provided with this paper.

Code availability

Installable source code, pretrained models, associated guidelines and various custom scripts can be found via GitHub at https://github.com/timkartar/DeepPBS. The implementation is also available via a Code Ocean capsule at https://doi.org/10.24433/CO.0545023.v2. In addition, DeepPBS is accessible as a webserver through https://deeppbs.usc.edu.

References

Spitz, F. & Furlong, E. E. M. Transcription factors: from enhancer binding to developmental control. Nat. Rev. Genet. 13, 613–626 (2012).
Article CAS PubMed Google Scholar
Zhao, Y., Granas, D. & Stormo, G. D. Inferring binding energies from selected binding sites. PLoS Comput. Biol. 5, e1000590 (2009).
Article PubMed PubMed Central Google Scholar
Rohs, R. et al. The role of DNA shape in protein–DNA recognition. Nature 461, 1248–1253 (2009).
Article CAS PubMed PubMed Central Google Scholar
Stirnimann, C. U., Ptchelkine, D., Grimm, C. & Müller, C. W. Structural basis of TBX5–DNA recognition: the T-box domain in its DNA-bound and -unbound form. J. Mol. Biol. 400, 71–81 (2010).
Article CAS PubMed Google Scholar
Helene, C. Specific recognition of guanine bases in protein–nucleic acid complexes. FEBS Lett. 74, 10–13 (1977).
Article CAS PubMed Google Scholar
Rohs, R. et al. Origins of specificity in protein–DNA recognition. Annu. Rev. Biochem. 79, 233–269 (2010).
Article CAS PubMed PubMed Central Google Scholar
Schildbach, J. F., Karzai, A. W., Raumann, B. E. & Sauer, R. T. Origins of DNA-binding specificity: role of protein contacts with the DNA backbone. Proc. Natl Acad. Sci. USA 96, 811–817 (1999).
Article CAS PubMed PubMed Central Google Scholar
Seeman, N. C., Rosenberg, J. M. & Rich, A. Sequence-specific recognition of double helical nucleic acids by proteins. Proc. Natl Acad. Sci. USA 73, 804–808 (1976).
Article CAS PubMed PubMed Central Google Scholar
Garvie, C. W. & Wolberger, C. Recognition of specific DNA sequences. Mol. Cell 8, 937–946 (2001).
Article CAS PubMed Google Scholar
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).
Article CAS PubMed PubMed Central Google Scholar
Berger, M. F. & Bulyk, M. L. Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors. Nat. Protoc. 4, 393–411 (2009).
Article CAS PubMed PubMed Central Google Scholar
Slattery, M. et al. Cofactor binding evokes latent differences in DNA binding specificity between Hox proteins. Cell 147, 1270–1282 (2011).
Article CAS PubMed PubMed Central Google Scholar
Park, P. J. ChIP-seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10, 669–680 (2009).
Article CAS PubMed PubMed Central Google Scholar
Jolma, A. et al. DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013).
Article CAS PubMed Google Scholar
Slattery, M. et al. Absence of a simple code: how transcription factors read the genome. Trends Biochem. Sci. 39, 381–399 (2014).
Article CAS PubMed PubMed Central Google Scholar
Persikov, A. V. & Singh, M. De novo prediction of DNA-binding specificities for Cys₂His₂ zinc finger proteins. Nucleic Acids Res. 42, 97–108 (2014).
Article CAS PubMed Google Scholar
Wetzel, J. L., Zhang, K. & Singh, M. Learning probabilistic protein–DNA recognition codes from DNA-binding specificities using structural mappings. Genome Res. 32, 1776–1786 (2022).
Article PubMed PubMed Central Google Scholar
Persikov, A. V., Osada, R. & Singh, M. Predicting DNA recognition by Cys₂His₂ zinc finger proteins. Bioinformatics 25, 22–29 (2009).
Article CAS PubMed Google Scholar
Aizenshtein-Gazit, S. & Orenstein, Y. DeepZF: improved DNA-binding prediction of C2H2-zinc-finger proteins by deep transfer learning. Bioinformatics 38, ii62–ii67 (2022).
Article PubMed Google Scholar
Meseguer, A. et al. On the prediction of DNA-binding preferences of C2H2-ZF domains using structural models: application on human CTCF. NAR Genom. Bioinform. 2, lqaa046 (2020).
Article PubMed PubMed Central Google Scholar
Molparia, B., Goyal, K., Sarkar, A., Kumar, S. & Sundar, D. ZiF-Predict: a web tool for predicting DNA-binding specificity in C₂H₂ zinc finger proteins. Genom. Proteom. Bioinform. 8, 122–126 (2010).
Article CAS Google Scholar
Christensen, R. G. et al. Recognition models to predict DNA-binding specificities of homeodomain proteins. Bioinformatics 28, i84–i89 (2012).
Article CAS PubMed PubMed Central Google Scholar
Yanover, C. & Bradley, P. Extensive protein and DNA backbone sampling improves structure-based specificity prediction for C₂H₂ zinc fingers. Nucleic Acids Res. 39, 4564–4576 (2011).
Article CAS PubMed PubMed Central Google Scholar
Chiu, T. P., Rao, S. & Rohs, R. Physicochemical models of protein–DNA binding with standard and modified base pairs. Proc. Natl Acad. Sci. USA 120, e2205796120 (2023).
Article CAS PubMed PubMed Central Google Scholar
Stormo, G. D. Modeling the specificity of protein–DNA interactions. Quant. Biol. 1, 115–130 (2013).
Article CAS PubMed PubMed Central Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article CAS PubMed PubMed Central Google Scholar
Ahdritz, G. et al. OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nat. Methods https://doi.org/10.1038/s41592-024-02272-z (2024).
Article PubMed Google Scholar
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Article CAS PubMed PubMed Central Google Scholar
Baek, M., Mchugh, R., Anishchenko, I., Baker, D. & Dimaio, F. Accurate prediction of nucleic acid and protein–nucleic acid complexes using RoseTTAFoldNA. Nat. Methods 21, 117–121 (2024).
Article CAS PubMed Google Scholar
Krishna, R. et al. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science 384, eadl2528 (2024).
Article CAS PubMed Google Scholar
Esmaeeli, R., Bauzá, A. & Perez, A. Structural predictions of protein–DNA binding: MELD-DNA. Nucleic Acids Res. 51, 1625–1636 (2023).
Article CAS PubMed PubMed Central Google Scholar
Abramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500 (2024).
Article CAS PubMed PubMed Central Google Scholar
Morrison, K. L. & Weiss, G. A. Combinatorial alanine-scanning. Curr. Opin. Chem. Biol. 5, 302–307 (2001).
Article CAS PubMed Google Scholar
Glasscock, C. J. et al. Computational design of sequence-specific DNA-binding proteins. Preprint at bioRxiv https://doi.org/10.1101/2023.09.20.558720 (2023).
Joshi, R. et al. Functional specificity of a hox protein mediated by the recognition of minor groove structure. Cell 131, 530–543 (2007).
Article CAS PubMed PubMed Central Google Scholar
Castro-Mondragon, J. A. et al. JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 50, D165–D173 (2022).
Article CAS PubMed Google Scholar
Kulakovskiy, I. V. et al. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-seq analysis. Nucleic Acids Res. 46, D252–D259 (2018).
Article CAS PubMed Google Scholar
Agback, P., Baumann, H., Knapp, S., Ladenstein, R. & Härd, T. Architecture of nonspecific protein–DNA interactions in the Sso7d–DNA complex. Nat. Struct. Biol. 5, 579–584 (1998).
Article CAS PubMed Google Scholar
Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 49, D412–D419 (2021).
Article CAS PubMed Google Scholar
Persikov, A. V. & Singh, M. An expanded binding model for Cys₂ His₂ zinc finger protein–DNA interfaces. Phys. Biol. 8, 035010 (2011).
Article PubMed PubMed Central Google Scholar
Ichikawa, D. M. et al. A universal deep-learning model for zinc finger design enables transcription factor reprogramming. Nat. Biotechnol. 41, 1117–1129 (2023).
Article CAS PubMed PubMed Central Google Scholar
Escalante, C. R., Yie, J., Thanos, D. & Aggarwal, A. K. Structure of IRF-1 with bound DNA reveals determinants of interferon regulation. Nature 391, 103–106 (1998).
Article CAS PubMed Google Scholar
de Martin, X., Sodaei, R. & Santpere, G. Mechanisms of binding specificity among bHLH transcription factors. Int. J. Mol. Sci. 22, 9150 (2021).
Article PubMed PubMed Central Google Scholar
Mariani, V., Biasini, M., Barbato, A. & Schwede, T. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 29, 2722–2728 (2013).
Article CAS PubMed PubMed Central Google Scholar
Genheden, S. & Ryde, U. The MM/PBSA and MM/GBSA methods to estimate ligand-binding affinities. Expert Opin. Drug Discov. 10, 449–461 (2015).
Article CAS PubMed PubMed Central Google Scholar
Joerger, A. C. & Fersht, A. R. Structural biology of the tumor suppressor p53. Annu. Rev. Biochem. 77, 557–582 (2008).
Article CAS PubMed Google Scholar
Kitayner, M. et al. Diversity in DNA recognition by p53 revealed by crystal structures with Hoogsteen base pairs. Nat. Struct. Mol. Biol. 17, 423–429 (2010).
Article CAS PubMed PubMed Central Google Scholar
Petty, T. J. et al. An induced fit mechanism regulates p53 DNA binding kinetics to confer sequence specificity. EMBO J. 30, 2167–2176 (2011).
Article CAS PubMed PubMed Central Google Scholar
Kitayner, M. et al. Structural basis of DNA recognition by p53 tetramers. Mol. Cell 22, 741–753 (2006).
Article CAS PubMed Google Scholar
Reaz, S., Mossalam, M., Okal, A. & Lim, C. S. A single mutant, A276S of p53, turns the switch to apoptosis. Mol. Pharm. 10, 1350–1359 (2013).
Article CAS PubMed PubMed Central Google Scholar
Barakat, K., Issack, B. B., Stepanova, M. & Tuszynski, J. Effects of temperature on the p53–DNA binding interactions and their dynamical behavior: comparing the wild type to the R248Q mutant. PLoS ONE 6, e27651 (2011).
Article CAS PubMed PubMed Central Google Scholar
Vousden, K. H. & Prives, C. Blinded by the light: the growing complexity of p53. Cell 137, 413–431 (2009).
Article CAS PubMed Google Scholar
Peterson, S. N., Dahlquist, F. W. & Reich, N. O. The role of high affinity non-specific DNA binding by Lrp in transcriptional regulation and DNA organization. J. Mol. Biol. 369, 1307–1317 (2007).
Article CAS PubMed Google Scholar
Ovek, D. et al. Artificial intelligence based methods for hot spot prediction. Curr. Opin. Struct. Biol. 72, 209–218 (2022).
Article CAS PubMed Google Scholar
Stefl, R., Wu, H., Ravindranathan, S., Sklenář, V. & Feigon, J. DNA A-tract bending in three dimensions: solving the dA₄T₄ vs. dT₄A₄ conundrum. Proc. Natl Acad. Sci. USA 101, 1177–1182 (2004).
Article CAS PubMed PubMed Central Google Scholar
Li, J., Chiu, T. P. & Rohs, R. Predicting DNA structure using a deep learning method. Nat. Commun. 15, 1243 (2024).
Article CAS PubMed PubMed Central Google Scholar
Dror, I., Zhou, T., Mandel-Gutfreund, Y. & Rohs, R. Covariation between homeodomain transcription factors and the shape of their DNA binding sites. Nucleic Acids Res. 42, 430–441 (2014).
Article CAS PubMed Google Scholar
Noyes, M. B. et al. Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition sites. Cell 133, 1277–1289 (2008).
Article CAS PubMed PubMed Central Google Scholar
Persikov, A. V. et al. A systematic survey of the Cys₂His₂ zinc finger DNA-binding landscape. Nucleic Acids Res. 43, 1965–1984 (2015).
Article CAS PubMed PubMed Central Google Scholar
Otwinowski, Z. et al. Crystal structure of trp represser/operator complex at atomic resolution. Nature 335, 321–329 (1988).
Article CAS PubMed Google Scholar
Zhou, T. et al. Quantitative modeling of transcription factor binding specificities using DNA shape. Proc. Natl Acad. Sci. USA 112, 4654–4659 (2015).
Article CAS PubMed PubMed Central Google Scholar
Nair, S. K. & Burley, S. K. X-ray structures of Myc-Max and Mad-Max recognizing DNA. Cell 112, 193–205 (2003).
Article CAS PubMed Google Scholar
Afek, A. et al. DNA mismatches reveal conformational penalties in protein–DNA recognition. Nature 587, 291–296 (2020).
Article CAS PubMed PubMed Central Google Scholar
Mitra, R. DeepPBS data. figshare https://doi.org/10.6084/m9.figshare.25678053.v1 (2024).
GoldEagle93. PDNALab/MELD-DNA: release for Zenodo. Zenodo https://doi.org/10.5281/zenodo.7501938 (2023).

Download references

Acknowledgements

This work was supported by an Andrew J. Viterbi Fellowship in Computational Biology and Bioinformatics (to R.M.), a Washington Research Foundation postdoctoral fellowship (to C.J.G.), the Human Frontier Science Program (grant RGP0021/2018 to R.R.) and the National Institutes of Health (grant R35GM130376 to R.R.). We acknowledge L. Manna for setup and maintenance of the DeepPBS webserver and thank all Rohs lab members for support and valuable feedback.

Author information

Jared M. Sagendorf
Present address: Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA

Authors and Affiliations

Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
Raktim Mitra, Jinsen Li, Jared M. Sagendorf, Yibei Jiang, Ari S. Cohen, Tsu-Pei Chiu & Remo Rohs
Department of Biochemistry, University of Washington, Seattle, WA, USA
Cameron J. Glasscock
Institute for Protein Design, University of Washington, Seattle, WA, USA
Cameron J. Glasscock
Department of Chemistry, University of Southern California, Los Angeles, CA, USA
Remo Rohs
Department of Physics and Astronomy, University of Southern California, Los Angeles, CA, USA
Remo Rohs
Thomas Lord Department of Computer Science, University of Southern California, Los Angeles, CA, USA
Remo Rohs

Authors

Raktim Mitra
View author publications
Search author on:PubMed Google Scholar
Jinsen Li
View author publications
Search author on:PubMed Google Scholar
Jared M. Sagendorf
View author publications
Search author on:PubMed Google Scholar
Yibei Jiang
View author publications
Search author on:PubMed Google Scholar
Ari S. Cohen
View author publications
Search author on:PubMed Google Scholar
Tsu-Pei Chiu
View author publications
Search author on:PubMed Google Scholar
Cameron J. Glasscock
View author publications
Search author on:PubMed Google Scholar
Remo Rohs
View author publications
Search author on:PubMed Google Scholar

Contributions

R.M., J.M.S. and R.R. conceived the project idea with input from T.P.C. R.M., J.M.S. and J.L. designed the model. R.M. and J.M.S. performed data preprocessing. R.M., with input from J.L. and J.M.S., performed model training and benchmarking. R.M., J.L. and T.P.C. developed all application ideas. R.M., J.L. and Y.J. carried out all applications and data analysis. A.S.C., with input from R.M., designed and built the web-based implementation. C.J.G. provided data for validation and application on predicted and synthetic designs. R.M., J.L., Y.J. and R.R. wrote the paper. All authors read and commented on the paper. R.R. supervised the project.

Corresponding author

Correspondence to Remo Rohs.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Gregory Poon and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–12, discussion and Table 1.

Reporting Summary

Peer Review File

Supplementary Data 1

Cluster-wise description of cross-validation folds, two-sided t-test results between DeepPBS variations and two-sided t-test results between readout modes.

Supplementary Data 2

Source data for supplementary figures.

Supplementary Video 1

Concurrent view of changes in network prediction as MD simulation of Exd-Scr–DNA complex progressed, along with corresponding changes in heavy atom importance score.

Source data

Source Data Fig. 1

Statistical source data.

Source Data Fig. 2

Statistical source data.

Source Data Fig. 3

Statistical source data.

Source Data Fig. 4

Statistical source data.

Source Data Fig. 5

Statistical source data.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mitra, R., Li, J., Sagendorf, J.M. et al. Geometric deep learning of protein–DNA binding specificity. Nat Methods 21, 1674–1683 (2024). https://doi.org/10.1038/s41592-024-02372-w

Download citation

Received: 13 August 2023
Accepted: 14 June 2024
Published: 05 August 2024
Issue date: September 2024
DOI: https://doi.org/10.1038/s41592-024-02372-w

This article is cited by

Assessing molecular docking tools: understanding drug discovery and design
- Harendar Kumar Nivatya
- Anjali Singh
- Arun Kumar Mishra
Future Journal of Pharmaceutical Sciences (2025)
Multimeric transcription factor BCL11A utilizes two zinc-finger tandem arrays to bind clustered short sequence motifs
- John R. Horton
- Meigen Yu
- Xiaodong Cheng
Nature Communications (2025)
Purine nucleoside phosphorylase dominates Influenza A virus replication and host hyperinflammation through purine salvage
- Yang Yue
- Qingyu Li
- Shengqi Wang
Signal Transduction and Targeted Therapy (2025)
Computational design of sequence-specific DNA-binding proteins
- Cameron J. Glasscock
- Robert J. Pecoraro
- David Baker
Nature Structural & Molecular Biology (2025)

Subjects

Abstract

Similar content being viewed by others

Main

Results

The DeepPBS framework

DeepPBS performance for experimentally determined structures

DeepPBS captures patterns of family-specific binding modes

Application to in silico-predicted protein–DNA complexes

Assessing protein residue importance at p53–DNA interface

Comparison of residue-level importance with mutagenesis data

Application to designed scaffolds targeting specific DNA

Discussion

Methods

Data sources

Cross-validation regimen

Benchmark regimen

PWM

DNA symmetrization

DeepPBS architecture and training details

Performance metrics

Bipartite edge perturbation and protein heavy atom importance score calculation

Description of competitor assay for quantifying designed proteins’ binding specificity

DeepPBS webserver

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links