Multiple overlapping binding sites determine transcription factor occupancy

Khetan, Shubham; Carroll, Brent S.; Bulyk, Martha L.

doi:10.1038/s41586-025-09472-3

Article
Published: 03 September 2025

Multiple overlapping binding sites determine transcription factor occupancy

Nature (2025)Cite this article

10k Accesses
1 Citations
58 Altmetric
Metrics details

Subjects

Abstract

Transcription factors (TFs) regulate gene expression by interacting with DNA in a sequence-specific manner. High-throughput in vitro technologies, such as protein-binding microarrays^1,2,3,4,5,6 and HT-SELEX (high-throughput systematic evolution of ligands by exponential enrichment)^7,8, have revealed the DNA-binding specificities of hundreds of TFs. However, they have limited ability to reliably identify lower-affinity DNA binding sites, which are increasingly recognized as important for precise spatiotemporal control of gene expression^{9,10,11,12,13,14,15,16,17,18,19}. Here, to address this limitation, we developed protein affinity to DNA by in vitro transcription and RNA sequencing (PADIT-seq), with which we comprehensively assayed the binding preferences of six TFs to all possible ten-base-pair DNA sequences, detecting hundreds of novel, lower-affinity binding sites. The expanded repertoire of lower-affinity binding sites revealed that nucleotides flanking high-affinity DNA binding sites create overlapping lower-affinity sites that together modulate TF genomic occupancy in vivo. We propose a model in which TF binding is not determined by individual binding sites, but rather by the sum of multiple, overlapping binding sites. The overlapping binding model explains how competition between paralogous TFs for shared high-affinity binding sites is determined by flanking nucleotides that create differential numbers of overlapping, lower-affinity binding sites. Critically, the model transforms our understanding of noncoding-variant effects, revealing how single nucleotide changes simultaneously alter multiple overlapping sites to additively influence gene expression and human traits, including diseases.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: PADIT-seq detects hundreds of lower-affinity interactions that are undetected by uPBM and HT-SELEX.**

**Fig. 2: Nucleotides flanking high-affinity binding sites create overlapping lower-affinity binding sites.**

**Fig. 3: ChIP-nexus footprinting provides direct evidence for independent TF binding to consecutive, overlapping binding sites.**

**Fig. 4: Competition between the paralogous TFs Pho4 and Cbf1 is determined by differential numbers of overlapping binding sites.**

**Fig. 5: Noncoding variants alter TF binding by perturbing multiple overlapping binding sites.**

Position-dependent function of human sequence-specific transcription factors

Article Open access 17 July 2024

Low overlap of transcription factor DNA binding and regulatory targets

Article 16 April 2025

Single-molecule states link transcription factor binding to gene expression

Article 20 November 2024

Data availability

PADIT-seq data have been deposited in the Gene Expression Omnibus (GEO) database under accession number GSE250601. MITOMI K_d data for EGR1 was downloaded from the supporting information of ref. ⁴¹. The UniPROBE database (http://thebrain.bwh.harvard.edu/pbms/UniPROBE_staging/browse.php) was used to access uPBM data^74,75. HT-SELEX data were downloaded using the ENA accession code ERP001826. ChIP–seq and ChIP-nexus data analysed in the paper were downloaded using accession codes GSE81356, GSE89457, GSE85628, GSE67482 and GSE207001. BET-seq data were downloaded using the accession code GSE111936. Competition gcPBM data were downloaded from the supplementary materials of ref. ⁶⁴. Processed SNP-SELEX data were downloaded from the supplementary data of ref. ⁶⁵. Allele-specific EGR1 ChIP–seq data were downloaded from the ADASTRA database: https://adastra.autosome.org/mabel. Processed MPRA data were downloaded from supplementary tables of refs. ^70,71,72,73.

Code availability

Code and processed data for generating the figures are available at https://github.com/BulykLab/PADIT-seq.

References

Bulyk, M. L., Gentalen, E., Lockhart, D. J. & Church, G. M. Quantifying DNA–protein interactions by double-stranded DNA arrays. Nat. Biotechnol. 17, 573–577 (1999).
Article CAS PubMed Google Scholar
Mukherjee, S. et al. Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays. Nat. Genet. 36, 1331–1339 (2004).
Article CAS PubMed PubMed Central Google Scholar
Berger, M. F. et al. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol. 24, 1429–1435 (2006).
Article CAS PubMed PubMed Central Google Scholar
Berger, M. F. & Bulyk, M. L. Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors. Nat. Protoc. 4, 393–411 (2009).
Article CAS PubMed PubMed Central Google Scholar
Badis, G. et al. Diversity and complexity in DNA recognition by transcription factors. Science 324, 1720–1723 (2009).
Article ADS CAS PubMed PubMed Central Google Scholar
Weirauch, M. T. et al. Determination and inference of eukaryotic transcription factor sequence specificity. Cell 158, 1431–1443 (2014).
Article CAS PubMed PubMed Central Google Scholar
Jolma, A. et al. Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res. 20, 861–873 (2010).
Article CAS PubMed PubMed Central Google Scholar
Jolma, A. et al. DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013).
Article CAS PubMed Google Scholar
Driever, W., Thoma, G. & Nüsslein-Volhard, C. Determination of spatial domains of zygotic gene expression in the Drosophila embryo by the affinity of binding sites for the bicoid morphogen. Nature 340, 363–367 (1989).
Article ADS CAS PubMed Google Scholar
Gaudet, J. & Mango, S. E. Regulation of organogenesis by the Caenorhabditis elegans FoxA protein PHA-4. Science 295, 821–825 (2002).
Article ADS CAS PubMed Google Scholar
Rowan, S. et al. Precise temporal control of the eye regulatory gene Pax6 via enhancer-binding site affinity. Genes Dev. 24, 980–985 (2010).
Article CAS PubMed PubMed Central Google Scholar
Crocker, J. et al. Low affinity binding site clusters confer hox specificity and regulatory robustness. Cell 160, 191–203 (2015).
Article CAS PubMed Google Scholar
Farley, E. K. et al. Suboptimization of developmental enhancers. Science 350, 325–328 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Zandvakili, A., Campbell, I., Gutzwiller, L. M., Weirauch, M. T. & Gebelein, B. Degenerate Pax2 and Senseless binding motifs improve detection of low-affinity sites required for enhancer specificity. PLoS Genet. 14, e1007289 (2018).
Article PubMed PubMed Central Google Scholar
Tanay, A. Extensive low-affinity transcriptional interactions in the yeast genome. Genome Res. 16, 962–972 (2006).
Article CAS PubMed PubMed Central Google Scholar
Segal, E., Raveh-Sadka, T., Schroeder, M., Unnerstall, U. & Gaul, U. Predicting expression patterns from regulatory sequence in Drosophila segmentation. Nature 451, 535–540 (2008).
Article ADS CAS PubMed Google Scholar
Giorgetti, L. et al. Noncooperative interactions between transcription factors and clustered DNA binding sites enable graded transcriptional responses to environmental inputs. Mol. Cell 37, 418–428 (2010).
Article CAS PubMed Google Scholar
Horton, C. A. et al. Short tandem repeats bind transcription factors to tune eukaryotic gene expression. Science 381, eadd1250 (2023).
Article CAS PubMed Google Scholar
Lim, F. et al. Affinity-optimizing enhancer variants disrupt development. Nature 626, 151–159 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Bartlett, A. et al. Mapping genome-wide transcription-factor binding sites using DAP-seq. Nat. Protoc. 12, 1659–1672 (2017).
Article CAS PubMed PubMed Central Google Scholar
Stormo, G. D., Zuo, Z. & Chang, Y. K. Spec-seq: determining protein-DNA-binding specificity by sequencing. Brief. Funct. Genom. 14, 30–38 (2015).
Article CAS Google Scholar
Fordyce, P. M. et al. De novo identification and biophysical characterization of transcription-factor binding sites with microfluidic affinity analysis. Nat. Biotechnol. 28, 970–975 (2010).
Article CAS PubMed PubMed Central Google Scholar
Isakova, A. et al. SMiLE-seq identifies binding motifs of single and dimeric transcription factors. Nat. Methods 14, 316–322 (2017).
Article CAS PubMed Google Scholar
Meng, X., Brodsky, M. H. & Wolfe, S. A. A bacterial one-hybrid system for determining the DNA-binding specificity of transcription factors. Nat. Biotechnol. 23, 988–994 (2005).
Article CAS PubMed PubMed Central Google Scholar
Stringham, J. L., Brown, A. S., Drewell, R. A. & Dresch, J. M. Flanking sequence context-dependent transcription factor binding in early Drosophila development. BMC Bioinformatics 14, 298 (2013).
Article PubMed PubMed Central Google Scholar
Levo, M. et al. Unraveling determinants of transcription factor binding outside the core binding site. Genome Res. 25, 1018–1029 (2015).
Article CAS PubMed PubMed Central Google Scholar
Dror, I., Golan, T., Levy, C., Rohs, R. & Mandel-Gutfreund, Y. A widespread role of the motif environment in transcription factor binding across diverse protein families. Genome Res. 25, 1268–1280 (2015).
Article CAS PubMed PubMed Central Google Scholar
Chaudhari, H. G. & Cohen, B. A. Local sequence features that influence AP-1 cis-regulatory activity. Genome Res. 28, 171–181 (2018).
Article CAS PubMed PubMed Central Google Scholar
Cohen, D. M., Lim, H.-W., Won, K.-J. & Steger, D. J. Shared nucleotide flanks confer transcriptional competency to bZip core motifs. Nucleic Acids Res. 46, 8371–8384 (2018).
Article CAS PubMed PubMed Central Google Scholar
Le, D. D. et al. Comprehensive, high-resolution binding energy landscapes reveal context dependencies of transcription factor binding. Proc. Natl Acad. Sci. USA 115, E3702–E3711 (2018).
Article CAS PubMed PubMed Central Google Scholar
Yang, M. G., Ling, E., Cowley, C. J., Greenberg, M. E. & Vierbuchen, T. Characterization of sequence determinants of enhancer function using natural genetic variation. eLife 11, e76500 (2022).
Article PubMed PubMed Central Google Scholar
Reiter, F., de Almeida, B. P. & Stark, A. Enhancers display constrained sequence flexibility and context-specific modulation of motif function. Genome Res. 33, 346–358 (2023).
Article CAS PubMed PubMed Central Google Scholar
Rudnizky, S. et al. Single-molecule DNA unzipping reveals asymmetric modulation of a transcription factor by its binding site sequence and context. Nucleic Acids Res. 46, 1513–1524 (2018).
Article CAS PubMed Google Scholar
Gordân, R. et al. Genomic regions flanking E-box binding sites influence DNA binding specificity of bHLH transcription factors through DNA shape. Cell Rep. 3, 1093–1104 (2013).
Article PubMed PubMed Central Google Scholar
Aditham, A. K., Shimko, T. C. & Fordyce, P. M. BET-seq: binding energy topographies revealed by microfluidics and high-throughput sequencing. Methods Cell Biol. 148, 229–250 (2018).
Article PubMed PubMed Central Google Scholar
Jung, C. et al. True equilibrium measurement of transcription factor-DNA binding affinities using automated polarization microscopy. Nat. Commun. 9, 1605 (2018).
Article ADS PubMed PubMed Central Google Scholar
Aditham, A. K., Markin, C. J., Mokhtari, D. A., DelRosso, N. & Fordyce, P. M. High-throughput affinity measurements of transcription factor and DNA mutations reveal affinity and specificity determinants. Cell Syst. 12, 112–127.e11 (2021).
Article CAS PubMed Google Scholar
Götzke, H. et al. The ALFA-tag is a highly versatile tool for nanobody-based bioscience applications. Nat. Commun. 10, 4403 (2019).
Article ADS PubMed PubMed Central Google Scholar
Hussey, B. J. & McMillen, D. R. Programmable T7-based synthetic transcription factors. Nucleic Acids Res. 46, 9842–9854 (2018).
Article CAS PubMed PubMed Central Google Scholar
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Article PubMed PubMed Central Google Scholar
Geertz, M., Shore, D. & Maerkl, S. J. Massively parallel measurements of molecular interaction kinetics on a microfluidic platform. Proc. Natl Acad. Sci. USA 109, 16540–16545 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Toko, H. et al. Csx/Nkx2-5 is required for homeostasis and survival of cardiac myocytes in the adult heart. J. Biol. Chem. 277, 24735–24743 (2002).
Article CAS PubMed Google Scholar
Moskowitz, I. P. G. et al. The T-Box transcription factor Tbx5 is required for the patterning and maturation of the murine cardiac conduction system. Development 131, 4107–4116 (2004).
Article CAS PubMed Google Scholar
Zhou, X. & O’Shea, E. K. Integrated approaches reveal determinants of genome-wide binding and function of the transcription factor Pho4. Mol. Cell 42, 826–836 (2011).
Article CAS PubMed PubMed Central Google Scholar
Ogawa, N. & Oshima, Y. Functional domains of a positive regulatory protein, PHO4, for transcriptional control of the phosphatase regulon in Saccharomyces cerevisiae. Mol. Cell. Biol. 10, 2224–2236 (1990).
CAS PubMed PubMed Central Google Scholar
Cai, M. & Davis, R. W. Yeast centromere binding protein CBF1, of the helix-loop-helix protein family, is required for chromosome stability and methionine prototrophy. Cell 61, 437–446 (1990).
Article CAS PubMed Google Scholar
Payne, J. L. & Wagner, A. The robustness and evolvability of transcription factor binding sites. Science 343, 875–877 (2014).
Article ADS CAS PubMed Google Scholar
Jaeger, S. A. et al. Conservation and regulatory associations of a wide affinity range of mouse transcription factor binding sites. Genomics 95, 185–195 (2010).
Article CAS PubMed Google Scholar
Grant, C. E., Bailey, T. L. & Noble, W. S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011).
Article CAS PubMed PubMed Central Google Scholar
Rube, H. T. et al. Prediction of protein–ligand binding affinity from sequencing data with interpretable machine learning. Nat. Biotechnol. 40, 1520–1527 (2022).
Article CAS PubMed PubMed Central Google Scholar
Sheth, R. et al. Distal limb patterning requires modulation of cis-regulatory activities by HOX13. Cell Rep. 17, 2913–2926 (2016).
Article CAS PubMed PubMed Central Google Scholar
Sun, Z. et al. EGR1 recruits TET1 to shape the brain methylome during development and upon neuronal activity. Nat. Commun. 10, 3892 (2019).
Article ADS PubMed PubMed Central Google Scholar
Anderson, D. J. et al. NKX2-5 regulates human cardiomyogenesis via a HEY2 dependent transcriptional network. Nat. Commun. 9, 1373 (2018).
Article ADS PubMed PubMed Central Google Scholar
Ang, Y.-S. et al. Disease model of GATA4 mutation reveals transcription factor cooperativity in human cardiogenesis. Cell 167, 1734–1749.e22 (2016).
Article CAS PubMed PubMed Central Google Scholar
Alexandari, A. M. et al. De novo distillation of thermodynamic affinity from deep learning regulatory sequence models of in vivo protein–DNA binding. Preprint at bioRxiv https://doi.org/10.1101/2023.05.11.540401 (2023).
Markstein, M., Markstein, P., Markstein, V. & Levine, M. S. Genome-wide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo. Proc. Natl Acad. Sci. USA 99, 763–768 (2002).
Article ADS CAS PubMed Google Scholar
Lifanov, A. P., Makeev, V. J., Nazina, A. G. & Papatsenko, D. A. Homotypic regulatory clusters in Drosophila. Genome Res. 13, 579–588 (2003).
Article CAS PubMed PubMed Central Google Scholar
Gotea, V. et al. Homotypic clusters of transcription factor binding sites are a key component of human promoters and enhancers. Genome Res. 20, 565–577 (2010).
Article CAS PubMed PubMed Central Google Scholar
Rohs, R. et al. The role of DNA shape in protein–DNA recognition. Nature 461, 1248–1253 (2009).
Article ADS CAS PubMed PubMed Central Google Scholar
Li, J., Chiu, T.-P. & Rohs, R. Predicting DNA structure using a deep learning method. Nat. Commun. 15, 1243 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
He, Q., Johnston, J. & Zeitlinger, J. ChIP-nexus enables improved detection of in vivo transcription factor binding footprints. Nat. Biotechnol. 33, 395–401 (2015).
Article CAS PubMed PubMed Central Google Scholar
De Masi, F. et al. Using a structural and logics systems approach to infer bHLH-DNA binding specificity determinants. Nucleic Acids Res. 39, 4553–4563 (2011).
Article PubMed PubMed Central Google Scholar
Grove, C. A. et al. A multiparameter network reveals extensive divergence between C. elegans bHLH transcription factors. Cell 138, 314–327 (2009).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Y., Ho, T. D., Buchler, N. E. & Gordân, R. Competition for DNA binding between paralogous transcription factors determines their genomic occupancy and regulatory functions. Genome Res. 31, 1216–1229 (2021).
Article CAS PubMed PubMed Central Google Scholar
Yan, J. et al. Systematic analysis of binding of transcription factors to noncoding variants. Nature 591, 147–151 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Coetzee, S. G., Coetzee, G. A. & Hazelett, D. J. motifbreakR: an R/Bioconductor package for predicting variant effects at transcription factor binding sites. Bioinformatics 31, 3847–3849 (2015).
Article CAS PubMed PubMed Central Google Scholar
Landrum, M. J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42, D980–D985 (2014).
Article CAS PubMed Google Scholar
Lettice, L. A. et al. Disruption of a long-range cis-acting regulator for Shh causes preaxial polydactyly. Proc. Natl Acad. Sci. USA 99, 7548–7553 (2002).
Article ADS CAS PubMed PubMed Central Google Scholar
Abramov, S. et al. Landscape of allele-specific transcription factor binding in the human genome. Nat. Commun. 12, 2751 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Tewhey, R. et al. Direct identification of hundreds of expression-modulating variants using a multiplexed reporter assay. Cell 165, 1519–1529 (2016).
Article CAS PubMed PubMed Central Google Scholar
Khetan, S. et al. Functional characterization of T2D-associated SNP effects on baseline and ER stress-responsive β cell transcriptional activation. Nat. Commun. 12, 5242 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Abell, N. S. et al. Multiple causal variants underlie genetic associations in humans. Science 375, 1247–1254 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
McAfee, J. C. et al. Systematic investigation of allelic regulatory activity of schizophrenia-associated common variants. Cell Genom. 3, 100404 (2023).
Article CAS PubMed PubMed Central Google Scholar
Newburger, D. E. & Bulyk, M. L. UniPROBE: an online database of protein binding microarray data on protein–DNA interactions. Nucleic Acids Res. 37, D77–D82 (2009).
Article CAS PubMed Google Scholar
Hume, M. A., Barrera, L. A., Gisselbrecht, S. S. & Bulyk, M. L. UniPROBE, update 2015: new tools and content for the online database of protein-binding microarray data on protein–DNA interactions. Nucleic Acids Res. 43, D117–D122 (2015).
Article CAS PubMed Google Scholar
Shahein, A. et al. Systematic analysis of low-affinity transcription factor binding site clusters in vitro and in vivo establishes their functional relevance. Nat. Commun. 13, 5273 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Yin, Y. et al. Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science 356, eaaj2239 (2017).
Article PubMed PubMed Central Google Scholar
Slattery, M. et al. Cofactor binding evokes latent differences in DNA binding specificity between Hox proteins. Cell 147, 1270–1282 (2011).
Article CAS PubMed PubMed Central Google Scholar
Riley, T. R. et al. SELEX-seq: a method for characterizing the complete repertoire of binding site preferences for transcription factor complexes. Methods Mol. Biol. 1196, 255–278 (2014).
Article CAS PubMed PubMed Central Google Scholar
Hammal, F., de Langen, P., Bergon, A., Lopez, F. & Ballester, B. ReMap 2022: a database of human, mouse, Drosophila and Arabidopsis regulatory regions from an integrative analysis of DNA-binding sequencing experiments. Nucleic Acids Res. 50, D316–D325 (2022).
Article CAS PubMed Google Scholar
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Article CAS PubMed PubMed Central Google Scholar
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Article CAS PubMed PubMed Central Google Scholar
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).
Article CAS PubMed PubMed Central Google Scholar
Foulk, M. S., Urban, J. M., Casella, C. & Gerbi, S. A. Characterizing and controlling intrinsic biases of lambda exonuclease in nascent strand sequencing reveals phasing between nucleosomes and G-quadruplex motifs around a subset of human replication origins. Genome Res. 25, 725–735 (2015).
Article CAS PubMed PubMed Central Google Scholar
Hon, J., Martínek, T., Zendulka, J. & Lexa, M. pqsfinder: an exhaustive and imperfection-tolerant search tool for potential quadruplex-forming sequences in R. Bioinformatics 33, 3373–3379 (2017).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

The authors thank S. Gisselbrecht for help with calculating PBM 9-mer E-scores for EGR1; K. Struhl, L. Mariani and X. K. Liu for critical reading of the manuscript, and members of the Bulyk laboratory for helpful discussions. This work was supported by grants from the National Institutes of Health (R21 HG010200 and R01 HG010501 to M.L.B.; K99 HG013675 to S.K.) and the American Heart Association (24POST1192017 to S.K.).

Author information

Authors and Affiliations

Division of Genetics, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA
Shubham Khetan, Brent S. Carroll & Martha L. Bulyk
Department of Pathology, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA
Martha L. Bulyk

Authors

Shubham Khetan
View author publications
Search author on:PubMed Google Scholar
Brent S. Carroll
View author publications
Search author on:PubMed Google Scholar
Martha L. Bulyk
View author publications
Search author on:PubMed Google Scholar

Contributions

M.L.B. and S.K. conceived and designed the research project. S.K. performed PADIT-seq experiments, and S.K. and B.S.C. performed custom PBM experiments. S.K. conducted the analyses and prepared the figures. M.L.B. supervised the research. M.L.B. and S.K. wrote the manuscript. All authors approved the final version of the manuscript.

Corresponding author

Correspondence to Martha L. Bulyk.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature thanks Matthew Weirauch and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer review reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 PADIT-seq demonstrates high reproducibility and correlates with orthogonal binding assays.

(a-b) First experiment with HOXD13, EGR1, and ‘NoDBD’ controls (R1-3). (a) Heatmap showing pairwise Pearson correlations between replicates with unsupervised row and column clustering. (b) PCA plot explaining ~93% of variation. (c) Second experiment with Pho4, Cbf1, TBX5, NKX2.5, and additional NoDBD controls (R4-7). Previous ‘NoDBD’ controls (R1-3) included for comparison. PCA plot explaining ~71% of variation. (d-e) PADIT-seq activities for HOXD13 (d) and EGR1 (e) from the all-10mers library and the small-scale library are compared. Red TFBS are active in both libraries. Black TFBS are not active in either libraries. Blue TFBS are active only in the small-scale library. (f) PADIT-seq activity from the small-scale library and MITOMI-derived dissociation constants (K_d) for EGR1 are compared. Red TFBS are active, whereas black TFBS are inactive. (g) Comparison of binding preferences measured by uPBM Z-scores (x-axis) and PADIT-seq (y-axis) for 4 human TFs, HOXD13, NKX2.5, TBX5, and EGR1, and 2 S. cerevisiae yeast TFs, Pho4 and Cbf1. Active k-mers are colored red, and inactive k-mers are colored black. (inset) AUROC comparing the ability of uPBM E-scores (orange) and Z-scores (purple) to discriminate between PADIT-seq active and inactive k-mers.

Extended Data Fig. 2 Custom PBM confirms PADIT-seq active k-mers represent genuine TF binding sites.

(a) Schematic of custom PBM design showing PADIT-seq active k-mers (8-mers for HOXD13 and 9-mers for EGR1) embedded within constant flanking sequences. FLAG-tagged HOXD13 or EGR1 binding was detected using Alexa Fluor 488 conjugated anti-FLAG antibodies. (b) Scatter plots comparing PADIT-seq activity (y-axis) and custom PBM signal (x-axis) for HOXD13 at three protein concentrations (300 nM, 500 nM, and 800 nM). Points are colored by PADIT-seq false discovery rate (FDR): red (FDR < 0.01), cyan (0.01 ≤ FDR < 0.05), and orange (0.05 ≤ FDR < 0.10). (c) Corresponding analysis for EGR1 at the same three protein concentrations. (d) ROC curves comparing the predictive performance of uPBM E-scores (orange) and PWM scores from FIMO (blue) in distinguishing between PADIT-seq active and inactive k-mers. In cases where PWM models were longer than PADIT-seq active k-mers, we scanned all possible relative positions (registers) of the PWM model against each k-mer and retained the highest affinity score for our analysis. (e) ROC curves comparing the predictive performance of uPBM E-scores (orange) and ProBound-predicted affinities (blue) in distinguishing between PADIT-seq active and inactive k-mers. (insets) Due to the large imbalance between active and inactive k-mers, even seemingly small false positive rates translate to substantial numbers of false predictions. For example, a 5% FPR corresponds to 1,556 false positives for HOXD13 and even a 1% FPR corresponds to 1,305 false positives for EGR1.

Extended Data Fig. 3 Lower affinity binding sites increase TF genomic occupancy at ChIP-seq and ChIP-nexus peaks.

(a) ROC curves comparing the performance of PADIT-seq, PWM FIMO, and ProBound in distinguishing ChIP-seq peaks from background genomic regions for HOXD13, NKX2.5, TBX5 and EGR1. For HOXD13, in addition to random, length-matched background genomic intervals, false positives were also determined with the background sequences defined to be embryonic forelimb bud ATAC-seq peaks not overlapping HOXD13 ChIP-seq peaks. For Pho4 and Cbf1, ROC curves compare the performance of PADIT-seq and PWM FIMO in distinguishing ChIP-nexus peaks from background genomic regions (ProBound motifs were not available). The number of foreground and background genomic intervals were equal for all 6 TFs. (b) The sum of PADIT-seq activities of all the active k-mers in ChIP-seq and ChIP-nexus peaks is plotted against the corresponding read counts normalized to peak length for each TF. Pearson correlation coefficients and significance values are shown. (c) Pearson correlation coefficient between normalized ChIP-seq and ChIP-nexus read counts and PADIT-seq predictions when varying the number of top active k-mers included. Red horizontal line indicates the maximum correlation achieved using PADIT-seq. Blue horizontal line shows the correlation achieved using PWM FIMO log-likelihood scores summed across peaks, which yielded higher correlation coefficients than using maximum PWM scores alone.

Extended Data Fig. 4 ChIP-seq and ChIP-nexus peaks have significantly more consecutive active k-mers, irrespective of how background genomic sequences were defined.

(a) Histograms showing the distribution of consecutive active k-mers in peaks (red) for six TFs: HOXD13, NKX2.5, TBX5, and EGR1 (ChIP-seq), and Pho4 and Cbf1 (ChIP-nexus). Background regions were generated by selecting genomic sequences flanking each ChIP peak. Statistical significance was determined by two-sided Wilcoxon rank sum tests. For EGR1, the large effect size resulted in a Pvalue below computational precision. (b) The vast majority of binding sites are detected with high statistical confidence. Across all six TFs, 56–76% of active k-mers are found at FDR < 0.01.

Extended Data Fig. 5 ChIP-seq and ChIP-nexus peaks are significantly enriched for consecutive active k-mers irrespective of FDR threshold.

Distribution of consecutive active k-mers within ChIP-seq and ChIP-nexus peaks (red) compared to random, length-matched genomic regions (black) for each TF at three different FDR thresholds, 1% (left panels), 5% (middle panels), and 10% (right panels). Statistical significance was determined by two-sided Wilcoxon rank sum tests.

Extended Data Fig. 6 Minor groove width (MGW) and Propeller twist (ProT) at the extended recognition sequences bound by TFs is distinct from flanking genomic regions.

(a-b) Predicted MGW (a) and ProT (b) are shown for genomic regions containing consecutive active k-mers within ChIP-seq peaks (red) and their 4-bp flanking regions (blue): HOXD13 (13–15 bp with consecutive active 8-mers), NKX2.5 (11–13 bp with consecutive active 8-mers), TBX5 (10–12 bp with consecutive active 8-mers), and EGR1 (11–13 bp with consecutive active 9-mers). Adjusted Pvalues < 0.05 from paired two-sided Wilcoxon rank sum tests are indicated by *.

Extended Data Fig. 7 ChIP-nexus footprint expansion persists after controlling for core motif positioning.

(a) For Pho4, the increased number of significant cuts and expanded footprint size observed for 5 overlapping binding sites is not because CACGTG occurs at more variable registers. Strand-specific distribution of ChIP-nexus 5’ends (red: positive strand; black: negative strand) averaged across peaks containing exactly 5 consecutive overlapping active 8-mers for Pho4. (b) For Cbf1, an expanded footprint size was observed even after constraining the position of CACGTG at the 3^rd register. Strand-specific distribution of ChIP-nexus 5’ends (red: positive strand; black: negative strand) averaged across peaks containing exactly 3 (top) or 5 (bottom) consecutive overlapping active 8-mers with CACGTG constrained to be exclusively at the 3^rd register. Peaks with 4 consecutive overlapping binding sites are not included because no significant cuts above background were observed due to low statistical power, which makes it difficult to objectively determine the size of footprints. (a-b) Dashed blue lines indicate the boundaries of the consecutive overlapping binding sites. Letters indicate cuts occurring significantly above background in the flanking 15-bp genomic regions (permutation test adjusted P < 0.05). Sequence logos above each plot show the relative frequency at which the 4 nucleotides occur at each position in the genomic sequences containing the indicated category of ChIP-nexus peaks. For these sequence logos, the y-axis represents information content ranging from a minimum of 0 to a maximum of 2.

Extended Data Fig. 8 Overlapping binding sites additively increase TF occupancy in vitro.

(a) The fraction of HT-SELEX reads (y-axis) with consecutive overlapping PADIT-seq active k-mers after 0–4 rounds of selection (x-axis). (b) Box plots showing uPBM signal intensities for 60-bp probes (n ≈ 42,000) categorized by the number of consecutive overlapping active k-mers. The bounds of the box plots define the 25th, 50th and 75th percentiles, and whiskers extend to the furthest data points within 1.5× the interquartile range. Two-sided Wilcoxon tests for all pairwise comparisons have an adjusted P < 0.05 (not indicated). (c) HOXD13, NKX2.5, TBX5, Pho4, and Cbf1 10-mer PADIT-seq activity versus mean uPBM E-scores of constituent 8-mers, centered around a fixed 8-mer in the center (green). EGR1 PADIT-seq activity of 10-mers containing ‘GCGTGGGTG’ (green) versus uPBM E-scores of constituent 9-mers. (d) For all the PADIT-seq active HOXD13, NKX2.5, TBX5, Pho4, and Cbf1 8-mers: distribution of Pearson correlation coefficients between 10-mer PADIT-seq activities and mean uPBM E-scores of constituent 8-mers. For all the PADIT-seq active EGR1 9-mers: correlations between 10-mer PADIT-seq activity and uPBM E-scores of constituent 9-mers.

Extended Data Fig. 9 PADIT-seq outperforms MotifBreakR in predicting variant effects on TF binding and gene expression.

(a-b) PADIT-seq outperforms MotifBreakR in predicting SNP effects on TF binding. Representative variants, rs62523478 (A) and rs2914146 (B), whose effects on HOXD13 and EGR1 binding, respectively, cannot be explained by PWM models. (Right) Box plots show custom PBM signals from 16 probes for the reference and alternate alleles. (Left) PADIT-seq k-mers tiled across the reference and alternate alleles. Red points represent k-mers that are active in both alleles; green points represent k-mers that are uniquely active in the allele with higher TF binding. The bounds of the box plots define the 25th, 50th and 75th percentiles, and whiskers extend to the furthest data points within 1.5× the interquartile range. (c) Comparison of PADIT-seq and MotifBreakR predictions with ADASTRA EGR1 ChIP-seq allelic skew measurements. (d-f) Noncoding variants alter multiple overlapping EGR1 binding sites to influence gene expression. (d) Allelic skew in MPRA activity of 149 SNPs with differential EGR1 PADIT-seq activity. Shapes represent the different studies from which allelic skew in MPRA activity was obtained. Red shapes correspond to concordant directions of effect between differential EGR1 PADIT-seq activity and MPRA allelic skew. (e) Euler diagram of variants with MPRA allelic skew predicted by MotifBreakR to alter EGR1 binding (green) and with differential PADIT-seq activity (blue). (Bottom) Boxplots comparing MotifBreakR (left) and PADIT-seq (right) effect sizes. The bounds of the box plots define the 25th, 50th and 75th percentiles, and whiskers extend to the furthest data points within 1.5× the interquartile range. * indicates two-sided Wilcoxon rank sum test Pvalue < 0.05. (f) Number of active overlapping 9-mers altered by the 149 variants with differential PADIT-seq activity. Brown variants are predicted to alter EGR1 binding by MotifBreakR, while blue is not.

Extended Data Fig. 10 ‘Weavability’ of binding sites is an inherent property of TFs from different DBD classes.

(a) Schema to demonstrate the logic of network construction. All incoming and outgoing edges to and from the HOXD13 active 8-mer ‘ACTTTACT’ are shown. Active 8-mers are colored red, inactive 8-mers are colored black. Edges occur between active k-mers only. (b) (left) Network representation of HOXD13 active 8-mers and reverse complements (n = 3,536), connected by directed edges (arrows not shown). 3,446 out of 3,536 nodes (97.5%) form the largest, single connected component. (right) HOXD13 PADIT-seq activity of active 8-mers is plotted against the total number of incoming and outgoing edges per node. The bounds of the box plots define the 25th, 50th and 75th percentiles, and whiskers extend to the furthest data points within 1.5× the interquartile range. * Adjusted P < 0.05, two-sided Wilcoxon tests. (c) (left) Network representation of randomly selected 8-mers and reverse complements (n = 3,536) connected by directed edges (arrows not shown). Only 7 out of 3,536 nodes (0.2%) forms the largest, single connected component. (right) HOXD13 PADIT-seq activity of the 3,536 random 8-mers is plotted against the total number of incoming and outgoing edges per node. (d-h) Network representations and activity distributions for NKX2.5, TBX5, EGR1, Pho4, and Cbf1. For each TF, the fraction of active k-mers in the largest, single connected component is indicated. (i) Among the top 500 uPBM 8-mers and reverse complements, the fraction of nodes in the largest, single connected component is plotted for 200 TFs from 9 different families of DBDs. 1,000 random samplings of 500 8-mers and reverse complements are also shown, and were used to perform the permutation test.

Supplementary information

Supplementary Figure 1

Conservation analysis of ChIP–seq peaks containing varying numbers of overlapping, PADIT-seq active k-mers.

Reporting Summary

Peer Review File

Supplementary Tables 1–6

Supplementary Table 1: List of primers used. Supplementary Table 2: List of gBlock and vector DNA sequences. Supplementary Table 3: Small-scale library design and sequence. Supplementary Table 4: PADIT-seq library sequencing statistics. Supplementary Table 5: PADIT-seq active and inactive k-mers. Supplementary Table 6: Processed custom PBM data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Khetan, S., Carroll, B.S. & Bulyk, M.L. Multiple overlapping binding sites determine transcription factor occupancy. Nature (2025). https://doi.org/10.1038/s41586-025-09472-3

Download citation

Received: 15 June 2025
Accepted: 28 July 2025
Published: 03 September 2025
DOI: https://doi.org/10.1038/s41586-025-09472-3