Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Analysis
  • Published:

Evaluation of methods for modeling transcription factor sequence specificity

Abstract

Genomic analyses often involve scanning for potential transcription factor (TF) binding sites using models of the sequence specificity of DNA binding proteins. Many approaches have been developed to model and learn a protein's DNA-binding specificity, but these methods have not been systematically compared. Here we applied 26 such approaches to in vitro protein binding microarray data for 66 mouse TFs belonging to various families. For nine TFs, we also scored the resulting motif models on in vivo data, and found that the best in vitro–derived motifs performed similarly to motifs derived from the in vivo data. Our results indicate that simple models based on mononucleotide position weight matrices trained by the best methods perform similarly to more complex models for most TFs examined, but fall short in specific cases (<10% of the TFs examined here). In addition, the best-performing motifs typically have relatively low information content, consistent with widespread degeneracy in eukaryotic TF sequence preferences.

This is a preview of subscription content, access via your institution

Access options

Figure 1: Evaluation criteria used in this study.
Figure 2: Comparison of algorithm performance by TF.
Figure 3: Comparison of algorithm performance on in vivo data.
Figure 4: Characteristics of Klf9 motifs produced by the eight PWM-based algorithms evaluated in this study.

Similar content being viewed by others

Accession codes

Accessions

Gene Expression Omnibus

References

  1. Stormo, G.D., Schneider, T.D., Gold, L. & Ehrenfeucht, A. Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res. 10, 2997–3011 (1982).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Berg, O.G. & von Hippel, P.H. Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J. Mol. Biol. 193, 723–743 (1987).

    Article  CAS  PubMed  Google Scholar 

  3. Stormo, G.D. Consensus patterns in DNA. Methods Enzymol. 183, 211–221 (1990).

    Article  CAS  PubMed  Google Scholar 

  4. Siddharthan, R. Dinucleotide weight matrices for predicting transcription factor binding sites: generalizing the position weight matrix. PLoS ONE 5, e9722 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  5. Zhao, X., Huang, H. & Speed, T.P. Finding short DNA motifs using permuted Markov models. J. Comput. Biol. 12, 894–906 (2005).

    Article  CAS  PubMed  Google Scholar 

  6. Sharon, E., Lubliner, S. & Segal, E. A feature-based approach to modeling protein-DNA interactions. PLOS Comput. Biol. 4, e1000154 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Badis, G. et al. Diversity and complexity in DNA recognition by transcription factors. Science 324, 1720–1723 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Nutiu, R. et al. Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument. Nat. Biotechnol. 29, 659–664 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Maerkl, S.J. & Quake, S.R. A systems approach to measuring the binding energy landscapes of transcription factors. Science 315, 233–237 (2007).

    Article  CAS  PubMed  Google Scholar 

  10. Agius, P., Arvey, A., Chang, W., Noble, W.S. & Leslie, C. High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions. PLoS Comput. Biol. 6, e1000916 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  11. Annala, M., Laurila, K., Lähdesmäki, H. & Nykter, M. A linear model for transcription factor binding affinity prediction in protein binding microarrays. PLoS ONE 6, e20059 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Zhao, Y., Granas, D. & Stormo, G.D. Inferring binding energies from selected binding sites. PLOS Comput. Biol. 5, e1000590 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  13. Slattery, M. et al. Cofactor binding evokes latent differences in DNA binding specificity between Hox proteins. Cell 147, 1270–1282 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Jolma, A. et al. Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res. 20, 861–873 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Zykovich, A., Korf, I. & Segal, D.J. Bind-n-Seq: high-throughput analysis of in vitro protein-DNA interactions using massively parallel sequencing. Nucleic Acids Res. 37, e151 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  16. Fordyce, P.M. et al. De novo identification and biophysical characterization of transcription-factor binding sites with microfluidic affinity analysis. Nat. Biotechnol. 28, 970–975 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Warren, C.L. et al. Defining the sequence-recognition profile of DNA-binding molecules. Proc. Natl. Acad. Sci. USA 103, 867–872 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Meng, X., Brodsky, M.H. & Wolfe, S.A. A bacterial one-hybrid system for determining the DNA-binding specificity of transcription factors. Nat. Biotechnol. 23, 988–994 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Berger, M.F. et al. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol. 24, 1429–1435 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Stormo, G.D. & Zhao, Y. Determining the specificity of protein-DNA interactions. Nat. Rev. Genet. 11, 751–760 (2010).

    Article  CAS  PubMed  Google Scholar 

  21. Prill, R.J. et al. Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PLoS ONE 5, e9202 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  22. Stolovitzky, G., Monroe, D. & Califano, A. Dialogue on reverse-engineering assessment and methods: the DREAM of high-throughput pathway inference. Ann. NY Acad. Sci. 1115, 1–22 (2007).

    Article  PubMed  Google Scholar 

  23. Stolovitzky, G., Prill, R.J. & Califano, A. Lessons from the DREAM2 Challenges. Ann. NY Acad. Sci. 1158, 159–195 (2009).

    Article  CAS  PubMed  Google Scholar 

  24. Zhao, Y. & Stormo, G.D. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat. Biotechnol. 29, 480–483 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Zhao, Y., Ruan, S., Pandey, M. & Stormo, G.D. Improved models for transcription factor binding site identification using non-independent interactions. Genetics 191, 781–790 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Foat, B.C., Morozov, A.V. & Bussemaker, H.J. Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics 22, e141–e149 (2006).

    Article  CAS  PubMed  Google Scholar 

  27. Chen, X., Hughes, T.R. & Morris, Q. RankMotif.: a motif-search algorithm that accounts for relative ranks of K-mers in binding transcription factors. Bioinformatics 23, i72–i79 (2007).

    Article  CAS  PubMed  Google Scholar 

  28. Berger, M.F. et al. Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell 133, 1266–1276 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Rhee, H.S. & Pugh, B.F. Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell 147, 1408–1419 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Wei, G.H. et al. Genome-wide analysis of ETS-family DNA-binding in vitro and in vivo. EMBO J. 29, 2147–2160 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. de Boer, C.G. & Hughes, T.R. YeTFaSCo: a database of evaluated yeast transcription factor sequence specificities. Nucleic Acids Res. 40, D169–D179 (2012).

    Article  CAS  PubMed  Google Scholar 

  32. Kulakovskiy, I.V., Boeva, V.A., Favorov, A.V. & Makeev, V.J. Deep and wide digging for binding motifs in ChIP-Seq data. Bioinformatics 26, 2622–2623 (2010).

    Article  CAS  PubMed  Google Scholar 

  33. Machanick, P. & Bailey, T.L. MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics 27, 1696–1697 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Zhu, C. et al. High-resolution DNA-binding specificity analysis of yeast transcription factors. Genome Res. 19, 556–566 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. John, S., Marais, R., Child, R., Light, Y. & Leonard, W.J. Importance of low affinity Elf-1 sites in the regulation of lymphoid-specific inducible gene expression. J. Exp. Med. 183, 743–750 (1996).

    Article  CAS  PubMed  Google Scholar 

  36. Tanay, A. Extensive low-affinity transcriptional interactions in the yeast genome. Genome Res. 16, 962–972 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Jaeger, S.A. et al. Conservation and regulatory associations of a wide affinity range of mouse transcription factor binding sites. Genomics 95, 185–195 (2010).

    Article  CAS  PubMed  Google Scholar 

  38. Segal, E., Raveh-Sadka, T., Schroeder, M., Unnerstall, U. & Gaul, U. Predicting expression patterns from regulatory sequence in Drosophila segmentation. Nature 451, 535–540 (2008).

    Article  CAS  PubMed  Google Scholar 

  39. Schneider, T.D. & Stephens, R.M. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18, 6097–6100 (1990).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Crooks, G.E., Hon, G., Chandonia, J.M. & Brenner, S.E. WebLogo: a sequence logo generator. Genome Res. 14, 1188–1190 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Keilwagen, J. et al. De-novo discovery of differentially abundant transcription factor binding sites including their positional preference. PLOS Comput. Biol. 7, e1001070 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Bailey, T.L. & Elkan, C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2, 28–36 (1994).

    CAS  PubMed  Google Scholar 

  43. Schutz, F. & Delorenzi, M. MAMOT: hidden Markov modeling tool. Bioinformatics 24, 1399–1400 (2008).

    Article  CAS  PubMed  Google Scholar 

  44. Kinney, J.B., Tkacik, G. & Callan, C.G. Jr. Precise physical models of protein-DNA interaction from high-throughput data. Proc. Natl. Acad. Sci. USA 104, 501–506 (2007).

    Article  CAS  PubMed  Google Scholar 

  45. Kinney, J.B., Murugan, A., Callan, C.G. Jr. & Cox, E.C. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl. Acad. Sci. USA 107, 9158–9163 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Linhart, C., Halperin, Y. & Shamir, R. Transcription factor and microRNA motif discovery: the Amadeus platform and a compendium of metazoan target sets. Genome Res. 18, 1180–1189 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc., B 58, 267–288 (1996).

    Google Scholar 

  48. Chen, C.Y. et al. Discovering gapped binding sites of yeast transcription factors. Proc. Natl. Acad. Sci. USA 105, 2527–2532 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Philippakis, A.A., Qureshi, A.M., Berger, M.F. & Bulyk, M.L. Design of compact, universal DNA microarrays for protein binding microarray experiments. J. Comput. Biol. 15, 655–665 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Lam, K.N., van Bakel, H., Cote, A.G., van der Ven, A. & Hughes, T.R. Sequence specificity is obtained from the majority of modular C2H2 zinc-finger arrays. Nucleic Acids Res. 39, 4680–4690 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Finn, R.D. et al. The Pfam protein families database. Nucleic Acids Res. 38, D211–D222 (2010).

    Article  CAS  PubMed  Google Scholar 

  52. Eddy, S.R. A new generation of homology search tools based on probabilistic inference. Genome Inform. 23, 205–211 (2009).

    PubMed  Google Scholar 

  53. Chen, L., Wu, G. & Ji, H. hmChIP: a database and web server for exploring publicly available human and mouse ChIP-seq and ChIP-chip data. Bioinformatics 27, 1447–1448 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Parkinson, H. et al. ArrayExpress update–an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res. 39, D1002–D1004 (2011).

    Article  CAS  PubMed  Google Scholar 

  55. Barrett, T. et al. NCBI GEO: archive for functional genomics data sets–10 years on. Nucleic Acids Res. 39, D1005–D1010 (2011).

    Article  CAS  PubMed  Google Scholar 

  56. Dreszer, T.R. et al. The UCSC Genome Browser database: extensions and updates 2011. Nucleic Acids Res. 40, D918–D923 (2012).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank H. van Bakel and M. Albu for database assistance, and members of the Hughes laboratory for helpful discussion. M.T.W. was supported by fellowships from the Canadian Institutes of Health Research (CIHR) and the Canadian Institute for Advanced Research (CIFAR) Junior Fellows Genetic Networks Program. This work was supported in part by the Ontario Research Fund and Genome Canada through the Ontario Genomics Institute, and the March of Dimes (T.R.H.). Funding was also provided by Operating Grant MOP-77721 from CIHR to T.R.H. and M.L.B., and grant no. R01 HG003985 from the US National Institutes of Health/National Human Genome Research Institute to M.L.B., as well as US National Institutes of Health grants R01HG003008 and U54CA121852 and a John Simon Guggenheim Foundation Fellowship to H.J.B. M.A., K.L., H.L. and M.L. were supported by the Academy of Finland (project 260403) and EU ERASysBio ERA-NET. Y.O., C.L. and R.S. were funded by the European Community's Seventh Framework Programme under grant agreement no. HEALTH-F4-2009-223575 for the TRIREME project, and by the Israel Science Foundation (grant no. 802/08). Y.O. was supported in part by a fellowship from the Edmond J. Safra Bioinformatics Program at Tel Aviv University. J.G., I.G., S.P. and J.K. were supported by grant XP3624HP/0606T by the Ministry of Culture of Saxony-Anhalt. A.M. was supported by US National Science Foundation (NSF) grant PHY-1022140. C.C. was supported by NSF grant PHY-0957573. J.B.K. was supported by the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory.

Author information

Authors and Affiliations

Authors

Consortia

Contributions

M.T.W. and T.R.H. wrote the manuscript. T.R.H., M.T.W., M.L.B. and A.V. conceived of the study. M.T.W. did the majority of the computational analyses. M.A., Y.Z. and T.R.R. did additional computational analyses. A.C. and S.T. performed the PBM experiments. T.R.H., M.T.W., G.S. and R.N. designed and carried out the DREAM5 TF challenge. The DREAM5 Consortium and M.A. participated in the DREAM5 TF challenge. R.N., J.S.-R., T.C. and M.T.W. designed and created the prediction server. M.L.B., G.S., Q.D.M. and H.J.B. provided critical feedback on the manuscript.

Corresponding author

Correspondence to Timothy R Hughes.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Notes 1–9, Supplementary Tables 1–8 and Supplementary Figures 1–4 (PDF 9689 kb)

Supplementary Table 1

Information on transcription factors and associated experiments (XLSX 34 kb)

Supplementary Table 3

Full evaluations for all algorithms, by TF (XLSX 92 kb)

Supplementary Table 6

Improvement of secondary over primary motifs, for each TF (XLSX 48 kb)

Supplementary Table 7

Full Comparison to ChIP-seq and ChIP-exo data (XLSX 26 kb)

Supplementary Table 8

Information on plasmids used for PBMs in this study (XLSX 46 kb)

Supplementary Code 1

Final set of PWMs for each transcription factor (ZIP 33 kb)

Supplementary Code 2

Algorithm source code (ZIP 4133 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Weirauch, M., Cote, A., Norel, R. et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat Biotechnol 31, 126–134 (2013). https://doi.org/10.1038/nbt.2486

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/nbt.2486

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research