Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Large-scale database searching using tandem mass spectra: Looking up the answer in the back of the book

Abstract

Database searching is an essential element of large-scale proteomics. Because these methods are widely used, it is important to understand the rationale of the algorithms. Most algorithms are based on concepts first developed in SEQUEST and PeptideSearch. Four basic approaches are used to determine a match between a spectrum and sequence: descriptive, interpretative, stochastic and probability–based matching. We review the basic concepts used by most search algorithms, the computational modeling of peptide identification and current challenges and limitations of this approach for protein identification.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Overview of the protein identification process.
Figure 2: Simplified representation of an MS/MS spectrum for the peptid e IYEVEGMR.

Similar content being viewed by others

References

  1. Henzel, W.J. et al. Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases. Proc. Natl. Acad. Sci. USA 90, 5011–5015 (1993).

    Article  CAS  Google Scholar 

  2. Yates, J.R.d., Speicher, S., Griffin, P.R. & Hunkapiller, T. Peptide mass maps: a highly informative approach to protein identification. Anal. Biochem. 214, 397–408 (1993).

    Article  CAS  Google Scholar 

  3. Papin, D.J., Hojrup, P. & Bleasby, A.J. Rapid identification of proteins using peptide mass fingerprinting. Curr. Biol. 3, 327–332 (1994).

    Article  Google Scholar 

  4. James, P., Quadroni, M., Carafoli, E. & Gonnet, G. Protein identification by mass profile fingerprinting. Biochem. Biophys. Res. Commun. 195, 58–64 (1993).

    Article  CAS  Google Scholar 

  5. Mann, M., Hojrup, P. & Roepstorff, P. Use of mass spectrometric molecular weight information to identify proteins in sequence databases. Biol. Mass Spectrom. 22, 338–345 (1993).

    Article  CAS  Google Scholar 

  6. Eng, J.K., McCormack, A.L. & Yates, J.R. III . An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).

    Article  CAS  Google Scholar 

  7. Mann, M. & Wilm, M. Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal. Chem. 66, 4390–4399 (1994).

    Article  CAS  Google Scholar 

  8. McCormack, A.L., Eng, J.K. & Yates, I.J.R. Peptide sequence analysis on quadrupole mass spectrometers. in Methods: A Companion to Methods in Enzymology 6, 274–283 (1994).

    Google Scholar 

  9. McCormack, A.L., Eng, J.K., DeRoos, P.C., Rudensky, A.Y. & Yates, I.J.R. in Biochemical and Biotechnological Applications of Electrospray Ionization Mass Spectrometry Vol. 619 (ed. Snyder, A.P.) 207–225 (American Chemical Society, Washington, D.C., 1995).

    Google Scholar 

  10. McCormack, A.L. et al. Direct analysis and identification of proteins in mixtures by LC/MS/MS and database searching at the low-femtomole level. Anal. Chem. 69, 767–776 (1997).

    Article  CAS  Google Scholar 

  11. Link, A.J. et al. Direct analysis of protein complexes using mass spectrometry. Nat. Biotechnol. 17, 676–682 (1999).

    Article  CAS  Google Scholar 

  12. Washburn, M.P., Wolters, D. & Yates, J.R. III. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol. 19, 242–247 (2001).

    Article  CAS  Google Scholar 

  13. Skop, A.R., Liu, H., Yates, J. III, Meyer, B.J. & Heald, R. Dissection of the mammalian midbody proteome reveals conserved cytokinesis mechanisms. Science 305, 61–66 (2004).

    Article  CAS  Google Scholar 

  14. Schirmer, E.C., Florens, L., Guan, T., Yates, J.R. III & Gerace, L. Nuclear membrane proteins with potential disease links found by subtractive proteomics. Science 301, 1380–1382 (2003).

    Article  CAS  Google Scholar 

  15. Gavin, A.C. et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415, 141–147 (2002).

    Article  CAS  Google Scholar 

  16. Cheeseman, I.M. et al. Phospho-regulation of kinetochore-microtubule attachments by the Aurora kinase Ipl1p. Cell 111, 163–172 (2002).

    Article  CAS  Google Scholar 

  17. Sickmann, A. et al. The proteome of Saccharomyces cerevisiae mitochondria. Proc. Natl. Acad. Sci. USA 100, 13207–13212 (2003).

    Article  CAS  Google Scholar 

  18. Blondeau, F. et al. Tandem MS analysis of brain clathrin-coated vesicles reveals their critical involvement in synaptic vesicle recycling. Proc. Natl. Acad. Sci. USA 101, 3833–3838 (2004).

    Article  CAS  Google Scholar 

  19. Vihinen, M. Bioinformatics in proteomics. Biomol. Eng. 18, 241–248 (2001).

    Article  CAS  Google Scholar 

  20. Fenyo, D. Identifying the proteome: software tools. Curr. Opin. Biotechnol. 11, 391–395 (2000).

    Article  CAS  Google Scholar 

  21. Fenyo, D. & Beavis, R.C. Informatics and data management in proteomics. Trends Biotechnol. 20, S35–S38 (2002).

    Article  Google Scholar 

  22. Yates, J.R. Database searching using mass spectrometry data. Electrophoresis 19, 893–900 (1998).

    Article  CAS  Google Scholar 

  23. Yates, J.R. III, McCormack, A.L. & Eng, J. Mining genomes with MS. Anal. Chem. 68, 534A–540A (1996).

    Article  CAS  Google Scholar 

  24. Nesvizhskii, A.I. & Aebersold, R. Analysis, statistical validation and dissemination of large-scale proteomics datasets generated by tandem MS. Drug Discov. Today 9, 173–181 (2004).

    Article  CAS  Google Scholar 

  25. Hunt, D.F., Yates, J.R. III, Shabanowitz, J., Winston, S. & Hauer, C.R. Protein sequencing by tandem mass spectrometry. Proc. Natl. Acad. Sci. USA 83, 6233–6237 (1986).

    Article  CAS  Google Scholar 

  26. Papayannopoulos, I.A. The interpretation of collision-induced dissociation tandem mass spectra of peptides. Mass Spectrom. Rev. 14, 49–73 (1995).

    Article  CAS  Google Scholar 

  27. Stults, J.T. & Watson, J.T. Identification of a new type of fragment ion in the collisional activation spectra of peptides allows leucine/isoleucine differentiation. Biomed. Environ. Mass Spectrom. 14, 583–586 (1987).

    Article  CAS  Google Scholar 

  28. Tabb, D.L. et al. Statistical characterization of ion trap tandem mass spectra from doubly charged tryptic peptides. Anal. Chem. 75, 1155–1163 (2003).

    Article  CAS  Google Scholar 

  29. Schutz, F., Kapp, E.A., Simpson, R.J. & Speed, T.P. Deriving statistical models for predicting peptide tandem MS product ion intensities. Biochem. Soc. Trans. 31, 1479–1483 (2003).

    Article  CAS  Google Scholar 

  30. Wysocki, V.H., Tsaprailis, G., Smith, L.L. & Breci, L.A. Mobile and localized protons: a framework for understanding peptide dissociation. J. Mass Spectrom. 35, 1399–1406 (2000).

    Article  CAS  Google Scholar 

  31. Zhang, Z. Prediction of low-energy collision-induced dissociation spectra of peptides. Anal. Chem. 76, 3908–3922 (2004).

    Article  CAS  Google Scholar 

  32. Mann, M., Meng, C.K. & Fenn, J.B. Interpreting mass spectra of multiply charged ions. Anal. Chem. 61, 1702–1708 (1989).

    Article  CAS  Google Scholar 

  33. Dancik, V., Addona, T.A., Clauser, K.R., Vath, J.E. & Pevzner, P.A. De novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol. 6, 327–342 (1999).

    Article  CAS  Google Scholar 

  34. Sadygov, R.G. et al. Code developments to improve the efficiency of automated MS/MS spectra interpretation. J. Proteome Res. 1, 211–215 (2002).

    Article  CAS  Google Scholar 

  35. Colinge, J., Magnin, J., Dessingy, T., Giron, M. & Masselot, A. Improved peptide charge state assignment. Proteomics 3, 1434–1440 (2003).

    Article  CAS  Google Scholar 

  36. Jonscher, K.R., Yates, I. & John, R. The quadrupole ion trap mass spectrometer—a small solution to a big challenge. Anal. Biochem. 244, 1–15 (1997).

    Article  CAS  Google Scholar 

  37. Moore, R.E., Young, M.K. & Lee, T.D. Method for screening peptide fragment ion mass spectra prior to database searching. J. Am. Soc. Mass Spectrom. 11, 422–426 (2000).

    Article  CAS  Google Scholar 

  38. Tabb, D. Eng, JK, Yates, J.R. III in Proteome Research: Mass Spectrometry, Vol. 1 (ed. James, P.) 125–142 (Springer, New York, 2001).

    Book  Google Scholar 

  39. Bern, M., Goldberg, D., McDonald, W.H. & Yates, J.R. III. Automatic quality assessment of peptide tandem mass spectra. Bioinformatics 20 (Suppl. 1), I49–I54 (2004).

    Article  CAS  Google Scholar 

  40. Fenyo, D., Qin, J. & Chait, B.T. Protein identification using mass spectrometric information. Electrophoresis 19, 998–1005 (1998).

    Article  CAS  Google Scholar 

  41. Perkins, D.N., Pappin, D.J., Creasy, D.M. & Cottrell, J.S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999).

    Article  CAS  Google Scholar 

  42. Clauser, K.R., Baker, P. & Burlingame, A.L. Role of accurate mass measurement (+/−10 ppm) in protein identification strategies employing MS or MS/MS and database searching. Anal. Chem. 71, 2871–2882 (1999).

    Article  CAS  Google Scholar 

  43. Bafna, V. & Edwards, N. SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database. Bioinformatics 17 (Suppl. 1), S13–S21 (2001).

    Article  Google Scholar 

  44. Zhang, N., Aebersold, R. & Schwikowski, B. ProbID: a probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data. Proteomics 2, 1406–1412 (2002).

    Article  CAS  Google Scholar 

  45. Havilio, M., Haddad, Y. & Smilansky, Z. Intensity-based statistical scorer for tandem mass spectrometry. Anal. Chem. 75, 435–444 (2003).

    Article  CAS  Google Scholar 

  46. Sadygov, R. & Yates, J.R.I. A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. Anal. Chem. 75, 3792–3798 (2003).

    Article  CAS  Google Scholar 

  47. Tabb, D.L., Saraf, A. & Yates, J.R. III. GutenTag: High-throughput sequence tagging via an empirically derived fragmentation model. Anal. Chem. 75, 6415–6421 (2003).

    Article  CAS  Google Scholar 

  48. Elias, J.E., Gibbons, F.D., King, O.D., Roth, F.P. & Gygi, S.P. Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nat. Biotechnol. 22, 214–219 (2004).

    Article  CAS  Google Scholar 

  49. Hansen, B.T., Jones, J.A., Mason, D.E. & Liebler, D.C. SALSA: a pattern recognition algorithm to detect electrophile-adducted peptides by automated evaluation of CID spectra in LC-MS-MS analyses. Anal. Chem. 73, 1676–1683 (2001).

    Article  CAS  Google Scholar 

  50. Hernandez, P., Gras, R., Frey, J. & Appel, R.D. Popitam: towards new heuristic strategies to improve protein identification from tandem mass spectrometry data. Proteomics 3, 870–878 (2003).

    Article  CAS  Google Scholar 

  51. Colinge, J., Masselot, A., Giron, M., Dessingy, T. & Magnin, J. OLAV: towards high-throughput tandem mass spectrometry data identification. Proteomics 3, 1454–1463 (2003).

    Article  CAS  Google Scholar 

  52. Field, H.I., Fenyo, D. & Beavis, R.C. RADARS, a bioinformatics solution that automates proteome mass spectral analysis, optimises protein identification, and archives data in a relational database. Proteomics 2, 36–47 (2002).

    Article  CAS  Google Scholar 

  53. Geer, L. in American Society for Mass Spectrometry (Nashville, Tennessee, USA, 2004). [AU: If this is a published article, please provide article title, title of book, and names of book editors (or editing organization). If this is an UNPUBLISHED meeting presentation, it should be cited as a personal communication in the text; in that case please renumber all following references and reference citations.]

    Google Scholar 

  54. MacCoss, M.J., Wu, C.C. & Yates, J.R. III. Probability-based validation of protein identifications using a modified SEQUEST algorithm. Anal. Chem. 74, 5593–5599 (2002).

    Article  CAS  Google Scholar 

  55. Sadygov, R.G., Liu, H. & Yates, J.R. Statistical models for protein validation using tandem mass spectral data and protein amino acid sequence databases. Anal. Chem. 76, 1664–1671 (2004).

    Article  CAS  Google Scholar 

  56. Griffin, P.R. et al. Direct database searching with MALDI-PSD spectra of peptides. Rapid Commun. Mass Spectrom. 9, 1546–1551 (1995).

    Article  CAS  Google Scholar 

  57. Yates, J.R., Eng, J.K., Klausner, C. & Burlingame, A.L. Searching databases by using high energy CID spectra of peptides. J. Am. Soc. Mass Spectrom. 7, 1089–1096 (1996).

    Article  CAS  Google Scholar 

  58. Skilling, J. in EPTO, Vol. EP1047107 (Micromass, Europe; 1999). [AU: Please (1) give title of article (2) spell out 'EPTO'—is this a book or a journal?, and (3) if a book, please list editor(s), if any, and city and publisher]

  59. Roepstorff, P. & Fohlman, J. Proposal for a common nomenclature for sequence ions in mass spectra of peptides. Biomed. Mass Spectrom. 11, 601 (1984).

    Article  CAS  Google Scholar 

  60. Tabb, D.L., MacCoss, M.J., Wu, C.C., Anderson, S.D. & Yates, J.R. III. Similarity among tandem mass spectra from proteomic experiments: detection, significance, and utility. Anal. Chem. 75, 2470–2477 (2003).

    Article  CAS  Google Scholar 

  61. Scherl, A. et al. Nonredundant mass spectrometry: a strategy to integrate mass spectrometry acquisition and analysis. Proteomics 4, 917–927 (2004).

    Article  CAS  Google Scholar 

  62. Tabb, D.L., McDonald, H.W. & Yates, J.R. III. DTASelect and Contrast: tools for assembling and comparing protein identifications from shotgun proteomics. J. Proteome Res. 1, 21–36 (2002).

    Article  CAS  Google Scholar 

  63. Kislinger, T. et al. PRISM, a generic large scale proteomic investigation strategy for mammals. Mol. Cell. Proteomics 2, 96–106 (2003).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

The authors would like to acknowledge funding from the US National Institutes of Health (R01 MH067880, DK067598-01, ES012021 and RR11823-09).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to John R Yates III.

Ethics declarations

Competing interests

J.R.Y. is an inventor on the SEQUEST patent owned by the University of Washington and licenced by the University to a commercial company.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sadygov, R., Cociorva, D. & Yates, J. Large-scale database searching using tandem mass spectra: Looking up the answer in the back of the book. Nat Methods 1, 195–202 (2004). https://doi.org/10.1038/nmeth725

Download citation

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/nmeth725

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing