Abstract
Database searching is an essential element of large-scale proteomics. Because these methods are widely used, it is important to understand the rationale of the algorithms. Most algorithms are based on concepts first developed in SEQUEST and PeptideSearch. Four basic approaches are used to determine a match between a spectrum and sequence: descriptive, interpretative, stochastic and probability–based matching. We review the basic concepts used by most search algorithms, the computational modeling of peptide identification and current challenges and limitations of this approach for protein identification.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout


Similar content being viewed by others
References
Henzel, W.J. et al. Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases. Proc. Natl. Acad. Sci. USA 90, 5011–5015 (1993).
Yates, J.R.d., Speicher, S., Griffin, P.R. & Hunkapiller, T. Peptide mass maps: a highly informative approach to protein identification. Anal. Biochem. 214, 397–408 (1993).
Papin, D.J., Hojrup, P. & Bleasby, A.J. Rapid identification of proteins using peptide mass fingerprinting. Curr. Biol. 3, 327–332 (1994).
James, P., Quadroni, M., Carafoli, E. & Gonnet, G. Protein identification by mass profile fingerprinting. Biochem. Biophys. Res. Commun. 195, 58–64 (1993).
Mann, M., Hojrup, P. & Roepstorff, P. Use of mass spectrometric molecular weight information to identify proteins in sequence databases. Biol. Mass Spectrom. 22, 338–345 (1993).
Eng, J.K., McCormack, A.L. & Yates, J.R. III . An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).
Mann, M. & Wilm, M. Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal. Chem. 66, 4390–4399 (1994).
McCormack, A.L., Eng, J.K. & Yates, I.J.R. Peptide sequence analysis on quadrupole mass spectrometers. in Methods: A Companion to Methods in Enzymology 6, 274–283 (1994).
McCormack, A.L., Eng, J.K., DeRoos, P.C., Rudensky, A.Y. & Yates, I.J.R. in Biochemical and Biotechnological Applications of Electrospray Ionization Mass Spectrometry Vol. 619 (ed. Snyder, A.P.) 207–225 (American Chemical Society, Washington, D.C., 1995).
McCormack, A.L. et al. Direct analysis and identification of proteins in mixtures by LC/MS/MS and database searching at the low-femtomole level. Anal. Chem. 69, 767–776 (1997).
Link, A.J. et al. Direct analysis of protein complexes using mass spectrometry. Nat. Biotechnol. 17, 676–682 (1999).
Washburn, M.P., Wolters, D. & Yates, J.R. III. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol. 19, 242–247 (2001).
Skop, A.R., Liu, H., Yates, J. III, Meyer, B.J. & Heald, R. Dissection of the mammalian midbody proteome reveals conserved cytokinesis mechanisms. Science 305, 61–66 (2004).
Schirmer, E.C., Florens, L., Guan, T., Yates, J.R. III & Gerace, L. Nuclear membrane proteins with potential disease links found by subtractive proteomics. Science 301, 1380–1382 (2003).
Gavin, A.C. et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415, 141–147 (2002).
Cheeseman, I.M. et al. Phospho-regulation of kinetochore-microtubule attachments by the Aurora kinase Ipl1p. Cell 111, 163–172 (2002).
Sickmann, A. et al. The proteome of Saccharomyces cerevisiae mitochondria. Proc. Natl. Acad. Sci. USA 100, 13207–13212 (2003).
Blondeau, F. et al. Tandem MS analysis of brain clathrin-coated vesicles reveals their critical involvement in synaptic vesicle recycling. Proc. Natl. Acad. Sci. USA 101, 3833–3838 (2004).
Vihinen, M. Bioinformatics in proteomics. Biomol. Eng. 18, 241–248 (2001).
Fenyo, D. Identifying the proteome: software tools. Curr. Opin. Biotechnol. 11, 391–395 (2000).
Fenyo, D. & Beavis, R.C. Informatics and data management in proteomics. Trends Biotechnol. 20, S35–S38 (2002).
Yates, J.R. Database searching using mass spectrometry data. Electrophoresis 19, 893–900 (1998).
Yates, J.R. III, McCormack, A.L. & Eng, J. Mining genomes with MS. Anal. Chem. 68, 534A–540A (1996).
Nesvizhskii, A.I. & Aebersold, R. Analysis, statistical validation and dissemination of large-scale proteomics datasets generated by tandem MS. Drug Discov. Today 9, 173–181 (2004).
Hunt, D.F., Yates, J.R. III, Shabanowitz, J., Winston, S. & Hauer, C.R. Protein sequencing by tandem mass spectrometry. Proc. Natl. Acad. Sci. USA 83, 6233–6237 (1986).
Papayannopoulos, I.A. The interpretation of collision-induced dissociation tandem mass spectra of peptides. Mass Spectrom. Rev. 14, 49–73 (1995).
Stults, J.T. & Watson, J.T. Identification of a new type of fragment ion in the collisional activation spectra of peptides allows leucine/isoleucine differentiation. Biomed. Environ. Mass Spectrom. 14, 583–586 (1987).
Tabb, D.L. et al. Statistical characterization of ion trap tandem mass spectra from doubly charged tryptic peptides. Anal. Chem. 75, 1155–1163 (2003).
Schutz, F., Kapp, E.A., Simpson, R.J. & Speed, T.P. Deriving statistical models for predicting peptide tandem MS product ion intensities. Biochem. Soc. Trans. 31, 1479–1483 (2003).
Wysocki, V.H., Tsaprailis, G., Smith, L.L. & Breci, L.A. Mobile and localized protons: a framework for understanding peptide dissociation. J. Mass Spectrom. 35, 1399–1406 (2000).
Zhang, Z. Prediction of low-energy collision-induced dissociation spectra of peptides. Anal. Chem. 76, 3908–3922 (2004).
Mann, M., Meng, C.K. & Fenn, J.B. Interpreting mass spectra of multiply charged ions. Anal. Chem. 61, 1702–1708 (1989).
Dancik, V., Addona, T.A., Clauser, K.R., Vath, J.E. & Pevzner, P.A. De novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol. 6, 327–342 (1999).
Sadygov, R.G. et al. Code developments to improve the efficiency of automated MS/MS spectra interpretation. J. Proteome Res. 1, 211–215 (2002).
Colinge, J., Magnin, J., Dessingy, T., Giron, M. & Masselot, A. Improved peptide charge state assignment. Proteomics 3, 1434–1440 (2003).
Jonscher, K.R., Yates, I. & John, R. The quadrupole ion trap mass spectrometer—a small solution to a big challenge. Anal. Biochem. 244, 1–15 (1997).
Moore, R.E., Young, M.K. & Lee, T.D. Method for screening peptide fragment ion mass spectra prior to database searching. J. Am. Soc. Mass Spectrom. 11, 422–426 (2000).
Tabb, D. Eng, JK, Yates, J.R. III in Proteome Research: Mass Spectrometry, Vol. 1 (ed. James, P.) 125–142 (Springer, New York, 2001).
Bern, M., Goldberg, D., McDonald, W.H. & Yates, J.R. III. Automatic quality assessment of peptide tandem mass spectra. Bioinformatics 20 (Suppl. 1), I49–I54 (2004).
Fenyo, D., Qin, J. & Chait, B.T. Protein identification using mass spectrometric information. Electrophoresis 19, 998–1005 (1998).
Perkins, D.N., Pappin, D.J., Creasy, D.M. & Cottrell, J.S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999).
Clauser, K.R., Baker, P. & Burlingame, A.L. Role of accurate mass measurement (+/−10 ppm) in protein identification strategies employing MS or MS/MS and database searching. Anal. Chem. 71, 2871–2882 (1999).
Bafna, V. & Edwards, N. SCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database. Bioinformatics 17 (Suppl. 1), S13–S21 (2001).
Zhang, N., Aebersold, R. & Schwikowski, B. ProbID: a probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data. Proteomics 2, 1406–1412 (2002).
Havilio, M., Haddad, Y. & Smilansky, Z. Intensity-based statistical scorer for tandem mass spectrometry. Anal. Chem. 75, 435–444 (2003).
Sadygov, R. & Yates, J.R.I. A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. Anal. Chem. 75, 3792–3798 (2003).
Tabb, D.L., Saraf, A. & Yates, J.R. III. GutenTag: High-throughput sequence tagging via an empirically derived fragmentation model. Anal. Chem. 75, 6415–6421 (2003).
Elias, J.E., Gibbons, F.D., King, O.D., Roth, F.P. & Gygi, S.P. Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nat. Biotechnol. 22, 214–219 (2004).
Hansen, B.T., Jones, J.A., Mason, D.E. & Liebler, D.C. SALSA: a pattern recognition algorithm to detect electrophile-adducted peptides by automated evaluation of CID spectra in LC-MS-MS analyses. Anal. Chem. 73, 1676–1683 (2001).
Hernandez, P., Gras, R., Frey, J. & Appel, R.D. Popitam: towards new heuristic strategies to improve protein identification from tandem mass spectrometry data. Proteomics 3, 870–878 (2003).
Colinge, J., Masselot, A., Giron, M., Dessingy, T. & Magnin, J. OLAV: towards high-throughput tandem mass spectrometry data identification. Proteomics 3, 1454–1463 (2003).
Field, H.I., Fenyo, D. & Beavis, R.C. RADARS, a bioinformatics solution that automates proteome mass spectral analysis, optimises protein identification, and archives data in a relational database. Proteomics 2, 36–47 (2002).
Geer, L. in American Society for Mass Spectrometry (Nashville, Tennessee, USA, 2004). [AU: If this is a published article, please provide article title, title of book, and names of book editors (or editing organization). If this is an UNPUBLISHED meeting presentation, it should be cited as a personal communication in the text; in that case please renumber all following references and reference citations.]
MacCoss, M.J., Wu, C.C. & Yates, J.R. III. Probability-based validation of protein identifications using a modified SEQUEST algorithm. Anal. Chem. 74, 5593–5599 (2002).
Sadygov, R.G., Liu, H. & Yates, J.R. Statistical models for protein validation using tandem mass spectral data and protein amino acid sequence databases. Anal. Chem. 76, 1664–1671 (2004).
Griffin, P.R. et al. Direct database searching with MALDI-PSD spectra of peptides. Rapid Commun. Mass Spectrom. 9, 1546–1551 (1995).
Yates, J.R., Eng, J.K., Klausner, C. & Burlingame, A.L. Searching databases by using high energy CID spectra of peptides. J. Am. Soc. Mass Spectrom. 7, 1089–1096 (1996).
Skilling, J. in EPTO, Vol. EP1047107 (Micromass, Europe; 1999). [AU: Please (1) give title of article (2) spell out 'EPTO'—is this a book or a journal?, and (3) if a book, please list editor(s), if any, and city and publisher]
Roepstorff, P. & Fohlman, J. Proposal for a common nomenclature for sequence ions in mass spectra of peptides. Biomed. Mass Spectrom. 11, 601 (1984).
Tabb, D.L., MacCoss, M.J., Wu, C.C., Anderson, S.D. & Yates, J.R. III. Similarity among tandem mass spectra from proteomic experiments: detection, significance, and utility. Anal. Chem. 75, 2470–2477 (2003).
Scherl, A. et al. Nonredundant mass spectrometry: a strategy to integrate mass spectrometry acquisition and analysis. Proteomics 4, 917–927 (2004).
Tabb, D.L., McDonald, H.W. & Yates, J.R. III. DTASelect and Contrast: tools for assembling and comparing protein identifications from shotgun proteomics. J. Proteome Res. 1, 21–36 (2002).
Kislinger, T. et al. PRISM, a generic large scale proteomic investigation strategy for mammals. Mol. Cell. Proteomics 2, 96–106 (2003).
Acknowledgements
The authors would like to acknowledge funding from the US National Institutes of Health (R01 MH067880, DK067598-01, ES012021 and RR11823-09).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
J.R.Y. is an inventor on the SEQUEST patent owned by the University of Washington and licenced by the University to a commercial company.
Rights and permissions
About this article
Cite this article
Sadygov, R., Cociorva, D. & Yates, J. Large-scale database searching using tandem mass spectra: Looking up the answer in the back of the book. Nat Methods 1, 195–202 (2004). https://doi.org/10.1038/nmeth725
Published:
Issue date:
DOI: https://doi.org/10.1038/nmeth725
This article is cited by
-
Pine wilt disease: what do we know from proteomics?
BMC Plant Biology (2024)
-
Starch treatment improves the salivary proteome for subject identification purposes
Forensic Science, Medicine and Pathology (2023)
-
DNA Oligonucleotide Fragment Ion Rearrangements Upon Collision-Induced Dissociation
Journal of the American Society for Mass Spectrometry (2015)
-
Quantitative shotgun proteomics: considerations for a high-quality workflow in immunology
Nature Immunology (2014)
-
Chemical Tools for Temporally and Spatially Resolved Mass Spectrometry-Based Proteomics
Annals of Biomedical Engineering (2014)


