Supplementary Figure 3: Expectation-maximization-estimated false discovery rate estimates are robust.
From: TagGraph reveals vast protein modification landscapes from large tandem mass spectrometry datasets

a) Randomized starting model guesses for expectation-maximization-based training of the hierarchical Bayes model rapidly converged, and yielded highly consistent probability estimates. b) Five-fold cross-validation demonstrated that training the EM-optimized hierarchical Bayes model did not substantially affect spectrum scores. Each model was trained and tested distinct spectra sets. Pair-wise comparisons between each cross-validation analysis are shown in the matrix as follows: The diagonal represents the overall score distribution (−60 < EM score >60), analogous to Fig. 2a, with the number of confidently assigned spectra (EM score > 2) indicated (green box); the lower left scatter plots compare EM score values for each pair of cross validation sets, noting the correlation (R2) for the entire score range (-60 < EM score >60). No spectra were found with conflicting (that is, positive and negative scores), but some cross validation pairs differed in the very high EM score sub-population (>20). This caused some deviation from perfect correlation (R2 > 90). The range (-20 <EM score >20, yellow box, upper right scatter plots) containing over 94% of confident identifications was markedly closer to unity (R2 >0.99). Further details of the cross-validation procedure can be found in Supplementary Note 4D. c) The A375 dataset was searched against randomized (Markov chain length = 4) protein sequence database as previously described (Elias, J. E. & Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 4, 207–14 (2007)), and compared with search results generated from the standard human proteome database used in Figs. 1 and 2. The peptide spectrum matches from standard and Markov database searches were divided into two bins: those with EM-estimated probabilities greater than 0.5 (positive EM scores; blue) and those with EM probabilities less than 0.5 (negative EM scores; gray). Three attributes and corresponding thresholds were selected based on the expectation of having little dependence on the type of dataset searched: high spectrum score; long matching substring length; and large modification mass (see Supplementary Note 4B for definitions). Discriminate scores comparing the high and low EM distributions above each indicated threshold (yellow boxes) were calculated (Supplementary Note 4E). Discriminant values above 1 were deemed as strong differences between high-and low-EM distributions; values in green and red are consistent with appropriate and inappropriate search spaces, respectively. d) Discriminant scores derived from the attributes shown in Supplementary Fig. 4b reliably indicate low- and high-confidence analyses across multiple searches, including the A375 dataset searched against the Markov-modeled proteome (word-lengths of 1-4; red), the A375 dataset searched against the standard human proteome (light green), and several arbitrarily selected datasets from the Kim et al. proteome (dark green, from left to right: Adult kidney/bRP/Velos; Adult Kidney/Gel/Elite; Adult Liver/bRP/Elite; Adult Liver bRP/Velos; Adult Monocytes/bRP/Elite; Adult Monocytes/bRP/Velos; Adult Platelets/Gel/Elite; Adult Retina/Gel/Elite; Fetal Brain/bRP/Elite; Fetal Brain/Gel/Velos). e) Despite major model differences between true and randomized database searches, the EM model can still produce high-scoring results from randomized searches (red), which stem from high-quality underlying de novo sequences. Searches against standard databases, however tend to have greater separation between correct and incorrect distributions (green).