Supplementary Figure 9: Success of full length peptide identifications in the three peptides sets generated in this study
From: Building ProteomeTools based on a complete synthetic human proteome

We measured the success of each synthesis by determining the fraction of peptides in a pool that could be identified by LC-MS/MS (the different fragmentation modes are indicated in each plot, all HCD collision energies were combined). Apart from a 1% peptide FDR, no additional score cutoff was applied here. For the ‘proteotypic’ set (top panel), recoveries are generally very high (average ~95 %) and only decrease for very long peptides (high pool numbers) presumably because it becomes increasingly difficult to obtain a full length peptide. For the ‘missing gene’ set (middle panel), recoveries were lower (average ~80 %) likely because of lower success in the LC-MS/MS analysis (e. g. solubility, ionization efficiency, fragmentation efficiency). We note that this was expected given the fact that these peptides were predicted from the protein sequences regardless of any prior observation from biological sources. The recovery of the ‘SRMAtlas’ set (bottom panel) was also lower (average ~65 %) possibly (among other potential factors) because these peptides had been synthesized ~6 years prior to our analysis and because this set contains peptides representing N-linked glycosylation sites after PNGase F digestion which we did not account for in the database search.