Extended Data Fig. 2: Proteogenomics and dynamic range of transcript and protein expression.
From: Mass-spectrometry-based draft of the Arabidopsis proteome

a, Number of identified N-terminal (NT) or C-terminal (CT) peptides of proteins in either unmodified or phosphorylated form. b, Frequency of amino acids following the initiator methionine in N-terminal peptides with (−X) or without (M−X) cleavage of the initiator methionine. X denotes the amino acid after the start codon. c, Frequency of protein N-terminal acetylation for amino acids in b. Because trypsin was used for protein digestion, the frequencies for Arg and Lys residues could not be determined (n.d.). d, Distribution of peptide-based sequence coverage of proteins in individual tissues and for the combined dataset (tissue abbreviations as in Fig. 1). Boxes contain 50% of the data and show the median as a black line. The top and bottom quartile ranges are shown as whiskers. The number of proteins is indicated for each tissue. e, Pie charts showing the percentage of proteins identified by <3, 3–10 or >10 peptides either allowing shared (razor) peptides or restricting to unique peptides only. f, Left, number of protein isoforms detected at the transcript and protein level compared with the number of all annotated isoforms in Araport11. Right, number of multiple isoforms of the same gene distinguished at the peptide level. g, Validation of protein isoform and sORF identification by comparing the tandem mass spectra from the tissue atlas to those of synthetic peptide reference standards. The normalized spectral contrast angle (SA) was used as a similarity metric (Methods). Candidate isoforms and sORFs were considered valid if the spectral contrast angle of the spectra was >0.7. These data are reported in Supplementary Data 3. h, Amino acid sequence and mirror plots of tandem mass spectra for two peptides of the sORF BIP138_4. The spectra pointing upwards were collected from tissue digests; those pointing downwards were collected from synthetic peptides. The normalized spectral contrast angle and Pearson’s correlation coefficient (r) were used as similarity metrics (Methods) and indicate that both high-scoring spectra (n = 1 acquired spectra) are near identical, thus validating the identification of this sORF as an expressed protein. i, Dynamic range of transcript abundance (grey) and proportion of transcripts that were also identified at the protein level projected into this plot (blue). OM, orders of magnitude. Note that for lower abundance transcripts, fewer proteins were detected. j, Dynamic range of protein abundance and proportion of proteins with phosphorylation evidence. Protein abundance spans six orders of magnitude, whereas transcript abundance only spans four (i). In addition, note that phosphorylation was detected across the entire protein abundance range. k, Percentage of all annotated kinases (K), phosphatases (P), transcription factors (TF) and transcription regulators (TR) detected at the transcript, protein or phosphoprotein levels. Numbers below the x axis denote the number of genes for these protein classes in the A. thaliana genome.