Fig. 2: Mucin-domain candidacy algorithm for confident assignment of mucin-domain glycoproteins.

A Known mucins in HeLa lysate enrichment. HeLa lysate was subjected to the enrichment procedure described in Fig. 1 and known mucin-domain glycoproteins (MUC1, MUC13, MUC16, DAF, and SDC1) were labeled. Source data are in Supplementary Data 3. Significance testing was performed using a two-tailed t-test with 250 randomizations to correct for multiple comparisons, an FDR of 0.01, and an S0 value of 2. B Mucin-domain glycoprotein candidate annotation. A mucin-domain candidacy algorithm was created to assign Mucin Scores to indicate confidence that a given protein contains a mucin domain. First, predicted O-GalNAc sites were generated by the NetOGlyc4.0 tool, curated lists of phosphosites were downloaded from PhosphoSitePlus and Uniprot, and cellular localization GO terms were downloaded. The mucin-domain candidacy algorithm then removed predicted O-GalNAc sites overlapping with known phosphosites, calculated the proportion of threonine to serine residues (T/S-ratio), evaluated protein subcellular localization, and checked for frequency and density of predicted O-GalNAc sites. These metrics were used to calculate a Mucin Score, which could then be used to evaluate mucinome enrichment. The entire human proteome was processed with the mucin-domain candidacy algorithm; using manually curated benchmarks, 357 proteins have mucin domains (~2% of human proteome). The cell image is licensed through a CC BY 4.0 license from the Uniprot database52. C Mucinome of HeLa lysate. The results in A were processed with the mucin domain definition program, and mucin-domain glycoproteins were labeled according to the Mucin Score. Red signified a score of >2 (high confidence), orange 2–1.5 (medium confidence), and yellow 1.5–1.2 (low confidence). Known mucin-domain glycoproteins labeled in A are still labeled in green. Source data are in Supplementary Data 3. Significance testing was performed using a two-tailed t-test with 250 randomizations to correct for multiple comparisons, an FDR of 0.01, and an S0 value of 2.