Fig. 3: The CoPheeKSA algorithm and evaluation of prediction results.
From: Deciphering the dark cancer phosphoproteome using machine-learned co-regulation of phosphosites

a Overview of the ground-truth data and features used in the XGBoost machine learning algorithm for CoPheeMap-powered KSA (kinase-substrate association) prediction (CoPheeKSA). b AUROCs (Area Under the Receiver Operating Characteristic Curve) of the KSA classifiers trained with different feature combinations. The number of AUROCs for each class is 10. P-values derived from a two-sided T test with Bonferroni correction. ns: 0.05 <p ≤ 1, **0.001 <p ≤ 0.01, ***0.0001 <p ≤ 0.001, ****p < 0.0001. For boxplots, centerline indicates the median, box limits indicate upper and lower quartiles, whiskers indicate the 1.5 interquartile range. PSSM (Position-Specific Scoring Matrix). c Venn diagram comparing KSAs predicted by CoPheeKSA and the known KSAs used as ground truth positives. d The kinase library percentile score distributions for different groups of kinase-substrate pairs. e Scatter plot comparing STRING scores for the associations between the host protein of a phosphosite and two sets of kinases. Each dot represents a unique site, with the y-axis showing the STRING score for the kinase with the highest CoPheeKSA prediction score and x-axis showing the STRING score for the kinase with the highest Kinase Library percentile score. The red and blue dots highlight sites where the two methods predicted different top kinases. f Cumulative distributions of STRING scores for associations between the host protein of a phosphosite and the top kinase predicted by the kinase library. The KSAs were separated into two groups for comparison: predicted by CoPheeKSA (red) and not predicted by CoPheeKSA (blue). Source data are provided as a Source Data file.