Fig. 6: Explainable AI analysis of substrate-defining features for APP, NOTCH2, and ITGB1.

a–c CPP-SHAP ranking plots showing the top 15 features explaining the substrate prediction scores of the high-confidence (HC) substrates APP (a) and NOTCH2 (b), as well as of the low-confidence (LC) non-substrate ITGB1 (c). The substrate prediction scores ± standard deviation (see Methods “Aggregation of prediction results”) for APP, NOTCH2 and ITGB1 are given, followed by their color-highlighted prediction scores (based on dataset 1 and TMHMM annotation) explained by SHAP. Indicated are the scale subcategories, residue positions of part-split combinations, differences in the mean feature value (compared to OTHERS), and the feature impact (based on TMHMM, dataset 1 training). Features are ranked according to their positive (blue) or negative (red) impact. Σ indicates the sum of the importance of all top 15 features. d–f CPP-SHAP profiles showing the cumulative feature impact per residue for the TMD-JMD sequence of APP (d), NOTCH2 (murine sequence), and ITGB1 (f).The feature impact was obtained based on dataset 1 with TMHMM annotation (see Methods “Combining CPP with SHAP”). g Comparison of discriminative power (substrates vs non-substrates) for different similarity measures exemplified for APP, NOTCH2, and ITGB1. Arrow thickness corresponds to similarity strength. h A scatterplot showing the normalized similarity to the closest (i.e., most similar or correlating) substrate from SUBEXPERT for all HC non-substrates (blue) and new HC substrates (red). A pair of connected dots represents the normalized similarity values for a particular protein based on the TMD-JMD sequence (gray) or CPP feature correlation (black). The closest substrate can differ between both measures, as exemplified by the new HC substrate ADAM7. Min-max normalization was performed on the human N-out proteome dataset. The dashed black line indicates the discrimination border based on CPP features. Source data are provided as a Source Data file.