Fig. 3: Benchmarking of CPP and dPULearn against state-of-the-art methods.

a Bar chart showing a comparison of three feature engineering algorithms combined without (None) and with data expansion methods. Feature engineering algorithms included average scale values over the entire TMD-JMD sequence without splits (scale-based), deep learning-based ProtTrans5 embedding (ProtT5), and CPP. Data expansion methods comprised the Synthetic Minority Over-sampling Technique (SMOTE) and deterministic Positive-Unlabeled (PU) Learning (dPULearn). Support vector machine models with leave-one-out cross-validation were used for validation. b Heatmap showing optimization of the number of CPP features used for model training and non-substrate identification by dPULearn. The optimized result is indicated by a bold square. Source data are provided as a Source Data file.