Fig. 1: A step-by-step approach to the network-topology-based identification of predictive oncotherapeutical biomarkers.
From: MarkerPredict: predicting clinically relevant predictive biomarkers with machine learning

a A 5-step flowchart of the MarkerPredict process. In the first two steps, the network topology analysis process is detailed. After the identification of IDP-target pairs, the next step is protein annotation to establish the final input dataset. Then, machine learning models are trained, and the classification of the unlabelled data is carried out. As a final step, the predicted predictive biomarkers are reviewed with the scope of potential medical usage. b Step 1.-2.: Sankey-diagram of the identified triangles in the ReactomeFI37 network. 3.04% of the triangles contained DisProt-defined IDPs or targets. Based on the number of DisProt IDP and target members in triangles, the random chance for IDP-target triangles can be calculated. Comparing this value with the actual ratios, IDP-target triangles are overrepresented in every network. The DisProt enrichment ratio is 11.91 in ReactomeFI (highlighted), 5.66 in CSN35 and 4.86 in the SIGNOR36 network. For AlphaFold IDPs (defined as pLLDT < 50), it is 5.48, 6 and 1.7, for IUPred long score>0.5 it is 6.1, 5.88 and 3.74, and for short score > 0.5 it is 3.98, 7.41 and 3, respectively. c Step 3.: The biomarker properties of neighbours of targets in triangles including cancer drug targets. The majority of neighbours (86.6% to 96.3%) were biomarkers according to the CIViCmine database. Among predictive biomarker neighbours, a considerable ratio is established as a predictive biomarker of a drug which has its target in a shared triangle with the particular protein. d Step 4.: Receiver operation characteristic (ROC) curve of hundred 5-fold cross-validation with the XGBoost29 model trained on the combined data of CSN, SIGNOR and ReactomeFI networks, on the data of all 3 IDP databases and prediction methods. The model reached high performance, with the average area under curve (AUC) of 0.99 ± 0.01 (marked with red). Other validation models also showed high performance (see Supplementary Table S3). e Step 5.: The Biomarker Probability Score (BPS), a rank score calculated from prediction probabilities (for the definition see Fig. 2) versus the original label of the training dataset. This figure shows the BPS score calculated with the models trained on all 3 IDP databases and prediction methods. In the order of growing BPS values (marked with red), the original labels are showed with grey shadowing. The large correlation is visible, with a few differing labels around average BPS values.