Fig. 2: A machine learning model to infer domain insertion sites in proteins. | Nature Methods

Fig. 2: A machine learning model to infer domain insertion sites in proteins.

From: Rational engineering of allosteric protein switches by in silico prediction of domain insertion sites

Fig. 2

a, Schematic of the machine learning pipeline for protein insertion site prediction. bd, Boxplots showing prediction scores for true positive (green) and negative labels (other positions, unknown; blue) on a test set. The performance of different models trained with different encoding strategies (b), on different dataset splits (random; Interpro, based on domain classes; single, one representative example per class) (c) or using positional masking (d) is shown (see Supplementary Note 1 for details). e, Boxplot of insertion scores predicted by the model variant trained on the ‘single’ representative protein dataset split, grouped by secondary (Sec.) structures. The calculation is based on secondary structure predictions for the entire test set. be, Boxes represent the interquartile range (IQR) and the median is represented by a horizontal line. Whiskers extend to the 1.5-fold IQR or to the value of the smallest or largest predicted value. n = 1,382 protein sequences with 1,382 known positive insertion sites and 325,510 unknown sites. f,g, Exemplary predictions from the test set. The natural insertion sites are marked in green and the insert domain is colored accordingly in the protein structures. f, Phosphoglycerate kinase (PDB ID 4NG4); g, Rvb1/Rvb2 heterohexamer (RuvB-like 1, PDB ID 5OAF). h, The insertion score for the bacterial transcription factor AraC is indicated for each amino acid position by a black line. Green positions indicate experimentally validated insertion-tolerant sites10. The domains and secondary structure elements of AraC are annotated. i, AUROC plot of the insertion site prediction for AraC. j, Insertion scores mapped onto the Alphafold2-predicted AraC protein structure. In h,j, allosteric insertion sites, previously validated in experiments (I113 and S170), are indicated. AUC, area under the curve.

Back to article page