Fig. 4: Multi-site mutagenesis and ML-guided specificity engineering.
From: Data-driven protease engineering by DNA-recording and epistasis-aware machine learning

a Randomized residues (cluster A: 30-32, cluster B: 217-219, stick model) are close to P1’ (yellow) in TEVs (stick model). Structure: PDB 1LVB55. b TEVp libraries randomizing either (A, B) or both clusters concomitantly (AB) were tested against all P1’-variants of TEVs (X = any amino acid). c P1’-specificity profiles for the libraries in (b). Variants are ranked according to mean activity on all tested substrates from high (left) to low (right). The canonical substrate ENLYFQS is highlighted (bold). Gray boxes indicate missing data. d Variants with distinct specificity were re-tested with the DNA recorder to obtain full P1’-profiles. A variant with a strong preference for W (Trp-spec.) and three with relaxed specificity (promiscuous) are compared to parent TEVp I. Characteristic TEVs variants (arrows, bold) were chosen for in vitro assays. Gray boxes indicate missing data. e In vitro activity of variants from (d) on characteristic P1’-variants. f MLP architecture. A multi-output classifier was used to model activity on 20 P1’-variants simultaneously using data from library AB (BLOSUM: Blocks Substitution Matrix). The architecture was optimized via systematic hyperparameter search and renders exhaustive prediction of large sequence-activity maps feasible. A random subset of predicted specificity profiles from cluster AB is shown. g Predictive performance of different ML models trained on varying dataset sizes from library AB. AUROC: Area Under the Receiver-Operator Characteristic. AUPRC: Area Under the Precision-Recall Curve. F1: harmonic F-score. MCC: Matthews Correlation Coefficient. MLP: multilayer perceptron. ESM: evolutionary scale modeling. GBT: gradient-boosted trees. LR: logistic regression. SVC: support vector classifier. KNN: k-nearest neighbor classifier. Color indicates mean value over n = 5 model replicates with different seeds. h Efficiency of random versus MLP-guided screens using a test set of 1000 library AB variants. Hit rates are the fraction of variants that are active (left) or inactive (right) on the TEVs variant bearing the indicated P1’ amino acid. Right panel: only candidates retaining activity on canonical TEVs are displayed. Bars represent the mean for n = 5 replicate models with different random seeds with SD (error bars). Source data are provided as a Source Data file.