Fig. 2: Residue amino acid preferences can explain combinatorial mutation effects for multiple proteins and enable predicting the function of unobserved combinatorial variants.
From: Protein design using structure-based residue preferences

a–d Ten binding residues (AT: L48, D52, I53, R55, L56, F74, R78, E80, A81, R82; n = 10,658) of the antitoxin ParD3 were randomized (shown space-filled on PDB ID:5CEG, panel a) and highlighted on the antitoxin sequence (bottom), transformed into cells containing wild-type toxin ParE3, and the growth of individual antitoxin variants followed by high-throughput sequencing over two timepoints to calculate the normalized log read ratio (growth rate, GR) for each variant (b). Antitoxin variants that are able to bind and neutralize the toxin will show higher growth rates. The distribution of measured growth rate values for all antitoxin variants, wild-type antitoxin, and truncated antitoxins is shown (c). The reproducibility of growth rate values between two biological replicates (d). e–g The logistic regression model learns the 20*N per residue mutation effects (e) before passing through a sigmoid function (orange) to predict 20N combinatorial variants. Logistic regression (g) fits the observed combinatorial variants better than linear regression does (f). Top row shows fit to 3 position randomized antitoxin library, bottom row shows 10 position randomized antitoxin library. The logistic regression model enables predicting held-out 20% of the random combinatorial variants (h), and enables classification of half-maximal neutralization (i). j Total variance explained and held-out correlation of site-wise logistic regression model across 8 combinatorial variant datasets. k A subset of the observed combinatorial variants is sufficient to infer the site-wise preference parameters to explain the remaining held-out combinatorial variant effects. Source data are provided as a Source Data file.