Fig. 4: Evaluation of function and diversity of CoVES-designed GFP variants.
From: Protein design using structure-based residue preferences

a–c GFP surrogate fitness function performance evaluated on held-out 10% test data, when trained on 90% random training data. a The correlation between predicted and observed fluorescence values is shown. The Pearson correlation coefficient, r, is indicated. b The true positive rate (TPR) vs. false positive rate (FPR) curve of half-maximal fluorescence classification is shown. c The positive predictive value, ie. fraction of truly above half-maximal fluorescing variants among the total number of predicted above half-maximal fluorescing GFP variants is shown for the held-out 10% training data. d, e The diversity and function of sampled sequences from unsupervised models is shown. Each dot represents a set of sampled sequences at a given temperature and a fixed number of positions that are randomized, summarized by the fraction of sequences predicted to show above half-maximal fluorescence and the average number of mutations. d A low sampling temperature (CoVES T = 0.5, ESM-IF T = 1.0, proteinMPNN T = 1.5) is used while the number of random mutated positions (out of 233 GFP residues) is increased. e The maximum number of mutated positions is set to 10, and the sampling temperature is increased. Only combinatorial variants consisting of individual variants on which the surrogate fitness function has been trained on and tested on in panels a–c, are shown. Source data are provided as a Source Data file.