Fig. 3: Overall performance on single-site and multi-site mutants.
From: Enhancing efficiency of protein language models with minimal wet-lab data through few-shot learning

a Average model performance tested on single-site mutants across all 87 datasets, evaluated by Spearman correlation. Error bars represent the standard deviation caused by five random splits. SaProt (FSFP) is significantly better than all baselines with the largest P value among all training sizes being 0.0079 (two-sided Mann–Whitney U test). Analogous results measured by NDCG are shown in Supplementary Fig. 3a. b Summary of how often the best test Spearman correlation for single-site mutants on a certain dataset is achieved by a PLM, where the colors represent different strategies applied to the best PLMs. c Average model performance tested on multi-site mutants across 11 datasets, evaluated by Spearman correlation. Error bars represent the standard deviation caused by five random splits. SaProt (FSFP) is significantly better than all baselines with the largest P value among all training sizes being 0.016 (two-sided Mann–Whitney U test). Analogous results measured by NDCG are shown in Supplementary Fig. 3b. d Similar to (b) but counted for the best performance on multi-site mutants. Source data are provided as a Source Data file.