Fig. 5: Impact of sequence diversity on data efficiency and model coverage.
From: Accuracy and data efficiency in deep learning models of protein expression

A We trained CNNs on datasets of constant size and increasing sequence diversity. We trained a total of 27 models by successively aggregating fractions of randomly chosen mutational series into a new dataset for training; the total size of the training was kept constant at 5800 sequences. Training on aggregated sequences achieves good accuracy for mutational series in the training set, but poor predictions for series not included in the training data. This suggests that CNNs generalize poorly across unseen regions of the sequence space. Accuracy is reported as the R2 computed on 10% held-out sequences from each mutational series. We excluded two series from training to test the generalization performance of the last model. B Bubble plot shows the R2 values averaged across all mutational series for each model. Labels indicate the model number from panel A, and insets show schematics of the sequence space employed for training; for clarity, we have omitted model 1 from the plot. Improved sequence diversity leads to gains in predictive accuracy across larger regions of the sequence space; we observed similar trends for other random choices of series included in the training set (Supplementary Fig. S12). The decreasing number of training sequences per series reflects better data efficiency, thanks to an increasingly diverse set of training sequences. To quantify sequence diversity, we counted the occurrence of unique overlapping 5-mers across all sequences of each training set, and defined diversity as \(1/\mathop{\sum }\nolimits_{i=1}^{100}{c}_{i}\), where ci is the count of the i-th most frequent 5-mers.