Fig. 2: Accuracy of non-deep machine learning models.
From: Accuracy and data efficiency in deep learning models of protein expression

A We trained models using datasets of variable size and with different strategies for DNA encoding. Sequences were converted to numerical vectors with five DNA encoding strategies (Table 1), plus an additional mixed encoding consisting of binary one-hot augmented with the biophysical properties of Fig. 1A; in all cases, one-hot encoded matrices were flattened as vectors of dimension 384. We considered four non-deep models trained on an increasing number of sequences from five mutational series with different phenotype distributions (Fig. 1B). B Impact of DNA encoding and data size on model accuracy. Overall we found that random forest regressors and binary one-hot encodings provide the best accuracy; we validated this optimal choice across the whole sequence space by training more than 5000 models in all mutational series (Supplementary Fig. S5). Phenotype distributions have a minor impact on model accuracy thanks to the use of stratified sampling for training. Model accuracy was quantified by the coefficient of determination (R2) between predicted and measured sfGFP fluorescence, computed on ~400 test sequences held-out from training and validation. The reported R2 values are averages across five training repeats with resampled training and test sets (Monte Carlo cross-validation). In each training repeat, we employed the same test set for all models and encodings. The full cross-validation results (Supplementary Fig. S4) show robust performance and little overfitting, particularly for the best performing models. C Exemplar predictions on held-out sequences for three models from panel B (marked with stars); the shown models were trained on 25% of mutational series 44 (bimodal fluorescence distribution; Fig. 1C) using 4-mer ordinal encoding. Details on model training and hyperparameter optimization can be found in the Methods, Supplementary Fig. S3, and Supplementary Tables S2–S3.