Supplementary Figure 10: Downsampling analysis determines the extent of data needed to train models of varying complexity. | Nature Biotechnology

Supplementary Figure 10: Downsampling analysis determines the extent of data needed to train models of varying complexity.

From: Deciphering eukaryotic gene-regulatory logic with 100 million random promoters

Supplementary Figure 10

Model performance as reflected by prediction accuracy on the held-out validation data (Pearson’s r2, y axis, top) and by the relative performance on validation-vs-training data (y axis, bottom, ratio of validation-vs-training Pearson r2; when <1.00, model is overfit) for models trained using sub-samples of the available training data (x axis) and learning different parameters (colors; parameters are cumulative from top to bottom of legend). Dotted lines: number of training examples needed to eliminate overfitting ((validation r2 / training r2) > 0.999). Solid points: number of examples at which the maximal validation performance is achieved.

Back to article page