Fig. 3: Simmering outperforms ensembled early stopping and dropout on the CIFAR-1025 (panel a) and Portuguese-English TED talk transcript translation27 (panel b) datasets.
From: Sufficient is better than optimal for training neural networks

Simmering’s ensemble prediction (rectangular marker) achieves both the highest accuracy and the most significant ensembling improvement (rectangular marker vs. round markers), with the latter indicating that the advantage of simmering extends beyond just ensembling. In contrast, the early stopped ensemble accuracy (rectangular marker) does not exceed that of its ensemble members (round markers) for both training tasks. For the CIFAR-10 dataset, we employed the ConvNet architecture26, and all non-simmering cases were learned via stochastic gradient descent. The early stopping ensemble consists of 100 independently optimized early stopped models, with an average training duration of 14.56 epochs. Dropout and ab initio simmering each trained for 20 epochs, and the models corresponding to the last 2000 weight updates contributed to the simmering ensemble. We used dropout’s inference mode prediction as its ensemble prediction,38 and aggregated the early stopping and simmering ensembles via majority voting. For the translation task, we trained a reduced version (described in Supplementary Methods) of the Transformer architecture presented in Ref. 18 with a pre-trained BERT tokenizer,35 and assessed accuracy via teacher-forced token prediction accuracy. We fixed the learning rate for all cases, and trained non-simmering cases with the Adam optimizer.16 The early stopping ensemble consists of 10 independently trained models, with an average training time of 53.1 epochs, aggregated with majority voting. We optimized a model with dropout for 60 epochs and used its inference mode as its ensemble prediction. Accuracy convergence curves for both training tasks are shown in Supplementary Figs. 1, 2, and additional comparison implementation information is detailed in Supplementary Methods. The simmering ensemble exceeded the test accuracy of all other cases after only 21 training epochs, with a majority-voted ensemble prediction from 200 models sampled during the last epoch. For equivalent training time, ab initio simmering produces more accurate predictions than other ensembled overfitting mitigation techniques on the CIFAR-10 dataset. However, simmering can both accelerate training and exceed the accuracy of other overfitting techniques on a natural language processing task.