Fig. 4: Results on a stationary ant-locomotion problem.

a, The four reinforcement-learning algorithms performed similarly on this and the non-stationary problem (compare with Fig. 3c). b,c, A closer look inside the networks reveals a similar pattern as in supervised learning (compare with Fig. 2c,d). d, The absolute values of the weights of the networks increased steadily under standard and tuned PPO, whereas they decreased and stayed small under L2 regularization with or without continual backpropagation. These results are averaged over 30 runs; the solid lines represent the mean and the shaded regions represent the 95% bootstrapped confidence interval.