Extended Data Fig. 5: Further results in stationary reinforcement-learning problems.

a, Similar to Fig. 4, the performance of standard PPO drops over time. However, unlike in Fig. 4, the performance of PPO with L2 regularization gets worse over time in Hopper-v3. On the other hand, PPO with continual backpropagation and L2 regularization can keep improving with time. b, Comparison of continual backpropagation and ReDo on Ant-v3. The performance of PPO with ReDo and L2 regularization worsens over time, whereas PPO with continual backpropagation and L2 regularization keeps improving over time. c, PPO with standard Adam leads to large updates in the policy network compared with proper Adam (β1 = β1 = 0.99), which explains why PPO with proper Adam performs much better than standard PPO. d, Comparison of two forms of utility in continual backpropagation, when using a running estimate of instantaneous utility and when using just the instantaneous utility. Both variations have similar performance. All these results are averaged over 30 runs; the solid lines represent the mean and the shaded regions correspond to 95% bootstrapped confidence interval.