Fig. 3

Learning curves depict the average episodic task rewards over the process of training. Results from all experiments are averaged across 5 seeds, with shaded regions indicating the standard deviation. The algorithm is evaluated every 10 epochs, and the curves are processed by smoothing functions.