Figure 3

Planning. (A) Example of online “planning”, during the observation of the world (gray lines) the model is used to predict \(n_{fut}\) steps in the future. (B) Average (dashed line), standard error (shading) and 80th percentile (solid line) of the reward as a function of the number of interactions with the environment, with (green) and without planning (black, policy gradient only). (C) Average final reward as a function \(n_{fut}\).