Fig. 2
From: Continual deep reinforcement learning with task-agnostic policy distillation

Overview of our task-agnostic policy distillation framework. (a) The task-agnostic phase is an abstraction of a process where intermediate alternations between maximizing intrinsic rewards and distillation occur. This process follows the same alternating pattern as in the progress and compress framework. (b) Here, the task-agnostic phase is initially used before alternating between progress (P) and compress (C) phases. When considering the Atari domain, each task can be randomly selected from the Meta-Environment, therefore simulating one game environment. In the C phase, the recently learned policy by the active column (green) is distilled into the knowledge base (KB) (blue) using the KL loss between the active column and KB while protecting KB’s old values using Elastic Weight Consolidation (EWC). In the P phase, features learned from previous tasks are reused via lateral connections when learning new tasks. \(r^e_k\) and \(r^i\) are the extrinsic reward of task k and the task-independent intrinsic reward, respectively. h is a hidden layer.