Fig. 2: Reinforcement learning agents with recurrence jointly learn ego and allocentric tasks.
From: Hippocampus supports multi-task reinforcement learning under partial observability

a Classical hippocampal trisynaptic circuitry: entorhinal cortex (EC), dentate gyrus (DG), and hippocampus CA3 and CA1 layers. b Schematics of reinforcement learning (RL) agents with hippocampal-like architecture modelled as deep-Q-networks (DQN) used to learn the goal-driven tasks described in Fig. 1. In our models, the DG receives a simplified (partially observable) map of the environment, which is processed by the CA3-CA1 pathway and then CA1 projects to the reward system to compute a Q-value of state-action pairs, Q(s,a). We consider two main models: (i) with recurrence (hcDRQN, top) or (ii) with CA3 as a feedforward network (hcDQN, bottom). Both models consist of two hidden layers (CA3 and CA1) and an output action readout layer. c Minigrid environment showing 3 × 3 and full view size (orange outline). d Performance of all models for allocentric (left), egocentric (middle) tasks. For comparison with modern machine learning solutions to multi-task learning we also consider two popular algorithms: elastic weight consolidation (ML-EWC) and synaptic intelligence (ML-SI). Data are presented as mean values ± SEM (n = 240). p_allo (bottom to top) = 6.844e−167, 2.304e−92, 2.273e−83, 1.293e−69. p_ego (bottom to top) = 8.004e−02, 3.616e−01, 1.940e−224, 4.395e−183. e Task performance of RL agents as environment observability is progressively incremented. Both models achieve the same performance under full observability, whereas only the hcDRQN agent can learn tasks under non-full observability. Data are presented as mean values ± SEM over 5 different initial conditions. f, Learning curves for both hcDQN and hcDRQN, showing that the former fails to learn allocentric tasks. Vertical dashed lines represent switching points. Data are presented as mean values ± SEM over 5 different initial conditions. g Task performance aligned to task point of the task switch (left). Performance drop and recovery after task switching for both hcDRQN and hcDQN (right). Data are presented as mean values ± SEM (n = 25). pallo,drop = 1.18 × 10−5, pallo,rec = 2.71 × 10−3, pego,drop = 0.431, pego,rec = 0.903. **: p < 0.01, ***: p < 0.001, ****: p < 0.0001, ns indicates no significant (independent two-sample, two-sided t-tests across models). Icons used in panels c are released by OpenMOJI under a Creative Commons Attribution ShareAlike license 4.0. Source data are provided as a Source Data file.