Fig. 1 | Nature Communications

Fig. 1

From: Optimizing agent behavior over long time scales by transporting value

Fig. 1

Task setting and Reconstructive Memory Agent. a The three-phase task structure. In phase 1 (P1), there is no reward, but the agent must seek information or trigger an event. In phase 2 (P2), the agent performs a distractor task that delivers reward. In phase 3 (P3), the agent can acquire a distal reward, depending on its behavior in P1. At each time step, the RL agent takes in observations \({{\bf{o}}}_{t}\) and produces actions \({{\bf{a}}}_{t}\) and passes memory state to the next time step. b The Passive Visual Match task: the agent passively observes a colored square on the wall in P1 (gray here), consumes apples in P2, and must select from a lineup of the previously observed square from P1. The agent and colored square are indicated by the yellow and red arrow, respectively. c The Reconstructive Memory Agent (RMA) takes in observations, \({{\bf{o}}}_{t}\), encodes them, \({{\bf{e}}}_{t}\), compresses them into a state variable \({{\bf{z}}}_{t}\), and decodes from \({{\bf{z}}}_{t}\) the observations and value prediction \({\hat{V}}_{t}\). The state variable is also passed to an RNN controller \({{\bf{h}}}_{t}\) that can retrieve (or read) memories \({t}_{t}\) from the external memory \({M}_{t}\) using content-based addressing with search keys \({{\bf{k}}}_{t}\). \({{\bf{z}}}_{t}\) is inserted into the external memory at the next time step, and the policy \({\pi }_{t}\) stochastically produces an action \({{\bf{a}}}_{t}\) as a function of \(({{\bf{z}}}_{t,t},{{\bf{h}}}_{t})\) (only \({{\bf{z}}}_{t}\) shown). d The RMA solves the Passive Visual Match, achieving better performance than a comparable agent without the reconstruction objective (and decoders), LSTM+Mem, and better than an agent without external memory, LSTM. An agent that randomly chooses in P3 would achieve a score of \(3.25\). Learning curves show standard error about the mean, computed over five independent runs. e The RMA uses its attentional read weight on time step 526 in P3 to retrieve the memories stored on the first few time steps in the episode in P1, when it was facing the colored square, to select the corresponding square and acquire the distal reward, worth ten points

Back to article page