Fig. 5: Actor–critic framework applied to a navigation task akin to the Morris water maze using in-software-emulated memristors.
From: Actor–critic networks with analogue memristors mimicking reward-based learning

a, Illustration of the task objective. An artificial agent is placed at a random location in the maze (starting position; blue) and navigates through the maze (trajectory; blue) until it reaches the reward (hidden platform; red). To make the task more challenging, a U-shaped obstacle (grey) forces the agent to reach the reward from above10. For each step in the maze, the agent can select one out of eight actions (inset). The two-dimensional continuous state space is mapped to a grid of 11 × 11 overlapping place cells (black dots denote the centre of place cells). The closer the location of the agent to a place cell, the larger the input activation, given by Gaussian RBFs (green circles correspond to one standard deviation of the Gaussian function). b, Mean and standard deviation of the number of steps per episodes for 100 random seeds learned on in-software-emulated memristors (light blue) compared with the case in which the update noise is set to the minimum of all memristors (dark blue). The black dotted line indicates the mean and standard deviation achievable by a noiseless, ideal learner. Note that a random walk policy would take the agent 450 steps on average to find the reward. c, Mean policy map after training on in-software-emulated memristors. Each arrow points towards the average direction taken by the agent. The colour and length of these arrows are scaled according to the probability of taking the shown direction. From any initial position, actions leading to the target area are learned with increasing probabilities closer to the reward, as expected from ref. 10. d, Mean state value map after training. The closer the value is to the reward area, the higher is the learned value.