Fig. 4: Bio-inspired actor–critic RL demonstrated on a proof-of-concept T-maze navigation task. | Nature Machine Intelligence

Fig. 4: Bio-inspired actor–critic RL demonstrated on a proof-of-concept T-maze navigation task.

From: Actor–critic networks with analogue memristors mimicking reward-based learning

Fig. 4: Bio-inspired actor–critic RL demonstrated on a proof-of-concept T-maze navigation task.

a, An agent must navigate through a T-maze environment made of nine states labelled 0–8 to reach a reward (state 6). Place cells represent the position of the agent and each of them corresponds to a distinct state in the environment (one-hot encoded). Two actor neurons encode forwards/backwards (all states except 4) or left/right movements (state 4). The single critic neuron computes an estimate of the value of a given state. The ideal policy and value maps are illustrated at the bottom. The synaptic weights connecting place cells to actor neurons (θij) and place cells to the critic neuron (wj) are implemented by memristors. For each run, two out of the nine critic memristors are implemented in hardware, whereas the behaviour of the other critic and all actor weights is emulated in software. Therefore, five distinct configurations were investigated to measure all critic weights in hardware: (w0, w1), (w2, w3), (w4, w5), w6 and (w7, w8), where each wj corresponds to a distinct device. The hardware memristors are updated in an online manner and perform the in-memory calculation of the desired update Δwdes. b, Measured critic weights w0w8 (orange) over 200 episodes compared with the ideal software case (blue). The software curves correspond to the average of 1,000 distinct runs, with light blue indicating two standard deviations101. Each critic weight was implemented by a different memristor. c, Number of steps per episode required to reach the reward as a function of the episode number, extracted as the mean value from all the configurations shown in b. The learned policy approaches the optimal one (six steps between the initial position and the reward). d, Mean and error bars of the critic weights after 200 episodes for runs with in-software-emulated memristors (blue) compared with the experimental values (yellow crosses). The values of emulation are the average of 1,000 distinct runs, with the error bars indicating two standard deviations. e, Histogram showing the error between the measured (Δwdes,meas) and expected (Δwdes,exp) theoretical values for the in-memory calculation of the weight update. Δwdes,meas is obtained using equation (8) (Methods). f, Comparison between the experimental update error (top) and the update noise (bottom) extracted from the corresponding potentiation and depression curves. The update error is smaller or equal to the extracted update noise. The resulting accuracy allows for an exact tuning of the analogue memristor weights, thereby confirming the feasibility of in-memory/online learning. The dashed lines in the top histogram denote the hardware error limits shown in e. Panel a created with BioRender.com.

Back to article page