Fig. 1: Bio-inspired RL. | Nature Machine Intelligence

Fig. 1: Bio-inspired RL.

From: Actor–critic networks with analogue memristors mimicking reward-based learning

Fig. 1: Bio-inspired RL.

a, Sketch of a T-maze experiment. A biological agent (mouse) is navigating through a maze to find a reward (cheese). The T-maze consists of a stem and two lateral arms, with a reward at the end of one arm. The biological agent is placed at the bottom of the maze and can take a sequence of actions: going forwards or backwards, and, at the junction of the T-maze, turning right or left. Initially, the action choice is random (black trajectory), but the agent learns to choose actions that maximize the future reward (blue trajectory). b, Sketch of the Morris water maze experiment. A biological agent (mouse) is placed at a random location in the pool and strives to escape from the cold water onto a hidden platform (reward). The mouse is free to move in any direction. The mouse starts by swimming in circuitous paths (black trajectory) before learning over many trials to approach the platform on a nearly direct path (blue trajectory). c, Artificial agent with actor–critic TD learning (schematic). Place cells (state neurons; left) represent the momentary state st (position) of the agent and are connected via a weight matrix θ to actor neurons, which encode the probabilities πi for choosing action i. The place cells are also connected via the weight vector w to a critic neuron representing the value V(s) of the current state. The TD error (3rd) is a reward prediction error, analogous to the dopamine signal in biology8,11. It is used to update the weights in the actor–critic network (w, θ). d, Illustration of a memristor used as an artificial synapse in the actor–critic network. Memristors implement both weights w and θ. They perform online weight updates, compute the actions and are used for the in-memory calculation of the weight updates Δw and Δθ. e, Flowchart of the proposed bio-inspired actor–critic RL algorithm split into a software and hardware part. The hardware part employs the memristors to perform all operations highlighted in d. The software part mostly emulates the role of the environment by taking steps and returning rewards. Dedicated software calculations are kept minimized and only comprise sampling the actions, calculating the number of voltage update pulses from the corresponding weight updates, and computing the RBFs of the input representation. The combination of an online learning algorithm and continuous in-memory weight update calculation enables error-correcting weight updates as errors caused by the device non-idealities are compensated for in the next learning iteration. Panels a and b created with BioRender.com.

Back to article page