Actor–critic networks with analogue memristors mimicking reward-based learning

Portner, Kevin; Zellweger, Till; Martinelli, Flavio; Bégon-Lours, Laura; Bragaglia, Valeria; Weilenmann, Christoph; Jubin, Daniel; Falcone, Donato Francesco; Hermann, Felix; Hrynkevych, Oscar; Stecconi, Tommaso; La Porta, Antonio; Drechsler, Ute; Olziersky, Antonis; Offrein, Bert Jan; Gerstner, Wulfram; Luisier, Mathieu; Emboras, Alexandros

doi:10.1038/s42256-025-01149-w

Download PDF

Article
Open access
Published: 09 December 2025

Actor–critic networks with analogue memristors mimicking reward-based learning

Kevin Portner ORCID: orcid.org/0000-0002-1175-9122¹^na1,
Till Zellweger ORCID: orcid.org/0000-0001-9402-8917¹^na1,
Flavio Martinelli ORCID: orcid.org/0009-0007-1514-0718²,
Laura Bégon-Lours ORCID: orcid.org/0000-0003-2520-3317^1,3,
Valeria Bragaglia³,
Christoph Weilenmann¹,
Daniel Jubin³,
Donato Francesco Falcone³,
Felix Hermann³,
Oscar Hrynkevych¹,
Tommaso Stecconi³,
Antonio La Porta³,
Ute Drechsler³,
Antonis Olziersky³,
Bert Jan Offrein³,
Wulfram Gerstner²,
Mathieu Luisier ORCID: orcid.org/0000-0002-2212-7972¹ &
…
Alexandros Emboras¹

Nature Machine Intelligence volume 7, pages 1939–1953 (2025)Cite this article

13k Accesses
3 Citations
15 Altmetric
Metrics details

Subjects

Abstract

Advancements in memristive devices have given rise to a new generation of specialized hardware for bio-inspired computing. However, most of these implementations draw only partial inspiration from the architecture and functionalities of the mammalian brain. Moreover, the use of memristive hardware is typically restricted to specific elements within the learning algorithm, leaving computationally expensive operations to be executed in software. Here we demonstrate reinforcement learning through an actor–critic temporal difference algorithm implemented on analogue memristors, mirroring the principles of reward-based learning in a neural network architecture similar to the one found in biology. Memristors are used as multipurpose elements within the learning algorithm: they act as synaptic weights that are trained online, they calculate the weight updates associated with the temporal difference error directly in hardware and they determine the actions to navigate the environment. Owing to these features, weight training can take place entirely in memory, eliminating data movement. We test our framework on two navigation tasks—the T-maze and the Morris water maze—using analogue memristors based on the valence change memory effect. Our approach represents the first step towards fully in-memory and online neuromorphic computing engines based on bio-inspired learning schemes.

Experimental demonstration of highly reliable dynamic memristor for artificial neuron and neuromorphic computing

Article Open access 03 June 2022

A self-adaptive hardware with resistive switching synapses for experience-based neurocomputing

Article Open access 21 March 2023

Single neuromorphic memristor closely emulates multiple synaptic mechanisms for energy efficient neural networks

Article Open access 13 August 2024

Main

With its ability to adapt to new situations, process large amounts of data and generalize from past experiences, the human brain is a beacon of efficiency and computational power. Adjustments in brain connectivity are governed by ‘learning rules’¹ that can be classified as two- or three-factor rules². Although two-factor rules such as Hebbian learning or spike-timing-dependent plasticity^1,3,4 are useful to learn representations of features based on the statistics of the input stream^5,6,7, they lack a mechanism to incorporate neuromodulatory feedback signals that condition learning on reward, punishment or novelty^8,9. By contrast, three-factor rules such as reward-modulated spike-timing-dependent plasticity (R-STDP)¹⁰ not only respond to input statistics but also incorporate an additional modulatory signal, allowing synaptic plasticity to be regulated based on behavioural context. Specifically, three-factor rules rely on Hebbian signals locally available at each synapse combined with a global broadcast signal (third factor)^2,8,9. Three-factor learning rules are closely linked to reinforcement learning (RL)¹⁰, a machine learning approach in which artificial agents acquire knowledge by interacting with their surroundings, striving to maximize cumulative rewards¹¹, similar to biological agents¹². Software implementations of deep RL^13,14 have excelled at complex games, for example, the game of Go¹⁵, and at navigation-related tasks such as autonomous robots, drones and cars^14,16,17. However, the computational demands and power consumption of such software-based RL systems remain substantial, particularly when deployed in real-time, resource-constrained environments. Moreover, the necessity of deep RL to rely on the backpropagation algorithm makes it not only biologically implausible¹⁸ but also difficult to implement in energy-efficient hardware.

Implementing three-factor learning directly in hardware offers an alternative towards more efficient and scalable learning systems^19,20. Such neuromorphic computing approaches^21,22,23,24 have been realized using both complementary metal–oxide–semiconductor (CMOS) and memristive technologies. Although CMOS-based platforms are more mature in terms of integration²⁵, they typically rely on digital architectures with physically separated memory and processing units, resulting in energy and latency bottlenecks in large-scale systems²⁴. By contrast, memristors offer unique capabilities such as analogue, tunable synaptic weights and in-memory matrix–vector multiplications via crossbar arrays^{26,27,28,29,30}, making them particularly attractive in energy-efficient, bio-inspired computing applications. However, to fully exploit these advantages, it is crucial that as many operations as possible are carried out directly on the memristive hardware itself^31,32.

So far, only few memristor-based implementations of three-factor learning have been reported^33,34,35,36, all limited to the use of individual devices only. Sarwat et al. implemented reward-based three-factor learning on memristors by combining light and voltage signals³³. However, as the devices display only digital switching, they could not replicate the analogue nature of biological synapses, preventing fine-grained synaptic modulation essential for three-factor learning. An alternative to this approach is memristor-based circuits implementing R-STDP^34,35,36. However, R-STDP has been shown to be less powerful than, for example, temporal difference (TD) learning rules^10,11, which are optimally suited to navigation tasks^37,38. Most importantly, an in situ approach in which memristive weights are trained online (that is, trained during runtime²⁷) is still lacking.

Here we propose a framework for memristor-based TD learning. Specifically, we use the so-called TD error as a third factor, analogous to the reward prediction error delivered by the neurotransmitter dopamine in the mammalian brain^8,39,40. The TD error signal is calculated in the critic module of an actor–critic network, which has been proposed as a fundamental architecture for reward-based learning in the brain^{41,42,43,44,45}. Notably, and contrary to previous demonstrations of RL on memristors, our combination of a biologically plausible network architecture and three-factor learning enables the execution of all critical operations in hardware. Our memristors (1) serve as synaptic weights that are trained online and in memory; (2) they determine the next actions of the network; and (3) they compute the weight updates associated with the TD error. As such, our approach minimizes external recourse to software and allows full in-memory training, which enhances the processing speed and efficiency of artificial neural networks⁴⁶. We demonstrate the applicability of our framework in two common neuroscience navigation tasks, the T-maze and the Morris water maze^47,48,49. Our implementation relies on analogue valence change memories (VCM) consisting of a dielectric HfO₂ layer combined with a conductive metal oxide (CMO)^50,51,52,53; however, it is independent of the memristor type.

Results

Memristor-based actor–critic TD learning

As a testbed for our bio-inspired three-factor learning framework on memristors, we consider the T-maze and Morris water maze navigation tasks (Fig. 1a,b, respectively). Both mazes are widely used to gain insight into the spatial learning and memory of animals^47,48,49. In both experiments, a biological agent (mouse) initially moves randomly (trial 1), but gradually learns to take an efficient (direct) path to a predefined reward (trial n). The improvement in the mouse’s spatial navigation, trial by trial, is an example of reward-based learning^37,38. Here we formulate both navigation tasks in a biologically plausible actor–critic network (Fig. 1c)^42,43,44,45.

The network has two main parts, an actor and a critic component. The critic computes the value of each spatial position or state¹¹, which measures its proximity to the reward, whereas the actor network represents the learned action choices. The actor–critic architecture exhibits multiple similarities with the functions of certain brain areas^54,55,56,57 (Supplementary Fig. 1) and with widely used computational neuroscience models^37,38,58,59. At time t, the agent’s position in the two-dimensional environment is given by the state s_t = (s_x, s_y), where s_x and s_y are Cartesian coordinates. This position is encoded by n place cells⁶⁰ with activities (x₁; x₂…x_n) = x_t. Place cells are implemented as radial basis functions (RBFs) and form a fixed input layer to the actor–critic network (Methods). Given the agent’s position in the place-cell representation, a single actor–critic layer is sufficient to learn complex navigation tasks^61,62, thereby reducing the size and complexity of typical RL networks.

In our network, the connection from any place cell j and the critic neuron(s) (a single one in our model) is characterized by a weight w_j. The activity of the critic neuron V(s_t) = ∑_jw_jx_j assigns a value V(s_t) to each state s_t. In the actor network, each neuron i encodes a different action a_i, such as moving forwards, left or right (Methods provides details on the action selection). The weights, labelled θ_ij, connect the place cells j (pre-synaptic neurons) to the actions i (post-synaptic neurons). The weights of both networks are adjusted by a neo-Hebbian three-factor rule^2,9,63,64, which depends on the activity of the pre-synaptic (j) and post-synaptic (i) neuron via a ‘Hebbian’ coincidence detection ‘H’ (Methods) and a third factor called 3^rd:

$$\Delta {\theta }_{ij}=\alpha \times {3}^{rd}\times {H}^{act}(i,j),\,\Delta {w}_{j}=\alpha \times {3}^{rd}\times {H}^{cri}(j),$$

(1)

where α is a learning rate parameter and the superscripts act and cri refer to actor and critic, respectively. The third factor 3^rd is related to phasic dopamine signals in the brain^8,54 and represents the TD error δ_t (ref. ¹¹):

$${3}^{rd}={\delta }_{t}=r({s}_{t})+\gamma \times V({s}_{t+1})-V({s}_{t}).$$

(2)

Here r(s_t) is the reward received at time t, the parameter γ ≤ 1 is a discount factor, V(s_t) is the value of the state s at time t and V(s_t+1) is the value of the next state at time t + 1 after an action has been taken. The discount factor influences the relative importance of distant to immediate rewards¹³. This architecture and learning rule differ from a previously reported memristor-based actor–critic network⁶⁵ employing backpropagation and gradient descent algorithms—two less biologically plausible techniques than those implemented here¹⁸.

In our actor–critic learning scheme, analogue memristors act as the artificial synapses θ_ij and w_j (Fig. 1d). The memristor weights are not only employed to statically encode the learned actions and values but they also (i) compute the actions, (ii) determine the weight updates based on the TD error (in-memory weight update calculation) and (iii) are updated in an online fashion according to equation (1) (online learning). Importantly, the same memristors are employed to perform all these tasks, highlighting their multifunctional role. Moreover, our implementation does not require any weight read-outs, which reduces data movement. The flowchart in Fig. 1e summarizes a single time step of the algorithm, segmented into its software and hardware components. Some operations are still performed in software but are limited to the interaction of the agent with its environment, a few computationally inexpensive operations (that is, action sampling, conversion of weight update into pulses) and the evaluation of RBFs to obtain place-cell representations.

Executing the majority of the algorithm in memory, as here, is known to enhance the processing speed and lower the energy consumption of learning processes on crossbar arrays⁴⁶. By contrast, conventional RL implementations on memristive hardware only partially exploit in-memory computing. Typically, (1) the action and the weight update computation—both complex operations—are carried out in software and (2) the memristor weights are continuously read-out in case of write–verify schemes^66,67,68. Supplementary Table 1 compares our TD learning hardware implementation with previous demonstrations of RL algorithms on memristors. As the TD error is calculated at each step, errors due to imperfect weight updates on memristors are trained away in the next iteration, providing error-correcting capabilities. With our online learning strategy, the errors are automatically incorporated into the in-memory weight update calculation. The learning loop is detailed in the ‘In-memory learning loop’ section.

Analogue memristor synapses as active components of actor–critic networks

Our actor–critic framework relies on analogue memristors as hardware components consisting of a W/TiN/CMO/HfO₂/TiN stack and operating on the VCM effect (Fig. 2a). Both HfO₂ and CMO layers are involved in the switching process: a conductive filament grows through the HfO₂ layer, whereas the CMO acts as an oxygen reservoir layer^52,69,70. This CMO–HfO₂ bilayer film offers better analogue switching characteristics than Ti–HfO₂ stacks⁵².

**Fig. 2: Fabrication, characterization and analysis of the analogue, VCM-type memristors used in this study.**

The fabrication procedure, which employs processes compatible with CMOS and back-end-of-line (BEOL) technologies, closely follows the method presented in ref. ⁵², differing only by the use of electron-beam lithography instead of optical lithography steps. A detailed description of the fabrication process is provided in Extended Data Fig. 1. Our memristors have an active area of 600 × 600 nm² (Fig. 2b). The focused-ion-beam cross-section image (Fig. 2c) displays the deposited material stack.

Under direct current (d.c.) operation, the HfO₂–CMO bilayer exhibits reproducible resistive switching properties, as demonstrated by the resistance–voltage (R–V) characteristics (Fig. 2d). Both transitions from the high-resistance state to the low-resistance state and from the low-resistance state to the high-resistance state are gradual. Such a behaviour originates from the modulation of oxygen vacancies within the CMO layer, and from the confinement of the electric field and temperature within the HfO₂–CMO bilayer⁷¹. This confinement, in turn, enables controlled and continuous changes in electrical conductivity^52,72. We also investigated synaptic potentiation and depression by stimulating our devices with identical pulse trains. Figure 2e reports ten full potentiation/depression cycles together with their mean values. The measurement data demonstrate a gradual and controlled switching during pulsed operations, resulting in multiple reproducible non-volatile states that represent the actor (θ_ij) and critic (w_j) weights in our actor–critic network (Fig. 1c).

We quantified the noise level in the potentiation and depression curves for multiple devices by subtracting the measurement data from their respective means (Fig. 2f). The obtained histogram serves as a measure of the update noise within our devices. The noise can be attributed to the read operation⁶⁸, cycle-to-cycle variability and the measurement setup (Methods and Supplementary Fig. 3). Although protons and water molecules could be incorporated during layer deposition and are known to play a crucial role in the switching operation⁷³, no detailed analysis was performed, as endurance measurements (not shown here) demonstrated reliable operation up to 10⁸ programming cycles. The same applies to interfacial reactions, such as the possible oxidation of the TiN electrodes.

In-memory learning loop

Our actor–critic RL framework supports learning, that is, the adaptation of θ_ij and w_j, on both hardware and in-software-emulated memristors. The training of critic weights (actor weights work analogously) is depicted in Fig. 3a in the case of one-hot encoding: only one place cell is active at a time (Supplementary Note 1), which is effective if the state space is limited.

The learning process begins with the calculation of the desired weight update in hardware (Fig. 3a(i)), denoted as Δw_des. This is done in memory by performing an in situ vector–vector multiplication using two critic memristors (w_i and w_i+1) along with a fixed-value resistor (w_fixed) that has a weight equal to 1. Since in one-hot encoding, each critic weight stores the value V(s) of one state, w_i and w_i+1 are chosen such that they correspond to V(s_t) (current state) and V(s_t+1) (next state) and, thus, labelled w_t and w_t+1, respectively. In other words, the same memristors store the critic weights and are used to calculate the weight update. By applying the voltages U₁ = α × r(s_t), U₂ = α × γ and U₃ = −α, to w_fixed, w_t+1 and w_t, respectively, the desired weight update is obtained for the memristor storing w_t:

$$\Delta {w}_{des}=\alpha\times r({s}_{t})+\alpha \times \gamma \times V({s}_{t+1})-\alpha \times V({s}_{t}).$$

(3)

The update is determined by measuring the total output current I_tot resulting from the vector–vector multiplication. The detailed derivation is provided in the Methods and Supplementary Note 2, whereas the general case of beyond one-hot encoding with higher-dimensional input activations is discussed in Supplementary Note 7.

In the second part of the learning algorithm (Fig. 3a(ii)), the desired weight update Δw_des is converted into the number of update pulses Δp = Δw_des × N, assuming a linear potentiation and depression of the memristor conductance. Here Δw_des corresponds to the extracted output current I_tot = Δw_des, and N is the total number of pulses, which is a constant (200 in our case). This conversion is currently done in software, but could be directly implemented on chip^24,74,75.

As the third step of the learning loop, the conductance of the targeted memristor is updated by applying Δp pulses (Fig. 3a(iii)). The assumed linear ‘Δw versus Δp’ relationship during the calculation of Δp introduces an update error ϵ₁ because of the nonlinearity of our memristors’ potentiation/depression curves (Fig. 2e). This error ϵ₁ is particularly pronounced near the devices’ minimum and maximum conductance values (saturation regions), where deviations from linear weight updates are the largest (Extended Data Fig. 2). A second error ϵ₂ arises because of update noise, as introduced in Fig. 2f. Both errors are automatically taken into account and, therefore, compensated for by the in-memory weight update calculation in the next iteration of the learning loop (Fig. 3a(iv)), leading to an error correction mechanism (Methods and Supplementary Note 4). The learning loop for in-software-emulated memristors is discussed in Supplementary Note 3.

Although ϵ₁ impacts the weight update, the assumption of a linear update offers substantial advantages over more complex schemes⁷⁶. In particular, it avoids read-outs of the current weights to determine the position within the potentiation/depression model curves and eliminates the need to store the true weight update curves of each memristor. Hence, faster and more energy-efficient in situ weight update processes are possible²⁷.

The calculation of Δw_des in hardware assumes a linear ‘resistance versus voltage’ relationship and includes the measurement noise. We, thus, empirically verify the accuracy of the in-memory weight update evaluation through a comparison between the measured (Δw_des,meas) and expected (Δw_des,exp) weight updates (Fig. 3b) for different weight combinations of the two memristors involved, namely, w_t and w_t+1. In all combinations tested, we found good agreement between the experimental measurements and theoretical expectations, with an error below 3%.

To further examine the error-correcting properties of our approach, we consider the standard deviation of the trained critic weights across 1,000 independent training runs in a simulated actor–critic RL task (Fig. 3c; Methods). The incorporation of the error terms ϵ₁ and ϵ₂ into the calculation of the desired weight update is responsible for the error-correcting feedback loop. This results in reduced variability in the learned weights, as indicated by the lower standard deviation compared with the case in which the error terms are not compensated for: the errors do not accumulate over several episodes, but are trained away at every iteration. Furthermore, with error correction, the error bars become narrower, corresponding to a smaller spread in the weights.

Learning in discrete space using analogue memristors

The aforementioned analogue memristors are used to solve the T-maze navigation task (Fig. 1a) in discrete space. It involves an agent navigating through the maze and adapting its policy to locate a reward. The environment consists of nine states labelled 0–8 with the reward located in the left corner of the T-maze (state 6). The limited state space allows for the use of one-hot encoding to represent the agent’s current position in the maze. This means that only one place cell in Fig. 1c is active at a time, which directly corresponds to the agent’s location. This encoding eliminates the need for RBFs, and the activation x_t is equal to the state s_t. Each place cell is then connected to two actor neurons that encode the possible actions and one critic neuron that computes a value estimate of the current location. The actions can either be ‘moving forwards/backwards’ (all states except state 4) or ‘moving left/right’ (state 4).

The concepts of online learning, in-memory weight update calculation and error correction presented in Fig. 3 are combined to realize bio-inspired learning in an actor–critic network. The latter comprises 27 synaptic weights, including 9 × 2 = 18 actor weights (θ_ij) and 9 × 1 = 9 critic weights (w_j). Each of these weights is implemented by a different hardware memristor (Methods). For every run, two out of the nine critic weights are represented by physical memristors and updated in hardware via online training. The same two hardware devices additionally implement the in-memory weight update calculation introduced in Fig. 3a. The outcome is the learned policy and the corresponding value map (Fig. 4a).

**Fig. 4: Bio-inspired actor–critic RL demonstrated on a proof-of-concept T-maze navigation task.**

The measurements of the learned critic weights w₀–w₈ as a function of the episode number are displayed in Fig. 4b, together with the software runs that use ideal synaptic weights (continuous, linear and no noise). The measured curves follow the software runs, with the agreement being better for critic weights associated with states closer to the reward. Deviations between the measurements and ideal runs can be attributed to fluctuations related to the nonlinear potentiation/depression curves (error term ϵ₁) and update noise (error term ϵ₂). This non-ideal behaviour is the most evident at the start of training and for weights that remain small during training (that is, w₀ – w₂), where ϵ₁ is the most pronounced. However, since these errors are fed back into the in-memory weight update calculation in subsequent iterations, they are gradually corrected over time.

We also tested the implementation of all actor weights in hardware using 18 different memristors (Supplementary Note 6). Extended Data Figure 3 presents the potentiation and depression curves of the memristors utilized for the hardware critic and actor weights. Overall, in-memory training, including online weight updates and weight update calculations, consumes 28.2 μJ of energy (Supplementary Note 8).

The agent initially finds the reward through random exploration, subsequently through the reinforcement of successful trajectories. With an increasing number of episodes, the agent predominantly exploits the stereotypical trajectory it has learned to reach the target. This behaviour can also be observed in Fig. 4c, which illustrates the number of steps required to reach the reward as a function of the episode number. At the beginning of the learning experiment, the number of steps is high because the actor weights are still small, resulting in random action choices. As the learning experiment progresses, the number of steps converges to the optimal trajectory of six steps (Fig. 4c, black dotted line). Even after the correct path has been learned (that is, after around 50 episodes), the finite temperature of the softmax action selection ensures continued exploration via random actions, thereby addressing the exploration–exploitation dilemma (Supplementary Fig. 4)¹¹ and inducing slight fluctuations in the number of steps.

The values of the trained critic weights after the last episode are shown in Fig. 4d. The experimental data are compared with results from runs using only in-software-emulated memristors. All measurements fall within the error bars of the simulated values, proving that the in-software-emulated memristors closely replicate the learning of their physical counterparts.

Importantly, in our measured runs, the weight updates (Δw_des) are directly calculated in hardware, allowing for the entire learning loop to remain in memory. Hence, the algorithm operates without explicit read-outs of the weight values at any point, minimizing off-device computations. We only measured the critic weights for visualization purposes during training.

The in-memory weight update calculations introduce only minimal deviations from the targeted values (Fig. 4e), which reports the difference between the measured (Δw_des,meas) and expected (Δw_des,exp) in-memory weight update values. The absolute error stays below 0.04, consistent with the reference measurement shown in Fig. 3b. It becomes even more evident that the error in Δw_des is small, when compared with its counterpart introduced by the memristor weight update processes (Fig. 4f, top). Finally, we relate this update error to the overall update noise (Fig. 4f, bottom). The latter is extracted from the corresponding potentiation and depression characteristics of the measured memristors. The update error is smaller than or equal to the extracted update noise, suggesting that weight updates within the analogue programming process of our devices are very accurate, establishing the feasibility of online learning for our memristors.

Learning in continuous space using in-software-emulated memristors

To demonstrate that our framework can be applied to more complex problems and scaled up to larger memristor arrays, we set up the two-dimensional Morris water maze experiment (Fig. 1b)^10,37,38,47 and perform simulations exclusively on the previously validated in-software-emulated memristors (Methods). In this task, the agent is randomly placed at a starting location in a water maze. Its goal consists of reaching a reward in the form of a hidden platform (Fig. 5a). Once this happens, a reward signal is released and the episode ends.

**Fig. 5: Actor–critic framework applied to a navigation task akin to the Morris water maze using in-software-emulated memristors.**

Figure 5b displays the number of required steps to reach the reward versus the episode number. Despite the presence of initially large fluctuations probably caused by the update error of the in-software-emulated memristors, the number of steps gradually converges to a mean value close to its optimum of around 6.5 steps obtained through software runs (black dotted line). We attribute this noise resilience to the error correction mechanism of our approach. Reducing the update noise of each device to the minimum value among all 27 memristors, while keeping the same potentiation/depression curves, allows the agent to more rapidly attain the reward and drastically reduce the number of episodes for convergence (Fig. 5b, dark blue line). A VCM technology with better controlled noise level, such as that demonstrated in ref. ⁷⁷, would, therefore, lead to a faster convergence of the learning process, as indicated by the simulation results shown in Extended Data Fig. 4. Moreover, additional simulations revealed that more linear weight update curves are beneficial (Extended Data Fig. 5 and Supplementary Fig. 5).

To gain further insight into the learning process of in-software-emulated memristors, we examine their synaptic weights for the actor (policy map) and critic (value map) networks after training. Figure 5c displays the mean policy map at every possible position of the agent. The vectors represent the most likely action to be chosen at each location in the maze, whereas their length and colour correspond to the probability of that action. Regardless of the starting position, the learned actions lead the agent directly towards the reward. As expected, the action probability is higher for positions close to the reward¹⁰. Although the maze is symmetric, the action vector field shows slight asymmetries, which are attributed to the nonlinearity of the weight update curves and to the update noise. The corresponding state values at each position are plotted as a colour map (Fig. 5d).

These software results indicate that our framework can potentially be implemented on a physical VCM crossbar array to perform the Morris water maze navigation task. A possible realization is proposed in Supplementary Note 7. Its energy consumption for operations executed on memristors is estimated to be 20 times (39 times) lower than that of a standard memristor (GPU) RL implementation using the backpropagation algorithm (Extended Data Table 1 and the ‘Energy consumption and latency estimation of a crossbar-level implementation’ section).

Discussion

We implemented actor–critic TD learning on analogue memristors to mimic biological reward-based learning. In contrast to other memristor-based systems (Methods and Supplementary Table 1), our approach takes advantage of a local three-factor learning rule, which allows for full in-memory training with memristors acting as synaptic weights and directly calculating weight updates. We demonstrated learning on our analogue computing platform by first implementing the T-maze navigation task. For that purpose, we made use of the controllable and gradual switching properties of our devices and showed that the learned weights are close to the ideal values obtained in software. A quantification of the error in the in-memory weight update calculation proved that our approach is highly accurate. Although our framework was tested with HfO₂–CMO bilayer memristors, it is compatible with any memristive hardware. As an outlook for future work, we applied our framework to a complex biologically plausible learning task with a two-dimensional continuous state space, inspired by the Morris water maze experiment. Overall, our results lay the groundwork for full in-memory operations of neuromorphic chips and for the realization of computing engines that are built similar to their biological counterparts. For example, we envision that our framework could be utilized for real-time navigation in autonomous robots. First, our memristors should be integrated into crossbar arrays, allowing for larger-scale demonstrations of TD learning, where all actor and critic weights are stored and updated simultaneously in hardware. Our scheme could also be extended with eligibility traces^9,11, a biologically inspired technique to accelerate the convergence time of RL experiments.

Methods

Place cells

The position of the RL agent is encoded by n place cells⁶⁰ with activities (x₁; x₂…x_n) = x_t, which serve as the input layer to the actor–critic network shown in Fig. 1c. Their construction and functionality are described elsewhere^60,78. We adopt the same principles here to encode the spatial information. Specifically, in continuous spatial environments, an effective input representation of the environment is achieved through a fixed layer using RBFs, where each place cell is active in a specific region. The use of place cells is instrumental in reducing the size and complexity of actor–critic networks. Given the position of the agent in the place-cell representation, a single subsequent layer is sufficient to learn complex navigation tasks. This contrasts with deep RL networks, which require the training of potentially many hidden layers to achieve useful input representations⁷⁹.

Action selection and Hebbian term

The actor network assigns a synaptic weight θ_ij to the connection of place cells j (pre-synaptic) to the actor neurons i (post-synaptic), where each neuron i represents a different action a_i. The activity of an action neuron i is given by

$${h}_{i}=\sum _{j}{\theta }_{ij}{x}_{j},$$

(4)

where x_j denotes the pre-synaptic activity (that is, activity of the place cells). h_i determines the probability of selecting action a_i in the momentary state s_t through the softmax policy π(i∣s_t) (ref. ¹¹):

$$\pi (i| {s}_{t})=\frac{\exp ({h}_{i}/T)}{{\sum }_{k}\exp ({h}_{k}/T)},$$

(5)

where k is the number of possible actions and T is the softmax temperature parameter. The latter determines the balance between exploration (executing random actions) and exploitation (application of the learned actions)¹¹, with a higher value resulting in increased exploration. In our actor–critic framework, actions are dynamically learned and become increasingly more certain over time as the actor weights grow. Together with the temperature parameter, which ensures continued exploration, even if the actor network favours a particular action, these two mechanisms help prevent the overexploitation of suboptimal trajectories.

The Hebbian term H(i, j) in equation (1) is a combination of signals that are locally available to the synapse, namely, the pre-synaptic activity x_j and the post-synaptic activity h_i. It is defined in our model as

$${H}^{act}(i,j)=\left\{\begin{array}{ll}(1-{h}_{i}){x}_{j}(t),\quad &\,\text{for}\,\,{i}^{* }=i\\ -{h}_{i}{x}_{j}(t),\quad &\,\text{for}\,\,{i}^{* }\ne i\end{array}\right.,\,{H}^{cri}(j)={x}_{j}(t),$$

(6)

where i* is the post-synaptic action neuron that fired following the chosen action.

Experimental setups

The d.c. characterization of the memristors were performed with the B2912A source measure unit from Keysight. The bottom electrode (TiN) was grounded, whereas the top electrode (W) was biased with a positive or negative voltage. Neither current compliance nor external series resistor was used during the d.c. measurements as the current passing through the device was self-limited by the active layers of the memristor. The electrical measurements of the dynamic characterization were conducted using the 33500B arbitrary waveform generator from Keysight in combination with the RTE1102 oscilloscope from Rhode & Schwarz and a 10-kΩ series resistance. The conductance states of the potentiation and depression curves were determined via the voltage drop across the resistor. For the hardware weight update calculation and the weight updates, two 33500B arbitrary waveform generators from Keysight were combined with the DHPCA-100 amplifier from FEMTO and the RTE1102 oscilloscope from Rhode & Schwarz. More details about the different experimental setups are given in Supplementary Figs. 3 and 7. All the memristor weight updates were performed using identical pulses: 2.5 V with 1.5-μs width for potentiation, and –2.7 V with 10-μs width for depression, with 200 pulses spanning the full conductance range.

Derivation of the in-memory weight update calculation

The formulas for the in-memory weight update calculation used in the T-maze task are summarized in this section. The learning rule for the critic weight can be rewritten as a scalar product:

$$\begin{array}{l}\Delta w({s}_{t})=\alpha \times {3}^{rd}\times {H}^{cri}(j)\\=\alpha \times \left(r({s}_{t})+\gamma \times V({s}_{t+1})-V({s}_{t})\right)=\left(\begin{array}{c}\alpha \times r({s}_{t})\\ \alpha \times \gamma \\ -\alpha \end{array}\right)\cdot \left(\begin{array}{c}1\\ V({s}_{t+1})\\ V({s}_{t})\end{array}\right)\end{array}$$

(7)

Here α represents the learning rate, r(s_t) is the reward at state s_t, γ is the discount factor, V(s_t+1) and V(s_t) are the value estimates and H^cri(j) is the Hebbian term of the critic. The latter is equal to 1 (that is, H^cri(j) = 1) in the case of one-hot encoding, as only one entry of the input vector x_t is non-zero. As shown in Fig. 3a, this scalar product can be implemented with two memristors w_t+1 and w_t from the critic network and one resistor w_fixed, which are wired together in one row. In this manner, the first vector of the weight update can be mapped to the input voltages U₁ = α × r(s_t), U₂ = α × γ and U₃ = − α and the second vector to the weights w_fixed = 1, w_t+1 = V(s_t+1) and w_t = V(s_t).

$$\Delta w({s}_{t})=\left(\begin{array}{c}\alpha \times r({s}_{t})\\ \alpha \times \gamma \\ -\alpha \end{array}\right)\cdot \left(\begin{array}{c}1\\ V({s}_{t+1})\\ V({s}_{t})\end{array}\right)=\left(\begin{array}{c}{U}_{1}\\ {U}_{2}\\ {U}_{3}\end{array}\right)\cdot \left(\begin{array}{c}{w}_{fixed}\\ {w}_{t+1}\\ {w}_{t}\end{array}\right)$$

(8)

Since the reward term α × r(s_t) is a feedback signal from the environment of the navigation task, it is implemented by the applied voltage U₁ and requires w_fixed to be equal to one. A resistor is, thus, chosen to represent this constant term, but it could also be implemented using another memristor with a fixed conductance. In practice, the memristor conductances need to be converted to normalized weights, which results in adjusted input voltages U₁–U₃. A detailed explanation of the mathematical derivation of these voltages is provided in Supplementary Note 2.

Error correction mechanism

In navigation tasks, when moving between states in its environment, an agent strives to choose actions that maximize the amount of reward it collects. A difference between the actual (the immediate reward the agent receives) and expected (the predicted reward the agents anticipates if it follows its current strategy) reward leads to a non-zero TD error 3^rd (equation (2)), which updates the actor θ_ij and critic weights w_j according to equation (1). This iterative adjustment of the weights drives the agent’s learning process towards a near-optimal set of state values V(s_t) and policy π(i∣s_t).

Our actor–critic RL framework calculates the weight updates Δw_des directly in hardware through a subnetwork of two critic memristors (w_t and w_t+1) along with a fixed-value resistor (w_fixed) (Fig. 3a). Two sources of errors are introduced during the actual update: an error ϵ₁ because of the nonlinear dependence of the weight update on the number Δp of applied voltage pulses and an error ϵ₂ because of the inherent noise in the memristor updates. Neither ϵ₁ nor ϵ₂ are known during the hardware update and are, therefore, contained in the new weight after the update. However, as the new weights directly represent the value estimates of the current (V(s_t)) and next (V(s_t+1)) state through w_t and w_t+1, respectively, both error terms are taken into account during the subsequent iteration and, therefore, compensated through the in-memory calculation of the next desired weight update (Fig. 3a(iv)). They are, thus, trained away by the algorithm¹¹, leading to an error correction mechanism. Similar mechanisms are present in other online training algorithms on memristors. However, these implementations require an external computation of the weight update to account for these error terms (that is, the gradient of the loss function in backpropagation is computed in software), preventing full in-memory training. By contrast, in our approach, the weight updates are computed in hardware according to equation (3) and implemented by the scheme shown in Fig. 3a. A mathematical description of the error correction mechanism is provided in Supplementary Note 4.

The error correction mechanism can compensate for non-idealities such as update noise or conductance drift. As such, it can also adapt to potential device degradation that occurs over long timescales. If conductance values change over time, the TD error is no longer equal to zero, naturally triggering retraining and thereby mitigating other hardware non-idealities as well. However, this requires that devices remain reprogrammable after degradation, that is, no permanent device failure has occurred.

We also investigated the impact of read accuracy (Fig. 3b) during the hardware weight update calculation on convergence. Specifically, we analysed how variations in this accuracy affect the convergence in the Morris water maze task (Supplementary Fig. 6). The simulation results show that our measured read accuracy has a negligible impact on the convergence and performs similarly to the ideal case with perfect accuracy.

Evaluation of the error correction mechanism

The error correction mechanism was tested by solving the T-maze navigation task illustrated in Fig. 4 with in-software-emulated memristors and by inspecting the resulting standard deviation of the critic weights w_j as a function of the episode number. Specifically, we compared the case in which the errors ϵ₁ (resulting from linear updates on nonlinear potentiation/depression curves) and ϵ₂ (update noise) were included in the in-memory weight update calculation of the next iteration (with feedback) to the case in which they were omitted (no feedback). In the case where ϵ₁ and ϵ₂ were not fed back, errors were accumulated over different iterations of the learning algorithm, resulting in a higher standard deviation of the learned weights and a larger spread in the weights. We conducted 1,000 distinct simulated runs and extracted the mean along with error bars representing two standard deviations.

Implementation of the T-maze experiment

The algorithm used for the T-maze experiment is based on the equations of TD learning presented in the ‘Analogue memristor synapses as active components of actor–critic networks’ section. The TD error (equation (2)) adapts the actor and critic synapses based on the learning rules given in equation (1) with the adjustable learning parameters α, γ and T. The reward function r(s_t) is equal to 1 for state 6 (where a reward is present) and 0 otherwise. In all the runs, we set the discount factor γ to 0.9. Moreover, the optimal parameters for the learning rate and softmax temperature are determined through a grid search of simulated runs using in-software-emulated memristors (Supplementary Note 5), which yields α = 0.2 and T = 0.3, respectively. For all the experiments, the reward was placed in the left corner of the T-maze (state 6). Although the task involved a static reward, our actor–critic framework is also capable of learning in dynamic environments in which the reward location changes slowly over time, either smoothly or abruptly. In such cases, the actor and critic weights would slowly adapt through updates driven by the TD error. The speed of relearning could be increased further by the use of a global signal conveying uncertainty or surprise^37,80,81.

As mentioned in the main text, the actor–critic network comprises 27 synaptic weights in total, including 9 × 2 = 18 actor weights (θ_ij) and 9 × 1 = 9 critic weights (w_j). Each of these weights is implemented by a different hardware memristor. In each run, two out of the nine critic weights are represented by physical devices and updated in hardware via online training. Due to experimental constraints (Supplementary Fig. 7), we are limited to operating and training only two memristors per run at the same time. The behaviour of the other critic and actor weights is, thus, emulated in software. Each of them relies on the fitted characteristics of a distinct memristor, including its potentiation/depression curves, cycle-to-cycle variability, nonlinearity and update noise (Extended Data Fig. 6 shows the measured potentiation and depression curves of all 27 in-software-emulated memristors). The same two hardware devices additionally implement the in-memory weight update calculation, as introduced in Fig. 3a.

Due to experimental constraints, including the availability of only four probe needles on the probe station and a limited number of output channels on the arbitrary waveform generators, we were restricted to operating and updating two memristors in parallel per run.

Implementation of the Morris water maze experiment

As the simulated environment is continuous, Gaussian RBFs are used as the input layer of our actor–critic network (Fig. 1c). They create a representation of the current agent’s location in the maze, which is encoded by the activation x_t of 121 overlapping RBFs that are centred at evenly spaced grid points in an 11 × 11 layout. The components of x_t become larger as the agent moves closer to the corresponding grid point. The representation of positions through an RBF input layer enables to solve the complex water maze navigation task in continuous space by learning actions and state values in a single subsequent layer, thereby substantially reducing the required neural network size compared with multilayer networks^11,37,38. Our choice of a fixed RBF grid with evenly spaced grid points is sufficient for the types of task analysed in this work, where the reward location is static and the environment obstacles placed uniformly across the space. However, if it is not known a priori where higher spatial resolution is needed, a more flexible place-cell representation would be advantageous. For example, self-organizing maps or similar unsupervised algorithms could be employed, as they typically rely on local learning rules^58,61,82, and are, therefore, fully compatible with our in situ, local learning framework.

To navigate through the maze, the agent chooses among eight possible actions (Fig. 5a). Following the actor–critic network shown in Fig. 1c, each place cell is connected to one critic neuron and eight action neurons. In total, the actor–critic network comprises 1,089 synaptic weights, including 121 × 1 critic weights and 121 × 8 = 968 actor weights. The behaviour of all these weights is emulated in software, with all the weights initialized to zero, which showed the fastest convergence (Supplementary Fig. 8). We use the same 27 potentiation/depression measurements as in the T-maze (Extended Data Fig. 6) as the basis for the in-software-emulated memristors. Although the number of weight update curves is much smaller than the total number of synaptic weights, device-to-device variability is captured by randomly assigning these measured curves to the network weights. For each device, the emulated weight updates incorporate cycle-to-cycle variability, nonlinearity and update noise. Compared with the T-maze case in which distinct cycles were chosen at each iteration, here cycle-to-cycle variability and update noise are combined within a single error term $\sigma$. For each memristor, this parameter is extracted from overlapping all ten measurement potentiation/depression cycles (similar to Fig. 2f). By varying $\sigma$, we can properly investigate the effect of the total update error on our simulation runs. Moreover, we analysed the impact of actor weight initialization and granularity (that is, the number of pulses between minimum and maximum conductance) on the convergence speed (Supplementary Fig. 8).

Extension to deep networks

In our navigation framework, a single RBF-based input layer is sufficient to encode a representation of the environment. This representation is rich enough for learning actions and state values in a single subsequent layer, making deep RL unnecessary^58,61,62. A representation with approximate RBFs could be the result of a generic preprocessing pipeline, for example, with a deep convolutional neural network that serves as a foundation model and transforms arbitrary input images, or other sensor data, into high-level representations^61,83,84,85. All weights of the preprocessing pipeline could be mapped onto memristors, with each layer implemented as a crossbar array. Only the last layer—the actor–critic one—would be trained in situ on a specific task, using our three-factor learning rule and in-memory weight update scheme.

One limitation of the proposed approach is the limited adaptability to new environments due to the fixed input layer(s). The application of three-factor learning rules with local plasticity to the case of self-supervised representation learning provides an alternative to extend our approach to deep neural networks^84,86,87. These biologically inspired learning rules rely on layer-specific loss functions and eliminate the need for the backpropagation of error signals. To illustrate this, we have used the local three-factor rule, named CLAPP^84,88, in simulation in a deep network comprising six layers. The deep network was pretrained on the STL10 database. We then kept the weights fixed and applied inputs from simulated views of a three-dimensional T-maze environment with images on the walls (Supplementary Fig. 9). The representation layer (layer 6 of the deep network) was rich enough that it could be used as input to our simulated (one-layer) actor–critic network, which learns the navigation task in fewer than 20 trials. However, these rules are currently an active field of research and it is too early to attempt an implementation memristor-based architectures.

Comparison with other RL algorithms and local learning rules based on backpropagation approximations

The actor–critic TD learning algorithm lends itself particularly well to an in-memory implementation compared with the most-common RL algorithms such as Q-learning, SARSA or Monte Carlo methods¹¹. Whereas Monte Carlo methods are not compatible with online learning¹¹, Q-learning is an online, although off-policy method, which prevents efficient in-memory weight updates. SARSA theoretically allows for a similar hardware implementation as TD learning with actor–critic networks, but the latter directly learns and updates the action policy over time, a feature that makes it both resistant to function approximation errors and better suited to complex environments¹¹. Since our bio-inspired algorithm employs RBFs to represent the agent’s location, a single subsequent layer combined with a local learning rule is sufficient to learn both actions and state values, realizing complex navigation tasks^38,58,61. Owing to the local learning rule, only individual weight updates on a small subset of all memristor devices are performed.

Within our developed framework, hardware memristors are not only employed as synaptic weights for online learning but also for the calculation of weight updates. Compared with existing in-memory weight update calculations, where updates are solely based on the sign of the weight change and thus imprecise^31,32, our method computes exact weight updates. When updating the memristors, no additional error mitigation schemes such as write–verify algorithms are necessary as opposed to other weight update schemes^89,90, thereby simplifying the control circuitry^27,31,91. Hence, the proposed approach minimizes off-device computations and avoids weight read-outs so that the main task of the software reduces to environment interactions.

Our methodology contrasts with modern deep RL methods such as deep Q-networks and proximal policy operation (PPO) that rely on error backpropagation across multiple layers¹¹ and are, therefore, less biologically plausible¹⁸ than our actor–critic TD learning approach, where both actions and state values are learned using a single layer. We note that deep RL methods¹⁴ train all layers on a given task or set of tasks¹³. However, in our approach, we assume that a good representation of the environment can be achieved independently of the task, using, as preprocessing, a foundation model trained with modern self-supervised learning algorithms^92,93,94,95 on large datasets. In line with existing foundation models, we expect that a representation built by the foundation model is useful for many different tasks. Most importantly, although deep RL algorithms have demonstrated strong performance in many deep RL tasks¹³, they go beyond what is needed to solve navigation tasks^38,58,61. In Extended Data Fig. 7, we directly compare our actor–critic TD learning algorithm with PPO and R-STDP implementations on the Morris water maze task, using the same RBF input representation and an identical network structure consisting of a single layer. While the software implementations of actor–critic TD learning and PPO achieve similar performance, the memristor emulation performs slightly worse due to the presence of non-idealities in the weight updates, and R-STDP does not converge at all. Unlike TD learning, where weight updates happen whenever a non-zero TD error (a reward prediction error) is present, updates in R-STDP only take place when the reward is reached.

As an alternative to directly implementing local three-factor learning rules, yet avoiding the biological limitations of backpropagation, several approximations of the backpropagation algorithm have emerged in recent years¹⁸. A notable example is the proposed memristor-based architecture employing direct feedback alignment⁹⁶. Although these methods are compelling, it is important to highlight a key distinction: in our framework, the TD error acts as a scalar, one-dimensional global error signal, in contrast to the high-dimensional error signals used in both backpropagation and direct feedback alignment. This scalar error enables fully local learning by eliminating the need for network-wide error propagation (as required in direct feedback alignment) and allows the same modulatory signal to be broadcast uniformly to all synapses, unlike the synapse-specific feedback used in approaches such as that in ref. ⁹⁶.

Energy consumption and latency estimation of a crossbar-level implementation

The energy consumption and latency of the actor–critic TD learning algorithm was calculated, focusing specifically on the operations that can be performed in hardware to highlight the potential of a crossbar implementation of our framework (Extended Data Table 1). We compared three different cases: ‘this work’, ‘hybrid’ and ‘software’, where ‘hybrid’ refers to other works that employ memristors within the RL algorithm and ‘software’ to an implementation without memristors. Each algorithmic operation performed in ‘software’ is assumed to be executed on an NVIDIA A100 40-GB GPU. The operations performed in ‘hardware’ are assumed to be implemented on a crossbar array, namely, the one proposed in Supplementary Note 7. For both GPU and memristor operations, we consider a ‘standard’ case and an ‘optimal’ case, as well as a ‘compute’ scenario specifically for the GPU. The GPU is ideally fully utilized (‘optimal’ case), which results in the lowest latency and energy consumption, whereas ‘standard’ is a more realistic utilization scenario, such as that in ref. ⁹⁷. ‘Compute’ provides a reference for the energy consumed solely by computation, excluding any overhead from fetching or storing weights in memory. As basis for the energy and latency calculations, we employ the values presented in Table 2 and Supplementary Table S1 of ref. ⁹⁷. For the memristor implementations, we consider a standard case using the pulse widths employed in this work, as well as an optimal case with a 60-ns pulse width for all the operations, similar to what has been demonstrated in the past for the same HfO₂–CMO cells⁹⁸ (Supplementary Note 8 provides more details on the effect of the pulse width on energy consumption). Furthermore, we assume all memristors to be in the low-resistance state of 50 μS, and that each weight update consists of three reset pulses of 10 μs (standard case) or four set pulses of 60 ns (optimal case), each representing the worst-case scenario in terms of energy consumption during the water maze emulation. For all vector–matrix and vector–vector calculations, we consider the same mapping that we used in the hardware calculation of Δw in the T-maze, which results in a maximum voltage of 0.1 V applied to a memristor. Here we assume that 0.1 V is applied to all memristors during the vector–matrix and vector–vector calculations.

In the analysis of energy consumption and latency, we did not include the contribution of the peripheral circuitry. Analogue-to-digital digital-to-analogue converters are typically the main contributors to the energy consumption of memristor-based systems⁹⁹. To minimize their negative impact, our framework avoids converting data between the digital and analogue domains by computing as many components of the algorithm as possible in memory. This is expected to further reduce the energy consumption and latency compared with other memristor-based systems.

Data availability

The figure files used in this study are available via Zenodo at https://doi.org/10.5281/zenodo.15740718 (ref. ¹⁰⁰), whereas the datasets for the T-maze and Morris water maze experiments are available via GitHub at https://github.com/ztill/TD_learning_on_memristors/.

Code availability

The source code for the TD learning framework is available via Zenodo at https://doi.org/10.5281/zenodo.17315889 (ref. ¹⁰¹) and via GitHub at https://github.com/ztill/TD_learning_on_memristors/.

References

Caporale, N. & Dan, Y. Spike timing–dependent plasticity: a Hebbian learning rule. Annu. Rev. Neurosci. 31, 25–46 (2008).
Article Google Scholar
Frémaux, N. & Gerstner, W. Neuromodulated spike-timing-dependent plasticity, and theory of three-factor learning rules. Front. Neural Circuits 9, 85 (2016).
Article Google Scholar
Markram, H., Gerstner, W. & Sjöström, P. J. Spike-timing-dependent plasticity: a comprehensive overview. Front. Synaptic Neurosci. 4, 2 (2012).
Article Google Scholar
Rathi, N., Agrawal, A., Lee, C., Kosta, A. K. & Roy, K. Exploring spike-based learning for neuromorphic computing: prospects and perspectives. In Proc. IEEE International Electron Devices Meeting (IEDM) 902–907 (IEEE, 2021).
Olshausen, B. A. & Field, D. J. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381, 607–609 (1996).
Article Google Scholar
Kempter, R., Gerstner, W. & van Hemmen, J. L. Hebbian learning and spiking neurons. Phys. Rev. E 59, 4498–4514 (1999).
Article MathSciNet Google Scholar
Zenke, F., Agnes, E. & Gerstner, W. Diverse synaptic plasticity mechanisms orchestrated to form and retrieve memories in spiking neural networks. Nat. Commun. 6, 6922 (2015).
Article Google Scholar
Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).
Article Google Scholar
Gerstner, W., Lehmann, M., Liakoni, V., Corneil, D. & Brea, J. Eligibility traces and plasticity on behavioral time scales: experimental support of neohebbian three-factor learning rules. Front. Neural Circuits 12, 53 (2018).
Article Google Scholar
Frémaux, N., Sprekeler, H. & Gerstner, W. Reinforcement learning using a continuous time actor-critic framework with spiking neurons. PLoS Comput. Biol. 9, e1003024 (2013).
Article MathSciNet Google Scholar
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, 2018).
Neftci, E. O. & Averbeck, B. B. Reinforcement learning in artificial and biological systems. Nat. Mach. Intell. 1, 133–143 (2019).
Article Google Scholar
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).
Article Google Scholar
Li, Y. Deep reinforcement learning: an overview. Preprint at https://arxiv.org/abs/1701.07274 (2017).
Silver, D. et al. Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017).
Article Google Scholar
Kaufmann, E. et al. Champion-level drone racing using deep reinforcement learning. Nature 620, 982–987 (2023).
Article Google Scholar
AlMahamid, F. & Grolinger, K. Reinforcement learning algorithms: an overview and classification. In Proc. IEEE International Conference on Big Data (Big Data) 1–7 (IEEE, 2021).
Lillicrap, T. P., Santoro, A., Marris, L., Akerman, C. J. & Hinton, G. Backpropagation and the brain. Nat. Rev. Neurosci.21, 335–346 (2020).
Article Google Scholar
Xia, Q. & Yang, J. J. Memristive crossbar arrays for brain-inspired computing. Nat. Mater. 18, 309–323 (2019).
Article Google Scholar
Gokmen, T. & Vlasov, Y. Acceleration of deep neural network training with resistive cross-point devices: design considerations. Front. Neurosci. 10, 333 (2016).
Article Google Scholar
Mead, C. Neuromorphic electronic systems. Proc. IEEE 78, 1629–1636 (1990).
Article Google Scholar
Furber, S. Large-scale neuromorphic computing systems. J. Neural Eng. 13, 051001 (2016).
Article Google Scholar
Christensen, D. V. et al. 2022 roadmap on neuromorphic computing and engineering. Neuromorph. Comput. Eng. 2, 022501 (2022).
Article Google Scholar
Sebastian, A., Le Gallo, M., Khaddam-Aljameh, R. & Eleftheriou, E. Memory devices and applications for in-memory computing. Nat. Nanotechnol. 15, 529–544 (2020).
Article Google Scholar
Orchard, G. et al. Efficient neuromorphic signal processing with Loihi 2. In Proc. IEEE Workshop on Signal Processing Systems (SiPS) 254–259 (IEEE, 2021).
Mehonic, A. et al. Memristors—from in-memory computing, deep learning acceleration, and spiking neural networks to the future of neuromorphic and bio-inspired computing. Adv. Intell. Syst. 2, 2000085 (2020).
Article Google Scholar
Yu, S. Neuro-inspired computing with emerging nonvolatile memorys. Proc. IEEE 106, 260–285 (2018).
Article Google Scholar
Li, Y., Wang, Z., Midya, R., Xia, Q. & Yang, J. J. Review of memristor devices in neuromorphic computing: materials sciences and device challenges. J. Phys. D: Appl. Phys. 51, 503002 (2018).
Article Google Scholar
Yang, J. J., Strukov, D. B. & Stewart, D. R. Memristive devices for computing. Nat. Nanotechnol. 8, 13–24 (2013).
Article Google Scholar
Emboras, A. et al. Opto-electronic memristors: prospects and challenges in neuromorphic computing. Appl. Phys. Lett. 117, 230501 (2020).
Zhang, W. et al. Edge learning using a fully integrated neuro-inspired memristor chip. Science 381, 1205–1211 (2023).
Article Google Scholar
Gao, B. et al. Memristor-based analogue computing for brain-inspired sound localization with in situ training. Nat. Commun. 13, 2026 (2022).
Article Google Scholar
Sarwat, S. G., Moraitis, T., Wright, C. D. & Bhaskaran, H. Chalcogenide optomemristors for multi-factor neuromorphic computation. Nat. Commun. 13, 2247 (2022).
Article Google Scholar
Sun, Y. et al. Ferroelectric polarized in transistor channel polarity modulation for reward-modulated spike-time-dependent plasticity application. J. Phys. Chem. Lett. 13, 10056–10064 (2022).
Article Google Scholar
Shi, C., Lu, J., Wang, Y., Li, P. & Tian, M. Exploiting memristors for neuromorphic reinforcement learning. In Proc. IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS) 1–4 (IEEE, 2021).
Vlasov, D. et al. Memristor-based spiking neural network with online reinforcement learning. Neural Netw. 166, 512–523 (2023).
Article Google Scholar
Foster, D., Morris, R. & Dayan, P. Models of hippocampally dependent navigation using the temporal difference learning rule. Hippocampus 10, 1–16 (2000).
Article Google Scholar
Arleo, A. & Gerstner, W. Spatial cognition and neuro-mimetic navigation: a model of hippocampal place cell activity. Biol. Cybern. 83, 287–299 (2000).
Article Google Scholar
Amo, R. et al. A gradual temporal shift of dopamine responses mirrors the progression of temporal difference error in machine learning. Nat. Neurosci. 25, 1082–1092 (2022).
Article Google Scholar
Schultz, W. Dopamine reward prediction error coding. Dialogues Clin. Neurosci. 18, 23–32 (2016).
Article Google Scholar
Averbeck, B. & O’Doherty, J. P. Reinforcement-learning in fronto-striatal circuits. Neuropsychopharmacology 47, 147–162 (2022).
Article Google Scholar
Joel, D., Niv, Y. & Ruppin, E. Actor–critic models of the basal ganglia: new anatomical and computational perspectives. Neural Netw. 15, 535–547 (2002).
Article Google Scholar
Barto, A. G. in Models of Information Processing in the Basal Ganglia 215–232 (MIT Press, 1995).
Houk, J. C., Adams, J. L. & Barto, A. G. in Models of Information Processing in the Basal Ganglia 249–270 (MIT Press, 1995).
Takahashi, Y., Schoenbaum, G. & Niv, Y. Silencing the critics: understanding the effects of cocaine sensitization on dorsolateral and ventral striatum in the context of an actor/critic model. Front. Neurosci. 2, 282 (2008).
Article Google Scholar
Kim, D. et al. An overview of processing-in-memory circuits for artificial intelligence and machine learning. IEEE J. Emerg. Sel. Top. Circuits Syst. 12, 338–353 (2022).
Article Google Scholar
Morris, R. G. Spatial localization does not require the presence of local cues. Learn. Motiv. 12, 239–260 (1981).
Article Google Scholar
D’Hooge, R. & De Deyn, P. P. Applications of the Morris water maze in the study of learning and memory. Brain Res. Rev. 36, 60–90 (2001).
Article Google Scholar
Wenk, G. L. Assessment of spatial memory using the T maze. Curr. Protoc. Neurosci. 4, 8.5B (1998).
Google Scholar
Wu, W. et al. Improving analog switching in HfOx-based resistive memory with a thermal enhanced layer. IEEE Electron Device Lett. 38, 1019–1022 (2017).
Article Google Scholar
Stecconi, T. et al.Filamentary TaOx/HfO2 ReRAM devices for neural networks training with analog in-memory computing. Adv. Electron. Mater. 8, 2200448 (2022).
Article Google Scholar
Falcone, D. F. et al. Physical modeling and design rules of analog conductive metal oxide-HfO₂ RERAM. In Proc. IEEE International Memory Workshop 1–4 (IEEE, 2023).
Stecconi, T. et al. Role of conductive-metal-oxide to HfO_x interfacial layer on the switching properties of bilayer TaO_x/HfO_x ReRAM. In Proc. IEEE 52nd European Solid-State Device Research Conference (ESSDERC) 297–300 (IEEE, 2022).
Watabe-Uchida, M., Eshel, N. & Uchida, N. Neural circuitry or reward prediction error. Annu. Rev. Neurosci. 40, 373–394 (2017).
Article Google Scholar
Dhawale, A. K., Wolff, S. B., Ko, R. & Ölveczky, B. P. The basal ganglia control the detailed kinematics of learned motor skills. Nat. Neurosci. 24, 1256–1269 (2021).
Article Google Scholar
Kelley, A. & Domesick, V. The distribution of the projection from the hippocampal formation to the nucleus accumbens in the rat: an anterograde- and retrograde-horseradish peroxidase study. Neuroscience 7, 2321–2335 (1982).
Article Google Scholar
Trouche, S. et al. A hippocampus-accumbens tripartite neuronal motif guides appetitive memory in space. Cell 176, 1393–1406 (2019).
Article Google Scholar
Sheynikhovich, D., Chavarriaga, R., Strosslin, T., Arleo, A. & Gerstner, W. Is there a geometric module for spatial orientation? Insights from a rodent navigation model. Psychol. Rev. 116, 540–566 (2009).
Article Google Scholar
Geerts, H., Chersi, F., Stachenfeld, K. & Burgess, N. A general model of hippocampal and dorsal striatal learning and decision making. Proc. Natl Acad. Sci. USA 117, 31427–31437 (2020).
Article Google Scholar
O’Keefe, J. & Dostrovsky, J. The hippocampus as a spatial map: preliminary evidence from unit activity in the freely-moving rat. Brain Res. 34, 171–175 (1971).
Strösslin, T., Sheynikhovich, D., Chavarriaga, R. & Gerstner, W. Robust self-localisation and navigation based on hippocampal place cells. Neural Netw. 18, 1125–1140 (2005).
Article Google Scholar
Wu, Y., Wang, H., Zhang, B. & Du, K.-L. Using radial basis function networks for function approximation and classification. Int. Sch. Res. Not. 2012, 324194 (2012).
MathSciNet Google Scholar
Lisman, J., Grace, A. & Duzel, E. A neoHebbian framework for episodic memory; role of dopamine-dependent late LTP. Trends Neurosci. 34, 536–547 (2011).
Article Google Scholar
Roelfsema, P. & Holtmaat, A. Control of synaptic plasticity in deep cortical networks. Nat. Rev. Neurosci. 19, 166–180 (2018).
Article Google Scholar
Wu, N., Vincent, A., Strukov, D. & Xie, Y. Memristor hardware-friendly reinforcement learning. Preprint at https://arxiv.org/abs/2001.06930 (2020).
Wang, Z. et al. Reinforcement learning with analogue memristor arrays. Nat. Electron. 2, 115–124 (2019).
Article Google Scholar
Bianchi, S. et al. A self-adaptive hardware with resistive switching synapses for experience-based neurocomputing. Nat. Commun. 14, 1565 (2023).
Article Google Scholar
Lin, Y. et al. Uncertainty quantification via a memristor Bayesian deep neural network for risk-sensitive reinforcement learning. Nat. Mach. Intell. 5, 714–723 (2023).
Sekar, D. C. et al. Technology and circuit optimization of resistive RAM for low-power, reproducible operation. In Proc. IEEE International Electron Devices Meeting (IEDM) 28.3.1–28.3.4 (IEEE, 2014).
Cüppers, F. et al. Exploiting the switching dynamics of HfO₂-based ReRAM devices for reliable analog memristive behavior. APL Mater. 7, 091105 (2019).
Falcone, D. F. et al. Analytical modelling of the transport in analog filamentary conductive-metal-oxide/HfOx ReRAM devices. Nanoscale Horiz. 9, 775–784 (2024).
Article Google Scholar
Galetta, M. et al. Compact model of conductive-metal-oxide/HfO_x analog filamentary ReRAM devices. In Proc. IEEE European Solid-State Electronics Research Conference (ESSERC) 749–752 (IEEE, 2024).
Valov, I. & Tsuruoka, T. Effects of moisture and redox reactions in VCM and ECM resistive switching memories. J. Phys. D: Appl. Phys. 51, 413001 (2018).
Article Google Scholar
Yao, P. et al. Fully hardware-implemented memristor convolutional neural network. Nature 577, 641–646 (2020).
Article Google Scholar
Shafiee, A. et al. ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Comput. Archit. News. 44, 14–26 (2016).
Article Google Scholar
Portner, K. et al. Analog nanoscale electro-optical synapses for neuromorphic computing applications. ACS Nano 15, 14776–14785 (2021).
Article Google Scholar
Rao, M. et al. Thousands of conductance levels in memristors integrated on CMOS. Nature 615, 823–829 (2023).
Article Google Scholar
Lowe, D. & Broomhead, D. Multivariable functional interpolation and adaptive networks. Complex Syst. 2, 321–355 (1988).
MathSciNet Google Scholar
Gupta, J. K., Egorov, M. & Kochenderfer, M. J. Cooperative multi-agent control using deep reinforcement learning. In Proc. International Conference on Autonomous Agents and Multiagent Systems Workshops 66–83 (Springer, 2017).
Yu, A. & Dayan, P. Uncertainty, neuromodulation, and attention. Neuron 46, 681–692 (2005).
Article Google Scholar
Xu, A., Modirshanechi, A., Lehmann, M., Gerstner, W. & Herzog, M. Novelty is not surprise: human exploratory and adaptive behavior in sequential decision-making. PLOS Comput. Biol. 17, e1009070 (2021).
Article Google Scholar
Kohonen, T. Physiological interpretation of the self-organizing map algorithm. Neural Netw. 6, 895–905 (1993).
Google Scholar
Sünderhauf, N., Shirazi, S., Dayoub, F., Upcroft, B. & Milford, M. On the performance of ConvNet features for place recognition. In Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 4297–4304 (IEEE, 2015).
Illing, B., Ventura, J., Bellec, G. & Gerstner, W. Local plasticity rules can learn deep representations using self-supervised contrastive predictions. Adv. Neural Inf. Process. Syst. 34, 30365–30379 (2021).
Google Scholar
Novo, A., Lobon, F., Garcia de Marina, H., Romero, S. & Barranco, F. Neuromorphic perception and navigation for mobile robots: a review. ACM Comput. Surveys 56, 246 (2024).
Article Google Scholar
Halvagal, M. S. & Zenke, F. The combination of Hebbian and predictive plasticity learns invariant object representations in deep sensory networks. Nat. Neurosci. 26, 1906–1915 (2023).
Article Google Scholar
Kaiser, J., Mostafa, H. & Neftci, E. Synaptic plasticity dynamics for deep continuous local learning (DECOLLE). Front. Neurosci. 14, 424 (2020).
Article Google Scholar
Delrocq, A. et al. Critical periods support representation learning in a model of cortical processing. Preprint at bioRxiv https://doi.org/10.1101/2024.12.20.629674 (2024).
Bayat, F. M. et al. Implementation of multilayer perceptron network with highly uniform passive memristive crossbar circuits. Nat. Commun. 9, 2331 (2018).
Article Google Scholar
Kim, H., Mahmoodi, M., Nili, H. & Strukov, D. B. 4k-memristor analog-grade passive crossbar circuit. Nat. Commun. 12, 5198 (2021).
Article Google Scholar
Yao, P. et al. Face classification using electronic synapses. Nat. Commun. 8, 15199 (2017).
Article Google Scholar
van den Oord, A., Li, Y. & Vinyals, O. Representation learning with contrastive predictive coding. Preprint at https://arxiv.org/abs/1807.03748 (2019).
Bardes, A., Ponce, J. & LeCun, Y. VICReg: variance-invariance-covariance regularization for self-supervised learning. Preprint at https://arxiv.org/abs/2105.04906 (2022).
Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In Proc. 37th International Conference on Machine Learning 1597–1607 (PMLR, 2020).
Garrido, Q., Chen, Y., Bardes, A., Najman, L. & LeCun, Y. On the duality between contrastive and non-contrastive self-supervised learning. Preprint at https://doi.org/10.48550/arXiv.2206.02574 (2023).
Payvand, M., Fouda, M. E., Kurdahi, F., Eltawil, A. & Neftci, E. O. Error-triggered three-factor learning dynamics for crossbar arrays. In Proc. IEEE 2nd AICAS—2020 IEEE International Symposium on Artificial Intelligence Circuits and Systems 218–222 (IEEE, 2020).
Weilenmann, C. et al. Single neuromorphic memristor closely emulates multiple synaptic mechanisms for energy efficient neural networks. Nat. Commun. 15, 6898 (2024).
Article Google Scholar
Lombardo, D. G. F. et al. Read noise analysis in analog conductive-metal-oxide/HfO_x ReRAM devices. In Proc. 2024 Device Research Conference (DRC) 1–2 (IEEE, 2024).
Aguirre, F. et al. Hardware implementation of memristor-based artificial neural networks. Nat. Commun. 15, 1974 (2024).
Article Google Scholar
Portner, K. & Zellweger, T. Actor-critic networks with analogue memristors mimicking reward-based learning. Zenodo https://doi.org/10.5281/zenodo.15740718 (2025).
Zellweger, T., keportner & flavio-martinelli. ztill/TD_learning_on_memristors: first release actor-critic TD-learning on memristors. Zenodo https://doi.org/10.5281/zenodo.17315889(2025).

Download references

Acknowledgements

All authors acknowledge funding from the ALMOND project supported by the SNSF Sinergia program (grant number 198612). Funding from the Werner Siemens Foundation is acknowledged by K.P., T.Z., C.W., M.L. and A.E. This work was also supported in part by the Swiss State Secretariat for Education, Research, and Innovation (SERI) through the SwissChips research project. It was carried out at the Binnig and Rohrer Nanotechnology Center (BRNC) and at ETH Zurich. We thank the Cleanroom Operations Team of the BRNC for their help and support. We thank M. Stiefel for capturing the focused-ion-beam images.

Funding

Open access funding provided by Swiss Federal Institute of Technology Zurich.

Author information

These authors contributed equally: Kevin Portner, Till Zellweger.

Authors and Affiliations

Integrated Systems Laboratory, ETH Zurich, Zurich, Switzerland
Kevin Portner, Till Zellweger, Laura Bégon-Lours, Christoph Weilenmann, Oscar Hrynkevych, Mathieu Luisier & Alexandros Emboras
School of Life Sciences and School of Computer and Communication Sciences, École Polytechnique Fédérale de Lausanne, Lausanne EPFL, Switzerland
Flavio Martinelli & Wulfram Gerstner
IBM Research Europe—Zurich, Rüschlikon, Switzerland
Laura Bégon-Lours, Valeria Bragaglia, Daniel Jubin, Donato Francesco Falcone, Felix Hermann, Tommaso Stecconi, Antonio La Porta, Ute Drechsler, Antonis Olziersky & Bert Jan Offrein

Authors

Kevin Portner
View author publications
Search author on:PubMed Google Scholar
Till Zellweger
View author publications
Search author on:PubMed Google Scholar
Flavio Martinelli
View author publications
Search author on:PubMed Google Scholar
Laura Bégon-Lours
View author publications
Search author on:PubMed Google Scholar
Valeria Bragaglia
View author publications
Search author on:PubMed Google Scholar
Christoph Weilenmann
View author publications
Search author on:PubMed Google Scholar
Daniel Jubin
View author publications
Search author on:PubMed Google Scholar
Donato Francesco Falcone
View author publications
Search author on:PubMed Google Scholar
Felix Hermann
View author publications
Search author on:PubMed Google Scholar
Oscar Hrynkevych
View author publications
Search author on:PubMed Google Scholar
Tommaso Stecconi
View author publications
Search author on:PubMed Google Scholar
Antonio La Porta
View author publications
Search author on:PubMed Google Scholar
Ute Drechsler
View author publications
Search author on:PubMed Google Scholar
Antonis Olziersky
View author publications
Search author on:PubMed Google Scholar
Bert Jan Offrein
View author publications
Search author on:PubMed Google Scholar
Wulfram Gerstner
View author publications
Search author on:PubMed Google Scholar
Mathieu Luisier
View author publications
Search author on:PubMed Google Scholar
Alexandros Emboras
View author publications
Search author on:PubMed Google Scholar

Contributions

K.P., T.Z., M.L. and A.E. conceived the project. K.P., L.B.-L. and V.B. fabricated the devices. F.M. developed the RL simulations. K.P. and T.Z. performed the experiments. T.Z. designed the hardware implementation of the RL algorithm and developed the measurement setup. C.W. and O.H. helped on developing the measurement setup. D.J. helped with the device measurements. D.F.F., F.H., T.S., A.l.P., U.D. and A.O. contributed to the device fabrication. K.P., T.Z., M.L. and A.E. analysed the data and co-wrote the manuscript. W.G., F.M. and B.J.O. contributed to the writing of the manuscript. F.M. and W.G. contributed to the development of the concept. F.M., L.B.-L., V.B. and C.W. contributed to the data analysis. All authors contributed to the editing of the paper. M.L. and A.E. supervised the project.

Corresponding authors

Correspondence to Kevin Portner or Till Zellweger.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Tifenn Hirtzlin, Ilia Valov and Zhongrui Wang for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Detailed fabrication flow of the analogue memristors with Scanning Electron Microscopy (SEM) pictures.

(a) Deposition of the CMOS- and BEOL-compatible layers on a Si/SiO₂ substrate. The TiN and HfO₂ layers are deposited by a Plasma-Enhanced Atomic Layer Deposition (PEALD) process at 300 ^∘C, without breaking the vacuum to prevent oxidation of the TiN. The Conductive Metal Oxide (CMO), TiN and W layers are then deposited by sputtering. During the transfer from the ALD chamber to the sputtering system, the layer stack experiences a brief exposure to the atmosphere, during which moisture and protons may be incorporated into the device stack⁷³. (b) Etching of the top three layers. The etch defines the active area. (c) Encapsulation of the active area by SiN. This material is deposited via Plasma-Enhanced Chemical Vapour Deposition (PECVD) process at 300 ^∘C. (d) Etching of the bottom two layers. The HfO₂ and TiN are etched to define the bottom electrode (BE). (e) Encapsulation of the BE by SiN. (f) Via opening to provide access to the memristor device. For the top electrode (TE) via, SiN is etched, while for the bottom electrode (BE), both SiN and HfO₂ are etched. (g) Sputtering and patterning of a W TE. A SEM picture of the final device is shown. The inset provides an angled view.

Extended Data Fig. 2 Estimation of the error ϵ₁ arising from the difference between idealised (perfectly linear) and actual (non-linear) weight updates.

(a) Measured and fitted potentiation and depression curves of ten different cycles from a single memristor device. As mentioned in the main text, an error term ϵ₁ is introduced due to the difference between the ideal (linear) potentiation/depression (black line) and non-ideal measurement curves (dark and light blue). This error ϵ₁ is extracted by subtracting the measurement data (b) and the fits (c) from the ideal linear curve for all ten cycles. In both cases, the error is directly related to the non-linearity of the potentiation/depression and is largest at the beginning of the potentiation (light blue) and depression curves (dark blue), as indicated by the red circles. This implies that weight updates when the memristor conductance is close to the beginning of the potentiation and depression curves result in a larger error ϵ₁. Because the weights in our actor-critic reinforcement learning tasks are generally more often increased than decreased (their initial value is zero), the problem is more severe in the case of potentiation than depression (see for example Fig. 4b in the main text). The first update of each weight is therefore the most impacted one.

Extended Data Fig. 3 Measured potentiation and depression curves of the memristor devices used for the hardware runs of the T-maze navigation task (Fig. 4 of the main text and Supplementary Note 7).

The 9 critic weights are denoted as w_j and the 18 actor weights as θ_ij.

Extended Data Fig. 4 Effect of the noise level on the convergence speed of the Morris water maze navigation task.

(a) Extracted update noise for both potentiation and depression curves of all in-software-emulated memristors. A single parameter σ, corresponding to the standard deviation of all 10 overlapped potentiation or depression measurement curves was extracted for each measured device. (b) Number of steps per episode plotted against the number of episodes for different noise levels, compared to the memristor emulation in the paper. The curves show the mean of 100 distinct simulation runs with an applied running average of 10 episodes to improve comparability between the runs. The in-software-emulated memristors exhibit noise levels between 2.9% and 10%, with a mean of 6.7%.

Extended Data Fig. 5 Effect of non-linearity on the convergence speed of the Morris water maze navigation task.

(a) Extracted non-linearity parameter β for both potentiation and depression curves from our 27 experimental devices, used in the corresponding software models. (b) Number of steps per episode plotted against the number of episodes for different non-linearity parameters β, compared to the memristor emulation in the paper. Each curve represents the average of 100 independent simulation runs, with an applied running average of 10 episodes to improve comparability between the runs.

Extended Data Fig. 6 Measured memristor potentiation and depression curves used as a reference to parametrise the memristor behaviour emulated in software.

These in-software-emulated memristors are used in the T-maze and Morris water maze navigation tasks (Figs. 4 and 5 of the main text). Each subplot consists of 10 cycles. The 9 critic weights are denoted as w_j and the 18 actor weights as θ_ij.

Extended Data Fig. 7 Actor–critic TD learning (floating-point and memristor emulation) compared to PPO and rate-based R-STDP.

The performance of the actor-critic TD learning algorithm is comparable to PPO in the floating-point software implementation, whereas convergence slows down for the memristor emulation due to update noise (Extended Data Fig. 4). R-STDP does not converge, as updates effectively only happen when the reward is reached and not when a non-zero TD error is present. The memristor model represents the mean of 100 independent simulation runs, while the PPO results are averaged over five random seeds. A moving average over 10 episodes was applied to improve comparability between runs.

Extended Data Table 1 Energy consumption and latency of the actor-critic TD learning algorithm for three different scenarios (i) a crossbar implementation of our framework (‘This work’), (ii) common approaches of using memristors within RL applications (‘Hybrid’) and (iii), a full software implementation executed on a GPU

Full size table

Supplementary information

Supplementary Information (download PDF )

Supplementary Figs. 1–17, Notes 1–8, Table 1 and references.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Portner, K., Zellweger, T., Martinelli, F. et al. Actor–critic networks with analogue memristors mimicking reward-based learning. Nat Mach Intell 7, 1939–1953 (2025). https://doi.org/10.1038/s42256-025-01149-w

Download citation

Received: 20 February 2025
Accepted: 31 October 2025
Published: 09 December 2025
Version of record: 09 December 2025
Issue date: December 2025
DOI: https://doi.org/10.1038/s42256-025-01149-w

This article is cited by

Fully analogue reinforcement learning with memristors
- Yue Zhang
- Xiaojuan Qi
- Zhongrui Wang
Nature Machine Intelligence (2025)