Fig. 3: In-memory learning loop.
From: Actor–critic networks with analogue memristors mimicking reward-based learning

a, The learning procedure starts with the in-memory calculation of the desired weight update Δwdes through a subnetwork including three weights wfixed, wt+1 and wt. On the basis of the current location of the agent, the voltages U3 and U2 are applied to the two critic devices storing wt = V(st) and wt+1 = V(st+1). The value of Δwdes is then translated into the corresponding number Δp of update pulses, followed by the actual weight update on the memristor wt. Two sources of errors are introduced during the update: an error ϵ1 because of the nonlinear dependence of the weight update on the number Δp of applied voltage pulses and an error ϵ2 because of the inherent noise in the memristor updates. During the subsequent iteration, both error terms are taken into account and, therefore, compensated through the in-memory calculation of the next desired weight update (feedback). The update loop, thus, has capabilities to correct errors. b, Experimentally measured in-memory weight updates (Δwdes,meas) as a function of the expected software weight updates (Δwdes,exp). Δwdes,meas were measured out of the subnetwork in a, for various weights wt and wt+1, unit-free values within the range of 0–1, whereas wfixed has a fixed value of 1. Δwdes,exp was calculated with equation (3) using the values V(st+1) and V(st), within the range of 0–1, whereas r(st) has a fixed value of 1. The measured values of Δwdes are in good agreement with the calculated values, with their absolute difference never exceeding 0.03 (or 3%). Note that for both experiments and theory, a value of α = 0.2 was used, which, according to equation (3), implies that the weight updates cannot be larger than 0.2. c, Mean and error bars for the standard deviation of critic weights are compared between the cases with and without the error-correcting feedback as a function of the episode number. The results are extracted from 1,000 distinct simulated runs using an actor–critic RL scenario. The plot highlights the impact of the error correction mechanism on the variability of the learned weights.