Error-aware probabilistic training for memristive neural networks

Liu, Jinchang; Lu, Jian; Tang, Shuangzhu; Zhou, Ruixi; Ma, Huiqin; Lyu, Bo; Tian, Yang; Shi, Tuo; Liu, Qi

doi:10.1038/s41467-025-66240-7

Download PDF

Article
Open access
Published: 12 December 2025

Error-aware probabilistic training for memristive neural networks

Nature Communications volume 16, Article number: 11494 (2025) Cite this article

4889 Accesses
1 Citations
23 Altmetric
Metrics details

Subjects

Abstract

Analog computing-in-memory devices leverage fundamental physical laws for computation, greatly enhancing energy efficiency. However, the stochastic characteristics of analog devices conflict with the deterministic weight update of the backpropagation algorithm (BP), limiting training performance. To overcome the algorithm-device mismatch, we propose an error-aware probabilistic update method (EaPU) that updates the weights based on a specified probability derived from device writing noise. Compared to BP, EaPU reduces the number of weight updates to <1‰ with minimal performance loss. Furthermore, we validate EaPU experimentally on a 180 nm memristor system for image denoising and super-resolution and simulate its performance on ResNet and Vision Transformers. Results confirm that EaPU training yields over 60% accuracy improvement, with ~50.54× and 13.23× lower training energy (and ~35.51× and 11.26× lower inference energy) compared to BP-based memristor training and MADEM, respectively. Moreover, EaPU-based memristor hardware reduces training energy by nearly 6 orders of magnitude compared to graphics processing units. Here, we present a promising approach to precisely and efficiently train analog device-based deep neural networks.

Activity-difference training of deep neural networks using memristor crossbars

Article 21 November 2022

Optimised weight programming for analogue memory-based deep neural networks

Article Open access 30 June 2022

Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based accelerators

Article Open access 30 August 2023

Introduction

Deep neural networks (DNNs) have brought great success in extensive industrial applications, such as image classification^1,2, object detection^3,4, and natural language processing^5,6. Large-scale DNNs impose a substantial computing workload, requiring advanced hardware platforms to accelerate the computational tasks⁷. Analog computing-in-memory (ACIM) architecture, which performs vector-matrix multiplication (VMM) based on physical laws, has been regarded as a promising solution to address the limitations of the von Neumann architecture and has achieved significant energy efficiency⁸. Typically, memristive ACIM leverages the tunable nonvolatile conductance of memristors as the weight and performs VMMs with Ohm’s law and Kirchhoff’s law on crossbar arrays, showing significant efficiency improvement^9,10.

Owing to the nonideal characteristics inherent in memristive devices, arrays, and peripheral circuits¹¹, there are inevitably precision and efficiency limitations in memristor-based DNNs (shown in Fig. 1a), causing degradation in network performance. When memristors are utilized for calculation, the first step is device programming. It loads network weights from software onto hardware. In this process, writing noise (ε_cell) is inevitably introduced. ε_cell is composed of programming noise and device relaxation. In ACIM hardware, limited resolutions in devices and external circuits always result in discrete conductance states^9,12, thus a tolerance range is typically set to efficiently program the device conductance to a desired continuous state^13,14, leading to programming noise. The relaxation phenomenon after programming significantly contributes to retention failure and increases the error ε_cell^9,15,16. Therefore, ε_cell always exists, causing a mismatch between the actual weight and the required weight. Afterward, in the implementation of VMM on memristor arrays, there exists reading noise, IR drop, and external circuit nonidealities. These nonidealities in the VMM calculation result in a residual error (ε_residual). Therefore, ε_cell and ε_residual lead to precision limitations (quantitative indices for ε_cell and ε_residual are introduced in “Methods”). Moreover, in device programming, spatiotemporal variations ε_var^17,18 bring about the challenge of obtaining common programming parameters, reducing the weight update efficiency.

**Fig. 1: Challenge of analog training.**

Hardware-aware training^11,19, which involves precise modeling and employs software for error-aware learning and then applies trained parameters to memristor systems, represents an approach to addressing nonidealities. However, training on graphics processing units (GPUs) incurs high training power consumption and latency compared to memristor systems²⁰, along with substantial modeling costs. Additionally, the separation of training and inference hardware may lead to issues such as modeling biases and the impact of unmodeled parameters. Memristor-based training^18,21, a memristor-based training pathway for error-aware learning, can effectively leverage the high energy efficiency and low latency characteristics of memristor systems to reduce energy consumption and improve training efficiency. Meanwhile, since training is performed directly on the memristor system, it can counteract nonidealities through the learning process and remain unaffected by modeling biases.

It has been confirmed that memristor-based in situ training allows the neural network to tolerate nonideal characteristics of the device^20,21,22,23, resulting in a high-performance model. There exist several training strategies, including global learning and local learning. The global training strategy, such as the backpropagation algorithm (BP), has made great progress^18,21. However, the training strategy based on BP meets the mismatch between analytically calculated training information and the imprecisely programmed device conductance of analog devices²⁰. Owing to the programming noise and relaxation effects in memristors, ε_cell introduces significant deviations in the actual conductance update magnitudes (ΔG_desired + ε_cell) from the desired update magnitudes (ΔG_desired) in memristor arrays, resulting in a stochastic conductance update process and thus an unstable training process, as depicted in Fig. 1b. Even though the writing precision in analog devices is around 1 μS^9,19,24,25, there still exists a mismatch between analog devices and algorithm (detailed analysis in “Results”), reducing the training performance. Moreover, the realization of the above high precision relies on complicated circuit design or intensive software processing techniques, which substantially reduce energy and area efficiency. Recently, local learning techniques have been chosen to deal with the imprecision in memristive neural networks²⁰. However, owing to the lack of global information, local learning techniques are inferior in the training of DNNs with large datasets yet²⁶. Moreover, local learning techniques typically require additional operations (such as different phases or additional branches)^26,27, increasing energy and time consumption. Therefore, it is necessary to explore an efficient training technique that overcomes the algorithm-device mismatch while preserving global information.

In this work, to address the algorithm-device discrepancy, we propose an error-aware probabilistic update method (EaPU) that employs probabilistic update magnitudes to align with the stochastic updating process of memristors. With the EaPU, only 0.86‰ of the parameters (for ResNet152) are updated, reducing the number of update operations dramatically. For example, EaPU achieves an average update pulse count N_up (see “Methods”) of 6 × 10⁻³ in memristors and over 10⁴ times less than that of conventional write-verify methods (average update pulse count of 66) with the same precision. When combined with the non-write-verify methods¹⁸, EaPU further achieves an average update pulse count N_up of 8.6 × 10⁻⁴, 10³ times less than that of the original scheme (average update pulse count of 1). The sparse update characteristic is equivalent to enhancing the device’s endurance and extending its service life. Moreover, EaPU is highly compatible with BP, making training with EaPU capture global features and facilitate error-aware training simultaneously. For noiseless models, EaPU achieves negligible performance loss in comparison with the original BP. For noisy analog devices, the EaPU is experimentally validated on memristor hardware in multi-point regression problems, including denoising and super-resolution tasks. Training with EaPU on memristor hardware achieves better experimental results, with >80% improvement in structural similarity index (SSIM) in most cases, compared to related BP-based memristive training approaches (see “Methods”). Furthermore, the feasibility of EaPU in large-scale models, such as ResNet^1,28 with CIFAR10 and CIFAR100²⁹, SRResNet³⁰ with ImageNet³¹, and Swin Transformer³² with ImageNet100, is verified via simulation. The simulation results confirm >60% improvements in accuracy when noises are present. Moreover, evaluated at the 180 nm technology node, EaPU demonstrates a training energy cost reduction of 50.54× and 13.23× over classical BP-based memristor training method¹⁸ and advanced MADEM²⁰, respectively (For inference, the reduction is 35.51× and 11.26×, respectively.). Besides, when scaled down to 16 nm, the memristor hardware with EaPU offers an energy advantage of nearly six orders of magnitude compared with GPUs. At last but not least, we further test the retention characteristics of the network after EaPU training, confirming its long-term stability. Comparisons with training strategies from other distinct approaches demonstrate that EaPU exhibits good and balanced performance in learning efficiency, training stability, training performance, algorithm compatibility, energy-accuracy efficiency, and latency-accuracy efficiency.

Results

Algorithm-device mismatch and EaPU method

To overcome the unstable training processes originating from ε_cell, it is necessary to conduct an in-depth analysis of ε_cell. The relationship between device conductance and actual weight is typically set to a linear correspondence^8,33,34. Then the relationship between the conductance update magnitude ΔG and the weight update magnitude ΔW is given by

$$\frac{\triangle G}{{G}_{\max }}=\frac{\triangle W}{{W}_{\max }}$$

(1)

where G_max is the maximum programmable device conductance range and W_max is the maximum absolute synaptic weight value. We define the R_wg as the ratio between W_max and G_max, then

$$\Delta {W}_{{\mathrm{desired}}}={R}_{{\mathrm{wg}}}\times \Delta {G}_{{\mathrm{desired}}}$$

(2)

$$\Delta {W}_{{\mathrm{actual}}}={R}_{{\mathrm{wg}}}\times (\Delta {G}_{{\mathrm{desired}}}+{\varepsilon }_{{\mathrm{cell}}})$$

(3)

where Eqs. (2) and (3) show the ideal and actual update magnitudes in memristors, respectively. By analyzing Eqs. (1) and (3), it can be observed that a narrow conductance range (small G_max) leads to a small ΔG_desired, further exacerbating the negative impact of ε_cell. Due to the small learning rate (typically smaller than 1e-4)^35,36,37, the |ΔG_desired| of DNNs is always smaller than |ε_cell|. For example, we gather and analyze the weight update magnitudes ΔW_desired by training ResNet34^1,28 with CIFAR10²⁹ using a large learning rate of 1e-3 (introduced in Supplementary Fig. 2). It turns out that the weight update magnitudes are extremely small (~10⁻⁵). According to Eq. (1), even if G_max is in the range of thousands of μS, |ΔG_desired| is always less than 0.1 μS. Recently, researchers have achieved the programming precision of 1 μS^9,19,24,25. However, |ε_cell| of 1 μS is still much larger than update magnitudes |ΔG_desired|(<0.1 μS), causing learning information loss. Some excellent high-precision methods^38,39 may reduce ε_cell further, but the reliance on external software processing or the need for precise circuits leads to increased area, energy costs, and reduced flexibility. It would be more advantageous to explore a flexible and low-cost training approach. According to Eq. (3), the large ε_cell makes the ΔW_actual much larger than ΔW_desired, leading to the terrible learning information missing in analog learning (see detailed experiments in the “Simulation results on advanced networks” section). To overcome the contradiction between a large ε_cell and a small ΔG_desired, expanding ΔG_desired is an efficient choice. Small R_wg indicates a wide conductance range, causing large ΔG_desired and more ideal device characteristics. However, a wide conductance range means high conductance, causing high energy costs and complex circuit design. Moreover, most existing memristors are always in a narrow conductance range^13,20,40. Therefore, we propose a robust and cost-effective roadmap to extend the weight update magnitudes with negligible network performance degradation.

Enlarging the update magnitudes and maintaining the algorithm performance typically requires a probabilistic method to reserve the statistical mean value of updates^41,42. The variability of ε_cell represents the stochasticity of the memristor update. Moreover, the standard deviation of ε_cell (${{\mbox{SD}}}_{{\varepsilon }_{{\mbox{cell}}}}$) can be thought of as the quantified description of ε_cell. We propose the EaPU (shown in Fig. 2a) to adjust the update magnitudes ($\Delta$W, see “Methods”) with a threshold $\Delta$W_th (typically is ${{\mbox{SD}}}_{{\varepsilon }_{{\mbox{cell}}}}$× R_wg, analyzed in the “Hyperparameter $\Delta$W_th” section), thereby reducing the impact of update noise. The key idea of EaPU is to extend the target update magnitudes with a certain probability. If the absolute value of the original update magnitude |$\Delta$W| exceeds the threshold $\Delta$W_th, the update magnitude $\Delta$W_n remains $\Delta$W. Otherwise, if the |$\Delta$W| is below the $\Delta$W_th, the $\Delta$W_n has a possibility (|$\Delta$W|/$\Delta$W_th) to be set as the sign($\Delta$W)·$\Delta$W_th or it may stay at 0, meaning no update. The parameter update formula of EaPU can be expressed as follows:

$$\triangle {W}_{{{\rm{n}}}}=\left\{\begin{array}{cc}\triangle W,& |\triangle W|\ge \triangle {W}_{{\mathrm{th}}}\\ {{\mathrm{sign}}}\left(\right.\triangle W\left)\right.\cdot \triangle {{{W}}}_{{\mathrm{th}}},& |\triangle W| < \triangle {W}_{{\mathrm{th}}}{{\mathrm{and}}}\; {{\rm{with}}}\; {{\rm{probability}}} \, p=\frac{\left|\triangle W\right|}{\triangle {W}_{{\mathrm{th}}}}\\ 0,& \left|\triangle W\right| < \triangle {W}_{{\mathrm{th}}}{{\mathrm{and}}}\; {{\rm{with}}}\; {{\rm{probability}}} \, 1-p\end{array}\right.$$

(4)

where $\Delta$W_th is the threshold and p is the probability rounded to $\Delta$W_th. EaPU offers plug-and-play compatibility with popular optimizers (introduced in “Methods”). For example, we can improve the optimizer Adam⁴³ to AdamEaPU (summarized in Supplementary Note 1) by transforming original update magnitudes to desired update magnitudes with Formula (4) to adapt the feature of memristors. Experimental results (shown in Supplementary Fig. 3) demonstrate that AdamEaPU and SGDEaPU work well in practice and compare favorably to the original BP (91.83% for AdamEaPU vs 91.76% for Adam and 98.40% for SGDEaPU vs 98.50% for SGD, respectively). Furthermore, when the device is idealized, the $\Delta$W_th can be reduced to near zero so that the AdamEaPU degrades to Adam. Even under ideal conditions, EaPU can still substantially reduce the number of update parameters due to the probabilistic update.

**Fig. 2: The training architecture with EaPU.**

Error-aware probabilistic training

With the suggested EaPU methods, the training procedure on memristors is shown in Fig. 2b. The memristor performs the most expensive VMM, while digital computing handles the rest. The EaPU is utilized to compute the weight update magnitudes during error backpropagation. The EaPU method can be implemented by introducing a Bernoulli distribution (as shown in Supplementary Fig. 4). Furthermore, since the Bernoulli distribution can be realized using advanced random number generators^44,45, this validates the simplicity and efficiency of hardware implementation for EaPU. Additionally, to achieve an efficient update process, we propose a universal memorial programming method (MP, details in “Methods”) to process the spatiotemporal variations. MP maintains the updated simplicity of the two-pulse scheme¹⁸ and the programming precision of the write-verify method⁴⁶. It should be noted that the hardware-level weight update on memristors is not limited to MP, but other update methods can be applied. With error-aware probabilistic training, we can achieve in situ learning on memristors.

Hyperparameter ${{\boldsymbol{\Delta }}}$ W _th

To conduct the hardware experiments, we fabricate the one-transistor-one-resistor (1T1R) TiN/TaO_x/HfO_y/TiN memristor array and develop a corresponding computing system (introduced in “Methods” and Supplementary Fig. 11) with the National University of Defense Technology. We program the device to different target conductance ranges to obtain the ε_cell. With a programming tolerance value of 1 μS, the programming noise of the device (programmed conductance point in Supplementary Fig. 12a) is less than 1 μS, while the ε_cell (relaxed conductance point in Supplementary Fig. 12a) turns out to be much larger than the programming noise owing to the relaxation. It is confirmed that the relaxation error in the device is the main contributor to ε_cell. With an increase in programming tolerance value, the programming noise becomes more significant (shown in Supplementary Fig. 13). By these measurements, we can see that ${{\mbox{SD}}}_{{\varepsilon }_{{\mbox{cell}}}}$ ranges from 2 to 3 μS, and the standard deviation of relaxation error is around 2 μS in our device. We simulate the ResNet34 with varying ε_cell and $\Delta$W_th values (details in Supplementary Fig. 14) and find that the $\Delta$W_th that yields the best results is consistently around the value of ${{\mbox{SD}}}_{{\varepsilon }_{{\mbox{cell}}}}$. Therefore, we choose ${{\mbox{SD}}}_{{\varepsilon }_{{\mbox{cell}}}}$ of devices as $\Delta$W_th ($\Delta$W_th = ${{\mbox{SD}}}_{{\varepsilon }_{{\mbox{cell}}}}$ × R_wg, ${{\mbox{SD}}}_{{\varepsilon }_{{\mbox{cell}}}}$ = 2 μS, R_wg = 1/80 μS⁻¹ of our device) for experiments to enhance the training performance. Moreover, there exists a wide range of $\Delta$W_th values that can achieve satisfactory performance, allowing us to select an empirical $\Delta$W_th for network training. Furthermore, the simulated results of high ${{\mbox{SD}}}_{{\varepsilon }_{{\mbox{cell}}}}$ (up to 16 μS, shown in Supplementary Fig. 14d) confirm that even though ${{\mbox{SD}}}_{{\varepsilon }_{{\mbox{cell}}}}$ is 4 orders of magnitude larger than the original update magnitudes used in BP, EaPU still achieves an accuracy of approximately 80%. The robustness of EaPU is evidenced by the impressive results.

Hardware experiment for memristive autoencoder

To verify the feasibility of the EaPU method on memristors, we first conduct hardware experiments using an autoencoder^47,48, which is a typical multi-point regression problem. In the experiment, the Modified National Institute of Standards and Technology (MNIST) dataset⁴⁹ is used to train a denoising autoencoder, following the probabilistic training procedure. Figure 3a illustrates the feature map structure of the autoencoder. The trainable parameters of the autoencoder network are mapped using the dense mapping scheme, as illustrated in Fig. 3b with different color blocks. Existing mapping methods always use two or more cells to implement a signed weight^8,22,38, reducing the utilization rate and increasing the energy costs. To optimize the utilization of memristor cells, the signed weight is encoded as a conductance difference between a reference cell with a fixed conductance in the reference column and a trainable cell with variable conductance, called the reference column mapping method (introduced in “Methods”). By employing this approach, only 162 cells (135 + 27) are required to represent the 135 parameters of the autoencoder network, enhancing the energy efficiency of the training and inference process.

**Fig. 3: Denoising autoencoder tasks on memristor hardware.**

The training flow of the autoencoder follows the error-aware probabilistic training, and training configurations are introduced in “Methods.” During the training process, the loss over the training steps is presented in Fig. 3c, and the SSIM of the training epochs is shown in Fig. 3d. To obtain the simulation results, the autoencoder with memristor features is simulated, taking into account the nonidealities of memristors. The hardware experiment (SSIM: 0.896) and noise-free software simulation (SSIM: 0.869) achieve similar performance, confirming the ability of the memristor autoencoder to learn the various nonidealities during the training procedure. The memristor noise might improve the training robustness and performance of experimental results³³, causing better results. Supplementary Fig. 15 shows the traditional peak signal-to-noise ratio (PSNR, another metric) index of the autoencoder, where the experimental result (PSNR: 19.99) achieves nearly the same PSNR as the simulation (PSNR: 19.84). To verify the effectiveness of EaPU, we implement the error-aware probabilistic training, sign-based update method with BP, write-verify update method with BP, and two-pulse scheme with BP separately on the memristor system to train an autoencoder and compare them, as shown in Fig. 3e and Supplementary Fig. 16. The experimental results show that better results (SSIM:0.896, >80% improvements of SSIM to other training methods) are achieved with EaPU, due to the matching of the update process.

Supplementary Fig. 17a, b confirms the denoising ability of the trained memristor-based autoencoder by displaying the denoising result. We apply uniform noise and bicubic interpolation noise to the images, as shown in Supplementary Fig. 17c, further confirming the denoising capability of the trained autoencoder on other types of noise. Subsequently, we apply the trained autoencoder to denoise an image sourced from the SVHN (Street View House Numbers) dataset⁵⁰, mitigating various forms of noise such as compression artifacts and image editing distortions, and achieve the denoising image in Fig. 3f. Moreover, Supplementary Fig. 18 presents the data compression result achieved by a trained memristor-based autoencoder (compression rate is 64.67%). Furthermore, an important application of autoencoders is anomaly detection, introduced in Supplementary Fig. 19, achieving a classified accuracy of 96.56% to detect MNIST and Fashion MNIST images.

Hardware experiment for memristive super-resolution networks

In this part, a larger noise-sensitive task, super-resolution (SR)^51,52, is chosen to validate the effectiveness of EaPU on memristor hardware. Figure 4a illustrates the feature-map-based architecture of the super-resolution network with an upscaling factor of 2× (SRNet ×2). Similar to the autoencoder, the dense mapping scheme is employed in the super-resolution network, as shown in Fig. 4b. As the transposed convolutional layer of SRNet ×2 results in a large weight size after dense mapping, we divide its weight into two parts, namely layer2-1 and layer2-2, for separate mapping. Besides, through the reference column mapping method, 396 cells (369 + 27) are required to represent the 369 parameters rather than 738 (369 × 2) cells, reducing nearly half the number of cells.

**Fig. 4: The implementation of SRNet ×2 on memristor hardware.**

The training process of the SRNet ×2 follows the error-aware probabilistic training, and the training configurations are introduced in “Methods.” Figure 4c illustrates the loss curve of the training process during 400 training steps, and Fig. 4d shows the variation of SSIM with epochs, confirming that the experimental results (SSIM:0.933) closely align with the noise-free software-trained results (SSIM: 0.958). We further compare the error-aware probabilistic training with alternative training methods for SRNet ×2, as illustrated in Fig. 4e and Supplementary Fig. 20, achieving better results with EaPU (SSIM: 0.933, >0.6 improvements of SSIM to the classical two-pulse scheme with BP). This result confirms the robustness and effectiveness of EaPU in the noise networks. Besides, Fig. 4f, g displays the results using the trained memristor-based SRNet ×2 for two-times image super-resolution, and Supplementary Fig. 21 shows the further tasks on SRNet ×4.

To discuss the efficiency of error-aware probabilistic training, we gather the update pulse number in different programming strategies during the aforementioned training process (shown in Fig. 4h and Supplementary Fig. 22). Due to the probabilistic updates of cells, by using the EaPU, only around 6% parameters are updated during the training process and the rest parameters are unchanged, shown in Supplementary Fig. 22b, c, reducing the programming complexity. Thus, the combination of EaPU and MP significantly reduces the number of cells required for updating and results in an N_up of less than 1 (~0.7) per device, reducing the N_up by 10 times compared to MP alone and 90 times compared to the write-verify method (WU). When the fast-programming scheme (FP), including the sign-based update (SU) and two-pulse scheme (TP), is integrated with EaPU, it is observed that the N_up (0.1) decreases by a factor of ten compared to the fast-programming scheme alone. When training with deep networks, N_up decreases further, thereby yielding great energy consumption advantages, which are analyzed in detail below.

Simulation results on advanced networks

Given that previous experiments have confirmed EaPU’s good training performance and efficiency on memristor hardware, we further conduct simulations on advanced networks to verify the feasibility and effectiveness of EaPU in deep networks.

We simulate the ResNet34 (34 layers) with the CIFAR10 dataset and SRResNet ×4 network³⁰ (37 layers, details in Supplementary Fig. 23) with the ImageNet dataset to verify the enhancements and robustness of EaPU. To analyze the impact of ε_cell, we implement simulated experiments with the original BP and EaPU, shown in Fig.5a. To be precise, the two-pulse scheme, write-verify update, and sign-based update can all be equivalent to BP learning algorithms with update noise. The dashed lines in Fig. 5a depict the training results of BP for ResNet34 and SRResNet, highlighting the impact of large ε_cell on the original BP algorithm. The arrow in Fig.5a further confirms the substantial improvement offered by the proposed EaPU methods (For example, EaPU achieves an improvement of 73% in accuracy and 0.84 in SSIM for ResNet34 and SRResNet, respectively. The ${{\mbox{SD}}}_{{\varepsilon }_{{\mbox{cell}}}}$ is 2 μS and R_wg is 1/80 μS⁻¹). The robustness of EaPU to large ε_cell relaxes the requirements for device performance. Meanwhile, the simulation with different R_wg (shown in Fig. 5b) further confirms the robustness and insensitivity of EaPU to conductance range, achieving an improvement of 70% in accuracy and 0.79 in SSIM for ResNet34 and SRResNet corresponding to the original algorithm, respectively, when R_wg is 1/20 μS⁻¹. Due to the great robustness to ε_cell and the insensitivity to R_wg, EaPU can be effectively applied to analog devices with various update noise and conductance range, reducing the hardware constraints of training. Furthermore, we supplement simulations of ResNet34 with respect to unresponsive devices (shown in Supplementary Fig. 24), confirming that the network can still achieve a training accuracy of 85% in the presence of a 10% stuck-off ratio and a 2% stuck-on ratio.

**Fig. 5: Simulation on deep networks.**

As to very deep networks or complex datasets, we simulate ResNet152 with the CIFAR100 dataset and Swin Transformer³² networks with the ImageNet100 dataset and achieve the experimental outcomes presented in Fig. 5c and Supplementary Fig. 25. These results align with those in Fig. 5a, showing that the original BP algorithm struggles to train effectively in the presence of significant weight update noise, while EaPU effectively addresses this issue and the accuracies are improved by 60.23% and 88.78% for ResNet152 and Swin Transformer, respectively. Moreover, as shown in Fig. 5a, c, the training results of the noiseless model with EaPU can achieve negligible performance loss (<1%) in comparison with the original BP. The aforementioned hardware tests have confirmed the sparse update characteristic of the EaPU strategy. Given that the sparse update characteristic is of greater practical significance for deep networks, we statistically analyze the parameter update ratios during the training of ResNet152 and Swin Transformer. During the training process of ResNet152, we observe that the ratio of parameter updates (shown in Fig. 5d) follows a pattern of first increasing and then decreasing, with such updates generally remaining below 1.4‰ (the number of updated parameters is reduced by 99.86% at least), while the training accuracy continues to rise throughout the process. Moreover, the average update ratio R_up reaches 0.86‰ during the training process (shown in Fig. 5e). This is equivalent to increasing the number of maximum updates of the memristor system by three orders of magnitude, further extending the service life of memristors and alleviating the constraints imposed by endurance. The Supplementary Fig. 26 further validates the effectiveness of the sparse update characteristic of EaPU. During the transfer learning process of Swin Transformer, the parameter update ratio and R_up decrease to less than 0.15‰ and 0.04‰ (shown in Supplementary Fig. 27), respectively. The statistic of Swin Transformer confirms that the use of pre-trained weights can reduce the proportions of parameter updates to a great extent. The ultra-low R_up results in an order-of-magnitude reduction in N_up (shown in Fig. 5f), thereby significantly lowering the energy consumption of the update process (see “Discussion”).

We further test and simulate the long-term retention characteristics of the network after training. Given that the long-term retention characteristics of the trained network are closely related to device properties, we discuss this from two aspects, including the RRAM system and the PCM system. Firstly, we measure the conductance changes of 4000 programmed RRAM devices over time on our hardware, as shown in Supplementary Fig. 28. It shows that the conductance of RRAM devices does exhibit a drift over time. The conductance distribution of the RRAM only undergoes a mean shift, which can be restored to the original distribution by adding a mean offset, as shown in Supplementary Fig. 28f. We further test the accuracy over time of a fully connected network (64-32-24-10) under different methods on RRAM hardware, as shown in Supplementary Fig. 29, confirming that the EaPU-trained model without any compensation can address the drift issue in RRAM and maintain accuracy for more than 60 h. Then, we conduct the relevant simulation in the PCM system using AIHWKit^53,54. Due to the complex drift characteristics of PCM, GDC plays an important and indispensable role^11,19. By introducing GDC to the EaPU-trained ResNet32 model, the inference accuracy of the network can be effectively maintained, as shown in Supplementary Fig. 30.

Discussion

Due to the insensitivity to R_wg, EaPU can achieve comparable accuracy for analog devices with much narrower conductance ranges, reducing the hardware constraints for in situ training. As shown in Supplementary Table 2 of Supplementary Note 2, EaPU with R_wg of 1/40 μS⁻¹ achieves better accuracy than the original BP algorithm with R_wg of 1/4000 μS⁻¹, reducing the energy consumption (shown in Supplementary Note 2.3). Owing to its ability to learn efficiently within the narrow training conductance range, the inference energy consumption of the EaPU-trained model is reduced, achieving 35.51× lower than the fast-programming scheme (700 μJ vs. 19.71 μJ) and 11.26× lower than advanced MADEM (222 μJ vs. 19.71 μJ), as shown in Fig. 6a. Furthermore, EaPU yields an ultra-low update characteristic (with R_up being 0.86‰ in ResNet152), which can significantly reduce the update frequency (with N_up being 6 × 10⁻³ in the configuration of EaPU + MP) and thereby lower the energy consumption of the update process (more than 100× than related training method, shown in Supplementary Table 3 of Supplementary Note 2). Owing to low inference and update energy consumption, EaPU-based training exhibits great training advantages, with an energy consumption reduction of 50.54× compared to the fast-programming scheme (and 13.23× compared to the advanced MADEM), as shown in Fig. 6b. Furthermore, a comparison between EaPU-based systems and GPU is performed, introduced in Supplementary Note 3. In Flash memory systems where write energy consumption and latency are substantial, EaPU can achieve over three orders of magnitude reduction in update energy consumption and an 18.69× decrease in update latency. This ultimately results in a two-orders-of-magnitude reduction in training energy consumption and an 18.19× decrease in training latency, confirming the positive significance of extending EaPU to other nonvolatile devices. Meanwhile, training with high-energy-efficiency memristor systems yields better performance, enabling nearly a 6-order-of-magnitude reduction in training energy consumption and nearly a 2-order-of-magnitude reduction in training latency compared with GPUs.

**Fig. 6: Energy consumption comparison with related training methods and performance comparison with different memristor training approaches.**

We further compare EaPU with other distinct memristor training approaches using AIHWKit, including Tika-Takav2 (TTv2)⁵⁵, chopped-TTv2 (c-TTv2)⁵⁶, and hardware-aware training (Hwa)¹¹. EaPU, TTv2, and c-TTv2 are memristor-based in situ training, and they all exhibit favorable learning capabilities on the LeNet5 task; however, EaPU achieves faster convergence efficiency, as shown in Supplementary Fig. 31. The two-pulse scheme demonstrates the fastest convergence efficiency in LeNet5. Nevertheless, its sensitivity to update noise limits training accuracy. For deeper networks such as ResNet34, inappropriate hyperparameters can cause TTv2 and c-TTv2 to suffer from learning instability (as shown in Supplementary Fig. 32). It is necessary to perform fine-grained hyperparameter tuning for TTv2 and c-TTv2, which incurs optimization costs. By adjusting the threshold of the H matrix, the unstable learning problem can be effectively addressed (as shown in Supplementary Fig. 33). However, TTv2 and c-TTv2 require a certain degree of accumulation of gradient magnitudes and can only perform weight updates after reaching the threshold. This results in no weight updates within a certain number of iterations, which reduces the learning efficiency of the algorithm (as shown in Fig. 6c). TTv2 and c-TTv2 employ digital modules for the cumulative computation of H arrays, which endow them with momentum-like properties and enhance their learning capability (see Supplementary Fig. 34 for further optimization). By additional optimizations and fine parameter tuning, they may achieve training performance comparable to that of naive SGD. However, they fail to match the accuracy of momentum-based training methods (as shown in Fig. 6d) and struggle to effectively accommodate more advanced optimizer strategies (e.g., Adam). In contrast, EaPU is compatible with such advanced optimizer strategies, thereby achieving better training performance. Furthermore, in simulations related to EaPU, we don’t employ any optimization methods other than the one detailed in the “Hyperparameter ΔW_th” section. Hwa is an approach that trains the noise model in software (GPUs or CPUs) to improve noise robustness and programs the trained weights onto analog hardware⁵⁶; thus, it is mainly used for the inference application and cannot leverage the energy and latency advantages offered by memristor in situ training. Meanwhile, even in the presence of update noise, EaPU in memristor system can achieve learning capability comparable to that of Hwa in digital system, as shown in Supplementary Fig. 35a. Compared with Hwa in digital systems, EaPU in memristor training systems exhibit substantial advantages in energy consumption and latency (analyzed in Supplementary Note 3), with order-of-magnitude accuracy gains per unit of energy and per unit of latency (as shown in Supplementary Fig. 35b, c). This confirms the ultra-high energy-accuracy and latency-accuracy efficiency of EaPU-based training. Thus, under energy and latency constraints, EaPU can achieve training iterations that are multiple orders of magnitude greater than those of Hwa, thereby enabling higher learning performance.

In summary, EaPU training achieves robustness to update noise and insensitivity to conductance ranges, compensating for the limitations of in situ memristor training. Thus, it can handle various nonideal characteristics, supporting the precise and efficient training of DNNs in analog devices (such as ResNet152 and Swin Transformer). Meanwhile, due to the probabilistic update characteristic of EaPU, the number of updated parameters during training is significantly reduced (down to 0.86‰ for ResNet152 and 0.04‰ for Swin Transformer), which effectively lowers the energy consumption of the update process (a three-order-of-magnitude reduction in Flash memory) and increases the number of maximum iterations of the memristor system. These reasons make the EaPU promising to be extended to various nonvolatile device systems. Due to the ultra-low operating conductance range and N_up, EaPU can achieve a several dozen-fold reduction in training and inference energy consumption compared with related memristor training strategies. Coupled with the ultra-high energy efficiency of memristors, EaPU is expected to achieve a nearly six-order-of-magnitude reduction in energy consumption compared with GPUs. Furthermore, EaPU-trained models can exhibit good retention characteristics, and in particular, no additional compensation strategies are required for RRAM systems. Comparisons with training strategies from other distinct approaches have further confirmed EaPU’s good and balanced performance in learning efficiency, training stability, training performance, algorithm compatibility, energy-accuracy efficiency, and latency-accuracy efficiency, further enhancing the performance of analog training. The EaPU method holds promise for combination with other distinct learning approaches to achieve further improvements, thereby advancing progress in the field of memristor training.

Methods

Quantitative index for memristor error

Memristor precision limitation can be described as the calculation deviation of VMM⁸. VMM error ε_VMM is utilized to quantify the variance between the actual output y_actual and desired output y_fp, and can be described as follows:

$${{\varepsilon }}_{{\mathrm{VMM}}}={y}_{{\mathrm{actual}}}{{{-}}}{y}_{{\mathrm{fp}}}.$$

(5)

In this formula, y_fp = W_desiredX, where W_desired is the desired target conductance for memristors and X describes the calculated input. ε_VMM can be decomposed as the linear error ε_linear from incorrect programming, and residual error ε_residual from calculation errors⁸, including reading noise, IR drop, and external circuit nonidealities. ε_linear and ε_residual are defined as:

$${\varepsilon }_{{\mathrm{VMM}}}={\varepsilon }_{{\mathrm{linear}}}+{\varepsilon }_{{\mathrm{residual}}},$$

(6)

$${\varepsilon }_{{\mathrm{linear}}}={W}_{{{\mathrm{actual}}}}X{{{-}}}{y}_{{\mathrm{fp}}}=({W}_{{\mathrm{acutal}}}{{{-}}}{W}_{{\mathrm{desired}}})X,$$

(7)

$${\varepsilon }_{{\mathrm{residual}}}={y}_{actual}{{{-}}}{W}_{{\mathrm{acutal}}}X,$$

(8)

where W_actual is the actual conductance on memristors. Herein, we use writing noise ε_cell (ε_cell = W_acutal−W_desired) to represent the ε_linear for obtaining measured metrics. Following the Eqs. (5), (7), and (8), Supplementary Fig. 1 displays the visual representation of the differences among ε_VMM, ε_residual, and ε_cell. Here, we mainly illustrate the constraint factors of devices during the training process, while factors influencing the long-term retention process (such as long-term drift, discussed in the “Simulation results on advanced networks” section) are not included.

Average update pulse count N _up

N_up refers to the average number of applied update pulse counts per device per iteration during the training process, which facilitates energy consumption calculation and evaluation. Assuming that the total number of update pulses applied in each iteration is N_all and the total number of devices is n, then

$${N}_{{{\rm{up}}}}=\frac{{N}_{{{\rm{all}}}}}{n}.$$

(9)

Meanwhile, N_up can be derived from the average number of update pulses per device per iteration (N_cell, counting only the devices that undergo updates) and the average update ratio (R_up), as follows:

$${N}_{{{\rm{up}}}}={N}_{{{\rm{cell}}}}\times {R}_{{{\rm{up}}}}.$$

(10)

Related training approaches and their limitations

Typical training approaches are a combination of learning algorithms such as backpropagation and update schemes. To implement the conductance update in analog devices, various conductance update schemes for memristive training have been proposed yet, such as the two-pulse scheme^18,34, sign-based scheme^10,57, and write-verify scheme²¹. The two-pulse scheme employs a pair of set and reset pulses to achieve a linear conductance response. However, the requirement for highly linear and symmetric devices to decrease ε_cell limits its applicability^18,33. Sign-based update methods apply the voltage whose polarity relies on the sign of update magnitudes to the memristors. Sign-based update methods are sensitive to operation voltage, necessitating careful operations to prevent the large conductance change⁵⁷. These two methods bring about high efficiency but cause more or less loss of programming precision. Moreover, traditional write-verify methods combine a few pulses to adjust the conductance and estimate whether the current conductance is required. Though traditional write-verify strategies are efficient methods to decrease ε_cell^21,24, the low-parallel operation and spatiotemporal variations ε_var make existing write-verify strategies highly time- and energy-cost. Therefore, there exists an unbalance between precision and efficiency in existing update methods, then limiting the training performance. The proposed EaPU can overcome the limitations of unbalance, achieving robust training and better performance.

Definition for $\triangle$ W, $\triangle$ W _th, and $\triangle$ W _n

During the training process, the naïve stochastic gradient descent (SGD) update step can be described as follows:

$${w}_{{ij}}^{t+1}={w}_{{ij}}^{t}-\eta {\sum }_{b=1}^{B}{x}_{i}^{b}{\delta }_{j}^{b}$$

(11)

Where t is the iteration number, $\eta$ is the learning rate, and b is the index of batch size. ${x}_{i}^{b}$ is the activation at the input layer, and ${\delta }_{j}^{b}$ is the error computed by the output layer. Thus, $\triangle$W denotes the update magnitudes $-\eta {\sum }_{b=1}^{B}{x}_{i}^{b}{\delta }_{j}^{b}$. Meanwhile, in more complex update strategies, the update magnitudes may include components such as momentum and second-order gradients. $\triangle$W_th is defined as the threshold in the EaPU training process, which can be simply set as ${{\mbox{SD}}}_{{\varepsilon }_{{\mbox{cell}}}}$ (${{\mbox{SD}}}_{{\varepsilon }_{{\mbox{cell}}}}$× R_wg, specifically). $\triangle$W_n, the target update magnitudes processed by EaPU can be directly used in the memristor update process.

Transformation of the original optimizer to the suggested optimizer

Composing the original optimizer with EaPU requires only a few steps:

Step 0: define required learning hyperparameters, such as learning rate α, threshold ΔW_th;

Step 1: initialize the trainable parameters θ;

Step 2: calculate the update magnitudes $\Delta$W through backpropagation;

Step 3: calculate the target update magnitudes $\Delta$W_n through Formula (4);

Step 4: update parameters.

By repeating Steps 2–4, we finally obtain converged trainable parameters.

Memorial programming method

The memorial programming method (MP) is modified from the two-pulse scheme¹⁸ that maintains the original scheme’s update simplicity and the write-verify²¹ method’s programming precision. As shown in Supplementary Fig. 5, MP is composed of binary-search and step-to-step update procedures, where the binary-search is employed to load pre-trained weights or initial weights from existing initialization methods to achieve enhanced training outcomes. The binary-search procedure, detailed in Supplementary Fig. 6, addresses the issues caused by device-to-device variation (limiting the uniform programming parameters, shown in Supplementary Fig. 7) and nonlinearity (posing a challenge for the linearity dependence programming methods, shown in Supplementary Fig. 8) by only requiring a positive correlation between the conductance and the gate control voltage. Since the update magnitudes of parameters during network training are generally small, we can use the voltage position updated in the previous training step as the initial voltage position for the current training step and update the parameters by the step-to-step update procedure (The flow diagram is illustrated in Supplementary Fig. 9). Cycle-to-cycle variation (shown in Supplementary Fig. 10) allows the same control voltage to correspond to a conductance range, which improves the efficiency of the step-by-step update procedure, as only a few pulses are needed to reach the target conductance. As the position search is only performed in the first training step, the time cost associated with the binary search becomes negligible when conducting hundreds, thousands, or even more training steps, greatly reducing the complexity of using the MP. By comparing the programming complexity in the Supplementary Table 1, it can be observed that the MP has the same complexity as the two-pulse scheme during network training.

Fabrication and integration of 1T1R array

The transistor array was fabricated in a commercial foundry using the 180 nm technology node (SMIC 1P6M). The metal layers M1 to M5 and the vias V1 to V5 were manufactured in the foundry. Subsequently, the resistive layer comprising TiN/TaO_x/HfO_y/TiN and the metal layer M6 were integrated in a laboratory cleanroom. The process involved the deposition of a 30-nm TiN bottom electrode using physical vapor deposition. Following this, 8-nm HfO_y and 45-nm TaO_x were grown with the atomic layer deposition method. The top TiN electrode was then stacked to 30 nm using physical vapor deposition. Finally, the metal layer was fabricated via sputtering under a suitable vacuum environment. In the 1T1R cell configuration, transistors were employed to mitigate the sneak path problem and implement precise conductance tuning, while memristors implemented nonvolatile storage. The computing system was designed to control the 1T1R array and conduct experiments on neural networks. Further details about the measuring system can be found in Supplementary Fig. 11.

Reference column mapping method

In memristor-based neural networks, the relationship between the real domain training parameter W and the memristor training parameter G is

$$W={R}_{{\mathrm{wg}}}\times {{\bf{G}}},$$

(12)

where R_wg is a positive number and G is a signed matrix. When conductance is used to represent weights, since the conductance of memristors is always positive, we need to represent memristor weights by taking the difference. In our experiment, we represent G as follows:

$${{\bf{G}}}={{{\bf{G}}}}_{{{\rm{trainable}}}}{{{-}}}{{{\bf{G}}}}_{{{\rm{zero}}}},$$

(13)

where G_zero is the bias to keep G_trainable positive. G_zero is a fixed matrix in our experiments and is represented by a memristor column (called the reference column). After obtaining the ΔG, we only need to update the memristor parameters corresponding to G_trainable, reducing the number of parameter updates by half. Using a single memristor cell to represent signed weights means that G_max is only half of a single memristor conductance range, making the impact of the ε_cell more pronounced.

Training configurations of the autoencoder

During the training process, the Gaussian noise with a standard deviation of 0.3 is added to the original image as the input image, while the original image is set as the target image. The mean square error is used to calculate the loss and activate the backward pass. The training consists of 1100 mini-batches, and we choose 5 points with an interval of 200 mini-batches to validate the trained results. Moreover, the mini-batch size used in the training dataset is 64.

Training configurations of SRNet ×2

During the training process, we employ the bicubic method to downscale the original images to half of their size as the input data, while retaining the original images as the target images for training. The L1 loss calculates the output loss value used in backward propagation. Moreover, the training involves 400 mini-batches across 4 epochs, with each epoch containing 100 mini-batches, and the mini-batch size is 256.

Data availability

The datasets used for the experiments and simulations in this study are publicly available^29,49,50. The MNIST dataset is available at http://yann.lecun.com/exdb/mnist/. The SVHN dataset is available at http://ufldl.stanford.edu/housenumbers/. The CIFAR datasets are available at https://www.cs.toronto.edu/~kriz/cifar.html. Source data are provided with this paper.

Code availability

The code used to simulate the model with EaPU in this study is available at https://github.com/LJinchang/Experiment_EaPU⁵⁸. The code that supports the experiments on memristors relies on the custom-built measurement system and is available from the corresponding authors upon request.

References

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 4700–4708 (IEEE,2017).
Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: towards real-time object detection with region proposal networks. In Proc. of the 29th International Conference on Neural Information Processing Systems Vol. 1, 91–99 (MIT Press, 2015).
Redmon, J. & Farhadi, A. Yolov3: an incremental improvement. Preprint at arXiv https://doi.org/10.48550/arXiv.1804.02767 (2018).
Brown, T.B. et al. Language models are few-shot learners. In Proc. of the 34th International Conference on Neural Information Processing Systems Vol. 33, 1877–1901 (Curran Associates Inc., 2020).
Achiam, J. et al. GPT-4 technical report. Preprint at arXiv https://doi.org/10.48550/arXiv.2303.08774 (2023).
Jouppi, N. et al. TPU v4: an optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proc. of the 50th Annual International Symposium on Computer Architecture 82 (Association for Computing Machinery, 2023).
Le Gallo, M. et al. A 64-core mixed-signal in-memory compute chip based on phase-change memory for deep neural network inference. Nat. Electron. 6, 680–693 (2023).
Article Google Scholar
Wan, W. et al. A compute-in-memory chip based on resistive random-access memory. Nature 608, 504–512 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhang, W. et al. Edge learning using a fully integrated neuro-inspired memristor chip. Science 381, 1205–1211 (2023).
Article ADS CAS PubMed Google Scholar
Rasch, M. J. et al. Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based accelerators. Nat. Commun. 14, 5282 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Feng, Y. et al. Memristor-based storage system with convolutional autoencoder-based image compression network. Nat. Commun. 15, https://doi.org/10.1038/s41467-024-45312-0 (2024).
Yao, P. et al. Fully hardware-implemented memristor convolutional neural network. Nature 577, 641–646 (2020).
Article ADS CAS PubMed Google Scholar
Yang, J. et al. Resistive memory-based neural differential equation solver for score-based diffusion model. Preprint at arXiv https://doi.org/10.48550/arXiv.2404.05648 (2024).
Chen, Y. Y. et al. Understanding of the endurance failure in scaled HfO 2-based 1T1R RRAM through vacancy mobility degradation. In Proc. 2012 International Electron Devices Meeting. 20.23. 21-20.23. 24 (IEEE, 2012).
Xu, X. et al. 40× Retention Improvement by Eliminating Resistance Relaxation with High Temperature Forming in 28 nm RRAM Chip. In Proc. 2018 IEEE International Electron Devices Meeting. 20.21.21-20.21.24 (IEEE, 2018).
Laube, S. M. & TaheriNejad, N. Device variability analysis for memristive material implication. Preprint at arXiv https://doi.org/10.48550/arXiv.2101.07231 (2021).
Li, C. et al. Efficient and self-adaptive in-situ learning in multilayer memristor neural networks. Nat. Commun. 9, 2385 (2018).
Article ADS PubMed PubMed Central Google Scholar
Joshi, V. et al. Accurate deep neural network inference using computational phase-change memory. Nat. Commun. 11, 2473 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Yi, S.-i, Kendall, J. D., Williams, R. S. & Kumar, S. Activity-difference training of deep neural networks using memristor crossbars. Nat. Electron. 6, 45–51 (2022).
Google Scholar
Wang, R. et al. Implementing in-situ self-organizing maps with memristor crossbar arrays for data mining and optimization. Nat. Commun. 13, 2289 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Kiani, F., Yin, J., Wang, Z., Yang, J. J. & Xia, Q. A fully hardware-based memristive multilayer neural network. Sci. Adv. 7, eabj4801 (2021).
Article ADS PubMed PubMed Central Google Scholar
Shi, T. et al. Stochastic neuro-fuzzy system implemented in memristor crossbar arrays. Sci. Adv. 10, eadl3135 (2024).
Article CAS PubMed PubMed Central Google Scholar
Rao, M. et al. Thousands of conductance levels in memristors integrated on CMOS. Nature 615, 823–829 (2023).
Article ADS CAS PubMed Google Scholar
Feng, Y. et al. Improvement of state stability in multi-level resistive random-access memory (RRAM) array for neuromorphic computing. IEEE Electron Device Lett. 42, 1168–1171 (2021).
Article ADS Google Scholar
Scellier, B., Ernoult, M., Kendall, J. & Kumar, S. Energy-based learning algorithms for analog computing: a comparative study. In Proc. of the 37th International Conference on Neural Information Processing Systems 2295 (Curran Associates Inc., 2023).
Nøkland, A. & Eidnes, L. H. Training neural networks with local error signals. In Proc. International conference on machine learning 4839–4850 (PMLR, 2019).
He, K., Zhang, X., Ren, S. & Sun, J. Identity mappings in deep residual networks. In European conference on computer vision. 630–645 (Springer, 2016).
Krizhevsky, A. Learning multiple layers of features from tiny images. (University of Toronto, 2012).
Ledig, C. et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 4681–4690 (IEEE, 2017).
Russakovsky, O. et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015).
Article MathSciNet Google Scholar
Liu, Z. et al. Swin transformer: hierarchical vision transformer using shifted windows. In Proc. IEEE/CVF International Conference on Computer Vision 10012–10022 (IEEE, 2021).
Wang, Z. et al. In situ training of feed-forward and recurrent convolutional memristor networks. Nat. Mach. Intell. 1, 434–442 (2019).
Article Google Scholar
Wang, Z. et al. Reinforcement learning with analogue memristor arrays. Nat. Electron. 2, 115–124 (2019).
Article Google Scholar
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) 4171–4186. https://doi.org/10.18653/v1/N19-1423 (Association for Computational Linguistics, 2019).
Liu, Z. et al. A ConvNet for the 2020s. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 11976–11986 (IEEE, 2022).
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 10684–10695 (IEEE, 2022).
Song, W. et al. Programming memristor arrays with arbitrarily high precision for analog computing. Science 383, 903–910 (2024).
Article ADS MathSciNet CAS PubMed Google Scholar
Sharma, D. et al. Linear symmetric self-selecting 14-bit kinetic molecular memristors. Nature 633, 560–566 (2024).
Article ADS CAS PubMed Google Scholar
Cai, F. et al. A fully integrated reprogrammable memristor–CMOS system for efficient multiply–accumulate operations. Nat. Electron. 2, 290–299 (2019).
Article CAS Google Scholar
Bernstein, J., Wang, Y.-X., Azizzadenesheli, K. & Anandkumar, A. signSGD: Compressed optimisation for non-convex problems. In Proc. International Conference on Machine Learning 560–569 (PMLR, 2018).
Wen, W. et al. TernGrad: ternary gradients to reduce communication in distributed deep learning. In Proc. of the 31st International Conference on Neural Information Processing Systems. Vol. 30 1508–1518 (Curran Associates Inc., 2017).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at arXiv https://doi.org/10.48550/arXiv.1412.6980 (2014).
Jiang, H. et al. A novel true random number generator based on a stochastic diffusive memristor. Nat. Commun. 8, 882 (2017).
Article ADS PubMed PubMed Central Google Scholar
Kim, G. et al. Self-clocking fast and variation tolerant true random number generator based on a stochastic mott memristor. Nat. Commun. 12, 2906 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Shi, T. et al. Memristor-based feature learning for pattern classification. Nat. Commun. 16, https://doi.org/10.1038/s41467-025-56286-y (2025).
Michelucci, U. An introduction to autoencoders. Preprint at arXiv https://doi.org/10.48550/arXiv.2201.03898 (2022).
Choi, C. et al. Reconfigurable heterogeneous integration using stackable chips with embedded artificial intelligence. Nat. Electron. 5, 386–393 (2022).
Article Google Scholar
Deng, L. The MNIST Database of Handwritten Digit Images for Machine Learning Research. IEEE Signal Process. Mag. 29, 141–142 (2012).
Article ADS Google Scholar
Netzer, Y. et al. Reading digits in natural images with unsupervised feature learning. In Proc. NIPS workshop on deep learning and unsupervised feature learning Vol. 7 (Granada, 2011).
Wang, X., Xie, L., Dong, C. & Shan, Y. Real-ESRGAN: training real-world blind super-resolution with pure synthetic data. In Proc. IEEE/CVF International Conference on Computer Vision 1905–1914 (IEEE, 2021).
Shi, W. et al. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 1874–1883 (IEEE, 2016).
Rasch, M. J. et al. A flexible and fast PyTorch toolkit for simulating training and inference on analog crossbar arrays. In Proc. 2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS) 1–4 (IEEE, 2021).
Le Gallo, M. et al. Using the IBM analog in-memory hardware acceleration kit for neural network training and inference. APL Mach. Learn. 1, https://doi.org/10.1063/5.0168089 (2023).
Gokmen, T. Enabling training of neural networks on noisy hardware. Front. Artif. Intell. 4, 699148 (2021).
Article PubMed PubMed Central Google Scholar
Rasch, M. J., Carta, F., Fagbohungbe, O. & Gokmen, T. Fast and robust analog in-memory deep neural network training. Nat. Commun. 15, 7133 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Yao, P. et al. Face classification using electronic synapses. Nat. Commun. 8, 15199 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Liu, J. et al. Error-aware probabilistic training for memristive neural networks. EaPU, https://doi.org/10.5281/zenodo.17338135 (2025).

Download references

Acknowledgements

This work was supported by the National Key R&D Program of China under Grant No. 2021YFB3601200, the National Natural Science Foundation of China under Grant Nos. U20A20220, U22A6001, 61821091, 61888102, and 61825404, the Key R&D Program of Zhejiang (No. 2022C01048), the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant XDB44000000.

Author information

These authors contributed equally: Jinchang Liu, Jian Lu.

Authors and Affiliations

Zhejiang Laboratory, Hangzhou, China
Jinchang Liu, Jian Lu, Shuangzhu Tang, Ruixi Zhou, Huiqin Ma, Bo Lyu, Yang Tian & Tuo Shi
The Frontier Institute of Chip and System, Fudan University, Shanghai, China
Qi Liu

Authors

Jinchang Liu
View author publications
Search author on:PubMed Google Scholar
Jian Lu
View author publications
Search author on:PubMed Google Scholar
Shuangzhu Tang
View author publications
Search author on:PubMed Google Scholar
Ruixi Zhou
View author publications
Search author on:PubMed Google Scholar
Huiqin Ma
View author publications
Search author on:PubMed Google Scholar
Bo Lyu
View author publications
Search author on:PubMed Google Scholar
Yang Tian
View author publications
Search author on:PubMed Google Scholar
Tuo Shi
View author publications
Search author on:PubMed Google Scholar
Qi Liu
View author publications
Search author on:PubMed Google Scholar

Contributions

T.S. and J.C.L. conceived the concept and designed the experiments. J.C.L. S.T., R.Z., and H.M. performed the electrical measurements. J.C.L. contributed to the neural network simulation. J.C.L., J.L., T.S., and Q.L. analyzed the experimental data, and J.C.L., J.L., B.L., Y.T., T.S., and Q.L. wrote the manuscript. All authors discussed the results and commented on the manuscript at all stages. Q.L. and T.S. supervised the research.

Corresponding authors

Correspondence to Tuo Shi or Qi Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Corey Lammie and the other, anonymous, reviewer for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information (download PDF )

Transparent Peer Review file (download PDF )

Source data

Source data (download ZIP )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, J., Lu, J., Tang, S. et al. Error-aware probabilistic training for memristive neural networks. Nat Commun 16, 11494 (2025). https://doi.org/10.1038/s41467-025-66240-7

Download citation

Received: 07 May 2025
Accepted: 03 November 2025
Published: 12 December 2025
Version of record: 29 December 2025
DOI: https://doi.org/10.1038/s41467-025-66240-7