Modified spike backpropagation design towards highly parallelable hardware implementation

Zhang, Dayou; Zhou, Yue; Zhao, Vivian; Gao, Bin; Fu, Jiawei; Xue, Yibai; Yang, Zhe; Li, Yi; Tong, Hao; Miao, Xiangshui; He, Yuhui

doi:10.1038/s44335-025-00046-0

Download PDF

Article
Open access
Published: 22 January 2026

Modified spike backpropagation design towards highly parallelable hardware implementation

Dayou Zhang¹,
Yue Zhou²,
Vivian Zhao³,
Bin Gao⁴,
Jiawei Fu¹,
Yibai Xue¹,
Zhe Yang¹,
Yi Li¹,
Hao Tong¹,
Xiangshui Miao¹ &
…
Yuhui He¹

npj Unconventional Computing volume 3, Article number: 4 (2026) Cite this article

528 Accesses
Metrics details

Subjects

Abstract

This work presents a hardware-algorithm co-designed framework for neuromorphic computing, enabling efficient supervised learning in spike-based neural architectures. First, synaptic updates are reformulated as low-rank outer products of forward spike vectors and backward error gradients via singular value decomposition (SVD), enabling direct parallelization on 1T1R arrays. Second, a stochastic computing scheme replaces conventional sequential updates with probabilistic pulse-driven modulation, achieving one-step full-matrix synaptic updates. Third, gradient stabilization techniques mitigate training instability in deep SNNs by addressing silent neuron and gradient explosion issues. Evaluated on the ASL-DVS dynamic gesture recognition task, the framework maintains 84.7% accuracy with hardware-realistic 1T1R characteristics, while drastically reducing hardware update steps. This demonstrates a synergistic hardware-algorithm co-design where SVD-based approximation enables parallelization, stochastic computing achieves one-step updates, and gradient stabilization ensures trainability, advancing practical neuromorphic intelligence for edge sensing systems.

Brain-inspired global-local learning incorporated with neuromorphic computing

Article Open access 10 January 2022

The backpropagation algorithm implemented on spiking neuromorphic hardware

Article Open access 08 November 2024

A self-training spiking superconducting neuromorphic architecture

Article Open access 04 March 2025

Introduction

Neuromorphic computing, inspired by the brain’s architecture and computational principles, offers a paradigm shift for processing sensory data streams. Its core tenets include event-driven, sparse computation and inherent parallelism, enabling efficient handling of spatiotemporal information prevalent in the real world, such as the asynchronous event streams generated by Dynamic Vision Sensors (DVS). Unlike conventional frame-based data, DVS asynchronously captures brightness changes with microsecond resolution, producing highly sparse spatiotemporal event streams. Among various neuromorphic models, spiking neural networks (SNNs), which communicate through temporal spike events, stand out due to their direct compatibility with event-driven neuromorphic hardware architectures¹. The sparse nature of spikes significantly reduces redundant computations, crucial for low-power real-time processing, while neuronal dynamics directly model temporal correlations within event streams, providing efficient solutions for dynamic tasks like gesture recognition^2,3,4. Critically, the parallelism and efficiency of weight update strategies within these neuromorphic architectures directly determine the energy efficiency and latency of hardware systems⁵.

Training algorithms with spikes primarily fall into two categories: indirect and direct training. Indirect training converts pre-trained artificial neural networks (ANNs) into SNNs, leveraging mature ANN frameworks^6,7. However, this paradigm fundamentally operates as static mapping of offline data⁸, which exhibits inherent incompatibility with the spatiotemporal nature of DVS event streams. Specifically, indirect training fails to capture temporal information in event data and often loses dynamic features of spike events during conversion, leading to compromised recognition performance. In contrast, unsupervised direct training methods like Spike-Timing-Dependent Plasticity (STDP) have been widely explored for their hardware-friendly local synaptic rules^9,10. Nevertheless, the absence of a global credit assignment mechanism in STDP limits its scalability for complex tasks¹¹. Therefore, developing supervised training methods that combine temporal sensitivity, learning efficacy, and hardware efficiency is imperative for DVS-driven applications.

Spike-based Backpropagation (spikeBP), integrating backpropagation’s credit assignment with spike timing coding, offers a promising solution(Fig. 1a). Its strengths lie in two aspects: first, the well-established gradient framework ensures training stability in deep SNNs; second, spike-driven sparse computation substantially reduces hardware overhead for synaptic updates^12,13,14,15. However, when deploying spikeBP on neuromorphic hardware, two critical bottlenecks emerge: (1) silent-neuron-induced gradient loss and critical-slope-induced gradient explosion hinder training efficiency in deep networks^16,17, and (2) element-wise synaptic updates are incompatible with the highly parallel architecture of memristor crossbars(Fig. 1b). Notably, in traditional deep neural networks (DNNs), memristor crossbars achieve parallel weight updates through outer products of forward signal vectors and backward error vectors—deterministic or stochastic encoding of these vectors enables natural analog multiplication via memristor conductance modulation^18,19,20. However, spikeBP’s synaptic updates involve nonlinear interactions between presynaptic and postsynaptic spike timings, preventing direct mapping to vector outer products and forming a core obstacle for hardware optimization(Fig. 1c).

**Fig. 1: Event-driven spiking neural network with parallel synaptic updates.**

In this work, we propose a hardware-algorithm co-design approach to address these challenges, modifying the spikeBP algorithm for compatibility with parallel neuromorphic hardware based on one-transistor one-memristor (1T1R) arrays. First, singular value decomposition (SVD) is employed to approximate synaptic weight matrices as outer products of forward spike vectors and backward error vectors, enabling direct mapping to parallel multiply-accumulate operations. Second, stochastic computing techniques transform outer product operations into probabilistic pulse superposition, achieving one-step full-matrix updates with reduced hardware latency. Additionally, gradient clipping and forced firing mechanisms are designed to mitigate silent neuron and surge issues in large-scale network training. Crucially, we validate the proposed modifications through both algorithmic evaluation and 1T1R memristor device characterization integrated into the stochastic update scheme. Experiments demonstrate that our co-design achieves 92% accuracy under ideal simulation and 84.7% accuracy with hardware-realistic 1T1R characteristics on the ASL-DVS dynamic gesture recognition task, validating both algorithmic efficacy in event-driven scenarios and practical feasibility for parallel neuromorphic implementation.

Results

Neural dynamics and learning challenge

The core challenge in neuromorphic hardware implementation stems from the fundamental mismatch between spike-based neural dynamics and parallel computing architectures. We adopt the Spike Response Model (SRM)¹² for its analytical tractability while maintaining equivalence to Leaky Integrate-and-Fire (LIF) neurons under rectangular postsynaptic potentials. This ensures compatibility with standard neuromorphic hardware implementations⁹.

The membrane potential dynamics follow:

$${u}_{j}(t)=\mathop{\sum }\limits_{i}{w}_{ij}\varepsilon (t)$$

(1)

where u_j is the membrane potential, w_ij synaptic weights, ε the presynaptic potential kernel.

The spike timing-dependent weight update in traditional spikeBP follows:

$$\Delta {w}_{ij}=-\eta \varepsilon ({t}_{j}-{t}_{i}){\delta }_{j}$$

(2)

where t_i, t_j are pre- and postsynaptic spike times, and δ_j the backpropagated error.

This element-wise update presents two hardware bottlenecks: (1) ε(t_j − t_i) couples pre- and postsynaptic events (Fig. 2a, b) and prevents explicit decomposition into separable vectors. (2) It inherently suffers from the learning challenges noted in the “Introduction” section: silent-neuron-induced gradient loss and critical-slope-induced gradient explosion. Our key insight recognizes that while exact decoupling is impossible, the low-rank structure of spike timing matrices enables efficient approximation. As visualized in Fig. 2c, rectified timing differences ${\boldsymbol{T}}={[({t}_{j}-{t}_{i})\vee 0]}_{i\times j}$ exhibit: (1) Linear manifolds along presynaptic dimensions. (2) Shift invariance relative to postsynaptic baselines. This geometric regularity motivates rank-1 approximation via SVD, while the gradient stability challenges are addressed through our Ensemble Surrogate Gradients (Section “Ensemble of Surrogate Gradients”).

**Fig. 2: Low-rank decomposition of spike timing-dependent synaptic updates.**

Approximate synaptic update matrix as outer product of forward and backward vectors

To enable hardware-friendly parallel synaptic updates, we first reformulate the weight update matrix in spikeBP into an outer product form compatible with memristor crossbars. In analog neural networks, synaptic updates are expressed as the outer product of forward activations x and backward errors δ²¹:

$$\Delta W=-\eta {\boldsymbol{x}}\otimes {{\boldsymbol{\delta }}}^{\top },$$

(3)

enabling parallel updates by applying x and δ as row/column voltage pulses (Fig. 1c). First, two vectors of write voltages proportional to x_i and δ_j are generated separately, and then they are imposed to the line and column ends of the memristor crossbar, respectively. Owing to the multiplication effect, the memristor element G_ij in the crossbar would receive a programming voltage proportional to x_iδ_j and then the conductance change ΔG_ij would be proportional to x_iδ_j too according to the physical properties of memristor conductance tuning. In this way, the whole synaptic weight matrix would be updated in a one-step manner, harvesting the ultrahigh limit of parallel computing^22,23.

However, spikeBP introduces a critical divergence: updates depend on the nonlinear coupling of spike timing differences ε(t_j − t_i) (Fig. 2a), defined as²⁴:

$$\Delta {w}_{ij}=-\eta \varepsilon ({t}_{j}-{t}_{i}){\delta }_{j},\,\varepsilon (t)=\frac{t}{\tau }\exp \left(1-\frac{t}{\tau }\right)\,(t\ge 0),$$

(4)

where ε(t_j − t_i) encodes both the temporal interaction (t_j − t_i) and its alpha-shaped postsynaptic potential (PSP). As shown in Fig. 2b, ε(t_j − t_i) forms a non-factorizable matrix where each element depends on presynaptic (t_i) and postsynaptic (t_j) spike times, preventing direct vector outer product decomposition.

To resolve this, we decompose ε(t_j − t_i) into separable components. First, the alpha function’s exponential structure allows splitting into presynaptic- and postsynaptic-dependent terms:

$$\varepsilon ({t}_{j}-{t}_{i})=\mathop{\underbrace{\exp \left(1+\frac{{t}_{i}}{\tau }\right)}}\limits_{{\rm{p}}{\rm{r}}{\rm{e}}-{\rm{t}}{\rm{e}}{\rm{r}}{\rm{m}}}\cdot \frac{({t}_{j}-{t}_{i})\vee 0}{\tau }\cdot \mathop{\underbrace{\exp \left(-\frac{{t}_{j}}{\tau }\right)}}\limits_{{\rm{p}}{\rm{o}}{\rm{s}}{\rm{t}}-{\rm{t}}{\rm{e}}{\rm{r}}{\rm{m}}},$$

(5)

Here, ε(t_j − t_i) combines rectified timing differences (t_j − t_i) ∨ 0 and exponential decay terms. While the exponential components naturally decouple into pre- and postsynaptic vectors, the rectified matrix ${\boldsymbol{T}}={[({t}_{j}-{t}_{i})\vee 0]}_{i\times j}$ requires dimensionality reduction to align with hardware parallelism.

The geometric intuition behind this reduction is illustrated in Fig. 2c. Each row of T corresponds to a presynaptic neuron’s timing differences across postsynaptic neurons. In the absence of postsynaptic shifts (t_j = 0), T reduces to ${[-{t}_{i}]}_{i\times j}$, forming a linear manifold in the j-dimensional space (left panel). Introducing postsynaptic timings t_j shifts each row by − t_i along respective axes (right panel), yet preserves an approximately linear structure due to the dominance of presynaptic timing t_i. This near-linearity justifies approximating T via rank-1 SVD:

$${\boldsymbol{T}}\approx {\boldsymbol{p}}\otimes {{\boldsymbol{q}}}^{\top }+{\boldsymbol{1}}\otimes {{{\boldsymbol{q}}}_{{\boldsymbol{b}}}}^{\top },$$

(6)

where p (scaled left singular vector) and q (right singular vector) capture the principal variance, and q_b accounts for residual biases.

Combining SVD with exponential terms, the synaptic update rule becomes:

$$\Delta {\boldsymbol{W}}=-\eta ({\boldsymbol{x}}\otimes {{\boldsymbol{\delta }}}^{\top }+{{\boldsymbol{x}}}_{{\boldsymbol{b}}}\otimes {{{\boldsymbol{\delta }}}_{{\boldsymbol{b}}}}^{\top }),$$

(7)

with the presynaptic spike vector x, backpropagated error vector δ, presynaptic bias vector x_b, and error bias vector δ_b defined as:

$$\left\{\begin{array}{l}\begin{array}{cc}{\boldsymbol{x}}=\exp \left(1+\frac{{t}_{i}}{\tau }\right){\boldsymbol{p}}, & {\boldsymbol{\delta }}={\boldsymbol{q}}\exp \left(-\frac{{t}_{j}}{\tau }\right){\boldsymbol{\delta }}\end{array}\\ \begin{array}{cc}{{\boldsymbol{x}}}_{{\boldsymbol{b}}}=\exp \left(1+\frac{{t}_{i}}{\tau }\right), & {{\boldsymbol{\delta }}}_{{\boldsymbol{b}}}={{\boldsymbol{q}}}_{{\boldsymbol{b}}}\exp \left(-\frac{{t}_{j}}{\tau }\right){\boldsymbol{\delta }}\end{array}\end{array}\right.$$

(8)

Here, the bias vectors x_b and δ_b compensate for the approximations inherent in the rank-1 SVD, ensuring the fidelity of the decomposed update rule.

As visualized in Fig. 2d, the rectified timing differences T cluster tightly around the principal axis, confirming that the rank-1 approximation captures the main variance (eigenvalue ratio). This ensures minimal accuracy loss while enabling outer product-based updates. By applying x and δ as voltage pulses to 1T1R arrays, the entire synaptic matrix is updated in one step, achieving parallelism comparable to DNNs.

The average relative reconstruction error of this approximation across the dataset is ~4.3%, which justifies the trade-off between the accuracy and hardware efficiency, as evidenced by the minimal accuracy drop in network performance (Fig. 5).

One-step implementation via stochastic computing

Building upon the SVD-based outer product approximation of synaptic updates, we further propose a stochastic computing scheme to enable one-step parallel updates on 1T1R arrays. The core challenge lies in mapping the multiplicative relationship ΔG_ij ∝ x_iδ_j to the physical superposition of voltage pulses. Traditional deterministic pulse schemes require designing 64 distinct pulse patterns for 6-bit precision, which imposes prohibitive hardware overheads¹⁸. To address this, our method comprises three stages: SVD decomposition, probabilization, and stochastic encoding (Fig. 3a–c), leveraging statistical properties of random pulses to approximate outer-product operations.

**Fig. 3: Stochastic computing implementation workflow.**

Probabilization with variable cutting

Given the unbounded dynamic range of SVD-derived vectors in Eqs. (7) and (8), we normalize them to probability-compatible ranges:

$$\begin{array}{l}{\rm{S}}{\rm{i}}{\rm{g}}{\rm{n}}{\rm{e}}{\rm{d}}{\rm{t}}{\rm{e}}{\rm{r}}{\rm{m}}{\rm{s}}\,:\left\{\begin{array}{l}{{\rm{x}}}_{{\rm{i}}}^{\mathrm{prob}}=\min (\max (\frac{{{\rm{x}}}_{{\rm{i}}}}{{{\rm{s}}}_{{\rm{x}}}},-1),1)\\ {{\rm{\delta }}}_{{\rm{j}}}^{\mathrm{prob}}=\min (\max (\frac{{{\rm{\delta }}}_{{\rm{j}}}}{{{\rm{s}}}_{{\rm{\delta }}}},-1),1)\\ {{\rm{\delta }}}_{{\rm{b}},{\rm{j}}}^{\mathrm{prob}}=\min (\max (\frac{{{\rm{\delta }}}_{{\rm{b}},{\rm{j}}}}{{{\rm{s}}}_{\mathrm{\delta b}}},-1),1)\end{array}\right.\\ {\rm{P}}{\rm{o}}{\rm{s}}{\rm{i}}{\rm{t}}{\rm{i}}{\rm{v}}{\rm{e}}{\rm{t}}{\rm{e}}{\rm{r}}{\rm{m}}\,:{x}_{b,i}^{{\rm{p}}{\rm{r}}{\rm{o}}{\rm{b}}}=\min (\frac{{x}_{b,i}}{{s}_{xb}},1)\end{array}$$

(9)

where scaling factors s_x, s_δ, s_δb, s_xb are determined from layer-wise statistics. This preserves gradient distributions while constraining values to [−1, 1] for signed terms and [0, 1] for x_b. The truncation effect for x_b is explicitly visualized in Fig. 3d (bottom heatmap).

1T1R stochastic encoding

The normalized ${x}_{i}^{prob}$ and ${\delta }_{j}^{prob}$ are encoded as independent stochastic bitstreams (Fig. 3e). Gate pulses are applied to rows with the probability ${x}_{i}^{prob}$, while drain pulses are applied to columns with the probability ${\delta }_{j}^{prob}$. Crucially, conductance changes occur exclusively during simultaneous gate and drain pulses (V_g&V_d state in Fig. 3f). Bipolar updates are implemented through a differential pair architecture: Each synaptic weight w_ij is physically represented as ${G}_{ij}^{+}-{G}_{ij}^{-}$, with positive updates applying SET pulses to ${G}_{ij}^{+}$ and negative updates (${x}_{i}^{prob} < 0$) to ${G}_{ij}^{-}$.

Ideal simulation validation

Figure 3g validates the stochastic encoding scheme under ideal memristor assumptions, where each coincident pulse induces fixed conductance change ΔG₀ and the length of bitstreams is 50. The theoretical update surface (red) follows ΔG ∝ P(x) ⋅ P(δ), while stochastic simulation (blue) shows close alignment with minor discretization errors at low probabilities. Crucially, the ensemble average preserves the multiplicative relationship ${\mathbb{E}}[\Delta G]\propto P(x)\cdot P(\delta )$. Remarkably, even with minimal pulse sequences (sequence length = 1), statistical averaging across training epochs maintains learning efficacy.

This temporal accumulation effect allows ultra-short programming cycles without compromising convergence, a key advantage for event-driven systems. The physical realization and experimental validation of this scheme will be detailed in the next section.

1T1R memristor array characterization

To physically validate the proposed stochastic update scheme, we fabricated and characterized a 1-kb (32 × 32) 1T1R memristor array with TiN/TaO_x/HfO₂/TiN heterostructure devices. A micrograph of the fabricated array is shown in Fig. 4a, where word lines (WL) and source lines connect to transistor gates and sources, respectively, and bit lines (BL) connect to memristor top electrodes. The 1T1R configuration provides essential selection capability for parallel programming while suppressing sneak currents. (For full characterization data, including LTD behavior and endurance tests, see Supplementary Fig. 1)

Controlled conductance modulation

Figure 4b demonstrates reliable conductance modulation under various pulse conditions. Long-term potentiation (LTP) was achieved using 50 consecutive “11” pulses (10 μs width), where gate voltage (V_g = 1.2 V) activates the transistor and drain voltage (V_d = 0.8 V) induces SET switching. Crucially, we verified immunity to unintended updates: At three representative conductance levels (8 kΩ, 12 kΩ, and 20 kΩ), non-update pulse patterns (“01”: drain-only pulse, “10”: gate-only pulse, “00”: no pulses) produced negligible conductance changes, confirming selective update only during coincident “11” events.

Stochastic update implementation

The physical implementation of our probabilistic update scheme, based on the characterized properties of our 1T1R array, is illustrated in Fig. 4c. The schematic corresponds to the array architecture shown in the micrograph (Fig. 4a), where: 1, Rows receive gate pulses (V_g = 1.2 V) on the G₁-G_n lines (blue, corresponding to word lines, WL) with occurrence probability $P({x}_{i}^{prob})$. 2, Columns receive drain pulses (V_d = 0.8 V) on the D₁-D_n lines (red, corresponding to bit lines, BL) with occurrence probability $P({\delta }_{j}^{prob})$. Memristor conductance changes occur exclusively when both row and column pulses coincide (“11” state), implementing the multiplicative relationship $\Delta {G}_{ij}\propto P({x}_{i}^{prob})\cdot P({{\delta }_{j}}^{prob})$.

Stochastic update validation

Figure 4d shows a representative conductance trajectory under stochastic encoding (P(x) = 0.4, P(δ) = 0.7). The stepwise increases correspond to “11” pulse occurrences, demonstrating the cumulative nature of probabilistic updates. Statistical characterization across the probability space (Fig. 4e) reveals close agreement between theoretical expectations (red surface) and statistical measured conductance changes (blue surface). Despite inherent device variability, the ensemble behavior preserves the multiplicative relationship essential for outer product approximation, with ${\mathbb{E}}[\Delta G]\propto P(x)\cdot P(\delta )$.

ASL-DVS gesture recognition

Evaluated on ASL-DVS with a VGG16 SNN, the original SpikeBP achieves 97.1% accuracy. SVD approximation introduces marginal degradation (96.8%), while probabilization and stochastic encoding reduce accuracy to 92% (Fig. 5).When incorporating measured 1T1R device characteristics, the accuracy stabilizes at 84.7%. This trade-off enables one-step full-matrix updates, reducing synaptic update latency while preserving event-driven processing capabilities.

**Fig. 5: Accuracy evaluation on ASL-DVS dataset.**

Discussion

Our work establishes a hardware-algorithm co-design framework that reconciles the temporal sensitivity of SNNs with the parallelism constraints of neuromorphic hardware. By integrating SVD and stochastic computing, the proposed spikeBP variant achieves one-step synaptic updates on 1T1R arrays, reducing latency while maintaining functionality. This contrasts with conventional ANN-based approaches that discard temporal spike correlations or STDP methods lacking global optimization. The SVD approximation, which introduces an ~4.3% temporal reconstruction error, results in only a minor network accuracy drop (96.8% vs. 97.1%). This effective trade-off enables direct mapping to analog in-memory computing architectures, which is critical for energy-efficient event processing.

The stochastic encoding scheme bridges algorithmic gradients to device physics. It translates gradients into probabilistic pulse coincidence, circumventing the precision bottlenecks of deterministic pulse designs. This scheme achieves 84.7% accuracy on the ASL-DVS task when incorporating measured device characteristics. We conducted a controlled analysis to dissect the sources of this accuracy degradation. Our analysis shows that deterministic nonlinearity in conductance modulation is a primary bottleneck. This nonlinearity, which we fitted from device data (Supplementary Fig. 2), reduces accuracy to 87.0%. Device-level variability account for the remaining decrease to 84.7%.

This result underscores that the performance gap stems primarily from the non-ideal characteristics of the 1T1R devices. While our stochastic computing scheme averages out some of the stochastic noise over multiple pulses and updates, the inherent device-to-device variability and non-linearity ultimately limit the precision of the synaptic weights. We anticipate that future advancements in memristor technology, focusing on improved linearity and uniformity, will close this accuracy gap.

This trade-off balances computational fidelity with hardware feasibility, a necessity for large-scale SNN deployment. Notably, our method preserves temporal coding capabilities essential for DVS applications, unlike ANN-SNN conversions that statically map frame-based features. Physical validation through 1T1R device characterization confirms the feasibility of the proposed parallel update mechanism. Our framework demonstrates that SNN training can be both temporally precise and hardware-efficient, advancing toward real-world event-driven intelligence.

Method

Dataset

We evaluate our model on the ASL-DVS dataset²⁵, a large scale event-based dataset for American Sign Language recognition. It comprises 24 classes (letters A-Y, excluding J). In our work, we utilize a version of the dataset containing a total of 113,645 samples, which we split into 96,899 samples for training and 16,746 for testing. The data was recorded using a DAVIS240c event camera, with each sample representing a spatiotemporal event stream ~100 ms in duration, generated from dynamic hand gestures. This dataset presents a challenging task for event-based classifiers due to the subtle differences between certain gestures, making it a suitable benchmark for evaluating the temporal processing capabilities of our proposed SNN.

Ensemble of Surrogate Gradients (ESG)

This work implements forward propagation based on the SRM, where presynaptic voltage pulses are weighted by the kernel function x and integrated into the postsynaptic neuron’s dendritic potential u_j = ∑w_ijx_i. When u_j exceeds the threshold θ, the neuron emits a spike via the non-differentiable function f(u). To align with hardware implementation requirements, the SRM adopts an integral form equivalent to the LIF neuron model²⁶, utilizing existing LIF neuron circuit architectures to achieve event-driven low-power computation (as shown in Fig. 2a).

The error function is defined as the squared difference between the output spike timings t_j and target timings ${t}_{j}^{a}$: $E={\sum }_{j}{({t}_{j}-{t}_{j}^{a})}^{2}$. Following gradient descent, the synaptic weight update rule is:

$$\Delta {w}_{ij}=-\eta \frac{\partial {u}_{j}}{\partial {w}_{ij}}{\delta }_{j}=-\eta {x}_{ij}{\delta }_{j},$$

(10)

where the backpropagated error δ_i is computed as:

$${\delta }_{i}=\left\{\begin{array}{l}\begin{array}{cc}\frac{E}{{t}_{i}} & (\mathrm{output}\,\mathrm{layer})\end{array}\\ \begin{array}{cc}{\delta }_{j}\frac{\partial {t}_{j}}{\partial {u}_{j}}\frac{\partial {u}_{j}}{\partial {x}_{i}} & (\mathrm{hidden}\,\mathrm{layers})\end{array}\end{array}\right.$$

(11)

Here, $\frac{\partial {t}_{j}}{\partial {u}_{j}\,({t}_{j})}$ quantifies the sensitivity of spike timing to membrane potential and is critical for gradient stability. In the traditional SpikeProp algorithm, this term is approximated as the instantaneous rate of membrane potential change at the threshold crossing:

$$\frac{\partial {t}_{j}}{\partial {u}_{j}({t}_{j})}=-\frac{1}{\frac{\partial {u}_{j}(t)}{\partial t}{|}_{t={t}_{j}}},$$

(12)

where the negative sign arises from the physical mechanism where membrane potential increase (Δu_j > 0) accelerates spike timing (Δt_j < 0). However, this approximation introduces two hardware deployment challenges:

1. Silent Neuron Problem: When u_j fails to reach θ, $\frac{\partial {t}_{j}}{\partial {u}_{j}({t}_{j})}$ becomes undefined, causing gradient loss. 2. Gradient Surge Problem: If u_j barely crosses θ (${\frac{\partial {u}_{j}(t)}{\partial t}| }_{t={t}_{j}}\approx 0$), $\frac{\partial {t}_{j}}{\partial {u}_{j}({t}_{j})}$ diverges to infinity, triggering gradient explosion.

Existing methods mitigate these issues through weight constraints¹⁶, adaptive learning rates²⁷, or rectified postsynaptic potential functions¹⁵, but they struggle to balance training stability in large-scale networks with hardware compatibility. For example, regularization techniques²⁸ require computing first-order derivatives of membrane potential, increasing hardware timing control complexity, while neuron model modifications²⁹ rely on non-standard circuits, limiting generalizability.

To address these challenges, we propose the Ensemble of Surrogate Gradients (ESG), which employs piecewise approximations to reconcile hardware constraints with temporal event processing:

$$\frac{\partial {t}_{j}}{\partial {u}_{j}({t}_{j})}=\left\{\begin{array}{l}\begin{array}{cc}-\frac{1}{{\frac{\partial {u}_{j}(t)}{\partial t}| }_{t={t}_{j}}} & \mathrm{if}\,\frac{\partial {u}_{j}(t)}{\partial t}{|}_{{\rm{t}}={{\rm{t}}}_{{\rm{j}}}} > k\,(\mathrm{Normal}\,\mathrm{Propagation})\end{array}\\ \begin{array}{cc}-\frac{1}{k} & \mathrm{if}\,0\le \frac{\partial {u}_{j}(t)}{\partial t}{|}_{t={t}_{j}}\le k\,(\mathrm{Surge}\,\mathrm{Suppression})\end{array}\\ \begin{array}{cc}-\frac{T}{\theta } & \mathrm{otherwise}\,(\mathrm{Forced}\,\mathrm{Firing})\end{array}\end{array}\right.$$

(13)

Forced Firing operates as follows: For silent neurons, the required membrane potential rise rate $\frac{\theta -{u}_{j}^{\max }}{T-{t}_{j}^{\max }}$ within the remaining time window $T-{t}_{j}^{\max }$ can be estimated using the peak potential ${u}_{j}^{\max }$ and its timing ${t}_{j}^{\max }$. However, real-time monitoring of ${u}_{j}^{\max }$ and ${t}_{j}^{\max }$ is impractical in hardware. ESG simplifies this to a statistical ensemble average $\frac{\theta }{T}$, assuming a linear rise from zero to θ within a fixed window T. This simplification avoids real-time tracking while covering diverse neuron dynamics through an “ensemble” averaging concept.

Surge Suppression sets a lower bound k (determined from ASL-DVS dataset statistics) to cap gradient magnitudes when ${\frac{\partial {u}_{j}(t)}{\partial t}| }_{t={t}_{j}} < k$, preventing explosions.

For the ASL-DVS dynamic gesture recognition task, a VGG16-based SNN adopts a hybrid training strategy: Fixed front layers extract features from event streams, while the last two layers are fine-tuned via ESG and stochastic updating. ESG’s hardware-friendliness is reflected in the determination of k and T through statistical and predefined methods, eliminating the need for real-time monitoring of membrane potential change rates and peak values.

The efficacy of ESG is evidenced by the 97.1% baseline accuracy (Fig. 5, “Original”), which provides a stable foundation for subsequent SVD approximation and stochastic encoding optimizations.

The necessity of the ESG mechanism is empirically validated through an ablation study (Supplementary Fig. 3). Under hardware-realistic conditions with device non-idealities, disabling ESG leads to severe training failure and a significant drop in final accuracy. This result confirms that hardware imperfections exacerbate gradient instability issues, and that the ESG mechanism is essential for achieving stable convergence in a parallel hardware implementation.

The 1T1R array fabrication and measurement

The 1T1R memristor array was fabricated using standard 130 nm CMOS technology, comprising 1024 (32 × 32) devices with integrated control circuits³⁰. The memristor heterostructure consists of TiN/TaO_x/HfO₂/TiN, deposited in sequence: TiN bottom electrode, HfO₂ switching layer, TaO_x interface layer, and TiN top electrode. Device patterning defined 0.5 μm × 0.5 μm cells through lithography, followed by SiO₂ dielectric deposition and CMP planarization. Final interconnects were formed via aluminum metallization and etch processes.

Electrical characterization employed a Keysight B1530 test system. Gate pulses (20 μs) were applied to WL, while drain pulses (10 μs) were applied to BL. Pulse synchronization ensured simultaneous “11” state application for reliable SET switching during coincidence events. This configuration directly implements the stochastic update scheme described in the “One-step implementation via Stochastic Computing” section.

Data availability

The ASL-DVS dataset analyzed during the current study is publicly available from the original authors' repository: https://drive.google.com/drive/folders/1tK5OY3pkjppYwAnLF8bnxGdaFbYEA8iY?usp=sharing.

References

Schuman, C. D. et al. Opportunities for neuromorphic computing algorithms and applications. Nat. Comput. Sci. 2, 10–19 (2022).
Article PubMed Google Scholar
Gallego, G. et al. Event-based vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44, 154–180 (2020).
Article ADS Google Scholar
Izhikevich, E. M. Simple model of spiking neurons. IEEE Trans. neural Netw. 14, 1569–1572 (2003).
Article CAS PubMed ADS Google Scholar
Neftci, E. O., Mostafa, H. & Zenke, F. Surrogate gradient learning in spiking neural networks: bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Process. Mag. 36, 51–63 (2019).
Article Google Scholar
Sebastian, A., Le Gallo, M., Khaddam-Aljameh, R. & Eleftheriou, E. Memory devices and applications for in-memory computing. Nat. Nanotechnol. 15, 529–544 (2020).
Article CAS PubMed ADS Google Scholar
Pérez-Carrasco, J. A. et al. Mapping from frame-driven to frame-free event-driven vision systems by low-rate rate coding and coincidence processing–application to feedforward convnets. IEEE Trans. Pattern Anal. Mach. Intell. 35, 2706–2719 (2013).
Article PubMed ADS Google Scholar
Stöckl, C. & Maass, W. Optimized spiking neurons can classify images with high accuracy through temporal coding with two spikes. Nat. Mach. Intell. 3, 230–238 (2021).
Article Google Scholar
Diehl, P. U., Zarrella, G., Cassidy, A., Pedroni, B. U. & Neftci, E. Conversion of artificial recurrent neural networks to spiking neural networks for low-power neuromorphic hardware. In Proc. IEEE International Conference on Rebooting Computing (ICRC) 1–8 (IEEE, 2016).
Davies, M. et al. Loihi: a neuromorphic manycore processor with on-chip learning. IEEE Micro 38, 82–99 (2018).
Article Google Scholar
Schemmel, J. et al. A wafer-scale neuromorphic hardware system for large-scale neural modeling. In Proc. IEEE International Symposium on Circuits and Systems (ISCAS) 1947–1950 (IEEE, 2010).
Bengio, Y., Lee, D.-H., Bornschein, J., Mesnard, T. & Lin, Z. Towards biologically plausible deep learning. arXiv preprint https://doi.org/10.48550/arXiv.1502.04156 (2015).
Bohte, S. M., Kok, J. N. & La Poutre, H. Error-backpropagation in temporally encoded networks of spiking neurons. Neurocomputing 48, 17–37 (2002).
Article Google Scholar
Taherkhani, A. et al. A review of learning in biologically plausible spiking neural networks. Neural Netw. 122, 253–272 (2020).
Article PubMed Google Scholar
Zenke, F. & Ganguli, S. Superspike: supervised learning in multilayer spiking neural networks. Neural Comput. 30, 1514–1541 (2018).
Article PubMed PubMed Central MathSciNet Google Scholar
Zhang, M. et al. Rectified linear postsynaptic potential function for backpropagation in deep spiking neural networks. IEEE Trans. neural Netw. Learn. Syst. 33, 1947–1958 (2021).
Article Google Scholar
Takase, H. et al. Obstacle to training spikeprop networks-cause of surges in training process-. In Proc. International Joint Conference on Neural Networks 3062–3066 (IEEE, 2009).
Shrestha, S. B. & Song, Q. Robustness to training disturbances in spikeprop learning. IEEE Trans. Neural Netw. Learn. Syst. 29, 3126–3139 (2017).
Article PubMed MathSciNet Google Scholar
Burr, G. W. et al. Experimental demonstration and tolerancing of a large-scale neural network (165 000 synapses) using phase-change memory as the synaptic weight element. IEEE Trans. Electron Devices 62, 3498–3507 (2015).
Article ADS Google Scholar
Xu, Z. et al. Parallel programming of resistive cross-point array for synaptic plasticity. Procedia Comput. Sci. 41, 126–133 (2014).
Article Google Scholar
Gokmen, T., Onen, M. & Haensch, W. Training deep convolutional neural networks with resistive cross-point devices. Front. Neurosci. 11, 538 (2017).
Article PubMed PubMed Central Google Scholar
LeCun, Y., Touresky, D., Hinton, G. & Sejnowski, T. A theoretical framework for back-propagation. In Proc. 1988 Connectionist Models Summer School, Vol. 1, 21–28 (Morgan Kaufmann, San Mateo, CA, USA, 1988).
Gokmen, T. & Vlasov, Y. Acceleration of deep neural network training with resistive cross-point devices: design considerations. Front. Neurosci. 10, 333 (2016).
Article PubMed PubMed Central Google Scholar
Agarwal, S. et al. Achieving ideal accuracies in analog neuromorphic computing using periodic carry. In Proc. Symposium on VLSI Technology, T174–T175 (IEEE, 2017).
Comsa, I. M. et al. Temporal coding in spiking neural networks with alpha synaptic function. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 8529–8533 (IEEE, 2020).
Bi, Y. et al. Graph-based object classification for neuromorphic vision sensing. In Proc. IEEE International Conference on Computer Vision (ICCV) (IEEE, 2019).
Burkitt, A. N. A review of the integrate-and-fire neuron model: I. homogeneous synaptic input. Biol. Cybern. 95, 1–19 (2006).
Article CAS PubMed MathSciNet Google Scholar
McKennoch, S., Liu, D. & Bushnell, L. G. Fast modifications of the spikeprop algorithm. In The 2006 IEEE International Joint Conference on Neural Network Proceedings, 3970–3977 (IEEE, 2006).
Lee, J. H., Delbruck, T. & Pfeiffer, M. Training deep spiking neural networks using backpropagation. Front. Neurosci. 10, 508 (2016).
Article PubMed PubMed Central Google Scholar
Hong, C. et al. Training spiking neural networks for cognitive tasks: a versatile framework compatible with various temporal codes. IEEE Trans. Neural Netw. Learn. Syst. 31, 1285–1296 (2019).
Article PubMed Google Scholar
Li, J. et al. Memristive floating-point fourier neural operator network for efficient scientific modeling. Sci. Adv. 11, eadv4446 (2025).
Article CAS PubMed PubMed Central ADS Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Nos. U24A20303, 92164204, and 62374063) and the Science and Technology Major Project of Hubei Province (No. 2022AEA001).

Author information

Authors and Affiliations

Wuhan National Laboratory for Optoelectronics, School of Integrated Circuits, Huazhong University of Science and Technology, Wuhan, Hubei, China
Dayou Zhang, Jiawei Fu, Yibai Xue, Zhe Yang, Yi Li, Hao Tong, Xiangshui Miao & Yuhui He
Department of Applied Physics, The Hong Kong Polytechnic University, Hong Kong, China
Yue Zhou
School of Optic and Electronic Information, Huazhong University of Science and Technology, Wuhan, Hubei, China
Vivian Zhao
Institute of Microelectronics, Beijing Innovation Center for Future Chips (ICFC), Tsinghua University, Beijing, China
Bin Gao

Authors

Dayou Zhang
View author publications
Search author on:PubMed Google Scholar
Yue Zhou
View author publications
Search author on:PubMed Google Scholar
Vivian Zhao
View author publications
Search author on:PubMed Google Scholar
Bin Gao
View author publications
Search author on:PubMed Google Scholar
Jiawei Fu
View author publications
Search author on:PubMed Google Scholar
Yibai Xue
View author publications
Search author on:PubMed Google Scholar
Zhe Yang
View author publications
Search author on:PubMed Google Scholar
Yi Li
View author publications
Search author on:PubMed Google Scholar
Hao Tong
View author publications
Search author on:PubMed Google Scholar
Xiangshui Miao
View author publications
Search author on:PubMed Google Scholar
Yuhui He
View author publications
Search author on:PubMed Google Scholar

Contributions

D.Z. and Y.H. conceived the research idea and supervised the project. D.Z. designed the methodology, implemented the algorithms, performed simulations, analyzed results, and wrote the original manuscript. Y.X. and Y.L. designed and conducted the memristor device characterization experiments. J.F. prepared Figures 2 and 3. Y.Z., B.G., Z.Y., and X.M. provided critical insights on neuromorphic system design and applications. V.Z., Z.Y., and H.T. contributed to technical discussions and results validation. Y.H. and X.M. acquired funding and supervised the research direction. All authors reviewed and approved the final manuscript.

Corresponding authors

Correspondence to Xiangshui Miao or Yuhui He.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, D., Zhou, Y., Zhao, V. et al. Modified spike backpropagation design towards highly parallelable hardware implementation. npj Unconv. Comput. 3, 4 (2026). https://doi.org/10.1038/s44335-025-00046-0

Download citation

Received: 07 August 2025
Accepted: 18 November 2025
Published: 22 January 2026
Version of record: 22 January 2026
DOI: https://doi.org/10.1038/s44335-025-00046-0