Quantum sequel of neural network training

Zhang, Hao; Kamenev, Alex

doi:10.1038/s42005-025-02384-8

Download PDF

Article
Open access
Published: 26 November 2025

Quantum sequel of neural network training

Communications Physics volume 8, Article number: 478 (2025) Cite this article

1624 Accesses
27 Altmetric
Metrics details

Subjects

Abstract

Training of neural networks (NNs) has emerged as a major consumer of both computational and energy resources. Quantum computers were coined as a root to facilitate training, but no experimental evidence has been presented so far. Here we demonstrate that quantum annealing platforms, such as D-Wave, can enable fast and efficient training of classical NNs, which are then deployable on conventional hardware. From a physics perspective, NN training can be viewed as a dynamical phase transition: the system evolves from an initial spin glass state to a highly ordered, trained state. This process involves eliminating numerous undesired minima in its energy landscape. The advantage of annealing devices is their ability to rapidly find multiple deep states. We found that this quantum training achieves superior performance scaling compared to classical backpropagation methods, with a clearly higher scaling exponent (1.01 vs. 0.78). It may be further increased up to a factor of 2 with a fully coherent quantum platform using a variant of the Grover algorithm. Furthermore, we argue that even a modestly sized annealer can be beneficial to train a deep NN by being applied sequentially to a few layers at a time.

Empowering deep neural quantum states through efficient optimization

Article Open access 01 July 2024

Quantum deep learning by sampling neural nets with a quantum annealer

Article Open access 09 March 2023

Deep quantum neural networks on a superconducting processor

Article Open access 06 July 2023

Introduction

Neural networks (NNs)^{1,2,3,4,5,6,7,8} have become one of the most transformative technologies, accelerating progress across multiple fields. A key factor in the success of neural networks^{7,8,9,10,11,12,13} is the development of efficient training methods. Techniques such as stochastic gradient descent¹⁴, backpropagation¹⁵, pre-training⁵, ReLU activation¹⁶, residual connections¹⁷ have paved the way for training deep NNs effectively, leading to groundbreaking applications, such as AI. Yet, this progress comes at extraordinary costs, as training state-of-the-art models now routinely requires massive computational and energy resources. Moreover, extrapolating current trends suggests even steeper future costs¹⁸. This motivates the search for fundamentally different training paradigms.

Meanwhile, quantum technologies have achieved critical milestones in recent years^{19,20,21,22,23,24}. Of particular interest are quantum annealers, analog quantum computing devices engineered to exploit quantum coherent evolution to navigate glassy energy landscapes^{25,26,27,28,29,30,31,32,33,34,35}. Modern quantum annealing architectures already approach 10⁴ qubits^33,35, with demonstrated advantage in optimizations^35,36,37,38 and quantum dynamics simulations^23,33.

Previous studies have explored the synergy between neural networks and quantum effects, including the use of quantum annealers for training restricted Boltzmann machines and deep belief networks^39,40,41,42, the development of variational and circuit-based quantum neural networks^{43,44,45,46,47}. Broader perspectives on quantum machine learning can be found in comprehensive reviews⁴⁸. In this paper, we adopt a different perspective, which takes advantage of quantum annealers’ capability to rapidly map out low-energy landscapes of (e.g., classical) spin glass models. Here we show how one may take advantage of it to accelerate NNs training. Once trained, such NNs can be deployed on classical hardware, combining the advantages of quantum and classical computing.

The approach is based on the observation that the training process drives NNs through a phase transition from an initial glassy state to a highly ordered trained state^49,50. The quantum dynamics facilitates such a transition due to its inherent ability to escape local minima^33,38. It thus allows one to efficiently explore the hierarchical basin structure of the low-energy landscape³⁸, which appears to be a key for accelerating NNs’ training. Being implemented on the D-Wave quantum annealer^32,33, the quantum-assisted training algorithm demonstrates clear practical scaling advantages compared to classical training methods. Moreover, we show that a variant of Grover search algorithm⁵¹, known as amplitude amplification protocol⁵², implemented on a fully coherent quantum annealer, can further accelerate NNs’ training by potentially doubling the corresponding scaling exponent.

Methods

Neural network training

Here we explain the principal idea behind NN training strategy, amenable to be implemented on a quantum annealer. To be specific, consider a NN trained to recognize handwritten images of 0-9 digits. We adopted the architecture and image dataset (a subset of MNIST) as described in ref. ⁵³. The network consists of three layers: one input layer, one hidden layer, and one output layer, see Fig. 1. The input layer contains 28² = 784 units, denoted by x_a, where 0 ≤ x_a ≤ 1 and a = 1, 2, …, 784, representing grayscale pixel values from handwritten images. The hidden layer consists of 120 units (implemented as qubits), denoted by ${s}_{i}^{h}$, where i = 1, 2, …, 120 and h designates the hidden layer, taking values of ± 1 after a measurement of qubits’ z-component. The output layer contains 40 units (also implemented as qubits), denoted by ${s}_{\alpha }^{o}$, here α = 1, 2, …40 and o stands for the output, also taking values of ± 1 upon z-measurement.

**Fig. 1: NN architecture implemented on the D-Wave annealer.**

Connections from classical units in the input layer to quantum units in the hidden layer are implemented through local bias fields, h_i:

$${h}_{i}[x]={\sum }_{a=1}^{784}{W}_{ia}{x}_{a},$$

(1)

where W_ia matrix represents coupling strengths between them. Quantum units in the hidden and output layers are connected through couplings J_iα in the bare NN Hamiltonian,

$${H}_{0}={\sum}_{i\alpha }{J}_{i\alpha }\,{Z}_{i}^{h}{Z}_{\alpha }^{o}+{\sum }_{i=1}^{120}{b}_{i}^{h}\,{Z}_{i}^{h}+{\sum }_{\alpha =1}^{40}{b}_{\alpha }^{o}\,{Z}_{\alpha }^{o},$$

(2)

where ${Z}_{i}^{h},{Z}_{\alpha }^{o}$ are Pauli-Z matrices acting on qubits ${s}_{i}^{h}$ and ${s}_{\alpha }^{o}$ respectively, and ${b}_{i}^{h}$ and ${b}_{\alpha }^{o}$ are bias parameters. Note that there are no couplings within the same layer. Hamiltonian of NN with an input image, x, is given by:

$$H[x]={H}_{0}+{\sum }_{i=1}^{120}{h}_{i}[x]\,{Z}_{i}^{h},$$

(3)

where the argument, [x], emphasizes its dependence on an input training or test image, x.

When the neural network has already been trained, it works in the following way. An input image, x, to be inferred, is encoded into the bias fields, h_i[x], as in Eq. (1). One looks then for a 160-qubit configuration, $\{{s}_{i}^{h},{s}_{\alpha }^{o}\}$, providing a ground (or a low-energy) state of the Hamiltonian H[x]. Specifically, the output part, ${s}_{\alpha }^{o}$, of the resulting ground state (spin configuration) is used to predict the image class, ỹ = 0, 1, …, 9. To this end the 40 output units are grouped into 10 groups by summing every four consecutive output values, forming a 10-component vector. A class predictor, ỹ, is given by a position of the maximum component in this vector. This 4-fold redundancy of the output layer serves as a simple majority-rule error-correcting mechanism.

As explained below, the fully trained neural network spin-Hamiltonian, H[x], is not a glass. Rather, it’s energy landscape contains a single (in case of the input image resembling one of training images) deep basin of attraction. Therefore, finding its ground state is a computationally easy task, which may be accomplished by, e.g., a simulated classical annealing. On the contrary, the training stage involves dealing with a glassy energy landscape. It can thus significantly benefit from using a quantum annealer, as explained below.

First we describe a training routine, known as equilibrium propagation⁵⁴, which will be subsequently modified to take advantage of quantum capabilities. To train the NN, it introduces the nudge Hamiltonian, H_N[x, y], which depends on a training image, x, and its apriori known class y = 0, 1, . . . , 9, as

$${H}_{N}[x,y]=H[x]-{\sum }_{\alpha =1}^{40}{n}_{\alpha }[\,y]\,{Z}_{\alpha }^{o},$$

(4)

where n_α[y] is a nudge bias encoding the class, y, of each training image, x, as n_α[y] = 1 for 4y < α ≤ 4(y + 1), and n_α[y] = −1 otherwise.

The training process involves multiple updates of the network parameters. Initially, all parameters W_ia, J_iα, ${b}_{i}^{h}$, ${b}_{\alpha }^{o}$ are chosen randomly. In our implementation, W_ia is drawn from a uniform distribution over $\left[-1/\sqrt{784},1/\sqrt{784}\right]$; J_iα is drawn from a uniform distribution over $\left[-1/\sqrt{120},1/\sqrt{120}\right]$. The biases ${b}_{i}^{h}$, ${b}_{\alpha }^{o}$ are initialized to zero. In each update step, a training image, x, belonging to a class y, is randomly selected from the dataset. One then looks for two low-energy spin configurations: the first, $\{{s}_{i}^{h},{s}_{\alpha }^{o}\}$, of the system Hamiltonian H[x], and the second, $\{{s}_{i}^{h,N},{s}_{\alpha }^{o,N}\}$, of the nudge Hamiltonian H_N[x, y]. The second step may be circumvented, as the strong nudge fixes ${s}_{\alpha }^{o,N}={n}_{\alpha }[y]$, while ${s}_{\alpha }^{o,h}$ is then simply determined by the sign of the local field as in Eq. (7). The parameters are updated based on the differences between these two spin configurations. Minimization of the loss function leads to the following update rules⁵⁴:

$$\Delta {W}_{ia}={\delta }_{W}\left({s}_{i}^{h}{x}_{a}-{s}_{i}^{h,N}{x}_{a}\right),$$

(5a)

$$\Delta {J}_{i\alpha }={\delta }_{J}\left({s}_{i}^{h}{s}_{\alpha }^{o}-{s}_{i}^{h,N}{s}_{\alpha }^{o,N}\right),$$

(5b)

$$\Delta {b}_{i}^{h}={\delta }_{h}\left({s}_{i}^{h}-{s}_{i}^{h,N}\right),$$

(5c)

$$\Delta {b}_{\alpha }^{o}={\delta }_{o}\left({s}_{\alpha }^{o}-{s}_{\alpha }^{o,N}\right),$$

(5d)

where δ_W, δ_J, δ_h, and δ_o are small positive learning rates.

If the low-energy spin configuration sampled from H[x] is identical (or very close) to that from H_N[x, y], it implies that the network already produces the correct output. In this case the parameter do not update. However, if the spin configuration from H[x] differs from the one sampled from H_N[x, y], the update rules penalize this difference by increasing the energy of this incorrect spin configuration. As a result, the likelihood of sampling this spin configuration or its neighbors in the future trials decreases. Through repeated updates, the energy landscapes of H[x] and of H_N[x, y] gradually align. Since the latter is engineered to enforce the output layer units to predict the proper class, y, the former inherits the same trait, without having the nudge bias, Eq. (4), (not known apriori for test images). Once this is achieved, the neural network is considered to be trained. Presenting NN with all images from the training set along with corresponding adjustments of the parameters, Eq. (5), is called a training epoch. Number of such epochs needed to fully train NN may range from dozens to thousands.

Understanding NN training mechanism

To better understand how NN develops its ability to classify images, we utilized D-Wave’s quantum annealer to sample low-energy spin configurations of (classical) Hamiltonians (3) and (4) using a quantum annealing protocol (specified below). We used a training set (1000 images, 100 per class) to train the NN for 13 epochs. At various stages of the training process we employed a test set (100 images total, 10 per class) to visualize the energy landscape of NN Hamiltonian, (3). To this end we sampled 100 low-energy spin configurations for each test image, x, 10,000 low-energy spin configurations in total. We then applied Multidimensional Scaling (MDS) using the scikit-learn library⁵⁵, projecting high-dimensional spin configurations (hypercube with dimensionality N = 160) onto a two-dimensional plane. The MDS algorithm preserves the pairwise Hamming distances as closely as possible in the 2D representation.

Figure 2 shows the resulting 2D visualizations for 9 representative instances during the training history. Each point represents one of 10⁴ low-energy spin configurations, color-coded by their output classes (0-9) (with gray points, labeled as class -1, indicating unclassified spin configurations in cases when the output vector does not have a single largest component). The evolution of the low-energy landscape illustrates the development of the NN from a random initialization stage toward a highly structured mature stage. As training progresses, the landscape develops 10 distinct well-separated low-energy basins, representing 10 classes. Correspondingly the NN’s recognition accuracy on the test set increases from 6% to 92%. It can be further increased with a few extra training epochs.

Figure 3 gives a more detailed view of this process. The first row displays five examples of training images of the digit 7. Each image is fed into the NN and, as previously explained, low-energy spin configurations of the Hamiltonian H[x] with those images, x, belonging to the class y = 7, are sampled and plotted via the 2D MDS visualization. Dots are again color-coded according to their output class, $\tilde{y}$. Notice that the latter may or may not be equal to 7, representing correct or incorrect recognition respectively. The second and third rows display low-energy spin configurations for a poorly-and a well-trained NN. Each column corresponds to the same input image shown in the first row. In the poorly trained network (second row), one observes a mixed distribution of spin configurations associated with multiple output classes. As training progresses, the low-energy spin configurations become more closely spaced and their output class is almost surely $\tilde{y}=7$.

**Fig. 3: Training NN to recognize digit 7.**

The lesson is that the training process can be viewed as a phase transition from a glassy phase (no explicit structure) to an ordered phase (a single deep basin for any image, x). (It is extremely important that images with distinct output classes produce such a deep basin in far-apart regions of the hypercube.) Such a transition makes the low energy spin configurations of the NN Hamiltonian loaded with an image, x, Eq. (3), to be concentrated within a narrow basin, having the same output class, $\tilde{y}=y$, where y is the class of the image x. It achieves this goal by energetically penalizing all states with $\tilde{y}\ne y$. The question is thus if and how a quantum annealer can accelerate such a training transition.

Quantum training algorithm

The main advantage of quantum hardware is its ability to produce low-energy states (as low as 0.1% above the ground state energy³⁸) of Ising-like Hamiltonians, extremely fast (e.g., microseconds in case of D-Wave). Another useful feature is its ability to localize the search to specific parts of the hypercube. The latter capability is provided by the cyclic annealing protocol^35,36,38. Unlike the more traditional forward annealing, it biasses the search to a vicinity of a chosen (so-called reference) state.

The Equilibrium Training algorithm, described above, looks for a low-energy spin configuration, $\{{s}_{i}^{h},{s}_{\alpha }^{o}\}$, which needs to be sufficiently far from the nudge spin configuration $\{{s}_{i}^{h,N},{s}_{\alpha }^{o,N}\}$, to generate evolution of NN parameters according to Eq. (5).

Before presenting the quantum extension of Equilibrium Propagation, it is useful to clarify the role of the so-called wrong spin configurations. For a given training image x with correct label y, the nudge Hamiltonian H_N[x, y] enforces an output consistent with y. In contrast, the free Hamiltonian H[x] may yield low-energy spin configurations with incorrect outputs $\tilde{y}\ne y$. These are the wrong spin configurations, located in spurious basins in the energy landscape which compete with the correct one. The training updates are precisely driven by the differences between the free spin configurations $\{{s}_{i}^{h},{s}_{\alpha }^{o}\}$ and the nudge spin configuration $\{{s}_{i}^{h,N},{s}_{\alpha }^{o,N}\}$. Whenever a wrong spin configuration is encountered, the update rules penalize it by raising its energy relative to the correct spin configuration, gradually suppressing its future occurrence.

With a quantum annealer, one can efficiently obtain not just a single low-energy spin configuration but a collection of m ≫ 1 spin configurations, $\{{s}_{i}^{h,\gamma },{s}_{\alpha }^{o,\gamma }\}$ with γ = 1, 2, …, m, for a given training image x. By engineering the initial conditions of cyclic annealing, these configurations can be biased to explore wrong basins. To penalize all such configurations simultaneously, one substitutes

$${s}_{i}^{h} \to \frac{1}{m}{\sum }_{\gamma =1}^{m}{s}_{i}^{h,\gamma },\qquad {s}_{\alpha }^{o}\to \frac{1}{m}{\sum }_{\gamma =1}^{m}{s}_{\alpha }^{o,\gamma },\\ {s}_{i}^{h}{s}_{\alpha }^{o} \to \frac{1}{m}{\sum }_{\gamma =1}^{m}{s}_{i}^{h,\gamma }{s}_{\alpha }^{o,\gamma }$$

(6)

into Eq. (5). We refer to the resulting training scheme given by Eqs. (5)–(6) as the quantum propagation training algorithm.

Figure 2 shows exactly where are those configurations to penalize. They are in the 9 basins with wrong output classes, $\tilde{y}\ne y$. Indeed, Fig. 4 illustrates how performance improves as the number of sampled spin configurations, m, increases during a fixed training epoch. All model parameters are initialized identically for each data point. A significant decrease in error rate is observed for m ≲ 10 and plateaus for larger m. To further optimize the training one should therefore look for m = 9 energy basins (low-energy spin configurations) within wrong basins. To this end one should initialize cyclic annealing from the reference spin configurations which are the nudge spin configurations of the 9 wrong basins. The latter are given by

$${s}_{\alpha }^{o,\gamma }\,=\,{n}_{\alpha }[{\tilde{\,y}}^{\gamma }];\quad {s}_{i}^{h,\gamma }\,=-{{\rm{sign}}}\,\left(\,{\sum}_{\alpha }{J}_{i\alpha }{n}_{\alpha }[\,{\tilde{y}}^{\gamma }]\,+\,{b}_{i}^{h}\,+\,{h}_{i}[x]\,\right)\,,$$

(7)

where x is the currently presented image with the class y and ${\tilde{y}}^{\gamma }\ne y$ are 9 wrong output classes, γ = 1, …9.

**Fig. 4: Performance analysis of quantum propagation.**

Results

Benchmark

We benchmark the quantum propagation against classical backpropagation. To ensure fairness, both methods employ identical neural network architectures and best-practice settings. Classical training uses real-valued hidden and output units and ReLU activations in the hidden layer. Both methods use simple gradient descent to update parameters. Figure 5 shows the training error rate (1-accuracy) as a function of a number of training epochs in the log-log plot. It exhibits an algebraic relation,

$$\,{{\mbox{error}}} \,{{\rm{rate}}}\propto {(\#\, {{\rm{of}}}\, {{\rm{epoch}}})}^{-z},$$

(8)

where z is the scaling exponent indicating training efficiency. Blue triangles are data from the equilibrium propagation training, Eq. (5), the corresponding exponent is z = 0.64. This is worse than the conventional backpropagation technique (green squares, z = 0.78). Red dots represent quantum propagation(m = 20), exponent z = 1.01. The significantly larger exponent z = 1.01 for the quantum technique indicates superior performance and efficiency compared to both backpropagation and equilibrium propagation. Importantly, the scaling exponent z serves as a more meaningful performance metric than the raw accuracy alone, as it is unaffected by the computational overhead per parameter update. For a typical number of epochs (100 to 500), a conventional NN training process would require 3 to 4 times the resources to match quantum propagation performance. Additional testing under varied classical conditions did not match Quantum Propagation’s efficiency. Specifically, switching to sigmoid activations reduces z to 0.56, and increasing hidden units tenfold (from 120 to 1200 units) only marginally improves z to 0.84.

**Fig. 5: Performance comparison of training methods.**

Quantum coherent training

Equation (6) reminds an expectation value of corresponding Z^h,o operators in a many-body quantum state $\left\vert \psi \right\rangle$, which is a coherent superposition of m bit-string product states. Indeed, if

$$\left\vert \psi \right\rangle =\frac{1}{\sqrt{m}}{\sum }_{\gamma =1}^{m}\left\vert {s}_{i}^{h,\gamma },{s}_{\alpha }^{o,\gamma }\right\rangle ,$$

(9)

then Eq. (6) takes the form

$${s}_{i}^{h}\,\to \left\langle \psi \right\vert {Z}_{i}^{h}\left\vert \psi \right\rangle ,\quad {s}_{\alpha }^{o}\,\to \left\langle \psi \right\vert {Z}_{\alpha }^{o}\left\vert \psi \right\rangle ,\quad {s}_{i}^{h}{s}_{\alpha }^{o}\,\to \left\langle \psi \right\vert {Z}_{i}^{h}{Z}_{\alpha }^{o}\left\vert \psi \right\rangle .$$

(10)

At first glance, this does not provide any advantage, since to calculate the expectation values, one should run an annealing protocol multiple times and perform measurements of the corresponding z-components. This is exactly what Eq. (6) prescribes to begin with. Yet, there may be a significant benefit hidden here as explained below.

Consider a state $\left\vert \psi \right\rangle$, reached upon a completion of a quantum annealing run, given by $\left\vert \psi \right\rangle ={U}_{{{\rm{QA}}}}[x]\left\vert \psi (0)\right\rangle$, where an initial state, $\left\vert \psi (0)\right\rangle$, may be, eg., the x-polarized product state, $\left\vert \psi (0)\right\rangle ={\prod }_{i,\alpha }\left\vert {+}_{i},{+}_{\alpha }\right\rangle$. Here U_QA[x] is a unitary quantum evolution operator describing forward annealing with the time-dependent Hamiltonian:

$${H}_{{{\rm{QA}}}}(t)=\left(1-s(t)\right)\,{\sum}_{j(\alpha )}{X}_{j(\alpha )}^{h(o)}+s(t)\,H[x],\qquad 0\le t\le {t}_{f},$$

(11)

where ${X}_{j(\alpha )}^{h(o)}$ are Pauli-X’s acting on all qubits, s(0) = 0, while s(t_f) = 1. The final state, $\left\vert \psi \right\rangle =\left\vert \psi ({t}_{f})\right\rangle$, is a superposition of the two orthogonal components

$$\left\vert \psi \right\rangle ={A}_{y}\left\vert {\psi }_{y}\right\rangle +{A}_{\tilde{y}}\left\vert {\psi }_{\tilde{y}}\right\rangle ,$$

(12)

where state $\big\vert {\psi }_{y}\big\rangle$ has output qubits pointing to the correct class, y, of the presented image, x, and state $\big\vert {\psi }_{\tilde{y}}\big\rangle$ has output qubits pointing to all 9 incorrect classes, $\tilde{y}\ne y$. The corresponding amplitudes satisfy $| {A}_{y}{| }^{2}+| {A}_{\tilde{y}}{| }^{2}=1$.

The goal of the training process is to make $| {A}_{\tilde{y}}|$ as small as possible. As a result, during advanced training epochs $| {A}_{\tilde{y}}| \ll 1$. This is a good news for operating NN, but is a very bad news for its further training. Indeed, the “correct” component, $\big\vert {\psi }_{y}\big\rangle$, is identical (or very close) to the ground state, $\big\vert {s}_{i}^{h,N},{s}_{\alpha }^{o,N}\big\rangle$, of the nudge Hamiltonian. As a result, it does not lead to any improvement of NN parameters, see Eq. (5). It is thus the “wrong” component, $\big\vert {\psi }_{\tilde{y}}\big\rangle$, measured with the probability $| {A}_{\tilde{y}}{| }^{2}\ll 1$, which contains all the information about energy basins to be cut off. Therefore the training process may be significantly accelerated if one can use $\big\vert {\psi }_{\tilde{y}}\big\rangle$ component, instead of $\left\vert \psi \right\rangle$ in Eqs. (10) and (5). The cyclic annealing with reference states, Eq. (7), is not efficient in late epochs, since the basin, defined by the correct class, y, is so much attractive that almost every cycle ends up in it, even if the initial reference state is chosen in a basin with $\tilde{y}\ne y$.

Importantly, this can be achieved with the amplitude amplification procedure^52,56,57,58, a generalization of the Grover search algorithm⁵¹, which generates a rotation

$${U}_{{{\rm{amp}}}}\left\vert \psi \right\rangle =\big\vert {\psi }_{\tilde{y}}\big\rangle .$$

(13)

Engineering such a rotation requires order $| {A}_{\tilde{y}}{| }^{-1}\gg 1$ applications of U_QA[x] and ${U}_{{{\rm{QA}}}}^{{\dagger} }[x]$ operations, along with a control operation performed on ancilla coupled to the y-output qubits (${U}_{{{\rm{QA}}}}^{{\dagger} }[x]$ operation is achieved by running the annealing in the reversed time direction, t → t_f − t in Eq. (11)). This should be contrasted with the order $| {A}_{\tilde{y}}{| }^{-2}\gg | {A}_{\tilde{y}}{| }^{-1}$ annealing runs, required by the straightforward application of Eqs. (10), (12) to observe the $\big\vert {\psi }_{\tilde{y}}\big\rangle$ component.

As a result, the fully quantum coherent annealing protocol, which incorporates amplitude amplification⁵², may significantly further accelerate the training. Namely, it can potentially double the scaling exponent, z, in Eq. (8). We have not yet implemented such fully coherent protocol, since the D-Wave device is not expected to maintain coherence over multiple forward and backward runs of the annealing protocol. A smaller version of NN may be tested on a trapped ion platform⁵⁷ which exhibits much longer coherence time.

Training deep neural networks

Finally we discuss a possibility of training deep NNs with L layers with a modest-sized quantum annealer. The neurons in neighboring layers l and l + 1 are connected through couplings ${J}_{{i}_{l}{i}_{l+1}}$ in the bare Hamiltonian of a deep NN,

$${H}_{0}={\sum }_{l=1}^{L-1}{\sum}_{\{{i}_{l},{i}_{l+1}\}}\left({J}_{{i}_{l}{i}_{l+1}}{Z}_{{i}_{l}}^{l}{Z}_{{i}_{l+1}}^{l+1}+{b}_{{i}_{l}}^{l}{Z}_{{i}_{l}}^{l}\right)+{\sum}_{{i}_{L}}{b}_{{i}_{L}}^{L}{Z}_{{i}_{L}}^{L},$$

(14)

where ${Z}_{{i}_{l}}^{l}$ are Pauli-Z matrices acting on qubits ${s}_{{i}_{l}}^{l}$ in layer l, and ${b}_{{i}_{l}}^{l}$ are bias parameters. The last layer, l = L, serves as the output layer. To train such deep NN, one looks again for two low-energy spin configurations $\{{s}_{{i}_{l}}^{l}\}$ of the system Hamiltonian, H[x], and $\{{s}_{{i}_{l}}^{l,N}\}$ of the nudge Hamiltonian, H_N[x, y].

Due to the limited size of quantum annealers, one employs an active-layer sweep procedure. It calls for freezing all qubits along their total z-fields, except those in two (or more, depending on the capacity of the annealer) active layers. Starting from randomly initialized parameters, on the forward pass one allows layers l = 1, 2 to be unfrozen and uses the annealer to find their low-energy configuration. Next, with the updated values from layer l = 1, one freezes all layers except l = 2, 3, and repeat the annealing process for those two layers. Proceeding layer by layer, one sequentially update qubits up to the output layer L.

Then, the backward sweep is performed: first, one activates layers l = L, L − 1 only, and samples their m low-energy spin configurations from H[x] and one from H_N[x, y]. These are used to update parameters in layers l = L, L − 1 via Quantum Propagation rules (5), (6). One then activates layers l = L − 1, L − 2 and repeats updating down to the layer l = 1. One full forward and backward pass constitutes a single update of the network parameters.

This way one only needs an annealer capable of accommodating two successive layer of a deep NN to take advantage of its ability to rapidly search for multiple low-energy spin configurations. This observation may allow already existing annealers to accelerate training of practical NNs.

Conclusions

We have shown that even a modest-size quantum annealer can significantly accelerate neural network training. The quantum propagation routine, suggested and implemented here with the D-Wave platform, exhibits a scaling exponent larger than both equilibrium propagation and backpropagation. Moreover, we argued that a fully coherent annealer may further increase the exponent by up to a factor of two. Finally, we discussed a strategy to benefit from quantum annealers with qubit numbers significantly smaller than the total number of neurons.

This work represents an initial step towards implementing quantum devices to accelerate NN training. The favorable scaling observed in our proof-of-principle experiments highlights the potential of quantum devices to surpass conventional training methods. Future directions naturally include larger datasets, deeper architectures, and systematic comparisons with state-of-the-art classical approaches on more challenging tasks. Equally exciting is the prospect of next-generation quantum annealers with longer coherence times, larger qubit counts, and enhanced connectivity, which may enable fully coherent training protocols. Ultimately, one may wonder: if quantum devices prove more effective at navigating complex energy landscapes and training neural networks for increasingly demanding tasks, could they open the door to forms of intelligence beyond what is practically attainable with purely classical methods?

Data availability

The data that support the findings of this study are provided in Supplementary Data.

Code availability

The code that supports the findings of this study is available from the corresponding author upon request.

References

McCulloch, W. S. & Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5, 115–133 (1943).
Article MathSciNet Google Scholar
Rosenblatt, F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65, 386–408 (1958).
Article Google Scholar
Hopfield, J. J. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl Acad. Sci. USA 79, 2554–2558 (1982).
Article ADS MathSciNet Google Scholar
Hinton, G. E., Sejnowski, T. J. & Ackley, D. H.Boltzmann machines: Constraint satisfaction networks that learn (Carnegie-Mellon University, Department of Computer Science Pittsburgh, 1984).
Hinton, G. E., Osindero, S. & Teh, Y.-W. A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554 (2006).
Article MathSciNet Google Scholar
Bengio, Y., Lamblin, P., Popovici, D. & Larochelle, H. Greedy Layer-Wise Training of Deep Networks. In Advances in Neural Information Processing Systems, vol. 19 (MIT Press, 2006).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, vol. 25 (Curran Associates, Inc., 2012).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Article ADS Google Scholar
Goodfellow, I. J. et al. Generative adversarial nets. Adv. Neural. Inf. Process. Syst. 27 (2014).
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).
Article ADS Google Scholar
Vaswani, A. et al. Attention is all you need. Adv. Neural. Inf. Process. Syst. 30 (2017).
Brown, T. et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, vol. 33, 1877–1901 (Curran Associates, Inc., 2020).
OpenAI. GPT-4 Technical Report (2024). arXiv:2303.08774.
Robbins, H. & Monro, S. A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951).
Article MathSciNet Google Scholar
Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).
Article ADS Google Scholar
Glorot, X., Bordes, A. & Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 315–323 (JMLR Workshop and Conference Proceedings, 2011).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
Kaplan, J. et al. Scaling laws for neural language models arXiv: 2001.08361 (2020).
Kim, Y. et al. Evidence for the utility of quantum computing before fault tolerance. Nature 618, 500–505 (2023).
Article ADS Google Scholar
Bluvstein, D. et al. Logical quantum processor based on reconfigurable atom arrays. Nature 626, 58–65 (2024).
Article ADS Google Scholar
Google Quantum AI and Collaborators. Quantum error correction below the surface code threshold. Nature 638, 920–926 (2025).
Article ADS Google Scholar
Gao, D. et al. Establishing a new benchmark in quantum computational advantage with 105-qubit Zuchongzhi 3.0 Processor. Phys. Rev. Lett. 134, 090601 (2025).
Article ADS Google Scholar
King, A. D. et al. Beyond-classical computation in quantum simulation. Science 388, 199–204 (2025).
Article ADS MathSciNet Google Scholar
Aboumrad, W. et al. Accelerating large-scale linear algebra using variational quantum imaginary time evolution arXiv:2503.13128 (2025).
Kadowaki, T. & Nishimori, H. Quantum annealing in the transverse Ising model. Phys. Rev. E 58, 5355–5363 (1998).
Article ADS Google Scholar
Brooke, J., Bitko, D., Rosenbaum, T. F. & Aeppli, G. Quantum annealing of a disordered magnet. Science 284, 779 (1999).
Article ADS Google Scholar
Farhi, E. et al. A quantum adiabatic evolution algorithm applied to random instances of an NP-complete problem. Science 292, 472–475 (2001).
Article ADS MathSciNet Google Scholar
Santoro, G. E., Martoňák, R., Tosatti, E. & Car, R. Theory of quantum annealing of an Ising spin glass. Science 295, 2427–2430 (2002).
Article ADS Google Scholar
Johnson, M. W. et al. Quantum annealing with manufactured spins. Nature 473, 194–198 (2011).
Article ADS Google Scholar
Jiang, S., Britt, K. A., McCaskey, A. J., Humble, T. S. & Kais, S. Quantum annealing for prime factorization. Sci. Rep. 8, 17667 (2018).
Article ADS Google Scholar
Criado, J. C. & Spannowsky, M. Qade: solving differential equations on quantum annealers. Quant. Sci. Technol. 8, 015021 (2022).
King, A. D. et al. Coherent quantum annealing in a programmable 2,000 qubit Ising chain. Nat. Phys. 18, 1324–1328 (2022).
Article Google Scholar
King, A. D. et al. Quantum critical dynamics in a 5000-qubit programmable spin glass. Nature (2023).
Bernaschi, M., González-Adalid Pemartín, I., Martín-Mayor, V. & Parisi, G. The quantum transition of the two-dimensional Ising spin glass. Nature 631, 749–754 (2024).
Article ADS Google Scholar
Zhang, H., Boothby, K. & Kamenev, A. Cyclic quantum annealing: searching for deep low-energy states in 5000-qubit spin glass. Sci. Rep. 14, 30784 (2024).
Article ADS Google Scholar
Wang, H., Yeh, H.-C. & Kamenev, A. Many-body localization enables iterative quantum optimization. Nat. Commun. 13, 5503 (2022).
Article ADS Google Scholar
Bauza, H. M. & Lidar, D. A. Scaling advantage in approximate optimization with quantum annealing. Phys. Rev. Lett. 134, 160601 (2025).
Zhang, H. & Kamenev, A. Computational complexity of three-dimensional Ising spin glass: Lessons from D-wave annealer. Phys. Rev. Res. 7, 033098 (2025).
Adachi, S. H. & Henderson, M. P. Application of quantum annealing to training of deep neural networks. arXiv:1510.06356 (2015).
Job, J. & Adachi, S. Systematic comparison of deep belief network training using quantum annealing vs. classical techniques. arXiv:2009.00134 (2020).
Dixit, V., Selvarajan, R., Alam, M. A., Humble, T. S. & Kais, S. Training restricted Boltzmann machines with a D-wave quantum annealer. Front. Phys. 9 (Frontiers, 2021).
Higham, C. F. & Bedford, A. Quantum deep learning by sampling neural nets with a quantum annealer. Sci. Rep. 13, 3939 (2023).
Article ADS Google Scholar
McClean, J. R., Boixo, S., Smelyanskiy, V. N., Babbush, R. & Neven, H. Barren plateaus in quantum neural network training landscapes. Nat. Commun. 9, 4812 (2018).
Article ADS Google Scholar
Lloyd, S. & Weedbrook, C. Quantum generative adversarial learning. Phys. Rev. Lett. 121, 040502 (2018).
Article ADS MathSciNet Google Scholar
Dendukuri, A. et al. Defining quantum neural networks via quantum time evolution arXiv: 1905.10912 (2020).
Senokosov, A., Sedykh, A., Sagingalieva, A., Kyriacou, B. & Melnikov, A. Quantum machine learning for image classification. Mach. Learn.: Sci. Technol. 5, 015040 (2024).
ADS Google Scholar
Barney, R., Lakhdar-Hamina, D. & Galitski, V. Natural quantization of neural networks arXiv:2503.15482 (2025).
Biamonte, J. et al. Quantum machine learning. Nature 549, 195–202 (2017).
Article ADS Google Scholar
Huang, H. Statistical Mechanics of Neural Networks (Springer Nature Singapore, 2021).
Barney, R., Winer, M. & Galitski, V. Neural Networks as Spin Models: From Glass to Hidden Order Through Training arXiv:2408.06421 (2024).
Grover, L. K. A fast quantum mechanical algorithm for database search In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, 212–219 (1996).
Brassard, G., Hoyer, P., Mosca, M. & Tapp, A. Quantum amplitude amplification and estimation. Quantum Comput. Quantum Inf. AMS Contemp. Math. 305, 53–74 (2002).
MathSciNet Google Scholar
Laydevant, J., Marković, D. & Grollier, J. Training an Ising machine with equilibrium propagation. Nat. Commun. 15, 3671 (2024).
Article ADS Google Scholar
Scellier, B. & Bengio, Y. Equilibrium Propagation: Bridging the Gap between Energy-Based Models and Backpropagation. Front. Comput. Neurosci. 11 (2017).
Pedregosa, F. et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet Google Scholar
Yan, B. et al. Fixed-point oblivious quantum amplitude-amplification algorithm. Sci. Rep. 12, 14339 (2022).
Article ADS Google Scholar
Kikuchi, Y., Mc Keever, C., Coopmans, L., Lubasch, M. & Benedetti, M. Realization of quantum signal processing on a noisy quantum computer. npj Quantum Inf. 9, 1–12 (2023).
Article Google Scholar
V, K. & Santhanam, M. S. Amplitude amplification and estimation using the atom-optics kicked rotor. Phys. Rev. A 111, 032601 (2025).
Article ADS MathSciNet Google Scholar

Download references

Acknowledgements

We are grateful to M. Amin, V. Galitski, and A. King for useful discussions. This work was supported by the NSF grant DMR-2338819.

Author information

Authors and Affiliations

School of Physics and Astronomy, University of Minnesota, Minneapolis, MN, USA
Hao Zhang & Alex Kamenev
William I. Fine Theoretical Physics Institute, University of Minnesota, Minneapolis, MN, USA
Alex Kamenev

Authors

Hao Zhang
View author publications
Search author on:PubMed Google Scholar
Alex Kamenev
View author publications
Search author on:PubMed Google Scholar

Contributions

H.Z. and A.K. conceived the idea. H.Z. performed experiments. H.Z. and A.K. discussed the data and wrote the manuscript.

Corresponding author

Correspondence to Hao Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Physics thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Description of Additional Supplementary Files

Supplementary Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, H., Kamenev, A. Quantum sequel of neural network training. Commun Phys 8, 478 (2025). https://doi.org/10.1038/s42005-025-02384-8

Download citation

Received: 05 July 2025
Accepted: 19 October 2025
Published: 26 November 2025
Version of record: 26 November 2025
DOI: https://doi.org/10.1038/s42005-025-02384-8