Introduction

Continual learning, also known as incremental learning or lifelong learning, aims to empower artificial intelligence with strong adaptability to the non-stationary real world1,2,3. It is a fundamental feature of natural intelligence, yet poses a notorious challenge for artificial intelligence based on deep neural networks. A major obstacle that hinders continual learning is catastrophic forgetting, where adaptation to a new task generally leads to a largely reduced performance on old tasks4,5. This dilemma reflects a delicate trade-off between learning plasticity and memory stability: different sequential tasks correspond to different distributions, and maintaining plasticity would compromise stability in general1. In recent years, numerous efforts have been devoted to tackling this problem and the field of continual learning has been expanding rapidly6,7,8,9,10, with potential applications including medical diagnosis11, autonomous driving12, and financial markets13.

In parallel, the field of quantum computing has also made striking progress recently, with the experimental demonstration of quantum supremacy14,15,16 and error correction codes17,18,19,20 marked as the latest breakthroughs. The interplay between quantum computing and machine learning gives rise to a new research frontier of quantum machine learning21,22,23,24. Different quantum learning algorithms25,26,27,28,29,30 have been proposed and some of them have been demonstrated in proof-of-principle experiments with current noisy intermediate-scale quantum (NISQ) devices31,32. However, to date most quantum learning models have been designed for a specific predefined task with a static data distribution and no experiment on quantum learning of multiple tasks sequentially has been reported. For quantum artificial intelligence systems to accommodate dynamic streams of data in the real world, the capability of continual learning is indispensable and crucial. To this end, a recent theoretical work has extended continual learning to the quantum domain33. It is found that similar to classical learning models based on neural networks, quantum learning systems based on variational quantum circuits would suffer from catastrophic forgetting as well. In addition, a uniform strategy, namely the elastic weight consolidation (EWC) method34, has also been proposed to overcome this problem and achieve quantum continual learning. Despite this stimulating theoretical progress, experimental demonstration of quantum continual learning with NISQ devices is challenging and remains uncharted hitherto. To accomplish this, one faces at least two apparent difficulties: (i) constructing an experimentally feasible quantum classifier with sufficient expressivity to accommodate multiple tasks with diverse non-stationary data distributions and (ii) obtaining Fisher information required for implementing the EWC method in the presence of inevitable experimental noise.

In this paper, we overcome these difficulties and report the first experimental demonstration of quantum continual learning with a fully programmable superconducting quantum processor (Fig. 1a). We construct a quantum classifier with more than two hundred variational parameters, by using an array of 18 transmon qubits featuring average simultaneous single- and two-qubit gate fidelities greater than 99.96% and 99.68% respectively. We demonstrate that, without EWC regulation, such a quantum classifier exhibits catastrophic forgetting when incrementally learning three tasks, including classifying real-life images and recognizing quantum phases (Fig. 1b). However, by employing the EWC method, we can achieve a proper balance between memory stability for previous tasks and learning plasticity for new tasks, thus attaining quantum continual learning (Fig. 1c,d). In addition, we compare the continual learning performance of quantum classifiers with that of classical classifiers in sequentially handling an engineered quantum task and a classical task. We demonstrate that the quantum classifier can incrementally learn the two tasks with an overall accuracy up to 95.8%, exceeding the best overall accuracy of 81.3% achieved by the classical classifier with a comparable number of parameters. This manifests quantum enhancement in continual learning scenarios.

Fig. 1: Experimental quantum continual learning.
figure 1

a Exhibition of a 18-qubit quantum classifier running on a superconducting processor. The used transmon qubits are marked in orange. b Training data for three consecutive learning tasks. \({{\mathcal{T}}}_{1}\) concerns the classification of images depicting “T-shirt” and “ankle-boot” sampled from the Fashion-MNIST dataset35. \({{\mathcal{T}}}_{2}\) involves identifying images labeled as “Hand” and “Breast” from the magnetic resonance imaging dataset36. \({{\mathcal{T}}}_{3}\) is about recognizing quantum states in a symmetry-protected topological (SPT) phase and an antiferromagnetic (ATF) phase. c Illustration of elastic weight consolidation (EWC). EWC aims to balance memory stability for the previous task with learning plasticity for the new task. The memory stability is preserved by penalizing the deviation of the parameter θ from its optimal value θ for the previous task based on the importance of each parameter, which is measured by the Fisher information. d Conceptual diagram of catastrophic forgetting and continual learning. In a continual learning scenario, catastrophic forgetting refers to the dramatic performance drop on previous tasks after learning a new one. Continual learning is achieved when the learning system is able to maintain good performance on previous tasks while learning a new one.

Results

Framework and experimental setup

We first introduce the general framework for quantum continual learning33. We consider a continual learning scenario involving three sequential tasks, denoted as \({{\mathcal{T}}}_{k}\) (k = 1, 2, 3). As shown in Fig. 1b, \({{\mathcal{T}}}_{1}\) concerns classifying clothing images labeled as “T-shirts” and “ankle boot” from the Fashion-MNIST dataset35, \({{\mathcal{T}}}_{2}\) concerns classifying medical magnetic resonance imaging (MRI) scans labeled as “Hand” and “Breast”36, and \({{\mathcal{T}}}_{3}\) involves classifying quantum states in antiferromagnetic and symmetry-protected topological phases. The learning process consists of three stages for sequentially learning these tasks. For the k-th task, we define the following cross-entropy loss function

$$\begin{array}{rcl}{L}_{k}({\boldsymbol{\theta }}) & = & \frac{1}{{N}_{k}}\mathop{\sum }\limits_{i=1}^{{N}_{k}}L\left(h\left({{\boldsymbol{x}}}_{k,i};{\boldsymbol{\theta }}\right),{{\bf{a}}}_{k,i}\right)\\ & = & -\frac{1}{{N}_{k}}\mathop{\sum }\limits_{i=1}^{{N}_{k}}({{\bf{a}}}_{k,i}^{0}\log {{\bf{g}}}_{k,i}^{0}+{{\bf{a}}}_{k,i}^{1}\log {{\bf{g}}}_{k,i}^{1}),\end{array}$$
(1)

where Nk is the number of training samples, xk,i denotes the i-th training sample, \({{\bf{a}}}_{k,i}=({{\bf{a}}}_{k,i}^{0},{{\bf{a}}}_{k,i}^{1})\) denotes the ground true label of xk,i in the form of one-hot encoding, \(h\left({{\boldsymbol{x}}}_{k,i};{\boldsymbol{\theta }}\right)\) denotes the hypothesis function for the quantum classifier parameterized by θ, and \({{\bf{g}}}_{k,i}=({{\bf{g}}}_{k,i}^{0},{{\bf{g}}}_{k,i}^{1})\) denotes the probability of being assigned as label 0 and label 1 by the quantum classifier. The performance of the quantum classifier is evaluated on the test dataset for \({{\mathcal{T}}}_{k}\). In our experiment, we first train the quantum classifier with the above loss function for each task sequentially. After each learning stage, the quantum classifier has a good performance on the current task but experiences a dramatic performance drop on the previous ones, which demonstrates the phenomenon of catastrophic forgetting in quantum learning.

A salient strategy that can overcome catastrophic forgetting in quantum learning systems is the EWC method33,34, which preserves memories for previous tasks by penalizing parameter changes according to the importance of each parameter. To demonstrate its effectiveness, in the k-th stage some regularization terms are added to the cross-entropy loss for \({{\mathcal{T}}}_{k}\), yielding a modified loss function

$${L}_{k}^{\,{\rm{EWC}}}({\boldsymbol{\theta }})={L}_{k}({\boldsymbol{\theta }})+\mathop{\sum }\limits_{t=1}^{k-1}\frac{{\lambda }_{k,t}}{2}\mathop{\sum }\limits_{j}{F}_{t,j}{\left({\theta }_{j}-{\theta }_{t,j}^{\star }\right)}^{2},$$
(2)

where λk,t controls the regularization strength for \({{\mathcal{T}}}_{t}\) in the k-th stage; \({{\boldsymbol{\theta }}}_{t}^{\star }\) is the parameter obtained after the t-th stage; Ft,j denotes the Fisher information measuring the importance of the j-th parameter, which indicates how small changes to this parameter would affect the performance on \({{\mathcal{T}}}_{t}\). A schematic illustration of the main idea for quantum continual learning is shown in Fig. 1c, d.

Our experiments are conducted on a flip-chip superconducting quantum processor (Fig. 1a), which possesses 121 transmon qubits arranged in a two-dimensional array with tunable nearest-neighbor couplings. We choose 18 qubits (marked in orange in Fig. 1a) to implement a variational quantum classifier with a circuit depth of 20 and 216 trainable variational parameters (Fig. 2). To achieve a better learning performance, we push the average simultaneous two-qubit gate fidelities greater than 99.68% through optimizing the device fabrication and control processes. We mention that the gradients and Fisher information desired in updating the quantum classifier are obtained by measuring observables directly in the experiment based on the “parameter shift rule”37. Supplementary Section IIA provides more details about the characterization of the device.

Fig. 2: Quantum circuit classifier with 18 superconducting qubits for learning three sequential tasks.
figure 2

The circuit consists of four blocks of operations with a total of 216 variational parameters. Each block performs three consecutive single-qubit rotation gates on all qubits, followed by two layers of CNOT gates applied to adjacent qubits. The quantum classifier adapts the interleaved block encoding strategy to encode classical data and naturally handles the quantum data (in the form of quantum states) as input. For each input data, the classifier determines the prediction label based on the local observable \(\langle {\widehat{\sigma }}_{9}^{z}\rangle\): label 0 and label 1 for \(\langle {\widehat{\sigma }}_{9}^{z}\rangle \ge 0\) and \(\langle {\widehat{\sigma }}_{9}^{z}\rangle < 0\), respectively.

Demonstration of catastrophic forgetting

To demonstrate catastrophic forgetting in quantum learning, we train sequentially the quantum classifier with the loss function defined in Equation (1) for the three tasks. Our experimental results are displayed in Fig. 3a. The learning process comprises three stages. In the first stage, the quantum classifier is trained to learn \({{\mathcal{T}}}_{1}\). After 20 epochs of parameter updating, the prediction accuracy for classifying clothing images reaches 99%.

Fig. 3: Experimental results for continually learning three tasks.
figure 3

a, b The prediction accuracy for three sequential tasks at each epoch during the continual learning process of the quantum classifier. Tasks \({{\mathcal{T}}}_{1}\), \({{\mathcal{T}}}_{2}\), and \({{\mathcal{T}}}_{3}\) are marked in green, blue, and orange, respectively. The right (left) figure shows the case where EWC (no EWC) is employed. c Distribution of the experimentally measured expected values \(\langle {\widehat{\sigma }}_{9}^{z}\rangle\), which determine the prediction label of input data, for all test samples after training. For each task, the solid line and dotted line correspond to two classes of data samples, respectively. A greater separation between the two distributions means better classification performance. d Distribution of Fisher information (FI) for all parameters after learning each task. e Average parameter change compared to the obtained parameters for previous tasks during the learning stage for the new task. The top (bottom) figure corresponds to the learning for \({{\mathcal{T}}}_{2}\) (\({{\mathcal{T}}}_{3}\)).

In the second stage, the quantum classifier is retrained on the training data for \({{\mathcal{T}}}_{2}\). After 28 epochs, it attains a classification accuracy of 99% on \({{\mathcal{T}}}_{2}\). However, after this training stage, the performance on \({{\mathcal{T}}}_{1}\) drops dramatically to 54%. In the third stage, the quantum classifier is further trained to recognize quantum phases. After 18 epochs, the quantum classifier achieves an accuracy of 100%. However, the accuracy for \({{\mathcal{T}}}_{2}\) and \({{\mathcal{T}}}_{1}\) dramatically falls to 64% and 55%. These experimental results clearly showcase the phenomenon of catastrophic forgetting in quantum learning.

Continual learning with EWC

In this section, we show that the above demonstrated catastrophic forgetting can be effectively overcome with the EWC method. To this end, we train sequentially the quantum classifier with the modified loss function that includes the EWC regularization as defined in Equation (2). Our experimental results are shown in Fig. 3b. We observe that after the second learning stage, the prediction accuracy for \({{\mathcal{T}}}_{2}\) reaches 95% while the accuracy for \({{\mathcal{T}}}_{1}\) still maintains 97%. After the third learning stage, the prediction accuracy for \({{\mathcal{T}}}_{3}\) reaches 96%, while it remains 88% and 93% for \({{\mathcal{T}}}_{2}\) and \({{\mathcal{T}}}_{1}\), respectively. This is in sharp contrast to the case without the EWC strategy, where it drops to 64% and 55%, respectively. After training, we plot the distribution of the experimentally measured \(\langle {\widehat{\sigma }}_{9}^{z}\rangle\), whose sign determines the assigned labels, for all test data samples, as shown in Fig. 3c. It is clear that when applying EWC, data samples from \({{\mathcal{T}}}_{1}\) and \({{\mathcal{T}}}_{2}\) with different labels are far more distinguishable than the case without EWC, which confirms that the learned knowledge for \({{\mathcal{T}}}_{1}\) and \({{\mathcal{T}}}_{2}\) is effectively preserved with EWC.

To further understand how EWC balances the stability-plasticity trade-off for quantum classifiers, we analyze the average parameter changes in cases with EWC. According to Equation (2), for parameters with larger Fisher information, their deviations from the optimal values for previous tasks will cause a relatively more significant increase in the loss function. Therefore, the parameters with large Fisher information tend to undergo only small adjustments when learning the new task, so as to minimize the increase in the loss function. To verify this understanding experimentally, we measure F1,i for each parameter after the first learning stage. As shown in Fig. 3d, we find that only 11 parameters have F1,i values larger than 0.01, while the other 205 parameters have F1,i values less than 0.01. Based on this, we divide all parameters into two groups and plot the average parameter change for each group during the second learning stage for \({{\mathcal{T}}}_{2}\). The results are shown in Fig. 3e. From this figure, it is clear that in the case with EWC, the parameters with large Fisher information (>0.01) experience smaller changes on average than the parameters with small Fisher information (<0.01). This is consistent with the goal of EWC, which is to ensure that more important parameters experience smaller changes, therefore better maintaining the performance on \({{\mathcal{T}}}_{1}\). The average parameter change in the third stage for learning \({{\mathcal{T}}}_{3}\) is also plotted in Fig. 3e, which shows a similar observation. Compared to the case without EWC, parameters with both large and small Fisher information exhibit smaller changes. This is consistent with the fact that the added regularization terms will in general constrain the change of parameters. These experimental results unambiguously demonstrate the effectiveness of EWC in mitigating catastrophic forgetting in quantum continual learning scenarios.

We remark that, after learning each task, only a small portion of all parameters have relatively large Fisher information. This reflects that memories for the task can be preserved by selectively stabilizing these parameters. The majority of parameters, with relatively small Fisher information, retain a relatively large space to learn new tasks in subsequent stages. This selective stabilization mechanism in EWC mirrors biological learning processes, where old memories are preserved by strengthening previously learned synaptic changes6. We also mention that, although various continual learning strategies other than the EWC method exist1, overcoming the catastrophic forgetting problem has been proved to be NP-hard in general38. As a result, we do not expect the EWC method for quantum continual learning demonstrated above to be universally applicable to arbitrary sequential tasks or to have the optimal performance on given tasks.

Quantum enhancement

We consider two classification tasks with distinct data distributions: an engineered quantum task denoted as \({{\mathcal{T}}}_{1}^{{\prime} }\) and a classical task denoted as \({{\mathcal{T}}}_{2}^{{\prime} }\). As shown in Fig. 4a, \({{\mathcal{T}}}_{1}^{{\prime} }\) involves classifying engineered training data samples with target functions generated by a quantum model39,40,41, whereas \({{\mathcal{T}}}_{2}^{{\prime} }\) involves identifying medical images. To construct the dataset for \({{\mathcal{T}}}_{1}^{{\prime} }\), we choose clothing images of “T-shirt” and “ankle boot” as the source data and use principal component analysis (PCA) to compress the dimension of each image to ten. We generate the ground-truth label of each input data using a quantum model. To realize this, the ten-dimensional vector of each data is first encoded as a ten-qubit quantum state. The ground truth label is then taken as a local observable \(\langle {\widehat{\sigma }}_{1}^{z}\rangle\) evolved under a given quantum circuit with randomly chosen variational parameters (Materials and Methods). For \({{\mathcal{T}}}_{2}^{{\prime} }\), we use medical images as the source data. We similarly use PCA to compress each image to a ten-dimensional vector. The ground truth label of each data sample is its original label “hand” or “breast”.

Fig. 4: Comparison between the continual learning performances for the quantum and classical classifiers.
figure 4

a Training data for two sequential tasks \({{\mathcal{T}}}_{1}^{{\prime} }\) and \({{\mathcal{T}}}_{2}^{{\prime} }\). For \({{\mathcal{T}}}_{1}^{{\prime} }\), we choose clothing images as the source data and use principal component analysis (PCA) to reduce the dimension of each image to obtain a ten-dimensional vector. The ground truth label of each input sample is determined by a local observable \(\langle {\widehat{\sigma }}_{1}^{z}\rangle\) evolved under a quantum circuit with randomly chosen gate parameters. For \({{\mathcal{T}}}_{2}^{{\prime} }\), we choose medical images as the source data. We use PCA to compress each image to a ten-dimensional vector as the input data. The label of the input vector is determined by the category of its original images. b Schematic illustration of a quantum classifier and a classical classifier based on the feedforward neural network (FFNN). c Prediction accuracy for two sequential tasks as functions of training epochs during the continual learning process. For both quantum and classical classifiers, EWC is employed with the regularization strength set as 40. d Continual learning performance of the classical classifier as a function of regularization strength. For the classical classifier based on FFNN, we employ EWC with different regularization strengths. For each regularization strength, we train the classical classifier 50 times and plot the mean prediction accuracy for \({{\mathcal{T}}}_{1}^{{\prime} }\) and \({{\mathcal{T}}}_{2}^{{\prime} }\), and their averages. The optimal achievable overall performance, evaluated as the average of the accuracy on \({{\mathcal{T}}}_{1}^{{\prime} }\) and \({{\mathcal{T}}}_{2}^{{\prime} }\), is 81.3% for the classical classifier and 95.8% for the quantum classifier.

In a continual learning scenario involving these two tasks in sequence, we compare the performance of quantum and classical models. For quantum learning, we experimentally implement a ten-qubit quantum circuit classifier with a total of 90 variational parameters (Fig. 4b, left). The learning process consists of two stages. In each stage, the ten-dimensional vector of each input data is embedded as a ten-qubit quantum state, followed by the ten-qubit variational quantum classifier (Materials and Methods). In Fig. 4c, we present the experimental results. In the first stage, the quantum classifier is trained on \({{\mathcal{T}}}_{1}^{{\prime} }\), achieving 99.1% prediction accuracy after 20 epochs of parameter updating. In the second stage of learning \({{\mathcal{T}}}_{2}^{{\prime} }\), the EWC method is employed with a regularization strength of λq = 40. After 16 training epochs, the accuracy on \({{\mathcal{T}}}_{2}^{{\prime} }\) reaches 98%, while the accuracy on \({{\mathcal{T}}}_{1}^{{\prime} }\) slightly drops to 93.7%. The overall performance, typically evaluated by the average accuracy of the two tasks, is 95.8%.

For classical learning, we use a three-layer feedforward neural network with 241 variational parameters as the classical classifier (Fig. 4b, right). In each learning stage, the ten-dimensional vector is directly taken as the input data of the classical classifier. We present the numerical results in Fig. 4d. We find that the classical classifier struggles to achieve good performance on both tasks simultaneously, as \({{\mathcal{T}}}_{1}^{{\prime} }\) and \({{\mathcal{T}}}_{2}^{{\prime} }\) largely interfere with each other. The dominance of each task depends on the regularization strength λc used in EWC. For small values of λc, the classical classifier achieves high accuracy on \({{\mathcal{T}}}_{2}^{{\prime} }\) but performs poorly on \({{\mathcal{T}}}_{1}^{{\prime} }\), indicating catastrophic forgetting. As λc increases, the classical classifier places more weight on preserving old memories for \({{\mathcal{T}}}_{1}^{{\prime} }\). This leads to an improvement in performance on \({{\mathcal{T}}}_{1}^{{\prime} }\) and a drop in performance on \({{\mathcal{T}}}_{2}^{{\prime} }\). When λc is increased to a large value (λc = 100), the classical classifier almost completely loses its learning plasticity for \({{\mathcal{T}}}_{2}^{{\prime} }\) in the second learning stage. The best overall performance that can be achieved by the classical classifier is 81.3%. In addition, we implement a classical convolutional neural network (CNN) with 181 variational parameters. The simulation results (Fig. S10) show that the CNN classifier achieves an overall performance of up to 81.1%.

The comparison between quantum and classical models shows that quantum models can outperform classical models in certain continual learning scenarios, despite containing fewer variational parameters. This agrees with the theoretical predictions that quantum neural networks in general possess larger expressive power42 and effective dimension43 than classical ones with a comparable number of parameters, thus would better accommodate distribution differences among multiple tasks and lead to superior overall performance in continual learning scenarios.

Discussion

In classical continual learning, a variety of strategies other than the EWC method, such as orthogonal gradient projection44 and parameter allocation45, have been proposed to overcome catastrophic forgetting. These strategies might also be adapted to quantum continual learning scenarios, and their experimental demonstrations would be interesting and important. Our work focuses on a representative approach–EWC–as a proof-of-concept demonstration of quantum continual learning on near-term quantum hardware. Along this direction, it is, however, worthwhile to mention a subtle distinction between quantum and classical continual learning. In the quantum domain, due to the no-cloning theorem46 and the difficulty in building long-lived quantum memories47, one cannot duplicate unknown quantum data and store them for a long time. As a result, replay-based strategies that rely on recovering the old data distributions48,49 require either (currently unavailable) fault-tolerant quantum random access memories50 or the training of quantum generative models for each task. The latter would then need to be re-executed on hardware to synthesize past samples, introducing substantial overhead. By contrast, EWC only stores a classical representation of the Fisher information matrix (or its diagonal) for old tasks. This avoids quantum data storage and makes EWC a more viable strategy for realizing quantum continual learning on near-term quantum devices. In addition, this work primarily focuses on classification tasks in the framework of supervised learning. The extension of quantum continual learning to unsupervised and reinforcement learning presents more technical difficulties and has yet to be achieved in both theory and experiment. The use of classical learning surrogates51,52 is a promising approach to reduce the training cost of variational quantum circuits and may thus assist the development of quantum continual learning.

We note that our quantum continual learning strategy against catastrophic forgetting shares a conceptual similarity with quantum error mitigation techniques designed to combat environmental noise. Combining them at the current stage is highly non-trivial. Quantum error mitigation techniques such as zero-noise extrapolation53 typically introduce significant overhead, often requiring multiple circuit executions with modified parameters or deeper circuits. When combined with the already-intensive cost of estimating the Fisher information matrix for EWC, the total burden can become impractical on current hardware. We expect the future integration of hardware-level error mitigation techniques with algorithm-level continual learning strategies, potentially further enhancing quantum continual learning performance in real quantum devices.

Enabling quantum learning models to accommodate a dynamic stream of tasks demands long-term research. Our work makes a primary step in this direction by demonstrating the issue of catastrophic forgetting and the effectiveness of the EWC method for quantum continual learning in experiments. We note that while variational quantum classifiers offer a flexible framework for encoding and processing classical and quantum data, they face scalability limitations. In particular, training deep and high-dimensional variational quantum classifiers beyond classical simulability would be hindered by issues such as barren plateaus54,55,56,57. Beyond this general limitation, the barren plateaus issue may pose a unique challenge specifically to EWC itself. If an old task’s landscape is a barren plateau, its gradients (and thus the Fisher information matrix, which measures the landscape’s curvature) decay exponentially with the qubit number. The EWC penalty term would consequently vanish, leading to a plausible conclusion that no parameters are important for that task and thus EWC would fail to protect the old task’s knowledge. Despite these limitations, our work provides a proof-of-principle experimental demonstration of quantum continual learning on existing quantum hardware, motivating future development of quantum continual learning strategies not only for variational quantum classifiers but also for more robust and scalable quantum machine learning architectures in the future.

Methods

Variational quantum classifiers

We build the quantum classifiers with multiple blocks of operations, as illustrated in Figs. 2 and 5. Each block contains three layers of single-qubit gates with programmable rotation angles and ends with two layers of entangling gates for leveraging the exponentially large Hilbert space and establishing quantum correlations among the qubits. For classification tasks, the quantum classifier assigns a label to each input data based on the measured expectation value of the Pauli-Z operator on the m-th qubit, \(\langle {\widehat{\sigma }}_{m}^{z}\rangle\): a label for one class is assigned when \(\langle {\widehat{\sigma }}_{m}^{z}\rangle \ge 0\), while a label for the other class is assigned when \(\langle {\widehat{\sigma }}_{m}^{z}\rangle < 0\). In the experiment for learning \({{\mathcal{T}}}_{1}\), \({{\mathcal{T}}}_{2}\) and \({{\mathcal{T}}}_{3}\), we use 18 qubits with four blocks to construct the quantum classifier with a total of 216 variational parameters, where the entangling gates are selected as CNOT gates, and m = 9. In the experiment for learning \({{\mathcal{T}}}_{1}^{{\prime} }\) and \({{\mathcal{T}}}_{2}^{{\prime} }\), we construct a ten-qubit quantum classifier with three blocks containing a total of 90 variational parameters, where the entangling gates are selected as CZ gates, and m = 1.

Fig. 5: Dataset generation for the engineered quantum task.
figure 5

Each ten-dimensional input vector is first embedded into a quantum state via the quantum feature encoding (Fig. S2b). A variational circuit with randomly chosen parameters is then applied to the state. The ground true label of each input image vector is generated based on the local observable \(\langle {\widehat{\sigma }}_{1}^{z}\rangle\).

Dataset generation

The datasets for \({{\mathcal{T}}}_{1}\) and \({{\mathcal{T}}}_{2}\) are composed of images randomly selected from the Fashion-MNIST dataset35 and the MRI dataset36, respectively. The quantum dataset for \({{\mathcal{T}}}_{3}\) is composed of ground states of the cluster-Ising Hamiltonian58 in the ATF and SPT phases. We prepare approximate ground states in our experiments by executing a variational circuit. We first train the variational circuit on a classical computer with the aim of minimizing the energy expectation value for the output states. We then experimentally implement the variational circuit using the parameters obtained in the classical simulation. To characterize our quantum state preparation, we measure the string order parameter for these prepared states. In Supplementary Sec. IIB, we provide a detailed discussion about the quantum state preparation. For each of \({{\mathcal{T}}}_{1}\), \({{\mathcal{T}}}_{2}\), and \({{\mathcal{T}}}_{3}\), we construct a training set with 500 data samples and a test set with 100 data samples.

To construct the dataset for \({{\mathcal{T}}}_{1}^{{\prime} }\), we use the input data sourced from the Fashion-MNIST dataset. Specifically, we randomly select 1200 images labeled as “T-shirt” and “ankle boot”. We first perform PCA to compress these images to ten-dimensional vectors. Subsequently, each feature of these ten-dimensional vectors is further normalized to have a mean value of 0 and a standard deviation of 1. As depicted in Fig. 5, we generate the label g(x) for each data sample x using functions generated by a quantum model. To this end, we first use the feature encoding proposed in ref.39 to encode x into a quantum state. We show the quantum circuit for the feature encoding in Fig. S2b. We then experimentally implement the quantum circuit model with three blocks of operations. The variational parameters θ for the circuit are randomly generated within the range of [0, 2π]90. The ground true label g(x) is determined by the local observable \(\langle {\widehat{\sigma }}_{1}^{z}\rangle\) evolved under the above circuit model: g(x) = 0 if \(\langle {\widehat{\sigma }}_{1}^{z}\rangle > 0.2\) and g(x) = 1 if \(\langle {\widehat{\sigma }}_{1}^{z}\rangle < -0.2\). In our experiment, we obtain a total of 667 data samples with g(x) being 0 and 1. We select 556 of them as the training dataset and the other 111 of them as the test dataset.

To construct the dataset for \({{\mathcal{T}}}_{2}^{{\prime} }\), we use the data from the MRI dataset. We randomly select 600 images labeled as “hand” and “breast”. We also employ PCA to compress these images to ten-dimensional vectors. The ground true label of each ten-dimensional vector is just the label of the corresponding original image. We divide 600 samples into a training dataset of size 500 and a test dataset of size 100.

Data encoding

In our experiments, we utilize different strategies to encode different types of data. We use the interleaved block encoding strategy59 to encode classical images in the dataset for \({{\mathcal{T}}}_{1}\) and \({{\mathcal{T}}}_{2}\). For each classical image, we first reduce its size to 16 × 16 grayscale pixels and flatten it into a 256-dimensional vector. We then normalize the vector and add up the adjacent entries to obtain a 128-dimensional vector x. As shown in Fig. 2, we assign each single-qubit rotation gate with an angle of 2xi + θi, where θi is a variational parameter. We choose 128 rotation gates and assign the corresponding xi with an entry of x. For the remaining 88 rotation gates, we set the corresponding xi to zero. We note that other constant values could also be used for padding; however, this choice has no impact on performance in our setting. Since the variational parameters θi are randomly initialized, adding any constant value as padding simply results in an equally random initialization. Consequently, the specific padding value does not affect the model’s expressivity or optimization behavior. For the quantum data in \({{\mathcal{T}}}_{3}\), the quantum classifier can naturally handle these quantum states as input after their preparation on quantum devices.

For \({{\mathcal{T}}}_{1}^{{\prime} }\), we adopt the feature encoding approach proposed in Ref.39, with the circuit structure shown in Fig. S2b. This feature encoding is assumed to yield a kernel that is computationally hard to estimate on classical computers. For \({{\mathcal{T}}}_{2}^{{\prime} }\), we use a conventional rotation encoding approach in which the data vectors are encoded into a single layer of single-qubit rotation gates, with the circuit structure depicted in Fig. S2c.

Gradients and Fisher information

We minimize the loss function in Equation (1) by adapting the gradient descent method. Based on the chain rule, the derivatives of L with respect to the j-th parameter θj can be expressed as:

$$\frac{\partial L\left(h\left({{\boldsymbol{x}}}_{k,i};{\boldsymbol{\theta }}\right),{{\bf{a}}}_{k,i}\right)}{\partial {\theta }_{j}}=-\frac{{{\bf{a}}}_{k,i}^{0}}{{{\bf{g}}}_{k,i}^{0}}\frac{\partial {{\bf{g}}}_{k,i}^{0}}{\partial {\theta }_{j}}-\frac{{{\bf{a}}}_{k,i}^{1}}{{{\bf{g}}}_{k,i}^{1}}\frac{\partial {{\bf{g}}}_{k,i}^{1}}{\partial {\theta }_{j}}.$$
(3)

In our experiment, \({{\bf{g}}}_{k,i}^{0}\) and \({{\bf{g}}}_{k,i}^{1}\) are determined by the local observable \(\left|0\right\rangle {\left\langle 0\right|}_{m}\) and \(\left|1\right\rangle {\left\langle 1\right|}_{m}\) on the m-th qubit, respectively.

As all variational parameters in the quantum classifier take the form of \(\exp (-\frac{i}{2}\theta {P}_{n})\) (Pn belongs to the Pauli group), the derivatives of \({{\bf{g}}}_{k,i}^{l}\) can be computed via the “parameter-shift rule”37,60:

$$\frac{\partial {{\bf{g}}}_{k,i}^{l}}{\partial {\theta }_{j}}=\frac{{({{\bf{g}}}_{k,i}^{l})}^{+}-{({{\bf{g}}}_{k,i}^{l})}^{-}}{2},$$
(4)

where l = 0, 1, and \({({{\bf{g}}}_{k,i}^{l})}^{\pm }\) denotes the expectation values of the local observables with parameter θj being \({\theta }_{j}\pm \frac{\pi }{2}\).

We directly measure \({({{\bf{g}}}_{k,i}^{l})}^{\pm }\) in experiments to obtain the quantum gradients, based on which we adapt the gradient descent method assisted by the Nadam optimizer61 to optimize the quantum classifier. The learning rate is set as 0.05 in experiments.

After learning the k-th task, we need to obtain the Fisher information Fk,j for measuring the importance of each variational parameter θj. Based on the derivatives of the loss function at \({{\boldsymbol{\theta }}}_{k}^{\star }\), we estimate Fk,j as:

$${F}_{k,j}=\frac{1}{{N}_{k}}\mathop{\sum }\limits_{i=1}^{{N}_{k}}{\left({\left.\frac{\partial L\left(h\left({{\boldsymbol{x}}}_{k,i};{\boldsymbol{\theta }}\right),{{\bf{a}}}_{k,i}\right)}{\partial {\theta }_{j}}\right|}_{{\boldsymbol{\theta }}={{\boldsymbol{\theta }}}_{k}^{\star }}\right)}^{2},$$
(5)

The notations here follow those in Equation (1) and Equation (2). The detailed derivation of Fk,j is provided in Supplementary Sec. I.B. We provide the derivation of Fk,j in Supplementary Sec. I.B.

Training with EWC

To sequentially learn multiple tasks without catastrophic forgetting, we adapt the EWC method. The learning process consists of multiple stages. Initially, each variational parameter in the quantum classifier is randomly chosen within the range of [− π, π]. In the k-th stage, the quantum classifier is trained with the modified loss function LEWC as defined in Equation (2). At each training epoch, we calculate the gradients of LEWC on 25 data samples randomly selected from the training dataset, and evaluate the learning performance on all data samples in the test dataset. After the k-th stage, we obtain the Fisher information Fk,i for all variational parameters, which is used in the subsequent learning stage (see Supplementary Sec. IB for detailed algorithms).

In the experiment for learning \({{\mathcal{T}}}_{1}\), \({{\mathcal{T}}}_{2}\) and \({{\mathcal{T}}}_{3}\), we set λ2,1 = 60 in the second stage, and λ3,1 = 0 and λ3,2 = 60 in the third stage. We cancel the regularization term for \({{\mathcal{T}}}_{1}\) (λ3,1 = 0) in the third stage for two reasons. First, we expect the model to have fewer restrictions and thus more flexibility in adjusting parameters to learn \({{\mathcal{T}}}_{3}\).

Second, after the second stage, the obtained parameters \({{\boldsymbol{\theta }}}_{2}^{\star }\) can maintain knowledge for \({{\mathcal{T}}}_{1}\) since the regularization term for \({{\mathcal{T}}}_{1}\) is added during the second stage. Thus, by only adding the regularization term for \({{\mathcal{T}}}_{2}\), we can still preserve the learned knowledge from \({{\mathcal{T}}}_{1}\), as evidenced by the experimental results. Although the information from \({{\mathcal{T}}}_{1}\) will decay as we sequentially learn more tasks, considering there are only three tasks in total, it is reasonable to set λ3,1 = 0 for simplicity.

In addition, we simulate an idealized scenario in which \({{\mathcal{T}}}_{1}\), \({{\mathcal{T}}}_{2}\), and \({{\mathcal{T}}}_{3}\) are trained simultaneously, providing a best-case baseline for quantum continual learning. As shown in Fig. S11, this idealized setting achieves near-perfect performance.

Statistical analysis

In our experiments, we evaluated the prediction accuracies on 10 independent test datasets after sequential training. Each test set contains 50 data samples, and for each sample, 1200 repeated executions of the quantum circuit are performed to reliably determine the classification outcome.

The mean prediction accuracies, along with corresponding error bars, are presented in Fig. 6, clearly demonstrating the statistical robustness of the observed performance improvements when using the EWC method.

Fig. 6: Statistical analysis of experimental results for quantum continual learning.
figure 6

a, b,Cumulative distributions of prediction accuracy measured on 10 test datasets, each containing 50 randomly selected test samples, after sequentially learning three tasks: \({{\mathcal{T}}}_{1}\), \({{\mathcal{T}}}_{2}\) and \({{\mathcal{T}}}_{3}\). c, d Cumulative distributions of prediction accuracy for sequential learning of tasks \({{\mathcal{T}}}_{1}^{{\prime} }\) and \({{\mathcal{T}}}_{2}^{{\prime} }\).

Classical learning models for comparison

We specify classical feedforward networks used in numerical simulations for comparison. For quantum learning, we use the quantum circuit classifier belonging to the same variational family as those employed for relabeling data for \({{\mathcal{T}}}_{1}^{{\prime}}\), but initialized with different parameters randomly generated from [0, 2π]90. This quantum classifier contains a total of 90 variational parameters. For classical learning, we use a three-layer FFNN with ten neurons in the input layer, 20 neurons in the hidden layer, and one neuron in the output layer. The activation function used is the sigmoid function. This FFNN contains a total of 241 variational parameters. The ten neurons in the input layer encode the ten-dimensional input data vectors for the two tasks. The neuron in the output layer determines the prediction outcome for the input data: if the output value is greater than 0.5, the input data is assigned to class 0; if the output value is less than 0.5, it is assigned to class 1. The CNN first applies a one-dimensional convolutional layer with a kernel size of 3. This layer maps the single input channel to 20 output channels. The ReLU activation function is used, and padding (pad = 1) is used to ensure that the output length remains 10. Next, a max pooling layer with a pooling window of size two is applied, reducing the length of the signal from 10 to 5. The resulting feature map, which has dimensions 100, is flattened into a vector and fed into a fully connected layer with one neuron. A sigmoid activation function is used in this final layer to produce the output. In total, the CNN model contains a total of 181 trainable parameters.