Meta-learning assisted robust control of universal quantum gates with uncertainties

Zhang, Shihui; Miao, Zibo; Pan, Yu; Tao, Sibo; Chen, Yu

doi:10.1038/s41534-025-01034-9

Download PDF

Article
Open access
Published: 19 May 2025

Meta-learning assisted robust control of universal quantum gates with uncertainties

Shihui Zhang¹^na1,
Zibo Miao¹^na1,
Yu Pan²,
Sibo Tao³ &
…
Yu Chen⁴^na1

npj Quantum Information volume 11, Article number: 81 (2025) Cite this article

3846 Accesses
2 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Achieving high-fidelity quantum gates is crucial for reliable quantum computing. However, decoherence and control pulse imperfections pose significant challenges in realizing the theoretical fidelity of quantum gates in practical systems. To address these challenges, we propose the meta-reinforcement learning quantum control algorithm (metaQctrl), which leverages a two-layer learning framework to enhance robustness and fidelity. The inner reinforcement learning network focuses on decision making for specific optimization problems, while the outer meta-learning network adapts to varying environments and provides feedback to the inner network. Our comparative analysis regarding the realization of universal quantum gates demonstrates that metaQctrl achieves higher fidelity with fewer control pulses than conventional methods in the presence of uncertainties. These results can contribute to the exploration of the quantum speed limit and facilitate the implementation of quantum circuits with system imperfections involved.

Realizing a deep reinforcement learning agent for real-time quantum feedback

Article Open access 06 November 2023

High-fidelity parallel entangling gates on a neutral-atom quantum computer

Article Open access 11 October 2023

Minimal time robust control for two superconducting qubits

Article Open access 25 December 2025

Introduction

Recent theoretical and experimental studies have shown that quantum phenomena, such as entanglement and superposition, can be used to facilitate information science and technologies, including quantum computing, communications, and information storage, outperforming their classical counterparts^1,2,3. Quantum control is crucial to the successful realization of these leading technologies and also plays an important role in areas such as atomic physics and physical chemistry^4,5.

Quantum control is usually implemented by manipulating the Hamiltonians of quantum systems. By applying a series of unitary transformations driven by the control Hamiltonians to the target system, the target system can be transferred to any desired state. Therefore, obtaining such a control sequence, especially the optimal control sequence, is a core problem that should be addressed by quantum control. A variety of control algorithms have been proposed in the past few decades, and have achieved excellent results in specific models. For example, the gradient ascent pulse engineering (GRAPE) algorithm⁶, taking advantage of gradient-based optimization of a model-based target function, has seen successful applications in quantum optimal control.

However, in reality, there are inevitably noises or inaccuracies in the information flow of quantum control systems, including but not limited to actuator output limitations⁷, sensor measurement errors⁸, interactions with the environment, and approximations of the physical models⁹. These factors can significantly influence the performance of quantum control systems that face the requirements of high precision. For instance, regarding the application of GRAPE, limited knowledge about the system dynamics and control Hamiltonians means that GRAPE may result in control deviations and even uncontrollability. Although earlier search algorithms such as gradient ascent (GA) exhibit some robustness, they require a lot of time to perform the optimization and are susceptible to falling into local optimal solutions¹⁰. Therefore, designing an efficient algorithm to tackle the uncertain dynamics of quantum systems remains challenging and meaningful.

Since machine learning (ML) can efficiently and accurately complete classification or optimization tasks¹¹, the combination of ML and quantum engineering has been gradually explored^12,13. Owing to the advantages of ML, the difficulty in precisely modeling the dynamics of quantum systems can be relieved, thereby improving the control performance in various scenarios. In addition, reinforcement learning (RL) techniques in ML¹⁴ have been adopted, including value-based algorithms¹⁵ and policy gradient-based algorithms¹⁶, to discover optimal quantum control strategies. RL typically consists of two basic components, namely, the environment and the agent. The environment can be established based on specific task requirements, such as games¹⁷ and optimization problems¹⁸, while the agent is responsible for the decision-making process within the task and is supposed to learn the optimal policy by interacting with the environment. In particular, a series of RL algorithms have been specifically designed for the control of quantum gates through approaches such as the reconstruction of the reward function¹⁹ and strategy optimization²⁰.

However, existing results indicate that deep RL has at least two drawbacks compared to human beings. Training an intelligent agent requires a large amount of data, whereas humans can perform reasonably well in a novel environment with significantly smaller amounts of data. This limitation is prominent in quantum optimal control problems, as the exponential growth in the size of quantum states leads to a rapid decline in sample efficiency. On the other hand, deep RL specializes in a narrow task domain, while humans can effortlessly apply knowledge acquired from other scenarios to the current task. Consequently, improvements in generalization of RL algorithms have been considered in different tasks to overcome concomitant shortcomings^21,22,23.

Inspired by deep meta-RL^24,25,26,27, in this paper, we propose a new algorithm, the meta-reinforcement learning quantum control algorithm (metaQctrl), which provides a generic and robust agent that can adapt to different parametric settings in quantum systems. This algorithm significantly reduces the training time required for the agent to handle diverse task environments and facilitates the transition of the algorithm among different control domains. Since quantum gates serve as fundamental operational elements in quantum computing²⁸, achieving high-fidelity quantum gates is of the utmost importance to ensure accuracy and efficiency in the vast majority of quantum technologies. The metaQctrl algorithm has been applied to design single-qubit and multi-qubit gates, with the purpose of obtaining ultrahigh-fidelity (of the order 99.99%) in the presence of uncertainties or disturbances. For single-qubit quantum gates, compared to GA and GRAPE, our algorithm requires shorter control steps and exhibits smoother control curves. With disturbances or uncertainties involved, the fidelity of the quantum gate controlled by our algorithm is an order of magnitude higher than that of traditional algorithms (GA, GRAPE), as demonstrated in the simulation results. Furthermore, in the multi-qubit quantum gate control system with disturbances or uncertainties, our algorithm consistently outperforms the RL algorithm Proximal Policy Optimization (PPO) in terms of fidelity and control step length. Thus, it can be concluded that metaQctrl is capable of providing robust control strategies to adapt to complex environments. In addition, since metaQctrl requires fewer control pulses, our proposed algorithm can contribute to the exploration of the quantum speed limit when there are system imperfections^29,30,31.

Results

To demonstrate the features and strengths of metaQctrl, in this section, we discuss robust control of single- and multi-qubit quantum gates in the presence of uncertainty. We numerically compare our proposed algorithm with GRAPE, GA, and PPO within the same system setup in the simulation, with the assistance of OpenAI’s Gymnasium and QUTIP³².

Robust control of a single-qubit quantum gate

Considering robust control of a single-qubit quantum gate, the system Hamiltonian can be described by

$$H(\overrightarrow{u}(t))=(1+\mu (t))({\sigma }_{z}+{u}_{x}(t){\sigma }_{x}+{u}_{y}(t){\sigma }_{y}),$$

(1)

where the control variables u_x(t) and u_y(t) are restricted to the interval $\left[-5,5\right]$. The control horizon is T = 1.6 and the maximum steps of the control pulses are N = 40. The uncertainty involved in the system, μ(t), is assumed to obey the Gaussian distribution ${\mathbb{N}}(0,{\eta }^{2})$, with a standard deviation $\eta \in \left[0,1\right]$. The control objective is to transfer the quantum gate from the initial state U(0) to the final state U(T), with the end time T. The fidelity ${\mathcal{F}}(U(T),{U}_{f}) > 1-\varepsilon$ must be guaranteed as prescribed in Eq. (14). Specifically, we take $U(0)={\rm{I}}=\left(\begin{array}{rc}1&0\\ 0&1\end{array}\right)$, ${U}_{f}={U}_{{\rm{H}}}=\frac{1}{\sqrt{2}}\left(\begin{array}{rc}1&1\\ 1&-1\end{array}\right)$ with U_H, the Hadamard gate widely used in quantum computing and the convergence criterion ε = 10⁻⁴.

The control trajectories of four algorithms (GRAPE, GA, PPO, metaQctrl) corresponding to η = 0.3 are depicted in Fig. 1. It can be clearly seen that only metaQctrl and PPO successfully complete the control task in the presence of disturbance or uncertainty, while GRAPE and GA fail to meet the fidelity accuracy requirements within the maximum control step limit. Although GRAPE and GA exhaust all allocated control steps, this does not imply that our control step length N is unreasonable, since metaQctrl and PPO achieve their control objectives in fewer than N/2 steps. It can be concluded that metaQctrl and PPO, which combine efficient exploration with accuracy, outperform GRAPE and GA in realizing global optimal control. In fact, metaQctrl and PPO adopt a strategic exploration approach by boldly advancing when far from the target and making incremental adjustments as they approach, thus preventing them from missing the target and falling into local optima.

**Fig. 1: The system evolution regarding realization of the Hadamard gate in the presence of uncertainty.**

In addition, the maximum fidelity achieved by four algorithms with varying η is evaluated in Fig. 2. Here, η uniformly ranges from 0 to 1 with a step size of 0.05. The simulation results represent the average values obtained from the 100 Monte Carlo tests. PPO is trained with the disturbance parameter η = 0.3, while metaQctrl can handle the disturbance distribution ${\mathcal{P}}$ with $\eta \sim {\mathbb{U}}(0,1)$. In particular, the convergence criterion ε is set to 10⁻⁴ in Fig. 2a–c, while the convergence criterion ε is set to 10⁻⁶ in Fig. 2d.

**Fig. 2: The control performance comparison among GRAPE, GA, PPO, and metaQctrl in realization of single-qubit Hadamard gate with varying η.**

The fidelity values obtained by metaQctrl compared to GRAPE and GA are plotted in Fig. 2a. It is evident that metaQctrl exhibits significantly enhanced robustness compared to traditional algorithms. The differences between traditional methods and metaQctrl are subtle in uncertainty-free scenarios. However, as the intensity of the disturbance increases, so does the performance gap. Even in the extreme case (η = 1.0), metaQctrl achieves a maximum fidelity 10% higher than GRAPE and 30% higher than GA. The control performances of metaQctrl and PPO in terms of maximum fidelity are compared in Fig. 2b. It can be seen that metaQctrl excels in maintaining robust control performance across a broader range of disturbances, leveraging its unique meta-reinforcement learning framework. Figure 2c compares the control steps of metaQctrl and PPO when ε = 10⁻⁴. It can be observed from Fig. 2c that the exploration intensity of PPO increases in order to handle the system uncertainties, resulting in a rapid escalation of its total control steps. In contrast, the control steps for metaQctrl remain relatively stable across varying disturbance distributions and even exhibit a slight decreasing trend as η increases. However, as shown in Fig. 2d, the number of control steps for metaQctrl increases as η increases when ε = 10⁻⁶. This is due to the fact that the number of control steps required by metaQctrl is closely related to the choice of the convergence criterion ε and indeed the scale of the target neighborhood relative to the disturbances. When ε = 10⁻⁴, the scale of the target neighborhood relative to the disturbances is sufficiently large, and consequently, the trajectory of the system is unlikely to deviate from the target neighborhood. Since metaQctrl aims to avoid the oscillation of the control trajectory in the target neighborhood, it accelerates convergence during the later stages of control, and thus, the number of control steps decreases slightly as η increases in this case. In contrast, when ε = 10⁻⁶, the scale of the target neighborhood relative to the disturbances is not large enough. More pronounced oscillations near the target region during later stages of control increase the number of control steps as η increases.

In order to further explore the robustness of metaQctrl against PPO, we take into account multiple uncertainties or disturbances in the Hamiltonian, which gives

$$\begin{array}{r}H(\overrightarrow{u}(t))=(1+{\mu }_{0}(t)){\sigma }_{z}+(1+{\mu }_{u}(t))({u}_{x}(t){\sigma }_{x}+{u}_{y}(t){\sigma }_{y}).\end{array}$$

(2)

Here, μ₀(t) and μ_u(t) obey the Gaussian distributions ${\mathbb{N}}(0,{\eta }_{0}^{2})$ with standard deviation ${\eta }_{0} \sim {\mathbb{U}}(0,1)$ and ${\mathbb{N}}(0,{\eta }_{u}^{2})$ with standard deviation ${\eta }_{u} \sim {\mathbb{U}}(0,1)$, respectively. The control objective is to transfer the quantum gate from U(0) = I to $U(T)={U}_{{\rm{H}}}=\frac{1}{\sqrt{2}}\left(\begin{array}{rc}1&1\\ 1&-1\end{array}\right)$ (the Hadamard gate) or $U(T)={U}_{{\rm{\pi /8}}}=\left(\begin{array}{rc}1&0\\ 0&{e}^{i\pi /4}\end{array}\right)$ (the π/8 gate) or $U(T)={U}_{{\rm{S}}}=\left(\begin{array}{rc}1&0\\ 0&i\end{array}\right)$ (the phase gate), with the end time T = 1.6 and the maximum steps of the control pulses N = 40.

As shown in Fig. 3, metaQctrl obviously outperforms PPO by presenting stronger robustness quantified by infidelity in the presence of multiple uncertainties. It can be seen that the infidelity values can reach 10⁻⁵ or even smaller ones (e.g., for the Hadamard gate) using metaQctrl when the standard deviation η₀ (η_u) is not large. Moreover, metaQctrl allows for a wider range of deviations than PPO does, where fidelity values greater than 99.99% can be obtained even if the system is subject to multiple uncertainties. Table 1, along with Fig. 3, provides ratios of the areas corresponding to the key infidelity thresholds (10⁻⁴ and 10⁻⁵), relative to the entire space enclosed by η₀ and η_u, for metaQctrl and PPO, respectively.

**Fig. 3: The control performance regarding realization of single-qubit quantum gates in the presence of uncertainties.**

Table 1 The ratios of the areas corresponding to the key infidelity thresholds (10⁻⁴ and 10⁻⁵), relative to the entire space enclosed by η₀ and η_u for metaQctrl compared to PPO, in realization of Hadamard gate, π/8 gate, and phase gate

Full size table

Robust control of a two-qubit quantum gate

We now consider robust control of a two-qubit quantum gate, with the Hamiltonian of the system given by

$$H(\overrightarrow{u}(t))=(1+\mu (t))({Z}_{1}{Z}_{2}+\mathop{\sum }\limits_{j=1}^{2}({u}_{x}^{j}(t){X}_{j}+{u}_{y}^{j}(t){Y}_{j})),$$

(3)

where the control variables ${u}_{x}^{j}(t),{u}_{y}^{j}(t)\in \left[-5,5\right]$ ($j\in \left\{1,2\right\}$), and the control horizon is T = 2.0. The maximum step of the control pulses is N = 50, with the quantum gate initiated at U(0) = I. In this scenario, the desired quantum gate for control is U_f = U_CNOT (there is no local z phase error correction involved), with U_CNOT the controlled-NOT (CNOT) gate, and the convergence criterion is characterized by ε = 10⁻³. The CNOT gate is of the following form

$${U}_{{\rm{CNOT}}}=\left(\begin{array}{cccc}1&0&0&0\\ 0&1&0&0\\ 0&0&0&1\\ 0&0&1&0\end{array}\right).$$

(4)

The maximum fidelity achieved by four algorithms (GRAPE, GA, PPO, and metaQctrl) with varying η is evaluated in Fig. 4. As shown in Fig. 4a, the control performance of metaQctrl is compared with that of GRAPE and GA. Traditional algorithms (GRAPE and GA) are increasingly sensitive to uncertainty as η increases in a two-qubit system. Consequently, the fidelity of the quantum gates they control drops quickly below 0.9, failing to meet the requirements of actual control tasks. However, metaQctrl can provide infidelity of the order 10⁻⁴. Figure 4b shows the maximum fidelity values between metaQctrl and PPO. It can be seen that metaQctrl and PPO are also affected by the increase in the number of qubits, resulting in slightly lower fidelity for two-qubit systems compared to single-qubit systems. In addition, metaQctrl shows a more gradual downward trend and achieves higher fidelity than PPO with respect to different values of η. This indicates that the control algorithm proposed in this paper is indeed more robust. It can be understood that metaQctrl has learned to grasp dynamic environmental information from a higher dimension and can quickly adjust its strategy according to changes in the environment, through training with the meta-reinforcement learning framework. By comparing the total control steps in Fig. 4c, we observe that while PPO exhausts all available control steps, metaQctrl has the ability to search for effective control strategies, with shorter control sequences when η is small.

**Fig. 4: The control performance comparison among GRAPE, GA, PPO, and metaQctrl in the realization of the CNOT gate with varying η.**

Discussion

Drawing inspiration from meta-learning, the metaQctrl algorithm proposed in this paper is aimed at enhancing the control performance of quantum gates in the presence of system imperfections. To be specific, in the framework of metaQctrl, we incorporate a meta-learning outer loop apart from the RL inner loop and employ an alternating training approach. This two-layer structure significantly enhances the algorithm’s robustness against environmental changes. The meta-learning outer loop enables the extraction of specific tasks in the presence of uncertainties. By acquiring knowledge from various tasks, the inner loop requires only minimal new data to determine optimal control strategies for novel tasks, thereby realizing robust control of quantum gates with disturbances or uncertainties involved. Furthermore, in the context of simulation examples, we have evaluated the performance of our proposed algorithm with respect to robust control of single-qubit quantum gates and the CNOT gate that constitutes the universal gate set. The results and analyses show that metaQctrl obviously outperforms GRAPE or GA, and that metaQctrl can achieve improved robustness with fewer control pulses compared to PPO. It is thus promising that metaQctrl can help to discover the quantum speed limit. Additionally, it is worth highlighting that metaQctrl can be straightforwardly applied to quantum metrology or communication related to quantum state manipulation with system imperfections, not limited to the area of quantum computing, such as robust realization of quantum circuits. Future work may include generalizing our method to the more complicated quantum systems and taking into account totally time-varying disturbances.

Methods

Robust control of quantum gates

The evolution of a quantum system can be described by the following Schrödinger equation

$$\left\vert \dot{\psi }(t)\right\rangle =-iH\left\vert \psi (t)\right\rangle =-i({H}_{s}+\mathop{\sum }\limits_{k=1}^{m}{u}_{k}(t){H}_{k})\left\vert \psi (t)\right\rangle .$$

(5)

Here, ψ(t) represents the current quantum state of the system, H_s denotes the inherent Hamiltonian of the system, and H_k is the Hamiltonian generated by the external radio frequency (RF) field, with each RF field being independent and orthogonal. u_k(t) denotes the amplitude of the control pulse. Furthermore, we assume that the target quantum system is governed by an internal Hamiltonian generated by spin-spin coupling. An RF field orthogonal to the spin x- and y-directions can then be used to manipulate the quantum system with the control pulse. We can then have the specific form of Hamiltonian as follows

$$H(\overrightarrow{u}(t))=(1+\mu (t))\left[\mathop{\sum }\limits_{j=1}^{n-1}{Z}_{j}{Z}_{j+1}+\mathop{\sum }\limits_{j=1}^{n}({u}_{x}^{j}{X}_{j}+{u}_{y}^{j}{Y}_{j})\right],$$

(6)

where the operator ${X}_{j}={{\rm{I}}}^{1}\otimes \cdots \otimes {{\rm{I}}}^{j-1}\otimes {\sigma }_{x}^{j}\otimes {{\rm{I}}}^{j+1}\otimes \cdots \otimes {{\rm{I}}}^{n}$ denotes the Pauli operator σ_x applied on the j-th qubit in a system consisting of n qubits (I denotes the identity operator). Similarly, Y_j and Z_j are defined corresponding to the Pauli operators σ_y and σ_z of the j-th qubit, respectively, and the control pulses are represented by $\overrightarrow{u}(t)$. This type of system model is widely applied in a variety of quantum systems, including superconducting qubits³³ and trapped ions³⁴. In order to establish a universal control method that accounts for realistic imperfections, including dissipation, decoherence, and other uncertainties, the parametric disturbance μ(t) is introduced into our control model. Please note that there may be multiple disturbances or uncertainties appearing in the system (i.e. μ₁(t), μ₂(t), ⋯), and in this section we take μ(t) to illustrate the control scheme without loss of generality. To further improve the adaptive capacity of the algorithm, we assume that μ(t) obeys a truncated Gaussian distribution with the variance η being uncertain and the mean being 0. That is, we consider $\mu (t) \sim {\rm{clip}}[{\mathbb{N}}(0,{\eta }^{2}),-1,1]$.

We can use T to denote the end time and N to denote the steps of the control pulses. And then the quantum state of the target system can be calculated by

$$\left\vert \psi (t)\right\rangle ={\mathcal{D}}{e}^{-i\mathop{\int}\nolimits_{0}^{T}d\tau H(\tau )}\left\vert \psi (0)\right\rangle =U(T)\left\vert \psi (0)\right\rangle ,$$

(7)

where ${\mathcal{D}}$ denotes the Dyson time-ordering operator. In particular, the quantum gate U(T) can be given by $U(T)=\mathop{\prod }\nolimits_{j = 1}^{N}{U}_{j}({{\Delta }}t)$ with ${U}_{j}({{\Delta }}t)={e}^{-iH({\vec{u}}(j{{\Delta }}t)){{\Delta }}t}$ and Δt = T/N. The control goal is to drive the controlled quantum gate towards the desired quantum gate U_f that makes $\left\vert \psi (t)\right\rangle ={U}_{f}\left\vert \psi (0)\right\rangle$. Our aim is to discover the optimal control scheme that enables the controlled quantum gate to transit from the initial state to the target state with greater precision and in less time.

The precision of a controlled quantum gate can be quantified by the fidelity^35,36

$${\mathcal{F}}(U(T),{U}_{f})={\left| \frac{{\rm{Tr}}\left[{U}_{f}^{\dagger }U(T)\right]}{{2}^{n}}\right| }^{2},$$

(8)

with the range of ${\mathcal{F}}$ being $\left[0,1\right]$. The closer the fidelity is to 1, the higher the precision of the quantum gate. The objective function of the quantum gate control problem can then be defined by

$${{\mathcal{U}}}^{* }=\arg \mathop{\max }\limits_{{\mathcal{U}}}\int{\rm{d}}\mu (t){\mathcal{F}}(U(T),{U}_{f}).$$

(9)

The definition of ${\mathcal{U}}$ should be emphasized as it significantly deviates from traditional quantum control problems. Previous studies, in general, focus on identifying the optimal control ${\overrightarrow{u}}_{\mu (t)}^{* }$ in undisturbed environments (μ(t) ≡ 0) or in fixed disturbed environments (μ(t) subject to a specific distribution), to make ${\mathcal{F}}(U(T),{U}_{f})$ as close to 1 as possible. In contrast, our goal is to seek a generic mapping ${{\mathcal{U}}}^{* }:\mu (t)\to {\overrightarrow{u}}_{\mu (t)}^{* }$ rather than solving specific cases. This pursuit is more challenging than obtaining solutions for a given μ(t). For example, if an algorithm has been developed to determine ${\overrightarrow{u}}_{\mu (t)}^{* }$, such as GRAPE and GA, one would have to redesign or train the algorithm to adapt to the task in a new environment, which is indeed very time-consuming. However, our algorithm, which will be explained in detail in the following section, intelligently reduces the transition time between different disturbed environments. And therefore, the robustness can be enhanced.

The framework of metaQctrl

The decision-making process in the framework of metaQctrl is derived from RL, based on the Markov decision process (MDP). MDP can be defined by a 5-tuple $\left(S,A,P,R,\gamma \right)$, where S represents the state space, A denotes the action space, P is the transfer function providing the probability of transferring from a given state to a new one after taking an action, R is the reward function mapping the result of an action taken in a specific state to a value, and γ represents the discount factor.

As shown by the yellow loop in Fig. 5, RL can be viewed as an interactive process between an agent and the corresponding external environment. Based on the current state of the environment s_t and the associated policy function, the agent (denoted by the blue brain) takes the action a_t at the time t, leading to three consequences. Firstly, the agent receives a reward r_t, and then the state of the environment changes from s_t to s_t+1. Finally, the agent receives a new observation s_t+1. Through continuous interactions with the environment, the agent has access to a substantial amount of decision-making data. The data, in turn, enable further optimization of the policy function.

**Fig. 5: Illustration of how the intelligent agent can be constructed.**

Throughout this process, the agent aims to find a policy π(a∣s) to optimize the total expected reward defined as follows

$$J(\theta )=\sum _{s\in S}{d}^{\pi }(s){V}^{\pi }(s)=\sum _{s\in S}{d}^{\pi }(s)\sum _{a\in A}{\pi }_{\theta }(a| s){Q}^{\pi }(s,a),$$

(10)

where,

$${V}^{\pi }(s)={{\mathbb{E}}}_{a \sim \pi }\left[\sum _{k\ge 0}{\gamma }^{k}{r}_{t+k}| {s}_{t}=s\right],$$

(11)

$${Q}^{\pi }(s,a)={\mathbb{E}}\left[\sum _{k\ge 0}{\gamma }^{k}{r}_{t+k}| {s}_{t}=s,{a}_{t}=a\right],$$

(12)

and d^π(s) is the stationary distribution of the Markov chain concerning π_θ (i.e., on policy state distribution under π), mathematically defined by ${d}^{\pi }(s)=\mathop{\lim }\nolimits_{t\to \infty }P({s}_{t}=s| {s}_{0},\pi )$. The policy-based RL algorithm here is applied to optimize the policy π, approximated by a neural network π_θ, to maximize the reward function J(θ) (θ represents the parameter of the neural network). A well-known rule, referred to as the policy gradient theorem, can be used to update the reward function as follows

$$\begin{array}{rcl}{\nabla }_{\theta }J(\theta )&=&{\nabla }_{\theta }\mathop{\sum}\limits_{s\in S}{d}^{\pi }(s)\mathop{\sum}\limits_{a\in A}{\pi }_{\theta }(a| s){Q}^{\pi }(s,a)\\ &\propto &\mathop{\sum}\limits_{s\in S}{d}^{\pi }(s)\mathop{\sum}\limits_{a\in A}{Q}^{\pi }(s,a){\nabla }_{\theta }{\pi }_{\theta }(a| s).\end{array}$$

(13)

However, in general, RL algorithms face challenges in deriving the optimal solution for the objective as discussed in section “Robust control of quantum gates” due to their specificity to the environments. Once the environment changes, the originally learned optimal strategy becomes obsolete and necessitates retraining for the new environment, which is highly impractical for quantum systems. Therefore, in this paper, we introduce meta-learning to established RL algorithms, since meta-learning can help acquire knowledge quickly and adapt to new task scenarios based on existing knowledge. This process can also be perceived as maximizing the sensitivity of the new task loss function to the model parameters. In fact, even small alterations in model parameters can lead to significant enhancements in the loss function when the sensitivity is high. The intelligent agent can be constructed by training the model in various tasks, as illustrated in Fig. 5. Especially for the control objective ${{\mathcal{U}}}^{* }$, metaQctrl can contribute to increasing robustness in the presence of system disturbances or uncertainties.

The associated model parameters can be defined as follows. The state s_t is given by ${s}_{t}=\left\{\Re ({[U]}_{00}),\ldots ,\Re ({[U]}_{{2}^{n}{2}^{n}}),\,\Im ({[U]}_{00}),\ldots ,\Im ({[U]}_{{2}^{n}{2}^{n}})\right\}$, where [U]_jk represents the element at the intersection of the j-th row and k-th column of the unitary operator U(t) that describes the controlled quantum gate at time t. ℜ( ⋅ ) and ℑ( ⋅ ) denote the real and imaginary parts of an operator, respectively. The action ${a}_{t}=\left\{{u}_{x}^{1},\ldots ,{u}_{x}^{n},{u}_{y}^{1},\ldots ,{u}_{y}^{n}\right\}$, and each ${u}_{k}^{j}$ is bounded within the range $[{u}_{\min },{u}_{\max }]$. The reward r_t is set as a piecewise function related to ${\mathcal{F}}(U(T),{U}_{f})$, namely

$${r}_{t}=\left\{\begin{array}{ll}500(1-t/T),\quad &{\mathcal{F}} \,>\, 1-\varepsilon \\ 10(1-t/T),\quad &0.98 \,<\, {\mathcal{F}}\,\le\, 1-\varepsilon \\ 1-t/T,\quad &0.9 \,<\, {\mathcal{F}}\,\le\, 0.98\\ -(1-{\mathcal{F}}),\quad &{\mathcal{F}}\,\le\, 0.9\end{array}\right.$$

(14)

where ε defines the convergence criterion. By mapping the fidelity of quantum gates to rewards, our framework effectively transforms the control of quantum gates into a solvable task using RL algorithms. When fidelity is not high enough, an increase in fidelity leads to an increase in the corresponding reward, thus motivating the agent to quickly approach the target state. When fidelity is already high, segmented rewards are used to encourage exploratory behavior aimed at discovering strategies with even higher fidelity. In addition to promoting exploration, we also incorporate a penalty coefficient for step length to expedite convergence and prevent excessive exploration.

The metaQctrl algorithm, as illustrated in Fig. 6, comprises an inner loop and an outer loop. The outer loop represents a meta-learning process that maintains a global policy parameter ${\theta }_{j}^{* }$. It acquires a set of tasks ${\mathcal{T}}$ by sampling the quantum environment ${\mathcal{P}}$. Each task, denoted ${{\mathcal{T}}}_{k}$, can be viewed as an individual RL environment. By receiving each task ${{\mathcal{T}}}_{k}$ from the outer loop, the inner loop performs an independent RL with its current policy parameter ${\theta }_{j}^{* }$ determined by the outer loop. It then executes the MDP and collects state-reward pairs to update its own policy parameters accordingly. To accomplish the procedures, the inner loop employs an actor-critic network-based policy improvement method to update network parameters as follows,

$${\theta }_{j}^{k}={\theta }_{j}^{* }+{\nabla }_{{\theta }_{j}^{* }}\log {\pi }_{{\theta }_{j}^{* }}(s,a){{\mathcal{A}}}_{\pi }(s,a).$$

(15)

Here, ${{\mathcal{A}}}_{\pi }(s,a)$ denotes the advantage function whose value can be estimated by the value network. Once the inner loop is updated to obtain a new parameter ${\theta }_{j}^{k}$, the MDP is immediately reexecuted with the updated policy parameter. The newly obtained state-reward pairs can then be used to perform the following update process²¹

$${\theta }_{j+1}^{* }={\theta }_{j}^{* }-\beta {\nabla }_{{\theta }_{j}^{* }}\sum _{k}{{\mathcal{L}}}_{{{\mathcal{T}}}_{k}}({f}_{{\theta }_{j}^{k}}).$$

(16)

The parameter β defines the update rate of meta-learning. ${f}_{{\theta }_{j}^{k}}$ denotes the neural network of the inner loop with the updated policy parameter ${\theta }_{j}^{k}$, and ${\nabla }_{{\theta }_{j}^{* }}{{\mathcal{L}}}_{{{\mathcal{T}}}_{k}}(\cdot )$ denotes the gradient of the returns associated with the updated policy. The core idea of our framework is to take the magnitude of the loss value as weights, with the update direction of each subtask used as the basic direction for updating the global initialization parameters. It is worth noting that by taking advantage of the loss value of the new strategy as weights instead of directly summing gradients, it mitigates the risk of algorithmic convergence towards local optima. Through multiple iterations, the outer loop can acquire prior knowledge about dynamic environments in quantum systems from various tasks, thereby achieving adaptability. Furthermore, the outer loop samples disturbance distributions and performs meta-updates, which can be seen as a process of estimating the disturbance. Consequently, the inner loop can adaptively adjust control signals in varying noisy environments, thus enhancing robust performance against disturbance.

Inspired by the policy approximation methods and the Kullback–Leibler divergence formula³⁷, we optimize the update method for meta-policy parameters to better measure the policy distance and effectively control the policy update step size. Namely

$${\theta }_{j+1}^{* }\leftarrow \arg \mathop{\max }\limits_{\theta }\sum _{{{\mathcal{T}}}_{k} \sim {\mathcal{P}}}{{\mathcal{L}}}_{{\theta }_{j}^{k}}^{{\rm{Clip}}}(\theta ),$$

(17)

where

$${{\mathcal{L}}}_{{\theta }_{j}^{* }}^{{\rm{Clip}}}(\theta )={{\mathbb{E}}}_{{\pi }_{k}}\left[\mathop{\sum }\limits_{t=0}^{T}\left[\min ({c}_{t}(\theta ){{\hat{\mathcal{A}}}}_{t}^{{\pi }_{k}},{\rm{clip}}({c}_{t}(\theta ),1-\epsilon ,1+\epsilon ){{\hat{\mathcal{A}}}}_{t}^{{\pi }_{k}})\right]\right],$$

(18)

and ${{\hat{\mathcal{A}}}}_{t}^{{\pi }_{k}}={Q}^{{\pi }_{k}}(s,a)-{V}^{{\pi }_{k}}(s)$. The difference between the new and the old policies is given by ${c}_{t}(\theta )=\frac{{\pi }_{\theta }({a}_{t}| {s}_{t})}{{\pi }_{{\theta }_{{\rm{old}}}}({a}_{t}| {s}_{t})}$. When the new and old policies are identical, c_t(θ) = 1. Both $\min (\cdot )$ and the clip( ⋅ ) in Eq. (18) are used to ensure that the step size of the meta-updating process takes a value within the predefined range [1 − ε, 1 + ε].

As illustrated in Algorithm 1, whenever a control task is started, it requires collecting a new batch of data online, which is then used to update the network parameters. Using the prior knowledge obtained during pretraining, the algorithm can efficiently identify the optimal policy parameters tailored to the current noisy conditions. Upon completion of the training stage, the system can output a control sequence according to the control task. Despite relying on a sampling training process, this algorithm benefits from a more thoroughly trained model parameter, enabling the online training process to remain fast. Remarkably, its average runtime is significantly less than that of traditional algorithms such as GRAPE and GA, which will be explained in detail in the following section.

Algorithm 1

metaQctrl

Input: Target quantum gate U_f, distribution of quantum environment ${\mathcal{P}}$, convergence criterion ε, length of training steps ${\mathcal{J}}$

Output: Optimal policy θ^*

1: Initialize quantum gate U(0), meta-hyperparameter ${\theta }_{0}^{* }$, training steps j, trial buffer B

2: Build an actor-critic network with weights ${\theta }_{0}^{* }$

3: while $j \,<\, {\mathcal{J}}$ do

4: sample batch of tasks ${{\mathcal{T}}}_{k}$ from ${\mathcal{P}}$

5: while ${{\mathcal{T}}}_{k}$ do

6: initialize the inner actor-critic network with ${\theta }_{j}^{* }$

7: run the inner actor-critic network to get n episodes

8: compute advantage estimates for each steps

9: compute ${\theta }_{j}^{k}={\theta }_{j}^{* }+{\nabla }_{{\theta }_{j}^{* }}\log {\pi }_{{\theta }_{j}^{* }}(s,a){{\mathcal{A}}}_{\pi }(s,a)$

10: run the actor-critic network with ${\theta }_{j}^{k}$ again and store the episodes into B

11: end while

12: update ${\theta }_{j+1}^{* }\leftarrow \arg \mathop{\max }\limits_{\theta }{\sum }_{{{\mathcal{T}}}_{k} \sim {\mathcal{P}}}{{\mathcal{L}}}_{{\theta }_{j}^{k}}^{{\rm{Clip}}}(\theta )$ with episodes in B

13: j = j + 1

14: end while

Data availability

All relevant research data are available from the authors upon request.

Code availability

All relevant source codes are available from the authors upon request.

References

Nielsen, M. A. & Chuang, I. L.Quantum Computation and Quantum Information (Cambridge Univ. Press, 2011).
Xia, F. et al. Random walks: a review of algorithms and applications. IEEE Trans. Emerg. Top. Comput. Intell. 4, 95–107 (2020).
Article Google Scholar
Khabat, H., Duncan, G. E., Peter, C. H. & Philip, J. B. Quantum memories: emerging applications and recent advances. J. Mod. Opt. 63, 2005–2028 (2016).
Article Google Scholar
Chu, S. Cold atoms and quantum control. Nature 416, 206–210 (2002).
Article ADS Google Scholar
Dong, D. & Petersen, I. R. Quantum control theory and applications: a survey. IET Control Theory Appl. 4, 2651–2671 (2010).
Article MathSciNet Google Scholar
Khaneja, N., Reiss, T., Kehlet, C., Schulte-Herbrüggen, T. & Glaser, S. J. Optimal control of coupled spin dynamics: design of nmr pulse sequences by gradient ascent algorithms. J. Magn. Reson. 172, 296–305 (2005).
Article ADS Google Scholar
Ding, H.-J. & Wu, R.-B. Robust quantum control against clock noises in multiqubit systems. Phys. Rev. A 100, 022302–022308 (2019).
Article ADS Google Scholar
Hincks, I. N., Granade, C. E., Borneman, T. W. & Cory, D. G. Controlling quantum devices with nonlinear hardware. Phys. Rev. A 4, 024012–024020 (2015).
Article Google Scholar
Wang, H. et al. Quantumnas: noise-adaptive search for robust quantum circuits. In 2022 IEEE International Symposium on High-Performance Computer Architecture 692–708 (IEEE, 2022).
Tsubouchi, M. & Momose, T. Rovibrational wave-packet manipulation using shaped midinfrared femtosecond pulses toward quantum computation: optimization of pulse shape by a genetic algorithm. Phys. Rev. A 77, 052326–052340 (2008).
Article ADS Google Scholar
Shrestha, A. & Mahmood, A. Review of deep learning algorithms and architectures. IEEE Access 7, 53040–53065 (2019).
Article Google Scholar
Wang, Z. & Guet, C. Deep learning in physics: a study of dielectric quasi-cubic particles in a uniform electric field. IEEE Trans. Emerg. Top. Comput. Intell. 6, 429–438 (2022).
Article Google Scholar
Chen, S. Y.-C. et al. Variational quantum circuits for deep reinforcement learning. IEEE Access 8, 141007–141024 (2020).
Article Google Scholar
Zhang, X.-M., Wei, Z., Asad, R., Yang, X.-C. & Wang, X. When does reinforcement learning stand out in quantum control? A comparative study on state preparation. npj Quantum Inform. 5, 85–92 (2019).
Article ADS Google Scholar
Bukov, M. et al. Reinforcement learning in different phases of quantum control. Phys. Rev. X 8, 031086–031101 (2018).
Google Scholar
Moro, L., Paris, M. G. A., Restelli, M. & Prati, E. Quantum compiling by deep reinforcement learning. Commun. Phys 4, 178 (2021).
Article Google Scholar
Shao, K., Zhu, Y. & Zhao, D. Starcraft micromanagement with reinforcement learning and curriculum transfer learning. IEEE Trans. Emerg. Top. Comput. Intell. 3, 73–84 (2019).
Article Google Scholar
Pan, Z., Wang, L., Wang, J. & Lu, J. Deep reinforcement learning based optimization algorithm for permutation flow-shop scheduling. IEEE Trans. Emerg. Top. Comput. Intell. 7, 983–994 (2023).
Article Google Scholar
Hu, S., Chen, C. & Dong, D. Deep reinforcement learning for control design of quantum gates. In 2022 13th Asian Control Conference 2367–2372 (IEEE, 2022).
Ma, H., Dong, D., Ding, S. X. & Chen, C. Curriculum-based deep reinforcement learning for quantum control. IEEE Trans. Neural Netw. Learn. Syst. 34, 8852–8865 (2023).
Article Google Scholar
Finn, C., Abbeel, P. & Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proc 34th International Conference on Machine Learning 1126–1135 (JMLR.org, 2017).
Taylor, A., Dusparic, I., Guériau, M. & Clarke, S. Parallel transfer learning in multi-agent systems: what, when and how to transfer? In 2019 International Joint Conference on Neural Networks 1–8 (2019).
Khalid, I., Weidner, C. A., Jonckheere, E. A., Schirmer, S. G. & Langbein, F. C. Sample-efficient model-based reinforcement learning for quantum control. Phys. Rev. Res. 5, 043002 (2023).
Article Google Scholar
Nagabandi, A. et al. Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. Preprint at arXiv:1803.11347v6 (2019).
Finn, C., Rajeswaran, A., Kakade, S. & Levine, S. Online meta-learning. Proc. Mach. Learn. Res. 97, 1920–1930 (2019).
Google Scholar
Hospedales, T., Antoniou, A., Micaelli, P. & Storkey, A. Meta-learning in neural networks: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44, 5149–5169 (2022).
Google Scholar
Vettoruzzo, A., Bouguelia, M.-R., Vanschoren, J., Rognvaldsson, T. & Santosh, K. C. Advances and challenges in meta-learning: a technical review. IEEE Trans. Pattern Anal. Mach. Intell. 46, 4763–4779 (2024).
Article Google Scholar
Kitaev, A. Y. Quantum computations: algorithms and error correction. Russ. Math. Surv. 52, 1191 (1997).
Article MathSciNet MATH Google Scholar
Deffner, S. & Campbell, S. Quantum speed limits: from Heisenberg’s uncertainty principle to optimal quantum control. J. Phys. A Math. Theor. 50, 453001 (2017).
Taddei, M. M., Escher, B. M., Davidovich, L. & de Matos Filho, R. L. Quantum speed limit for physical processes. Phys. Rev. Lett. 110, 050402 (2013).
Article ADS Google Scholar
Zhang, M., Yu, H.-M. & Liu, J. Quantum speed limit for complex dynamics. npj Quantum Inform.9, 97 (2023).
Johansson, J., Nation, P. & Nori, F. Qutip 2: a python framework for the dynamics of open quantum systems. Comput. Phys. Commun. 184, 1234–1240 (2013).
Article ADS Google Scholar
Barends, R. et al. Digitized adiabatic quantum computing with a superconducting circuit. Nature 534, 222–226 (2016).
Article ADS Google Scholar
Saki, A. A., Topaloglu, R. O. & Ghosh, S. Muzzle the shuttle: efficient compilation for multi-trap trapped-ion quantum computers. In 2022 Design, Automation & Test in Europe Conference & Exhibition, 322–327 (IEEE, 2022).
Yu, H. & Zhao, X. Event-based deep reinforcement learning for quantum control. IEEE Trans. Emerg. Top. Comput. Intell. 1–15 (2023).
Zahedinejad, E., Ghosh, J. & Sanders, B. C. High-fidelity single-shot toffoli gate via quantum control. Phys. Rev. Lett. 114, 200502 (2015).
Article ADS Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal policy optimization algorithms. Preprint at arXiv:1707.06347v2 (2017).

Download references

Acknowledgement

The authors acknowledge support from the National Natural Science Foundation of China (Grant Nos. 62173296, 62003113, and 62273016) and the Basic and Applied Basic Research Foundation of Guangdong Province (Grant No. 2025A1515010186).

Author information

These authors contributed equally: Shihui Zhang, Zibo Miao, Yu Chen.

Authors and Affiliations

School of Intelligence Science and Engineering, Harbin Institute of Technology, Shenzhen, 518055, China
Shihui Zhang & Zibo Miao
College of Control Science and Engineering, Zhejiang University, Hangzhou, 310027, China
Yu Pan
School of Mathematical Sciences, Fudan University, Shanghai, 200433, China
Sibo Tao
Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
Yu Chen

Authors

Shihui Zhang
View author publications
Search author on:PubMed Google Scholar
Zibo Miao
View author publications
Search author on:PubMed Google Scholar
Yu Pan
View author publications
Search author on:PubMed Google Scholar
Sibo Tao
View author publications
Search author on:PubMed Google Scholar
Yu Chen
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.C. and Z.M. conceived of the project. S.Z., Y.C., and Z.M. designed and implemented the algorithm. The simulation results obtained by S.Z., Y.C., and Z.M. were analyzed by all authors. All authors contributed to the writing and editing of the manuscript. Z.M. supervised the project.

Corresponding authors

Correspondence to Zibo Miao or Yu Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, S., Miao, Z., Pan, Y. et al. Meta-learning assisted robust control of universal quantum gates with uncertainties. npj Quantum Inf 11, 81 (2025). https://doi.org/10.1038/s41534-025-01034-9

Download citation

Received: 20 November 2024
Accepted: 02 May 2025
Published: 19 May 2025
Version of record: 19 May 2025
DOI: https://doi.org/10.1038/s41534-025-01034-9