Introduction

High-precision quantum control is crucial for quantum information processing, which could be, in principle, achieved following adiabatic theorem1. However, to keep quantum properties, people are required to work fast in quantum systems for suppressing detrimental influence from dissipation/decoherence, indicating that nonequilibrium dynamics dominates quantum processes2. In fact, the nonequilibrium processes in thermodynamics have drawn much attention over the past decades, where the most influential works, such as Jarzynski equation3,4 and the fluctuation theorem5,6, have provided us new insights into the stochastic dynamics out of equilibrium. In particular, due to rapid progress in quantum technology, understanding the nonequilibrium thermodynamic process at the quantum level has recently turned to be a burgeoning field. Extended to the quantum regime, most thermodynamic quantities should be retraced and thus most restrictions of the original thermodynamics have to be reformulated. For example, irreversibility, which is relevant to the conventional second law of thermodynamics, has been reconsidered from the angle of information theory. Quantified by relative entropy, irreversibility in either closed or open quantum system is restricted by more strict bounds than the conventional second law of thermodynamics7,8,9. However, in contrast to the fact that all real macroscopic processes are definitely irreversible, irreversibility in microscopic processes seems much more complicated, as discovered in a recent experiment10.

We have noticed a recent proposal11 exploring reduction of irreversibility in the finite-time nonequilibrium thermodynamic transformations of a closed quantum system. The results indicate that the relative entropy could be much reduced in the nonequilibrium thermodynamics if strategies of reinforcement learning (RL)12 are employed. RL is an important paradigm of machine learning, the latter of which has been widely employed in the study of quantum physics over the past years13,14,15,16,17,18,19,20,21,22,23,24,25,26,27. From the simulation of characteristics of open quantum many-body systems to the design of quantum protocols, researchers have much benefited from the unique characteristics of intelligent algorithms and automation of machine learning. RL aims to tackle the problems of quantum control, which works between decision-making entities (called agents) and environment by updating their behavior based on the obtained feedback, see Fig. 1a. RL approaches have successfully achieved optimized quantum tasks from fast and robust control of a single qubit28 to efficient solution to quantum many-body systems18,29. Since they require little knowledge of the dynamic details of the system, RL approaches have outstanding advantages in practical quantum control over conventional optimized counterparts using, such as shortcuts to adiabaticity30,31, non-adiabatic schemes32, and non-cycling geometric ideas33,34,35,36. Recently, RL incorporating quantum technologies into the agents design has demonstrated speed-up in accomplishment of quantum operations37,38,39.

Fig. 1: Schematic for experimental method and steps.
figure 1

a Schematic for reinforcement learning (RL) method, where the agent (i.e., the network) interacts with the environment (i.e., the qubit) by repeated actions and rewards. b Level scheme of the 40Ca+ ion, where the double-sided arrows and the wavy arrow represent the laser irradiation and dissipation, respectively. An effective two-level system (TLS) with controllable driving and decay is approximated from the three-level configuration. The dissipative qubit is achieved by switching on the 854-nm laser, which describes the qubit coupled to a classical bath. c The Jaynes-Cummings (JC) interaction between the qubit and the vibrational degree of freedom of the ion, describing the qubit coupled to a quantum bath. d RL-controlled experimental steps for (I) TLS and (II) JC model, respectively, starting from the end of the sideband cooling to the final detection.

Here, we report our experimental investigation of optimized nonequilibrium quantum thermodynamics in an ultracold 40Ca+ ion, by a typical RL approach - the deep deterministic policy gradient (DDPG) algorithm40, which is an advanced actor-critic algorithm. In our case, the network (also called agent), provides choices of the actions (i.e., the control of Rabi oscillation strength and/or phase) for the environment (i.e., the qubit) to maximize the reward, which is to achieve the nonequilibrium thermodynamics with higher fidelity. Specifically, the total training duration τ is divided into n steps. At each step t, for a given environment state ρt, the agent generates an action at, by the main network, and obtains the next state ρt+1 as well as an immediate reward rt from the environment. Then the goal of RL is to find the optimized action set {at} (i.e., the control sequences) which reaches the maximum total reward \(R=\mathop{\sum }\nolimits_{t = 1}^{n}{r}_{t}\). This procedure is repeated until the networks are converged. The DDPG algorithm makes sure the optimized route of evolution corresponding to the largest accumulated reward. We focus on RL-controlled nonequilibrium thermodynamic transformations relevant to two key operations in ion trap, i.e., operations for carrier transition and red-sideband transition, demonstrating the advantages of high-precision control and reduced irreversibility. Our experimental observations have provided evidences for the advantages of RL strategy in optimizing quantum state control.

Results and discussion

The experiment platform

In our experiment, the ion is confined in a linear Paul trap with axial frequency ωz/2π = 0.94 MHz and radial frequency ωr/2π = 1.2 MHz, and the quantization axis is defined with respect to the axial direction at an angle 45 by a magnetic field of approximately 6.23 Gauss at the center of the trap. For our purpose, we have cooled the ion, prior to the experiment, down to near the ground state of the vibrational modes, and encode the qubit in the electronic states \(\vert {4}^{2}{S}_{1/2},{m}_{J}=-1/2\rangle\) (labeled as \(\vert g\rangle\)) and \(\vert {3}^{2}{D}_{5/2},{m}_{J}=-5/2\rangle\) (labeled as \(\vert e\rangle\)) with mJ the magnetic quantum number41. As plotted in Fig. 1b, the qubit is manipulated by an ultra-stable narrow linewidth 729-nm laser in the case of the Lamb-Dicke parameter η ~ 0.1. In what follows, we consider the optimized nonequilibrium quantum thermodynamics subject to classical bath and quantum bath, respectively, which is beyond the model in Ref. 11. Specifically, we manipulate the single qubit by switching on (off) the spontaneous emission to demonstrate the RL-engineered nonequilibrium quantum thermodynamics with (without) dissipation, and execute a Jaynes-Cummings model to explore the single qubit coupled to a quantum bath played by the vibrational degree of freedom of the ion, as shown in Fig. 1c. In the latter case, for simplicity, the classical bath is excluded, implying that no spontaneous emission occurs in the qubit.

Operations with a single qubit

To construct the dissipative channel to the classical bath, we introduce an extra energy level \(\vert {4}^{2}{P}_{3/2},{m}_{J}=-3/2\rangle\) which couples to \(\left\vert e\right\rangle\) by a 854-nm laser (with Rabi frequency Ω1) and dissipates to \(\vert g\rangle\) with the decay rate of Γ/2π = 23.1 MHz. Under appropriate laser irradiation as tested in refs. 10,42, an effective two-level model with engineered drive \(\tilde{{{\Omega }}}\)(t) and decay \({\gamma }_{{{{{{{{\rm{eff}}}}}}}}}={{{\Omega }}}_{1}^{2}/{{\Gamma }}\) can be achieved, as shown in Fig. 1b, which can be described by the Lindblad master equation as,

$$\dot{\rho }=-i[{H}_{s},\rho ]+\frac{{\gamma }_{{{{{{{{\rm{eff}}}}}}}}}}{2}(2{\sigma }_{-}\rho {\sigma }_{+}-{\sigma }_{+}{\sigma }_{-}\rho -\rho {\sigma }_{+}{\sigma }_{-}),$$
(1)

where ρ denotes the density operator, and Hs is the single-qubit Hamiltonian. Experimentally, the single-qubit rotations are accomplished by carrier transitions, following the time-dependent Hamiltonian within the time interval [0, τ],

$${H}_{s}(t)=\frac{{{{\Omega }}}_{0}}{2}[{\sigma }_{x}\cos (\pi t/2\tau )+{\sigma }_{y}\sin (\pi t/2\tau )],$$
(2)

which represents a time-dependent single-qubit rotation with the maximal Rabi frequency Ω0 and usual Paul operators σx,y for \(\left\vert e\right\rangle\) and \(\left\vert g\right\rangle\).

The RL-assistant evolutions are executed as plotted in Fig. 1d, and then compared with the free evolution under the government of the same Hamiltonian and from the same initial state \({\rho }^{eq}(0)=\exp (-\beta {H}_{0})/{Z}_{0}\) with the partition function \({Z}_{0}={{{{{{{\rm{tr}}}}}}}}[\exp (-\beta {H}_{0})]\) and H0 ≡ Hs(0) = Ω0σx/2. In our experiment, this initial Gibbs state ρeq(0) of the qubit is prepared by RL. To this end, we first cool the trapped ion down to near the ground vibrational state and then prepare the state \(\left\vert g\right\rangle\) by optical pumping. Then, the initial Gibbs state is prepared by the following Hamiltonian,

$${H}_{0}(t)=\frac{{{{\Omega }}}_{0}}{2}{\sigma }_{x}-{f}_{{{{{{{{\rm{opt}}}}}}}}}(t){\sigma }_{y},$$
(3)

where fopt(t)σy is the control term imposed by the RL approach. Experimentally, this Hamiltonian can be carried out by using a single beam of 729-nm laser with time-dependent Rabi frequency \(\tilde{{{\Omega }}}(t)=\sqrt{{{{\Omega }}}_{0}^{2}+4{f}_{{{{{{{{\rm{opt}}}}}}}}}^{2}}\) and phase \(\phi (t)=\arctan (2{f}_{{{{{{{{\rm{opt}}}}}}}}}/{{{\Omega }}}_{0})\). Meanwhile, the 854-nm laser is switched on to construct the dissipative channel. Therefore, the whole system is governed by the Lindblad master equation as given in Eq. (1). Under the RL-designed optimal pulses (see Fig. 2a, b, after six steps (12 μs), the system is thermalised to the state ρeq(0). As shown in Fig. 2c, we measure the time evolutions of the Stokes parameters \({S}_{i\in (x,y,z)}\equiv {{{{{{{\rm{Tr}}}}}}}}[{\sigma }_{i}\rho ]\) in comparison with the theoretical simulation (see Supplementary Note 3 for more details). Figure 2d shows how to experimentally acquire the values of the dissipative parameter γeff by switching on the 854-nm laser, which is required for the state preparation.

Fig. 2: Initial state preparation by reinforcement learning (RL) method.
figure 2

a, b Designed pulses of the effective Rabi frequency \(\widetilde{{{\Omega }}}\) and phase ϕ for initial state preparation using RL method, respectively. c Experimental measurement of the Stokes parameters Sx, Sy, and Sz, in comparison with the theoretical simulation, where the Stokes parameters are acquired by measuring the populations from x, y and z directions (see Supplementary Note 3). The error bars indicating the statistical standard deviation of the experimental data are obtained by 10,000 measurements for each data point. After the last step, the fidelity is F = 0.9799 ± 0.0103. d Time evolution of the population Pe due to decay, from which we acquire decay γeff = 11.99 kHz. Other parameters: Rabi frequency Ω0/2π = 20kHz and time for every step δτ = 0.25/Ω0.

To precisely execute the RL operations, we employ the arbitrary waveform generator as the frequency source of the acousto-optic modulator, which provides the phase and frequency control of the 729-nm laser during the experimental implementation. Due to fast gate operations (e.g., <60 μs) in our experiment, the qubit dephasing (≈0.81(11) kHz)43 originated from the magnetic and electric field fluctuations, is negligible throughout this paper.

After preparing the initial state ρeq(0), we intend to witness that the RL-assistant evolutions achieve higher fidelity while consuming less entropy production in both the cases of closed and open quantum systems. In the following operations, for a comparison, we execute Eq. (2) for the free evolution of the qubit, while carry out the optimal nonequilibrium thermodynamic transformation under the RL control, which imposes a control term into Eq. (2) as H = Hs(t) − fopt(t)σy, with fopt(t)  [ − 1.5Ω0, 1.5Ω0], fopt(0) = fopt(τ) = 0 and t [0, τ]. As a result, the RL-controlled Hamiltonian is given by

$$H(t)=\frac{\tilde{{{\Omega }}}(t)}{2}\left({e}^{i\phi (t)}{\sigma }_{+}+{e}^{-i\phi (t)}{\sigma }_{-}\right),$$
(4)

which is a modified form of Eq. (2) with Rabi frequency \(\tilde{{{\Omega }}}(t)=\sqrt{{{{\Omega }}}_{0}^{2}+4{f}_{{{{{{{{\rm{opt}}}}}}}}}[{f}_{{{{{{{{\rm{opt}}}}}}}}}-{{{\Omega }}}_{0}\sin (t\pi /2\tau )]}\) and phase \(\phi (t)=\arctan \{[2{f}_{{{{{{{{\rm{opt}}}}}}}}}-{{{\Omega }}}_{0}\sin (t\pi /2\tau )]/[{{{\Omega }}}_{0}\cos (t\pi /2\tau )]\}\). So our optimal nonequilibrium thermodynamics is carried out by controlling the power and phase of the 729-nm laser irradiation, under the government of Eq. (4). Moreover, we execute our operations in an open (a closed) system by simply switching on (off) the 854-nm laser, i.e., γeff ≠ 0 (γeff = 0). In either the open or closed case, the system is initialized from a thermal state ρeq(0). The target state is set to be the equilibrium thermal state of Hs(τ), i.e., \({\rho }_{1}^{eq}(\tau )=\exp (-\beta {H}_{s}(\tau ))/{Z}_{s}\) with the partition function \({Z}_{s}={{{{{{{\rm{tr}}}}}}}}[\exp (-\beta {H}_{s}(\tau ))]\).

Our goal with DDPG algorithm is to search for optimal controls of Rabi frequency \(\tilde{{{\Omega }}}(t)\) and phase ϕ(t) that maximize the fidelity F defined as \(F={\left[{{{{{{{\rm{tr}}}}}}}}\sqrt{{\rho }^{1/2}(t){\rho }^{eq}(\tau ){\rho }^{1/2}(t)}\right]}^{2}\), where ρ(t) is the actual density operator of the system at time t. For our purpose, we engineer Hamiltonian H(t), in terms of the RL designed pulses, to evolve to approach the target state in a nonequilibrium way. The nonequilibrium entropy production associated with such a transformation can be quantified by the quantum relative entropy Σ(t) = S(ρ(t)ρeq(τ)). The fidelity evaluates how good the controlled evolution is, and the relative entropy assesses how much irreversibility is involved in the process. As numerically verified in Supplementary Note 2, in our case of the nonequilibrium thermodynamics, the variation of the fidelity is inversely proportional to that of the relative entropy, indicating that the near unity fidelity corresponds to the nearly zero relative entropy. As such, our RL reward to the agent is also the minimized relative entropy. For convenience of comparison, we define ΔΣ to be the reduced entropy production with respect to the free evolution situation, i.e., ΔΣ(t) = 1 − Σopt(t)/Σfree(t) with Σopt(t) (Σfree(t)) the relative entropy of the state under RL control (free evolution). Since Σfree is constant in a certain evolution, the bigger value of ΔΣ implies the less consumption of Σopt. In addition, we assess the reduction of the work done to the system during this process by defining dW = 1 − (ΔUopt + Ein)/ΔUfree with \({E}_{{{{{{{{\rm{in}}}}}}}}}=| \int\nolimits_{0}^{1}{{{{{{{\rm{tr}}}}}}}}[\rho (t){f}_{{{{{{{{\rm{opt}}}}}}}}}(t){\sigma }_{y}]dt|\) being the energetic cost of the optimization and ΔU the change of the internal energy between the initial and final states11. Moreover, we also assess the variation of coherence, which is the norm of the off-diagonal terms of the density matrix in this process, i.e., \(\sqrt{{S}_{x}^{2}+{S}_{y}^{2}}\). The coherence, as a complement of the fidelity or the relative entropy, presents the quantum characteristics of the system.

Before studying the optimized nonequilibrium quantum thermodynamics in the open system, we first consider the closed case with γeff = 0. Using DDPG algorithm, we separate the whole dynamics into 13 steps under the designed optimal pulses, see Fig. 3a. After each step, the reward is given for the case with the maximized fidelity F. Driving the system initially from the mixed state ρeq(0) in terms of these pulses, we experimentally observe the state evolutions approaching the target state ρeq(τ) with fidelity of near unity, as demonstrated in Fig. 3c. Meanwhile, we also find the engineered state evolution with much less entropy production and much less work done to the system than the free evolution counterparts.

Fig. 3: Single-qubit nonequilibrium thermodynamic transformations in the closed system.
figure 3

a Designed pulses of the effective Rabi frequency \(\widetilde{{{\Omega }}}\) and phase ϕ for, segmented by 13 steps, due to reinforcement learning (RL) control. b Dynamics of coherence with and without RL-control. c Time evolutions of the fidelity F and the entropy production reduction ΔΣ, where dots are experimental results and lines represent theoretical simulation. The blue and red lines denote the RL-engineered evolution and free evolution, respectively, and the black line is plotted for variation of ΔΣ. After the last step, Fopt = 0.9899 ± 0.0051, ΔΣ = 0.9506 ± 0.0234 and the reduction of the work dW = 0.9042 ± 0.1396. Inset presents the Bloch sphere illustration for state evolution, corresponding to free evolution (red dots) and RL-controlled evolution (blue dots). d, e Comparison of robustness against the systematic imperfection for the RL-controlled fidelity Fopt, the reduced entropy production ΔΣ and the reduction of work dW at the final time τ = 26 μs, where d is for the Rabi frequency deviation δΩ/Ω0 of the laser irradiation and e for the resonance frequency deviation Δ/Ω0 of the qubit. Fopt can be larger than 95% when the deviation is 20%. The error bars indicating the statistical standard deviation of the experimental data are obtained by 10,000 measurements for each data point. Other parameters: Rabi frequency Ω0/2π = 20 kHz, time for every step δτ = 0.25/Ω0, and inverse temperature βΩ0 = 3.

Checking the evolution tracks in Bloch sphere, we see that, in comparison with the simple route of the free evolution, the RL-engineered evolution is complicated with some parts of zigzag fashion, which corresponds to the dynamics of quantum coherence shown in Fig. 3b. In fact, the advantage of this complicated evolution track is also reflected in the robustness against the systematic imperfection, i.e., the imperfect Rabi frequency and the inaccurate resonance frequency. Our observation in Fig. 3d, e reveals the higher fidelity assisted by RL pulses than in the free evolution within a large range of deviation, implying more robustness against the imperfections. In contrast, ΔΣ and dW are less robust, while still behave well as the deviation is not beyond 10%. These observations indicate that the RL approach owns a significant superiority in high-precision and optimized control of the state evolution. Moreover, we see larger uncertainty for the measured dW with respect to those for Fopt and ΔΣ in Fig. 3d, e, which mainly resulted from the integral of Ein that accumulates the uncertainties of different parameters involved. Due to this reason, Ein is sensitive to the state variation and thus brings about bigger uncertainty for larger deviation. In this sense, the fidelity and the relative entropy, instead of the reduction of the work cost, are more suitable for assessing the nonequilibrium thermodynamics.

Now, we carry out the above experimental steps again in an open system by switching on the 854-nm laser beam, which turns the isolated two-level system into the dissipative two levels as plotted in Fig. 1b. Compared to the closed case, we have an additional parameter γeff = 0.0216Ω0 in our treatment by setting the power of the 854-nm laser. Here we still segment the whole dynamics into 13 steps under the DDPG algorithm, see Fig. 4a, and the system is initially prepared in the same thermal state as in the above-closed case. Figure 4b gives the coherence dynamical with or without RL-control. In Fig. 4c we still observe the RL-engineered evolution with higher performance than the free evolution.

Fig. 4: Single-qubit nonequilibrium thermodynamic transformations in the open system.
figure 4

a Designed pulses for the effective Rabi frequency \(\widetilde{{{\Omega }}}\) and phase ϕ, segmented by 13 steps, due to reinforcement learning (RL) control. b Dynamics of coherence with or without RL-control. c Time evolutions of fidelity F and entropy production reduction ΔΣ, where dots are experimental results and lines represent theoretical simulation. The blue and red lines denote the RL-engineered evolution and free evolution, respectively, and the black line is plotted for variation of ΔΣ. After the last step, Fopt = 0.9886 ± 0.0067, ΔΣ = 0.9420 ± 0.0341 and the reduction of the work dW = 0.8360 ± 0.1321. Inset presents time evolution of the population Pe due to decay. d, e Comparison of robustness against the deviations of Rabi frequency δΩ/Ω0 and resonance frequency Δ/Ω0 for Fopt, ΔΣ and dW in the presence of decay, where Fopt can be larger than 95% when the deviation is 20%. The error bars indicating the statistical standard deviation of the experimental data are obtained by 10,000 measurements for each data point. Other parameters: Rabi frequency Ω0/2π = 20kHz, decay γeff = 0.0216Ω0, time for every step δτ = 0.25/Ω0, and inverse temperature βΩ0 = 3.

However, due to dissipation, the population in \(\left\vert e\right\rangle\) declines in time. Nevertheless, we witness at finite time, e.g., τ = 26 μs, the similar robustness of the RL-engineered evolution to the imperfections, as shown in Fig. 4d, e. The situation with larger decay rates can be found in Supplementary Note 1.

Operations with Jaynes–Cummings model

To further characterize the performance of the RL method, we have also carried out an experiment for a JC interaction between the qubit and the vibrational degree of freedom of the ion, that is,

$${H}_{s2}=\frac{{{{\Omega }}}_{0}}{2}\eta ({\sigma }_{+}a+{\sigma }_{-}{a}^{{{{\dagger}}} }),$$
(5)

with a (a) the creation (annihilation) operator of the phonon. This is to describe a qubit coupled to a quantum bath played by the vibrational mode of the ion. For simplicity, we exclude the dissipation of the qubit to the classical bath, and thus switching off the 866 nm laser throughout this experiment. To investigate the advantage of RL, we still compare the control of the Hamiltonians above by RL approach and free evolution. Following the RL control as Hsr(t) = Hs2 − fopt(t)σx, where \({H}_{{{{{{{{\rm{opt}}}}}}}}}(t)=-{f}_{{{{{{{{\rm{opt}}}}}}}}}(t){\sigma }_{x}=-\tilde{{{\Omega }}}(t){e}^{i\theta (t)}{\sigma }_{x}\) is our control term with \({f}_{{{{{{{{\rm{opt}}}}}}}}}(0)={f}_{{{{{{{{\rm{opt}}}}}}}}}(\tau )=0,\tilde{{{\Omega }}}\in [-{{{\Omega }}}_{0},{{{\Omega }}}_{0}],\theta \in [0,\pi ]\), and t [0, τ], we prepare the system initially to be in \({\rho }_{s2}(0)={\rho }_{2}(0)\otimes {\left\vert 0\right\rangle }_{r}\left\langle 0\right\vert\), where \({\rho }_{2}(0)=({e}^{\beta {{{\Omega }}}_{0}/2}\left\vert g\right\rangle \left\langle g\right\vert +{e}^{-\beta {{{\Omega }}}_{0}/2}\left\vert e\right\rangle \left\langle e\right\vert )/({e}^{\beta {{{\Omega }}}_{0}/2}+{e}^{-\beta {{{\Omega }}}_{0}/2})\) and the vibrational degree of freedom is in the ground state. But in this case, we cannot prepare the required initial state making use of the RL engineering due to the fact that no separate Hamiltonian exists for the qubit. As such, we consider another way for the initial state preparation, that is, ρ2(0) is prepared from the corresponding superposition state after waiting for a time longer than the dephasing time of the qubit10 (also see Supplementary Note 3 for more details).

The target state is set as the equilibrium thermal state regarding Hs2, i.e., \({\rho }_{s2}^{eq}(\tau )=\exp (-\beta {H}_{s2})/{Z}_{s2}\) with the partition function \({Z}_{s2}={{{{{{{\rm{tr}}}}}}}}[\exp (-\beta {H}_{s2})]\). Similar to the single-qubit case, we first design the optimized pulses by the DDPG algorithm, finding that the target state could be reached by 9 steps. In this case, however, we need two beams of 729-nm laser, one achieving Eq. (5), which is in parallel with the z-axis as the single-qubit case, and the other for the RL control, which radiates with an angle of 60 degrees to the z-axis. The system state, after 9 pulses applied, evolves to a thermal nonequilibrium state \({\rho }_{2}^{eq}(\tau )={{{{{{{{\rm{tr}}}}}}}}}_{r}[{\rho }_{s2}^{eq}(\tau )]\).

Figure 5a illustrates our experimental observation of the high-fidelity evolution under RL control, indicating that the RL-pulses-induced nonequilibrium evolution is very close to the target state \({\rho }_{2}^{eq}(\tau )\), much higher than the free evolution. Figure 5b gives the coherence dynamical with or without RL-control. Meanwhile, our observation also reveals that the relative entropy assisted by RL pulses is much reduced with respect to the free evolution, as shown in Fig. 5c. Similar to above observations, we have also witnessed at finite time, e.g., τ = 57 μs, the robustness of the RL-engineered evolution to the imperfections, as shown in Fig. 5d, e. Of particular interest in this case is the possibility to monitor the variation of the quantum bath during the nonequilibrium thermodynamic evolution. Since the spin state and the vibrational state of the ion are correlated by the JC model without dissipation to outside, although our target state is only set to be relevant to the qubit, the state of the quantum bath is also targeted. We have experimentally measured the mean phonon number \(\bar{n}\) at each step, which demonstrates the dissipation of the qubit to the quantum bath. We see from the inset of Fig. 5b the good agreement between the experimental observation and the expected values in the case of RL engineering.

Fig. 5: Nonequilibrium thermodynamic transformations of the Jaynes-Cummings interaction case.
figure 5

a Designed pulses of the Rabi frequency and the laser phase due to reinforcement learning(RL) control. b Dynamics of coherence with and without RL-control. c Time evolution of fidelity F and entropy production reduction ΔΣ, where dots are experimental results and lines represent theoretical simulation. Inset presents time evolution of the mean phonon number \(\bar{n}=\langle {a}^{{{{\dagger}}} }a\rangle\) with and without RL control pulses. d, e Comparison of robustness against the deviations of Rabi frequency and resonance frequency for Fopt and ΔΣ. The error bars indicating the statistical standard deviation of the experimental data are obtained by 10,000 measurements for each data point. Other parameters: Rabi frequency Ω0/2π = 50kHz, time for every step δτ = 0.25/Ω0, and inverse temperature βΩ0 = 2.

Conclusions

In summary, with high-precision operations on the single ultracold ion, our experimental observations have provided credible evidences for the outstanding advantages of RL strategy in optimizing quantum state control, which would be useful for exploiting microscopic thermal machines and quantum information processing based on far-from-equilibrium quantum processes.

In particular, we have explored the engineering of quantum systems subjected to dissipation from classical or quantum baths. We have further demonstrated the robustness of characteristic parameters under RL control, even in the presence of operational imperfections. These observations underscore the practical applicability of RL control in executing quantum tasks with higher fidelity and reduced consumption of entropy production and work. Expanding on our current research, we will delve deeper into the design of efficient single-atom quantum heat engines42 utilizing RL implementation.

Additionally, we have taken note of recent experimental endeavors that leverage the advantages of RL strategies in addressing many-body problems. For instance, one notable achievement includes the realization of improved number squeezing in the balanced three-mode Dicke states of \(1{0{}^{4}}^{87}\)Rb atoms29. Considering the theoretical results demonstrating minimized relative entropy in a RL-controlled nonequilibrium thermodynamic process for a two-qubit system11, it is highly anticipated that future experiments will showcase RL-engineered optimization in multi-qubit nonequilibrium thermodynamics. This may involve exploring larger Hilbert spaces, investigating multi-qubit systems, and integrating RL with other control techniques.

Methods

Here we describe the RL method that is used in our our experiment. The standard RL system contains two major entities: agent and environment, connected by the channels: state space \({{{{{{{\mathcal{S}}}}}}}}\), action \({{{{{{{\mathcal{A}}}}}}}}\), and reward \({{{{{{{\mathcal{R}}}}}}}}\). The agent and environment interact via a finite Markovian decision process, which divides the total training time τ into n steps with fixed interval δτ = τ/n. At each time step t, the agent receives a state \({s}_{t}\in {{{{{{{\mathcal{S}}}}}}}}\), and then takes an action \({a}_{t}\in {{{{{{{\mathcal{A}}}}}}}}\), which results in a new state \({s}_{t+1}\in {{{{{{{\mathcal{S}}}}}}}}\) and finally receives a reward \({r}_{t}\in {{{{{{{\mathcal{R}}}}}}}}\). Therefore, this Markovian process is described as a sequence of (s0, a0, r0, s1, . . . , st, at, rt, st+1, . . . , sn−1, an−1, rn−1, sn) with n the number of the steps in an episode.

The DDPG is a model-free off-policy actor-critic algorithm that can learn policies in spaces with high-dimension and by continuous actions. Before going to the details of the DDPG algorithm, we first introduce the definitions of the state \({{{{{{{\mathcal{S}}}}}}}}\), the action \({{{{{{{\mathcal{A}}}}}}}}\), and the reward function \({{{{{{{\mathcal{R}}}}}}}}\) for our case considered.

\({{{{{{{\mathcal{S}}}}}}}}:\) We employ the density operator ρt of the system as the input state of the agent, which contains the complete information of the quantum thermodynamical evolution process and can make the training process quicker.

\({{{{{{{\mathcal{A}}}}}}}}:\) The action space is a continuously controllable variable by taking at = fopt(t) in the interval [t, t + 1) and fopt(t)  [ − 1.5Ω0, 1.5Ω0]. Based on the optimization Hamiltonian Hopt(t) =− fopt(t)M, we define the unitary operator \({U}_{t}=\exp [-i({H}_{s}+{H}_{{{{{{{{\rm{opt}}}}}}}}}(t))\delta \tau ]\) with Hs the free evolution Hamiltonian. After the action of at, we obtain a new quantum state ρt+1 = U(t, 0)ρ0U(t, 0) with U(t, 0) = Ut. . . U1U0. Here, the operator M is chosen as M = σy and M = σx for the single-qubit rotation and the Jaynes-Cummings interaction, respectively. Experimentally, fopt(t) is implemented to control the amplitude and phase of the driving laser.

\({{{{{{{\mathcal{R}}}}}}}}:\) In our study, we select fidelity F as the objective function, with F(τ) = 1 indicating the complete evolution of the system to the target state at the end time. To address the issue of sparse rewards, we write the objective function as a summation form, R = ∑jrj. At each time step tj, the agent receives a reward (rj = F(tj)) that represents the instantaneous increase in fidelity. This dense reward scheme accelerates the training process and enhances stability. Practically, in our numerical treatment, we may modify the above-defined reward as \({r}_{j}\to | {\log }_{10}(1-F({t}_{j}))|\) to improve the learning efficiency when the fidelity approaches 1. In this context, (1 − F(tj)) represents the fidelity discrepancy between the quantum state at time tj and the target state. The mathematical meaning of the arrow indicates an adjustment in the weighting for values approaching F = 1 through the logarithmic transformation. The modified reward is of the same motivation as the above defined reward (rj = F(tj)).

To implement the DDPG algorithm40, we apply TensorFlow framework to build the neural network (i.e., the agent) for the deep learning. For the single trapped-ion system under our consideration, we apply a simple neural network to parameterize all the four sub-networks: μθ, Qω (as the main network), and \({\mu }_{{\theta }^{{\prime} }}^{{\prime} },{Q}_{{\omega }^{{\prime} }}^{{\prime} }\) (as the target network). As sketched in Fig. 6, the networks in DDPG are trained as what follows. Firstly, the action \({a}_{t}={\mu }_{\theta }({\rho }_{t})+{{{{{{{{\mathcal{N}}}}}}}}}_{t}\), with Gaussian noise \({{{{{{{{\mathcal{N}}}}}}}}}_{t}\), is generated randomly via the main actor network, and then the experience (ρt, at, rt, ρt+1) is saved in the replay buffer, before sampling a random minibatch of N transitions (ρi, ai, ri, ρi+1) used to update the main network. During the training, the actor network μθ is updated by the policy gradient descent as \({\bigtriangledown }_{\theta }{\mu }_{\theta }{| }_{{\rho }_{i}}\) to maximize the value Qω(ρi, a = μθ(ρi)) predicted by the critic network Qω. Besides, the target critic network \({Q}_{{\omega }^{{\prime} }}^{{\prime} }\) also predicts the value \({Q}_{{\omega }^{{\prime} }}^{{\prime} }({\rho }_{i+1},{a}^{{\prime} }={\mu }_{{\theta }^{{\prime} }}^{{\prime} }({\rho }_{i+1}))\), and thus minimizes the loss function \(L({Q}_{\omega },{Q}_{{\omega }^{{\prime} }}^{{\prime} })\) to update the critic in the main network. Meanwhile, the target network is updated much slowly than the main network, since the former only absorbs a small weight from the latter. We set the learning rates as αa = 0.001 (for actor) and αc = 0.002 (for critic) in the main network. The quantum dynamics in our training environment is numerically simulated by QuTip quantum Toolbox with 4.6.2 version (https://qutip.org).

Fig. 6: Schematic for reinforcement learning (RL) with deep deterministic policy gradient (DDPG) algorithm.
figure 6

This algorithm includes a replay buffer and the target network with actor \({\mu }_{{\theta }^{{\prime} }}^{{\prime} }\) and critic \({Q}_{{\omega }^{{\prime} }}^{{\prime} }\) in addition to the main network involving actor μθ and critic Qω. The learning agent (i.e., the network) acts on the environment (i.e., the trapped-ion qubit and/or the vibrational degree of freedom of the ion)) and updates the actions based on the obtained feedback from the environment.