Introduction

The very first proposed application of quantum computers can be traced back to Feynman’s idea of simulating quantum physics on a quantum device. Together with factoring1, simulation of quantum many-body systems stands as the clearest example of the dramatic advantages of quantum computers2. Machine learning is another much newer area where quantum computers are believed to possibly bring advantages in certain learning problems, and in fact, there are provable speed-ups achieved by a quantum algorithm3,4,5,6,7 for specific machine learning problems. Relating back to the original Feynman’s idea of simulating quantum physics, machine learning problems in quantum many-body physics seem a natural scenario where learning advantages could arise. However, perhaps surprisingly, it was recently shown that access to data exemplifying what the hard-to-compute function does can drastically change the hardness of the computational task, questioning the role of quantum computation in machine-learning scenarios8,9,10. If every quantum computation could be replicated classically, provided access to data, such results would confine the practical application of quantum computers solely to the data acquisition stage. This is, however, not the case, as was shown already in the examples considered in refs. 3,4. In such cases, the unknown function was cryptographic in nature and not related to genuine quantum simulation problems. Nonetheless,11 introduced the first methods for establishing learning separations beyond cryptographic tasks. In particular, they demonstrated that learning problems with provable speed-ups can be constructed from any \(BQP\)-complete function, thereby enabling stronger links to physical scenarios. However, the work in ref. 11 had two notable shortcomings. First, the classical non-learnability results were based on relatively strong assumptions about distributional problems, which, while plausible, are less well-understood than their decision problem counterparts. Second, the study did not introduce any significant quantum learning algorithms for general settings. Instead, quantum learnability was only shown in somewhat artificial scenarios where the concept class was small (polynomially bounded), allowing for the application of straightforward brute-force learning methods.

The results presented in this work effectively address both issues, enabling the establishment of clear boundaries between classical and quantum learning algorithms when dealing with data generated by quantum processes. Specifically, we consider learning problems where one is interested in predicting expectation values of an unknown observable from measurements on input quantum states, which can either be ground states of local Hamiltonians or time-evolved states. As motivation for the learning task, we have in mind experimentally plausible settings where learning an unknown observable may arise, such as in phase classification with an unknown order parameter or when dealing with real devices where the implemented measurement may be influenced by noise or external factors. In this scenario, our result is a proof of learning advantages for different types of quantum observables. The main contributions of this work are as follows:

  • We prove an exponential quantum advantage in learning observables that are formed as a linear combination of local Pauli strings acting on time-evolved or ground states of local Hamiltonians. Importantly, by considering more general learning settings and leveraging results from classical learning theory, we base all of our learning advantage results on the more widely studied and generally accepted assumption that \({\text{BQP}}\)-hard processes cannot be simulated by a polynomial-size classical circuit (i.e., \({\text{BQP}}\, \nsubseteq \,{\rm{P}}/{\text{poly}}\)), rather than on the less well-understood assumptions related to distributional problems.

  • Assuming \({\text{BQP}}\, \nsubseteq \,{\rm{P}}/{\text{poly}}\), we show how to construct learning problems that provide a quantum learning advantage in the general case of learning unitarily-parametrized observables, whenever an efficient procedure for learning an unknown class of unitaries through query access exists. We then provide a concrete example by connecting to recent results on learning shallow unitaries.

We also examine the Hamiltonian learning problem where the identification of the target function is demonstrated to be classically easy. We summarize the learning settings and the achieved separations in Table 1, and provide practical examples of where these learning settings may arise in Section Discussion.

Table 1 Learning problems investigated in this paper

Results

To make our work more self-contained and accessible, we include the necessary preliminaries from learning theory and complexity theory.

PAC learning

The definition of learnability in this work aligns directly with the widely adopted probably approximately correct (PAC) learning framework12,13. In the case of supervised learning, a learning problem in the PAC framework is defined by a concept class \({\mathcal{F}}\) which, for each input size \(n\in {\mathbb{N}}\), represent a set functions (or concepts) defined from some input space \({{\mathcal{X}}}_{n}\) (in this paper we assume \({{\mathcal{X}}}_{n}\) to be {0, 1}n) to some label set \({{\mathcal{Y}}}_{n}\) (in this paper we assume \({{\mathcal{Y}}}_{n}\) to be the interval [−1, 1]). The learning algorithm receives as input information on the unknown target concept f through training samples \(T={\{({x}_{\ell },f({x}_{\ell }))\}}_{\ell }\), where \(x\in {{\mathcal{X}}}_{n}\) is drawn according to a target distributions \({{\mathcal{D}}}_{n}\) over the inputs in \({{\mathcal{X}}}_{n}\). Finally, the goal of the learning algorithm is to output the description of a function (or hypothesis) h, which is again a function from \({{\mathcal{X}}}_{n}\) to \({{\mathcal{Y}}}_{n}\), that is, in some sense “close” to the target concept f, which labels the data in T.

Definition 1

(Efficient PAC learnability). A concept class \({\mathcal{F}}\) is efficiently PAC learnable if there exists a poly(1/ϵ, 1/δ, n)-time algorithm A such that for all ϵ ≥ 0, all 0 ≤ δ ≤ 1, and for any target concept f in \({\mathcal{F}}\) and any target distribution \({{\mathcal{D}}}_{n}\) on Xn, if A receives in input a training set \({\mathcal{T}}={\{({x}_{\ell },f({x}_{\ell }))\}}_{\ell }\) of poly(1/ϵ, 1/δ, n) size, then with probability at least 1 − δ over the random samples in \({\mathcal{T}}\) and over the internal randomization of A, the learning algorithm A outputs a specification of some hypothesis h(. ) = A(T, ϵ, δ,. ) that satisfies

$$P{r}_{x \sim {{\mathcal{D}}}_{n}}| h(x)-f(x)| \le \epsilon .$$
(1)

In the above definition, learnability is required for any input distribution. While this might seem a strong requirement, it must be the case in order to ensure the applicability of the theory in real scenarios, where input distributions are often unknown or, in any case, not fixed.

Complexity theory

Our hardness result for the learning tasks discussed in this work is derived based on the complexity-theoretic assumption that \({\text{BQP}}\, \nsubseteq \,{\rm{P}}/{\text{poly}}\). We remark here that, in contrast to previous works11, we do not consider the heuristic versions of the classes. Rather, we establish our learning separation by considering their simpler exact decision problem versions. In this section, we provide the precise definition of these two complexity classes.

Definition 2

(BQP). A language \({\mathcal{L}}\) is in BQP if and only if there exists a polynomial-time uniform family of quantum circuits \(\{{U}_{n}:n\in {\mathbb{N}}\}\), such that

  • For all \(n\in {\rm{{\mathbb{N}}}}\), Un takes as input an n-qubit computational basis state, and outputs one bit obtained by measuring the first qubit in the computational basis.

  • For all \(x\,\in \,{\mathcal{L}}\), the probability that the output of Ux applied on the input x is 1 is greater or equal to 2/3.

  • For all \(x\,\notin \,{\mathcal{L}}\), the probability that the output of Ux applied to the input x is 0 is greater than or equal to 2/3.

The class \({\rm{P}}/{\text{poly}}\), instead, captures the class of decision problems solvable by a polynomial-time deterministic algorithm equipped with an “advice” bitstring. Importantly, the advice depends only on the input size, but it must be the same for every input x of a given size.

Definition 3

(Polynomial advice14) A problem \({\mathcal{L}}:{\{0,1\}}^{* }\to \{0,1\}\) is in \({\rm{P}}/{\text{poly}}\) if there exists a polynomial-time classical algorithm \({\mathcal{A}}\) with the following property: for every n, there exists an advice bitstring αn {0, 1}poly(n) such that for all x {0, 1}n:

$${\mathcal{A}}(x,{\alpha }_{n})={\mathcal{L}}(x).$$
(2)

Equivalently, the class \({\rm{P}}/{\text{poly}}\) can be thought of as the class of decision problems solvable by a non-uniform family of polynomial-size Boolean circuits. That is, for each input size, there exists an efficient circuit that correctly decides the inputs, but the circuits may differ completely for each input size n. Finally, let us comment on the assumption \({\text{BQP}}\, \nsubseteq \,{\rm{P}}/{\text{poly}}\). Although the class \({\rm{P}}/{\text{poly}}\) is regarded as very powerful (for instance, it contains undecidable unary languages), it is generally believed that \({\text{BQP}}\) is not entirely contained within it. For example, if \({\text{BQP}}\subseteq {\rm{P}}/{\text{poly}}\), then factoring and the discrete logarithm problem1 would also be solvable by non-uniform polynomial-size classical circuits, which would compromise much of modern cryptography under standard security assumptions11,15. Another argument against \({\text{BQP}}\subseteq {\rm{P}}/{\text{poly}}\) arises when considering the corresponding sampling problem. If quantum sampling could be done in \({\text{SampBPP}/\text{poly}}\), then the polynomial hierarchy would collapse at the fourth level, as discussed in refs. 11,16,17. Finally, while Adleman’s trick allows classical randomness to be simulated with random strings, no analogous method exists for quantum algorithms. In particular, there is no known way to “extract the quantumness”17 from a quantum algorithm, leaving the nature of polynomial advice for simulating quantum computers entirely unclear.

The learning problems

We now consider the first class of learning problems addressed in this paper, where the unknown observable is a linear combination of local Pauli strings. Within this framework, we define the following three abstract learning scenarios. In the first case, the input states come from an arbitrary distribution and are time-evolved under a fixed local Hamiltonian before being measured by the (partially) unknown target observable. In the second scenario, the inputs are known local Hamiltonians that time-evolve a fixed initial state before being measured. In the third case, we again consider different Hamiltonians as inputs, but here the learning problem involves learning measurements on their ground states, rather than on time-evolved states. We anticipate that these abstract scenarios can represent a variety of realistic settings where noise or external factors constrain control over the input Hamiltonians and the implemented measurements, we discuss more on this in Section Discussion.

For all three scenarios, we rigorously model the learning tasks using concept classes within the PAC learning framework introduced in Section Results. We now analyze the first scenario, where the input states are time-evolved under a fixed Hamiltonian for a constant time; however, the input state preparation is not fully controlled. We model this by allowing the input states to be sampled randomly from an unknown distribution of quantum states; subsequently, they undergo a time evolution before being measured by an unknown observable. After collecting the corresponding measurement outcomes, the goal is to predict the expectation values of the unknown observable on new input states. Let H be a Hamiltonian and fix a constant time τ, we model the time-evolution learning problem by the following concept class

$${{\mathcal{F}}}_{\rm{evolved}}^{H,O}=\{{f}^{\alpha }(x)\in {\mathbb{R}}\,\,| \,{\alpha}\,\,{\in} {[-1,1]}^{m}\}$$
(3)
$$\begin{array}{rcl} & \mathrm{with}: & {f}^{\alpha }(x):\,\,\,x\in {\{0,1\}}^{n}\to {f}^{\alpha }(x)=\,{\text{Tr}}\,[{\rho }_{H}(x)O(\alpha )]\\ & & O(\alpha )=\mathop{\sum }\limits_{i=1}^{m}{\alpha }_{i}{P}_{i}.\end{array}$$

In the above, x specifies the initial state \(| x\rangle\), ρH(x) the constant time evolved state \({\rho }_{H}(x)=U| x\rangle \langle x| {U}^{\dagger }\) with U = eiHτ and each Pi is a k-local Pauli string where m scales polynomially with n. Notice that although we consider binary inputs x {0, 1}n, the output of each concept is real-valued. The goal of the ML learning algorithm is to learn a model h(x) which approximates the unknown concept fα(x) = Tr[ρH(x)O(α)], using as training samples data of the form \({{\mathcal{T}}}_{{\epsilon }_{2}}^{\alpha }={\{({x}_{\ell },{y}_{\ell })\}}_{\ell =1}^{N}\) where \({y}_{\ell }{\approx }_{{\epsilon }_{2}}{f}^{\alpha }({x}_{\ell })\) is an additive error ϵ2-approximation of the true expectation value Tr[ρH(x)O(α)]. Considering datasets with approximated values makes the learning problem closer to real-world scenarios. In an idealized setting, the dataset would consist of pairs (x, y), where y represents the exact expectation value. However, in real experiments, the estimations of these real-valued quantities are obtained by a finite number of state copies, resulting in only approximate estimations of expectation values due to sampling errors. Formally, we assume a maximum (sampling) error ϵ2 on the training labels y of our dataset, i.e., \({\epsilon }_{2}=\mathop{max}\limits_{\ell }| \,\mathrm{Tr}\,[{\rho }_{H}({x}_{\ell })O(\alpha )]-{y}_{\ell }|\). It is now possible to formally state our learning condition. Assuming the x’s in the training data come from an unknown distribution \({\mathcal{D}}\), we require that the ML model learns the concept class \({{\mathcal{F}}}_{{\rm{e}}{\rm{v}}{\rm{o}}{\rm{l}}{\rm{v}}{\rm{e}}{\rm{d}}}^{H,O}\) in the following sense (it is important to note that, in contrast to the learning condition in ref. 11, here we require learnability under any distribution).

Definition 4

(Efficient learning condition). A concept class \({\mathcal{F}}\) is efficiently learnable if there exists a poly(1/ϵ, 1/δ, 1/ϵ2, n)-time algorithm A such that for all ϵ, ϵ2 > 0 and all 0 < δ < 1, and for any fα in \({\mathcal{F}}\) and any input target distribution \({\mathcal{D}}\), if A receives in input a training set \({{\mathcal{T}}}_{{\epsilon }_{2}}^{\alpha }\) of poly(1/ϵ, 1/δ, 1/ϵ2, n) size, \(h(.)=A({{\mathcal{T}}}_{{\epsilon }_{2}}^{\alpha },\epsilon ,\delta ,.)\) satisfies with probability 1 − δ:

$${{\mathbb{E}}}_{x \sim {\mathcal{D}}}[| {f}^{\alpha }(x)-h(x){| }^{2}]\le \epsilon$$
(4)

Notice that our learning condition directly follows from the definition of PAC learnability in Def. 1, with the only difference being that it is modified to account for errors in the training data, as discussed above.

In our learning task, the learning algorithm only knows the observable through its functional form, mapping α to O(α). Any additional information regarding O(α) can exclusively be derived from the training samples within \({{\mathcal{T}}}_{{\epsilon }_{2}}^{\alpha }\). In particular, the vector α, which defines the specific concept in the concept class, is unknown to the learning algorithm. While the classical hardness of the learning problem relies on considering a specific α for which we prove the evaluation of fα(x) to be hard, the quantum algorithm will use a LASSO regression to infer a parameter w [−1, 1]m close to the target α so that condition (4) is satisfied.

We emphasize here that our learning definition demands that the trained classical model can label new points. This stands in contrast to other settings, e.g., Hamiltonian learning problems, where the task would be identifying the vector α. In the prior case, a learning advantage is more easily established as explained in ref. 11 where the hardness of evaluating versus identifying a concept was discussed. In Section Relationship to Hamiltonian learning, we will explore this difference further and present an example of an identification problem closely related to the concept class of Eq. (3), which indeed can be solved by a classical algorithm.

We can now state the first result of this paper, namely the existence of a concept class for the time-evolution learning problem learnable by a quantum algorithm but for which no classical algorithm can meet the learning condition of Eq. (4).

Theorem 1

(Learning advantage for the time-evolution problem). For any \(BQP\)-complete language, there exists a Hamiltonian Hhard and a set of observables {O(α)}α such that no classical algorithm can efficiently solve the time-evolution learning problem, formalized by the concept class \({{\mathcal{F}}}_{evolved}^{{H}_{hard},O}\), in the sense of Def. 4, unless \({\text{BQP}}\subseteq {\rm{P}}/{\text{poly}}\). However, there exists a quantum algorithm which learns \({{\mathcal{F}}}_{evolved}^{{H}_{hard},O}\) under any input distribution \({\mathcal{D}}\).

We present the proof of Theorem 1 in Methods, while the full proof can be found in Supplementary Information.

Generalizations: quantum advantages for fixed inputs and for ground state problems

We showed that the concept class defined in Eq. (3) leads to a learning advantage. In this case, the physical problem modeled by \({{\mathcal{F}}}_{evolved}^{H,O}\) assumes the initial input states are drawn from an underlying distribution, while there is precise control over the Hamiltonian governing their evolution. This raises the question of whether learning advantages can be extended to other scenarios, modeled by cases with limited control over the Hamiltonian. We argue that the following scenarios can represent practical examples in the Section Discussion.

A first example that naturally arises from the case discussed above is when a fixed initial state is prepared and time-evolved under a Hamiltonian from a fixed family of Hamiltonians, over which there is no full control. Specifically, the Hamiltonians in such a family will be labeled by some input bitstring x, which, for example, could parameterize the strength of the coupling interactions. Each initial state will then be evolved by a different Hamiltonian in the family accordingly to the input x, which comes from an unknown underlying distribution. It is easy to see that the mathematical description of such a learning problem is again defined by the concept class in Eq. (3). The only change is in the definition of ρH(x), now being \({\rho }_{H}(x)=U(x)| 0\rangle \langle 0| U{(x)}^{\dagger }\) with H(x) the Hamiltonian associated to the quantum circuit U(x) by the Feynman construction18. A learning advantage exists for such a concept class as well, as the general definition of a language \({\mathcal{L}}\) in BQP implies the existence of a family of circuits {U(x)}x which correctly decides every \(x\in {\mathcal{L}}\).

As the next example, we consider predicting ground state properties of local Hamiltonians. Here, the states to be measured are the ground states of local Hamiltonians, rather than time-evolved quantum states. Specifically, ground states of input k-local Hamiltonians, which belong to a family of Hamiltonians, are prepared (Section Discussion discusses when this is feasible on a quantum computer). However, there is no full control over the coupling parameters, which are instead random values drawn from an underlying distribution. This situation is close, for example, to the case of the random Ising19 or random Heisenberg model20. The mathematical formalization of such a learning problem is the following. Consider a family of parametrized local Hamiltonians \({\mathcal{H}}=\{H(x)\,\,| \,\,x\in {\{0,1\}}^{n}\}\). Let us define the concept class for the ground state learning problem similarly to the time-evolution case of Eq. (3), where now the unknown observable O(α) is measured on the states ρH(x), which correspond to the ground states of the Hamiltonians \(H(x)\in {\mathcal{H}}\), then we define:

$${{\mathcal{F}}}_{{\rm{g}}.{\rm{s}}.}^{{\mathcal{H}},O}=\{{f}^{\alpha }(x)\in {\rm{{\mathbb{R}}}}\,\,| \,\alpha \in {[-1,1]}^{m}\}$$
(5)
$$\begin{array}{rcl} & \mathrm{with}: & {f}^{\alpha }(x):\,\,\,x\in {\{0,1\}}^{n}\to {f}^{\alpha }(x)=\,{\text{Tr}}\,[{\rho }_{H}(x)O(\alpha )]\\ & & O(\alpha )=\mathop{\sum }\limits_{i=1}^{m}{\alpha }_{i}{P}_{i}.\end{array}$$

Considering the training data \({{\mathcal{T}}}_{{\epsilon }_{2}}^{\alpha }={\{({x}_{\ell },{y}_{\ell }{\approx }_{{\epsilon }_{2}}{f}^{\alpha }({x}_{\ell }))\}}_{\ell =1}^{N}\), the learning condition remains the same as in Definition 4. From the hardness result of Lemma 4, we obtain the following Theorem

Theorem 2

(Learning advantage for the ground state learning problem). For any \({\text{BQP}}\)-complete language there exists a family of Hamiltonians \({{\mathcal{H}}}_{{\rm{h}}{\rm{a}}{\rm{r}}{\rm{d}}}\) such that no classical algorithm can learn the ground state problem, formalized by learning the concept class \({{\mathcal{F}}}_{{\rm{g}}.{\rm{s}}.}^{{{\mathcal{H}}}_{{\rm{h}}{\rm{a}}{\rm{r}}{\rm{d}}},O}\) in the sense of Def. 4, unless \({\text{BQP}}\subseteq {\rm{P}}/{\text{poly}}\). However, there exists a quantum algorithm that learns \({{\mathcal{F}}}_{{\rm{g}}.{\rm{s}}.}^{{{\mathcal{H}}}_{{\rm{h}}{\rm{a}}{\rm{r}}{\rm{d}}},O}\) under any input distribution \({\mathcal{D}}\).

The proof of Theorem 2 is in Methods.

It is important to note that although our separation results, both for the time-evolution and the ground state versions of the problem, are proven using Kitaev’s Hamiltonians, analogous results hold for a broader and more physical class of Hamiltonians. Specifically, since we rely on the assumption that \({\text{BQP}}\, \nsubseteq \,{\rm{P}}/{\text{poly}}\), any quantum process that cannot be simulated by polynomial-sized circuits gives rise to a learning problem of the kind we introduced, with a provable quantum learning advantage. In the Section Discussion, we list examples of physical Hamiltonians to which our results apply, covering both the time-evolution and ground state versions of the problem.

We note that the ground state learning problem defined by the class in Eq. (5) closely resembles the machine learning problem studied in refs. 9 and 10, where the authors demonstrated that a classical algorithm could solve the task, provided the Hamiltonian has a constant gap. Since the Hamiltonians considered in our results have a polynomially decaying gap, this imposes a constraint on the family of local Hamiltonians \({\mathcal{H}}\) that one would need to consider in order to prove a quantum advantage in learning.

As a final remark, it is interesting to observe that if we consider the case of fixed input state, which could be a fixed initial state \({\rho }_{0}(x)=| x\rangle \langle x|\) in the time-evolution scenario modeled by the concept class in Eq. (3) or a single ground state ρH(x) of a fixed Hamiltonian H(x) in the ground state problem, then the learning problem becomes trivially classically easy. Such a learning scenario is formally equivalent to considering a “flipped concept” of the ones defined above. Let us for example consider the task of learning an unknown quantum process from many measurements, in this case the learning problem is still modeled by the concept class \({{\mathcal{F}}}_{{\rm{e}}{\rm{v}}{\rm{o}}{\rm{l}}{\rm{v}}{\rm{e}}{\rm{d}}}^{H,O}\) of Eq. (3) with the difference that now the role of x and α are switched. Namely, the concepts are defined as fx(α) so that their expressions remain the same of fα(x). The difference lies in the labeling, where x now denotes the concept while α represents the input vectors. As x is constant for an instance of the learning problem, specified by the concept that generates the data, the training samples are measurements of the same quantum state ρH(x) with different observables corresponding to different α. Since O(α) = ∑jαjPj, we can make use of data to solve the linear system and obtain the expectation values of each local Pauli string \(\,\mathrm{Tr}\,[{\rho }_{{H}_{\mathrm{hard}}}(x){P}_{i}]\). It then becomes easy to extrapolate the value of \(\,\mathrm{Tr}\,[{\rho }_{{H}_{\mathrm{hard}}}(x)O(\alpha )]\) for every new α.

Generalization of the quantum learning mechanism to quantum kernels

In ref. 3, considerable effort was invested to construct a task with provable learning speed-up where the quantum learning model is somewhat generic (related to natural QML models studied in literature), while still being capable of learning a classically unlearnable task. In our construction, it is possible to view the entire learning process as a quantum kernel setting, similarly to the approach in ref. 21. However, in standard kernel approaches, especially those stemming from support vector machines, the optimization process is solved in the dual formulation, where the hypothesis function is expressed as a linear combination of kernel functions. In contrast, the LASSO optimization employed here solves the optimization in the primal form with a constraint on the 1 norm. This is an issue making our learner not technically a quantum kernel. While the LASSO formulation does not straightforwardly convert into a kernel method, one could attempt to address the regression problem outlined in Equation (9) using an alternative optimization approach that supports a kernel solution, such as kernel ridge regression. In this case, the optimization is done for vectors with bounded 2 norms; nevertheless, there exist bounds on the generalization performance of such procedures as well13.

Generalization to observables parametrized by unitaries

We have demonstrated the existence of a quantum advantage for the learning problem of predicting k-local observables from time-evolved states and from ground states. Critically, we considered observables of the type O(α) = ∑iαiPi and we exploited the linear structure of O(α) to ensure quantum learnability through LASSO regression. It is, however, natural to ask if quantum learnability can be achieved for other types of observables while maintaining the classical hardness. In this section we consider the far more general case where the unknown observable is parameterized through a unitary matrix, i.e. O(α) = W(α)OW(α) where O is an hermitian matrix. Our findings in this scenario will be of two kinds. First, we show as a general result that for every method that learns a unitary W given query access on a known distribution of input quantum states, there exists a learning problem that exhibits a classical-quantum advantage. Then we concretize the general result by presenting a constructed example of a learning problem defined by a class of unitary-parameterized observables, showcasing a provable speed-up. Before stating our findings, let us properly introduce the learning problem under consideration. Imagine a scenario where observables are measured on evolved quantum states, but there is no control over the entire evolution, with a portion of it remaining unknown. Specifically, measurements are taken on input-dependent states \(| \psi (x)\rangle =W(\alpha )U(x)| 0\rangle\) for an unknown fixed α. The goal is to predict expectation values on states \(| \psi ({x}^{{\prime} })\rangle\) for new inputs \({x}^{{\prime} }\). Concretely, such a scenario corresponds to the following concept class:

$${{\mathcal{M}}}_{U,W,O}=\{{f}^{\alpha }(x)\in {\rm{{\mathbb{R}}}}\,\,| \,\,\alpha \in {[-1,1]}^{m}\}$$
(6)
$$\begin{array}{rcl} & \mathrm{with}: & {f}^{\alpha }(x):\,x\in {\mathcal{X}}\subseteq {\{0,1\}}^{n}\to \,{\text{Tr}}\,[{\rho }_{U}(x)O(\alpha )]\in {\rm{{\mathbb{R}}}}\\ & & O(\alpha )=W(\alpha )O{W}^{\dagger }(\alpha ).\end{array}$$

where \({\rho }_{U}(x)=U(x)| 0\rangle \langle 0| {U}^{\dagger }(x)\) and O is a hermitian matrix. Given as training data \({{\mathcal{T}}}^{\alpha }={\{({x}_{\ell },{y}_{\ell })\}}_{\ell =1}^{N}\) with \({\mathbb{E}}[{y}_{\ell }]=\,\mathrm{Tr}\,[{\rho }_{U}({x}_{\ell })O(\alpha )]\), the goal of the learning algorithm is again to satisfy the learning condition of Def. 4. We are now ready to state our main general result of this Section:

Theorem 3

(Informal version). Every (non-adaptive) learning algorithm \({{\mathcal{A}}}_{W}\) for learning a unitary W(α) {W(α)}α, where the probe states \({\{| {\psi }_{l}\rangle \}}_{l}\) and observables \({\{{Q}_{m}\}}_{m}\) come from discrete sets \(S={\{| {\psi }_{\ell }\rangle \}}_{\ell }\) and \(Q={\{{Q}_{m}\}}_{m}\) (or they can be discretized with controllable error), induces a classical-input-classical-output learning problem with a quantum-classical learning advantage.

We note that in the theorem, non-adaptive means that the algorithm probes the unitary with some input states and measures some observables which are chosen independently, meaning they do not have any choice that depends on previous outcomes.

In other words, Theorem 3 guarantees that whenever there is an efficient method to learn a unitary from query access on a set of arbitrary input states and observables, it is also possible to construct a learning problem described by the concept class in Eq. (6) which exhibits a learning speed-up. We formalize the statement of Theorem 3 and provide a rigorous proof in the Supplementary Information. The nontrivial part of our result lies in the fact that in the majority of the works present in the literature, which provide efficient algorithms to learn a unitary, the set of required probe states \({\{| {\psi }_{\ell }\rangle \}}_{\ell }\) is restricted to specific quantum states, often classically describable (e.g., stabilizer states). On such a set of states, our hardness results from the previous section can not directly be applie,d as a classical learning algorithm could simply prepare those states as the quantum algorithm would do.

The detailed proof of Theorem 3 can be found in the Supplementary Information, while the outlines of the main ideas are in Methods.

Learning advantages for shallow unitaries

Theorem 3 presents a method for constructing learning problems that demonstrate a provable quantum advantage in learning unitarily parametrized observables of the form O(α) = W(α)OW(α). This result crucially relies on the existence of a learning scheme capable of recovering the unitary W(α) based on output measurements performed on an arbitrary set of input states. As a concrete example where such a learning scheme is known to exist, and thus our result applies, we consider the case where the family of unitaries {W(α)}α consists of shallow circuits. In this case, recent results in the literature22 provide a procedure to learn a few body observables from measurements of them on the single qubit stabilizer states. As shallow circuits preserve the locality of the observable measured, we can directly apply the result in ref. 22 to construct a quantum learning algorithm for the concept class in Eq. (6) where the concepts are related to observables O(α) = W(α)OW(α), with W(α) a shallow unitary. In particular, we prove the following corollary of Theorem 3.

Corollary 1

(Learning advantage for shallow unitaries). There exists a family of parametrized unitaries {U(x)}x and parametrized shallow circuits {W(α)}α, a measurement O and a set of distributions \({\{{{\mathcal{D}}}_{i}\}}_{i}\) over x {0, 1}n such that the concept class \({{\mathcal{M}}}_{U,W,O}\) is not classically learnable with respect to the set of input distributions \({\{{{\mathcal{D}}}_{i}\}}_{i}\), unless \({\text{BQP}}\subseteq {\rm{P}}/{\text{poly}}\). However, there exists a quantum algorithm which learns \({{\mathcal{M}}}_{U,W,O}\) on the input distributions \({\{{{\mathcal{D}}}_{i}\}}_{i}\).

Relationship to Hamiltonian learning

In order to further elucidate the lines between settings with and without classical/quantum learning advantages, we can establish a connection between the task of learning Hamiltonians from time-evolved quantum states and learning unitarily-parametrized observables. The setting of the Hamiltonian learning problem is the following. Given an unknown local Hamiltonian in the form of H(λ) = ∑iλiPi, the objective of the Hamiltonian learning procedure is to recover λ. In the version of the problem we consider, we are given a black box which implements the time-evolution under the unknown Hamiltonian H(λ) on any arbitrary quantum state. The black box action on arbitrary inputs ρ is given by

$$\rho \to U\rho {U}^{\dagger }$$
(7)

with U = eitH(λ), where the evolution time t is known to us. In ref. 23, the authors provide a classical algorithm that learns the unknown λ from the expectation values of the Pauli string Pi on polynomially many copies of the evolved state UρU, for particular choices of initial states ρ’s. We can rephrase the Hamiltonian learning task in terms of a learning problem with many similarities to the ones we considered before. Concretely we define the corresponding concept class

$${{\mathcal{F}}}_{\lambda }=\{{f}^{\lambda }(x)\in {\rm{{\mathbb{R}}}}\,\,| \,\,\lambda \in {[-1,1]}^{m}\}$$
(8)
$$\begin{array}{rcl} & \mathrm{with}: & {f}^{\lambda }(x):\,x\in {\mathcal{X}}\subseteq {\{0,1\}}^{n}\to \,{\text{Tr}}\,[\rho (x)O(\lambda )]\\ & & O(\lambda )=\mathop{\sum }\limits_{i=1}^{m}U{(\lambda )}^{\dagger }{P}_{i}U(\lambda ).\end{array}$$

where x describes the input state ρ(x), and U = eitH(λ). We note that the specific choice of the observable \(O(\lambda )={\sum }_{i=1}^{m}U{(\lambda )}^{\dagger }{P}_{i}U(\lambda )\) is made to align with the work in ref. 23, where the learning algorithm solves the task by directly measuring the Pauli operators \({\{{P}_{i}\}}_{i}\) after evolving the input states with U = eitH(λ). In this sense, by considering \({{\mathcal{T}}}^{\lambda }={({x}_{\ell },{\text{Tr}}[\rho ({x}_{\ell })O(\lambda )])}_{\ell }\) as training data, the connection to the Hamiltonian learning task of ref. 23 becomes evident if we demand that the learning algorithm needs only to identify, rather than evaluate, the correct concept that generated the data. It is worth noting that the concept class \({{\mathcal{F}}}_{{\boldsymbol{\lambda }}}\) closely resembles the concept class for the time-evolution problem described earlier, for which a learning advantage was demonstrated. The main difference is that in \({{\mathcal{F}}}_{{\boldsymbol{\lambda }}}\) the unknown observable is no longer a linear combination of Pauli strings; instead, each Pauli string in O(λ) is parameterized by a unitary. As we are required to only identify the correct concept and considering that the authors in ref. 23 provide an algorithm capable of solving the Hamiltonian Learning problem in polynomial time (for sufficiently small t), one might suppose that the classical difficulty associated with the learning advantage presented in this work arises solely from the challenge of evaluating the correct concept rather than from the task of identifying it. This would answer another question left open in ref. 11, at least in the case of concepts labeled by observables with an unknown unitary parameterization. However, it must be noted that for the Hamiltonian learning algorithm in ref. 23 to work, the input states ρ(x) in Eq.(8) are of a very particular and simple form and do not come from a distribution \({\mathcal{D}}\) of \({\text{BQP}}\)-complete quantum states \({\rho }_{{H}_{\mathrm{hard}}}(x)\) as we assumed in our classical hardness result. For general input distributions, the hardness of identification thus remains elusive. In this regard, it is important to note that the task of identification can take various forms. For instance, here we are focusing on identifying the concept within the concept class that generated the data, and in this case, it is generally unclear when this setting allows a classical-efficient solution. In a more general scenario, one could ask if the learning algorithm is able to identify a model from a hypothesis class that differs from the concept class, but that is still capable of achieving the learning condition. Colloquially, here the question becomes: can traditional machine learning determine which quantum circuit could accurately label the data, even though classical computers lack the capability to evaluate such quantum circuits? In ref. 11, it was proven that in this broader context, provable speed-ups in identification are no longer possible. This is because the classical algorithm can successfully identify a sophisticated quantum circuit that carries out the entire quantum machine learning process as its subroutine, thereby delegating the learning aspect to the labeling function itself, prior to evaluating a data point. Thus, the task of identification is indeed only interesting when the hypothesis class is somehow restricted, which is often the case in practically relevant learning scenarios.

Discussion

In this paper, we propose a machine learning task that demonstrates a quantum learning advantage under the assumption that quantum processes cannot be simulated by polynomial-size classical circuits (i.e., \({\text{BQP}}\, \nsubseteq \,{\rm{P}}/{\text{poly}}\)). The learning problem involves reproducing a partially unknown observable from classical, measured-out data. While classical learning theory guarantees the computational hardness of the task for input states originating from certain quantum processes, we have developed a quantum learning algorithm capable of solving it for any input distribution. In this discussion, we consider whether this abstract learning problem can reasonably model physically relevant scenarios. We identify three key components for our framework: firstly, the preparation of a “hard” quantum state from a large family, indeed ensuring that U specifies a classically-hard but quantum realizable operation (see below for details); secondly, the heralded yet uncontrolled nature of the selection of U(x), reflecting the fact that data is generated under an unknown distribution over x in the PAC learning framework; and finally, the task of learning unknown observables itself. We next discuss the extent to which these three criteria correspond to natural learning tasks.

Firstly, while in our proof we deal with the contrived Kitaev Hamiltonian, this is only for technical simplicity. In general, all that is required is a scenario where a family of parametrized “hard” states (\(| \psi (x)\rangle =U(x)| 0\rangle\)) is measured, and where there is strong evidence to suggest that computing the map \(x\mapsto \langle \psi (x)| O| \psi (x)\rangle\) (where x is the description of the state in the family) is beyond the capabilities of classical polynomial-size circuits (i.e., not in \({\text{P}/\text{poly}}\) for the decision variant of the problem). Time-evolution scenarios (where U(x) represents the time-evolution of a time-dependent or independent Hamiltonian) and ground-state scenarios (where \(U(x)| 0\rangle\) is the desired ground state) are natural candidates for such state preparation. To provide the strongest arguments that the resulting map is also classically hard, it is useful to work with Hamiltonians known to be computationally hard (i.e., \(\text{BQP}\)- or \({\text{QMA}}\)-hard in the context of time-evolution or ground-state problems, respectively). With these conditions in mind, we can still identify numerous previously recognized families of physical Hamiltonians that satisfy these criteria. For time-evolution, we list a few known cases of \({\text{BQP}}\)-complete Hamiltonians and otherwise specified evolutions which give rise to \({\text{BQP}}\)-complete problems: various variants of the Bose-Hubbard model24, stoquastic Hamiltonians25, ferromagnetic Heisenberg model24, the XY model26, estimating the scattering probabilities in massive field theories27, (more precisely, estimating the vacuum-to-vacuum transition amplitude, in the presence of spacetime-dependent classical sources, for a massive scalar field theory in (1 + 1) dimensions), simulation of topological quantum field theories28.

For the ground state version, \({\text{QMA}}\) hardness results are known for the following Hamiltonians (the first two examples are actually in \(\text{MA}\), a subset of \({\text{QMA}}\)): stoquastic local Hamiltonians29, Bose-Hubbard with negative hoppings29, Bose-Hubbard with positive hoppings30, antiferromagnetic Heisenberg model26, the XY model26, electronic structure problem31 for quantum chemistry, certain supersymmetric quantum mechanical theories32.

We note that the list above is, to our knowledge, not exhaustive. Focusing on the ground state setting, the computational hardness involved in preparing these states is reflected in our quantum learning algorithm, however, we do not imply that either nature (in the realization of the dataset) or our quantum algorithm is solving \({\text{QMA}}\)-hard problems. In practical scenarios, any \({\text{QMA}}\) problem can include instances that are solvable within \({\text{BQP}}\) (as indeed \({\text{BQP}}\) is contained in \({\text{QMA}}\), and so all \({\text{BQP}}\) problems correspond to some instances of \({\text{QMA}}\)-hard problems). For those distributions over the inputs, which are arguably the ones realized in nature, there can be efficient quantum algorithms. Specifically, while the Hamiltonian families listed above define a broader class of physically relevant systems, the instances encountered in real-world scenarios typically fall within easier subclasses that are likely to involve \({\text{BQP}}\)-hard problems. This aligns well with our framework. An example of a setting where the quantum learning algorithm is easy to specify occurs in systems where classical approximations of ground states (such as mean-field solutions, matrix product approximations, etc.) are not exponentially far from the true ground states. These systems, while belonging to \({\text{QMA}}\)-like families, could result in \({\text{BQP}}\)-complete problems (by virtue of the guided Hamiltonian problem construction33). Other examples of quantumly efficient simulable processes involve Hamiltonians connected by an inverse-polynomially gapped adiabatic path, which can occur in critical Hamiltonians or near phase transition boundaries, with the appropriate scaling of system size. In all of these cases, our approach not only provides a natural learning scenario but also an efficient quantum learning algorithm that effectively leverages a guiding state. Finally, we note that implementing \({\text{BQP}}\)-complete processes would require fault-tolerant quantum computers, so they are unlikely to be feasible in the near term. However, as we explain later, any computation achievable on near-term devices that is not in \({\rm{P}}/{\text{poly}}\) (and, for example, for sampling problems, this is widely believed to be attainable) can serve as the core of our constructions and still provides provable guarantees. In fact, as we transition to practical applications, it is important to note that the emphasis on \({\text{BQP}}\)-complete problems is primarily relevant when aiming for formal proofs of asymptotic advantages. In particular, our separation results hold for any quantum process which maps xρ(x) and for which there exists an observable O such that the function: \(x\to tr[\rho (x)O]\) is outside \({\rm{P}}/{\text{poly}}\), thus contained in the class \({({\rm{P}}/{\text{poly}})}^{{\rm{c}}}\cap {\text{BQP}}\) (\({({\rm{P}}/{\text{poly}})}^{{\rm{c}}}\) indicates the complement of the class \({\rm{P}}/{\text{poly}}\)). Many systems of interest may not be \({\text{BQP}}\)-complete but are likely beyond the reach of classical poly-sized circuits. Potential examples may include problems in glassy systems, Hamiltonians related to exotic highly correlated states of matter (e.g., spin liquids), highly correlated molecules (such as those involving heavier metals), and certain areas of high-energy physics, like quantum chromodynamics. While many of these problems may not be strictly \({\text{BQP}}\)-complete, they may remain intractable for classical computers and could potentially be addressed by quantum ones, making them strong candidates for demonstrating quantum advantages in our learning task. We acknowledge, however, that the hardness of these last examples listed above remains purely conjectural and has yet to be more precisely assessed.

As a second point, we discuss the assumption of the learning problems regarding the data distribution. In the formulation of our learning problems, we assume that the input Hamiltonians, or more generally the resulting states, are drawn from a distribution. While it may not be unusual to have no information about which Hamiltonian is acting on the system, our framework assumes that this information is also revealed in the data point x. Specifically, for our approach to work, we require both that the Hamiltonian is chosen randomly and that we are later informed about which Hamiltonian was selected–essentially heralding without control. We notice that if the Hamiltonian were fully controlled, we would no longer be in a PAC learning scenario, but something closer to Angluin’s query model34. While this setup would also be interesting, the learning objectives would differ. The scenarios considered in this paper can occur in situations where control is limited, but some type of partial (process) tomography allows the reconstruction of the implemented Hamiltonian afterward. For example, in fields like materials science and condensed matter physics, we frequently do not have full control over the conditions affecting a sample, but we can analyze it afterward. This is particularly true in cases like material doping or the introduction of nitrogen-vacancy (NV) centers. In condensed matter systems, doping introduces additional electrons or holes into a material. However, the precise concentration and spatial distribution of dopants are challenging to control at the atomic scale, leading to variability in the resulting electronic structure35. Similarly, lattice imperfections, such as strain, dislocations, or defects, perturb key Hamiltonian parameters like hopping amplitudes and interaction strengths, often in unpredictable ways36. Impurities, including unintentional substitutions or vacancies, modify on-site energies and introduce disorder37. Collectively, these factors result in effective Hamiltonians that deviate from the intended design, necessitating a statistical approach to account for variability. Such challenges are not unique to condensed matter systems. In quantum optics, heralded operations used to generate entangled photon pairs rely on probabilistic detection events, meaning the effective realized quantum operation depends on external factors such as photon loss or background noise. Similarly, in high-energy physics, the governing Hamiltonian of particle collisions is subject to environmental factors like background radiation or fluctuating particle flux, making precise control impractical. In both cases, experimentalists work within a probabilistic framework, relying on statistical inference rather than attempting direct control of the Hamiltonian. Our model reflects these realities by treating input Hamiltonians as drawn from an underlying distribution. This approach is well-suited to systems with intrinsic or practical variability, where repeated experiments often produce diverse realizations of effective Hamiltonians. By adopting a distributional perspective, we capture the statistical nature of Hamiltonian variability and address learning problems under realistic experimental constraints. Furthermore, since our work requires learnability for any input distribution, rather than focusing on specific distributions as in ref. 11, we more accurately address the case where the underlying distributions are unknown. This makes our framework further aligned with real-world settings than previous approaches.

Finally, we motivate the task of learning an unknown observable. There are many natural scenarios where the actual measurement performed is not fully characterized or is entirely unknown. One clear example is learning order parameters (as observables) based on, e.g., data labeled using indirect indicators, such as entanglement entropy, which are not simple linear functions of the state. A similar scenario occurs when the order parameter is known, but the corresponding observable cannot be directly implemented on the given device. In these cases, a learning model can infer a proxy for the true order parameter as a linear combination of observables that are feasible to implement. Additionally, since time evolution under a parametrized Hamiltonian can be considered part of the measurement in the case of unitarily parametrized observables, the scenario of partially unknown observables also includes situations where certain aspects of the dynamics are unknown, while the measurement itself remains fixed. Another relevant setting concerns experimental contexts where, due to incomplete device characterization, the measurements realized may not align with the intended ones, reflecting a partially unknown scenario. It is important to note that limiting the degree of ignorance in the setting corresponds to restricting or biasing the concept class structure. In this case, learnability still holds (regardless of the prior) along with classical hardness, as long as there remains at least one “hard” measurement within the concept class. The characterization and learning of observables have been extensively studied, precisely because they represent a plausible scenario, especially within the framework of quantum measurement tomography, which was first introduced in ref. 38. Specifically, in ref. 38 a setting where unknown observables emerge was identified, when a system interacts with external degrees of freedom before being measured by a fixed observable. This is particularly relevant in cases involving reservoirs or other mechanisms that induce losses and decoherence effects. Subsequent works, such as ref. 39, used the task of learning unknown measurements to characterize experimental detectors. This task is especially pertinent in quantum computing, where measurement devices are often imperfect due to noise, making it crucial to understand and mitigate the effects of these imperfections on quantum measurements.

Combining meaningful settings from these three classes of assumptions results in more realistic learning tasks that can be captured by our abstract learning framework.

Methods

The proofs of all theorems are available in the Supplementary Information, but we provide high-level ideas in this section.

Theorem 1 - classical hardness

Our classical hardness results are based on the assumption \({\text{BQP}}\, \nsubseteq \,{\rm{P}}/{\text{poly}}\). As previously stated, this is different from previous works11, where classical intractability was established by considering the distributional version of complexity classes. Specifically, the results in ref. 11 explicitly assume the existence of an input distribution \({{\mathcal{D}}}_{{\rm{h}}{\rm{a}}{\rm{r}}{\rm{d}}}\) for which no classical algorithm can even approximate, in the sense of Eq. (1), a target function that computes \({\text{BQP}}\)-complete languages. In contrast, this work considers the weaker assumption that \({\text{BQP}}\)-hard languages cannot be correctly decided on every input by polynomial-sized classical circuits. While the assumption considered in ref. 11 implies the assumption considered in this paper, the reverse is not generally true. To achieve classical hardness through the more natural and studied assumption \({\text{BQP}}\, \nsubseteq \,{\rm{P}}/{\text{poly}}\), we require a stronger but arguably more natural learning condition in Def. 4 than what present in ref. 11. Namely, we work in the distribution-free PAC framework and require the learning algorithm to succeed for any input distribution. In this way, we can use well-known results from classical learning theory that link the hardness of learning to the hardness of computing a target function. The full proof of classical hardness, as well as quantum learnability, for the learning task can be found in the Supplementary Information. To prove classical hardness is enough to show that there exists a local Hamiltonian Hhard for which the concepts considered in our learning problem can decide \({\text{BQP}}\) languages. In fact, by the results in ref. 40 from classical learning theory, if a target concept is learnable in the sense of Def. 1, then there exists a polynomial-size classical circuit which efficiently evaluates it on every input point. In particular, this implies that the concept can be computed in \({\rm{P}}/{\text{poly}}\). Therefore, if the concept class \({{\mathcal{F}}}_{evolved}^{{H}_{hard},O}\) is classically learnable, then \({\text{BQP}}\subseteq {\rm{P}}/{\text{poly}}\). As a final note, we observe that although our learning problem is defined over real-valued functions, in our hardness result, the specific Hamiltonian Hhard and the associated observables ensure that the target concept has a binary output.

Theorem 4

(Classical hardness of the time-evolution learning problem). For any \({\text{BQP}}\)-complete language, there exists a Hamiltonian Hhard such that no randomized polynomial-time classical algorithm Ac satisfies the learning condition of Def. 4 for the concept class \({{\mathcal{F}}}_{evolved}^{{H}_{hard},O}\), unless \({\text{BQP}}\subseteq {\rm{P}}/{\text{poly}}\).

Proof sketch

The detailed proof of Theorem 4 can be found in the Supplementary Information, here we give the overall proof strategy. Let \({\mathcal{L}}\) be an arbitrary \({\text{BQP}}\) language. Since \({\mathcal{L}}\in BQP\) there exists a quantum circuit UBQP which decides input bitstrings x {0, 1}n correctly on average with respect to \({\mathcal{D}}\). As shown more rigorously in the Supplementary Information, measuring the Z operator on the first qubit of the state \({U}_{BQP}| x\rangle\) will output a positive or negative value depending on whether \(x\in {\mathcal{L}}\) or not. Therefore, for the observable \({O}^{{\prime} }=Z\otimes I\otimes \ldots \otimes I\), the quantum model \({f}^{{O}^{{\prime} }}(x)=\,\mathrm{Tr}\,[{O}^{{\prime} }{\rho }_{{U}_{BQP}}(x)]\), with \({\rho }_{{U}_{BQP}}(x)={U}_{BQP}| x\rangle \langle x| {U}_{BQP}^{\dagger }\), correctly decides every input x. Finally, using Feynman’s idea18,41 it is possible to construct a local Hamiltonian that time-evolves the initial state \(| x\rangle\) into \({U}_{BQP}| x\rangle\) in constant time42. Denote this constructed Hamiltonian as Hhard. Then the concept \({\alpha }^{{\prime} }\), associated with the observable \(O({\alpha }^{{\prime} })={O}^{{\prime} }\), implements a \(\text{BQP}\) computation. The final step is a result in ref. 40, which guarantees that if a function is learnable under any distribution in the sense of Def. 1, then there exists a polynomial-size circuit which correctly evaluates it on every input x. We provide a more detailed explanation of the crucial result contained in ref. 40 and its consequences in the Supplementary Information. This concludes the proof, as if \({{\mathcal{F}}}_{evolved}^{{H}_{hard},O}\) was learnable, then there would be a polynomial size circuit which evaluates the concept \({f}^{{\alpha }^{{\prime} }}\). Thus, any \({\text{BQP}}\) language could be decided in \({\text{P}/\text{poly}}\). Since the existence of an efficient algorithm is formally stated as the existence of a uniform family of poly-sized circuits, the assumption \({\text{BQP}}\, \nsubseteq \,{\rm{P}}/{\text{poly}}\) implies that no classical machine-learning algorithm running in polynomial time can solve the learning task for the Hamiltonian Hhard. By contrast, in the next session, we will introduce a quantum learning algorithm that solves the task and runs in polynomial time, yielding therefore an exponential advantage (in time complexity) over classical approaches.

Theorem 1 - Quantum learnability

To establish the quantum learnability of the classically hard concept class constructed above for the time-evolution problem, we present a quantum algorithm directly. This algorithm accurately predicts any concept in the associated learning problem with high probability when provided with a sufficient number of training data samples. Crucially, we prove that the algorithm meets the learning condition of Def. 4 using only polynomial training samples and running in polynomial time. The central idea involves leveraging the capability of the quantum algorithm to efficiently prepare the time-evolved states ρH(x), for an input local Hamiltonians H and, in particular, for the hard instances of Hhard considered in our hardness result of Lemma 4.

Algorithm 1

Quantum Algorithm

1. For every training point in \({T}_{{\epsilon }_{2}}^{\alpha }={\{({x}_{\ell },{y}_{\ell })\}}_{i=1}^{N}\) the quantum algorithm prepares poly(n) copies of the state ρH(x) and computes the estimates of the expectation values \({\langle {P}_{j}\rangle }_{\ell }=\,{\text{Tr}}\,[{\rho }_{H}({x}_{\ell }){P}_{j}]\,\,\forall j=1,\ldots ,m\) up to a certain precision ϵ1. Note that m scales at most polynomially in n as \({\{{P}_{j}\}}_{j=1}^{m}\) are local observables.

2. Define the model h(x) = w ϕ(x), where ϕ(x) is the vector of the Pauli string expectation values \(\phi (x)=[\,{\text{Tr}}\,[{\rho }_{H}(x){P}_{1}],\ldots ,\,{\text{Tr}}\,[{\rho }_{H}(x){P}_{m}]]\) computed at Step 1. Then, given as hyperparameter a B ≥ 0 the LASSO ML model finds an optimal w* from the following optimization process:

$$\mathop{min}\limits_{\begin{array}{l}w\in {{{\mathbb{R}}}}^{m}\\ | | w| {| }_{1}\le B\end{array}}\,\,\frac{1}{N}\mathop{\sum }\limits_{l=1}^{N}| w\cdot \phi ({x}_{l})-{y}_{l}{| }^{2}$$
(9)

with \({\{({x}_{l},{y}_{l}=\,{\text{Tr}}\,[{\rho }_{H}({x}_{l})O(\alpha )])\}}_{l=1}^{N}\) being the training data.

Importantly, to meet the learning condition, the optimization does not need to be solved exactly, i.e., w* = α. As we will make it clear in the Supplementary Information, it is sufficient to obtain a w* whose training error is ϵ2 larger than the optimal one.

Theorem 5

(Quantum learnability of the time-evolution learning problem). There exists an efficient quantum algorithm Aq such that for any concept \({f}^{\alpha }\in {{\mathcal{F}}}_{\mathrm{evolved}}^{{H}_{\mathrm{hard}},O}\) considered in Lemma 4, Aq satisfies the following. Given n, ϵ, δ≥0, and any training dataset \({T}_{{\epsilon }_{2}}^{\alpha }\), with ϵ2ϵ, of size

$$N={\mathcal{O}}(\frac{\log ({\text{poly}}(n)/\delta ){\text{poly}}(n)}{{\epsilon }^{2}})$$
(10)

with probability 1 − δ the quantum algorithm Aq outputs a model \({h}^{* }(.)=A({T}_{{\epsilon }_{2}}^{\alpha },\epsilon ,\delta ,.)\) which satisfies the learning condition of Def. 4:

$${{\mathbb{E}}}_{{x}{ \sim }{D}}[| {f}^{{\rm{\alpha }}}(x)-{h}^{* }(x){| }^{2}]\le \epsilon$$
(11)

where \({\mathcal{D}}\) is any arbitrary distribution from which the training data is sampled.

The idea for the quantum algorithm is the following. For every point x, we construct a vector ϕ(x) [−1, 1]m of expectation values of the single Pauli strings present in O on the time-evolved quantum states. The model h(x) = w ϕ(x) is then trained on the data samples with a LASSO regression to find a w* so that the trained model is in agreement with the training samples, i.e., h*(x) = w* ϕ(x) ≈ y for any \(({x}_{\ell },{y}_{\ell }) \sim {{\mathcal{T}}}_{{\epsilon }_{2}}^{\alpha }\). As the 1-norm of the optimal wopt = α scales polynomially in n, imposing the constraint \(| | w| {| }_{{\ell }_{1}}\le B\) with \(B={\mathcal{O}}({\text{poly}}(n))\) in the LASSO regression will allow us to obtain an error ϵ in the generalization performance using a training set of at most polynomial size. This is because the generalization error for the LASSO regression is bounded linearly by B13. The description of the quantum algorithm can be found in Algorithm 1, while we leave the precise analysis of its sample and time complexity in the Supplementary Information.

Ground state problem

Proof

Proof of Theorem 2. The existence of a class of Hamiltonian \({{\mathcal{H}}}_{hard}\) for which the ground state problem is classically hard to learn is guaranteed by the argument above regarding the hardness of time-evolution case. The structure of the proof is exactly the same as the proof for Theorem 1, rigorously written in the Supplementary Information. The only missing step for the ground state version is that now the states ρH(x) we are considering in Eq. (5) are ground state of Hamiltonians H(x) and not time-evolved states. However, using Kitaev’s construction43,44, it is possible to create for any x {0, 1}n and U a local Hamiltonian H(x) such that its ground state will have a large overlap with \(U| x\rangle\), with U an arbitrary quantum circuit with a polynomial depth. This completes our proof of classical hardness as we consider the family of Hamiltonian \({{\mathcal{H}}}_{{\rm{h}}{\rm{a}}{\rm{r}}{\rm{d}}}\) to exactly be the set of such {H(x)}x with U implementing a \({\text{BQP}}\)-complete computation. More rigorously it means that for every n, we consider \(U={U}_{BQP}^{n}\) from the family \({\{{U}_{BQP}^{n}\}}_{n}\) of quantum circuits which correctly decide a \(\text{BQP}\)-complete language \({\mathcal{L}}\). Same as before, it is still the case that there is at least one concept that can not be evaluated by polynomially sized classical circuits. Finally, the quantum algorithm with learning guarantees for the concept class \({{\mathcal{F}}}_{{\rm{g}}.{\rm{s}}.}^{{{\mathcal{H}}}_{{\rm{h}}{\rm{a}}{\rm{r}}{\rm{d}}},O}\) closely follows Algorithm 1. The only missing point to prove here is that the ground states \({\rho }_{{{\mathcal{H}}}_{\mathrm{hard}}}(x)\) are easily preparable on a quantum computer. Recall that the class of hard Hamiltonian \({{\mathcal{H}}}_{{\rm{h}}{\rm{a}}{\rm{r}}{\rm{d}}}\) considered in the hardness result of the ground state problem are the one derived from the Kitaev’s circuit-to-Hamiltonian construction from a \(\text{BQP}\)-complete circuit. It is well known that those Hamiltonians present a Ω(1/poly(n)) gap (in contrast to the classically learnable Hamiltonians in ref. 9), and it is possible to construct the ground state \({\rho }_{{{\mathcal{H}}}_{{\rm{h}}{\rm{a}}{\rm{r}}{\rm{d}}}}\) from the description of the corresponding Kitaev’s Hamiltonian, known to the learner through the description of the concept class and the input x. This then concludes our proof for the quantum learnability of \({{\mathcal{F}}}_{g.s.}^{{{\mathcal{H}}}_{hard},O}\), thus completing the entire proof of Theorem 2.

Advantages for unitarily parametrized observables

Proof sketch: Proof sketch of Theorem 3

The main idea of the proof is to construct a concept class of the kind of Eq. (6) which can be learned by a quantum algorithm using the learning algorithm \({{\mathcal{A}}}_{W}\), while maintaining classical hardness. Consider the circuit in Fig. 1. Let U(x) be a unitary that, depending on the first bit x1 of the input x {0, 1}n, prepares the \(| {\psi }_{U}^{{n}_{S}}(x)\rangle\) on the nS qubit register in two different ways. If x1 = 0, then \(| {\psi }_{U}^{{n}_{S}}(x)\rangle\) is exactly one of the states \(| {\psi }_{\ell }\rangle \in S\), labeled by the final nS bits \({x}_{S}\in {\{0,1\}}^{{n}_{S}}\) of the input bitstring x, i.e., \({x}_{S}={x}_{n-{n}_{S}}\ldots {x}_{n}\). If x1 = 1, then U(x) prepares the state \(| {\psi }_{U}^{{n}_{S}}(x)\rangle\) as the result of a \(\text{BQP}\)-complete computation. Regarding the observable, we define a controlled unitary VA, controlled by the first 1 + nQ qubits register, such that when x1 = 0 VA acts on the target nS qubit register by rotating the nS qubit measurement operator O into one of the Qm in the set Q required by the learning algorithm. The description of which Qm the unitary VA implements is contained in the nQ bitstring \({x}_{Q}\in {\{0,1\}}^{{n}_{Q}}\) with \({x}_{Q}={x}_{2}{x}_{3}\ldots {x}_{{n}_{Q}+1}\). When x1 = 1, VA acts as the identity matrix on the nS qubit register. Suppose now the input bits are sampled from an arbitrary distribution \({{\mathcal{D}}}_{i}\in {\{{{\mathcal{D}}}_{i}\}}_{i}\) of the following kind:

  • The first bit x1 of x is randomly selected with equal probability between 0 and 1.

  • If x1 = 0 then the other n − 1 bit x2x3xn are sampled from the required distribution by \({{\mathcal{A}}}_{W}\) to learn the unitary W(α).

  • If x1 = 1 then the nS bits in xS follow any possible distribution over \({x}_{S}\in {\{0,1\}}^{{n}_{S}}\). For each i, the distribution over \({x}_{S}\in {\{0,1\}}^{{n}_{S}}\) defines the specific overall distribution \({{\mathcal{D}}}_{i}\).

    Fig. 1: Quantum model which exhibits a learning speed-up for the concept class \({{\mathcal{M}}}_{U,W,O}\).
    figure 1

    The unitary U(x) prepares the state \(| {\psi }_{U}(x)\rangle =| {x}_{1}\rangle \otimes | {x}_{Q}\rangle \otimes | {\psi }_{U}^{{n}_{S}}({x}_{S})\rangle\) where the form of \(| {\psi }_{U}^{{n}_{S}}({x}_{S})\rangle\) depends on x1, the first bit of each input x {0, 1}n. If x1 = 0, then \(| {\psi }_{U}^{{n}_{S}}({x}_{S})\rangle =| {\psi }_{{x}_{S}}\rangle \in S\) is the nS qubit state described by the bitstring \({x}_{S}={x}_{2}{x}_{3}\ldots {x}_{{n}_{S}}\) of the set S of polynomially describable quantum states needed to learn W(α). If x1 = 1, then \(| {\psi }_{U}^{{n}_{S}}({x}_{S})\rangle\) is the quantum state which encodes the \({\text{BQP}}\)-complete computation which decides the nS input bits xS, considered as input of a \({\text{BQP}}\)-complete language \({\mathcal{L}}\) over \({x}_{S}\in {\{0,1\}}^{{n}_{S}}\). W(α) is an unknown parametrized unitary; in order to prove classical hardness, it is sufficient to consider \(W(\alpha =0)={I}^{\otimes {n}_{S}}\) and the measurement operator to be O = Z I I on the nS qubits register. The unitary VA rotates the measurement operator O so that, when x1 = 0, the final measurement is an operator \({Q}_{{x}_{Q}}\) described by the bitstring xQ. This provides the right training samples to learn W(α), as explained in the proof of Theorem 3.

Note that in the second bullet point, the distribution of probes states and measurements required by \({{\mathcal{A}}}_{W}\) to learn the unitary W(α) could be very complicated, involving perhaps a joint distribution on probe states and measurements. Nevertheless, every target distribution can be obtained by post-processing of samples from the uniform distribution, which can be done coherently by taking uniformly random input bitstrings.

Taking \(W({\boldsymbol{0}})={I}^{\otimes {n}_{S}}\) for α = 0 and \(O=I\otimes {I}^{\otimes {n}_{Q}}\otimes Z\otimes {I}^{{n}_{S}-1}\), it is clear that the concept \({f}^{0}=\,\mathrm{Tr}\,[{\rho }_{{U}}(x){V}_{{A}}O{V}_{A}^{\dagger }]\) cannot be learned by any classical algorithm. In particular, by the result of our Theorem 4 any classical algorithm can not meet the learning condition of Def. 4 on more than half of the inputs x coming from any distribution \({{\mathcal{D}}}_{i}\), namely, it is guaranteed to fail for the inputs for which x1 = 1. To prove quantum learnability, we notice that for every α the dataset \({{\mathcal{T}}}^{\alpha }\) associated with the concept class \({{\mathcal{M}}}_{U,W,O}\) contains exactly the pairs of state and measurement outcomes needed by the learning algorithm \({{\mathcal{A}}}_{W}\) in order to learn the unitary W(α). Namely, half of the training samples, characterized by x1 = 0 in their input x, allows the algorithm to recover the unknown W(α). Therefore, the quantum algorithm is able to learn the unknown observable O(α) and evaluate it on each quantum state associated with every input x {0, 1}n.

Proof sketch:

Proof sketch of Corollary 1. The proof follows the same approach as in Theorem 3, namely constructing a family of input distributions \({\{{{\mathcal{D}}}_{i}\}}_{i}\) that allows a quantum algorithm to recover the target unitary W(α), while ensuring that the learning problem remains intractable for any classical algorithm. Specifically, this is achieved by designing a model that prepares stabilizer states for half of the inputs and performs a \(\text{BQP}\)-hard computation for the other half. The quantum learning algorithm can then utilize the training data containing stabilizer states to learn the unitary W(α) by applying the procedure in ref. 22. Importantly, it will also be able to evaluate the learned O(α) on the other half of the input data.