Introduction

Since first-generation quantum computers were made publicly available eight years ago by IBM, the technological frontier is expanding at an ever increasing pace1,2,3,4,5,6,7,8,9,10,11. Nevertheless, decoherence and hardware errors still limit the applicability of these early-stage quantum devices, and practical quantum advantage yet remains to be demonstrated. To go beyond what is possible now, it is crucial to operate both classical and quantum computers in an orchestrated manner that exploits their respective strengths.

For example, in the variational quantum eigensolver (VQE) algorithm12,13,14, a classical computer optimizes the parameters of a trial quantum state vector \(\left\vert \psi \right\rangle\) to accurately approximate the minimal eigenvalue of an observable O, e.g., the Hamiltonian of a molecule. The tasks performed by the quantum processor are preparing \(\left\vert \psi \right\rangle\) and gathering measurement data from which the expectation value \(\langle O\rangle =\left\langle \psi \right\vert O\left\vert \psi \right\rangle\) can be estimated. In practice, the observable O cannot be measured directly as this would require a quantum circuit that diagonalizes it, i.e., a circuit that rotates the unknown eigenbasis of O to the computational basis. A common approach to circumvent this problem is to express O as a linear combination of n-qubit Pauli operators Pi {I, X, Y, Z}n with real coefficients \({c}_{i}\in {\mathbb{R}}\), as in

$$O=\mathop{\sum }\limits_{i=1}^{M}{c}_{i}{P}_{i}.$$
(1)

Since any Pauli operator Pi can be diagonalized using single-qubit Clifford gates, it is straightforward to measure its expectation value 〈Pi〉. Once all 〈Pi〉 have been obtained, 〈O〉 is calculated from Eq. (1). Although the number M of Pauli operators for molecular Hamiltonians has a nominal scaling of up to \({\mathcal{O}}({n}^{4})\), measuring all Pauli expectation values individually would require a large number of quantum circuit executions (“shots”)14. In addition to the mere evaluation of 〈O〉, usually also its gradient is estimated from the measured data15,16. To keep resource requirements at an affordable level, one can make use of simultaneous measurements of commuting Pauli operators, a technique that is commonly applied. For example, with a diagonalization circuit that only contains single-qubit gates, one can measure a tensor product basis (TPB), i.e., a set of Pauli operators that are qubitwise commuting (QWC)1. For typical problems it is possible to group an average number of three Pauli operators into a common TPB17. To further reduce the number of required diagonalization circuits from \({\mathcal{O}}({n}^{4})\) to \({\mathcal{O}}({n}^{3})\), grouping the Pauli operators into general commuting (GC) sets has been suggested18,19,20. Under ideal circumstances, such GC groupings would substantially decrease the number of shots required to estimate 〈O〉 to a desired accuracy21. Unfortunately, the corresponding diagonalization circuits consist of up to \(\frac{n(n-1)}{2}\) two-qubit gates and, if the connectivity of the device is limited, a large number of additional SWAP gates. As a first step to interpolate between these extremes, diagonalization circuits featuring a single layer of two-qubit gates have been introduced22,23; however, it is an open challenge to fully exploit the trade-off between QWC and GC. As of today, the best method to experimentally estimate an expectation value 〈O〉 is unclear. Besides Pauli grouping1,17,18,19,20,21,22,23,24,25, there is active research in addressing this problem with classical shadows26,27,28,29,30,31, unitary partitioning32,33, low-rank factorization34, adaptive estimators35,36,37, and decision diagrams38.

In this article, we introduce a theoretical framework for constructing diagonalization circuits whose two-qubit gates are tailored to meet the connectivity restrictions imposed by most current quantum computing architectures, e.g., super- and semiconducting qubits1,2,3,4,5,6,7,8,9. Our flexible approach can be applied to any hardware connectivity. We demonstrate the viability of our techniques for a large class of paradigmatic Hamiltonians in the context of the Pauli grouping problem.

Results

Framework for HT diagonalization circuits

The purpose of our theoretical framework is the construction of hardware-tailored (HT) Clifford circuits that diagonalize a given set of commuting n-qubit Pauli operators P1, …, Pm. After possibly replacing some of the Pj by −Pj, we can assume that the group they generate, which is denoted by 〈P1, …, Pm〉, does not contain − In. From now on, we will always assume that this is the case since it allows us to extend 〈P1, …, Pm〉 to the stabilizer group \({\mathcal{S}}\) of some stabilizer state vector \(\left\vert {\psi }_{{\mathcal{S}}}\right\rangle\), where \(\left\vert {\psi }_{{\mathcal{S}}}\right\rangle\) is defined as the common +1-eigenvector of all operators \(S\in {\mathcal{S}}\), see the Supplementary Material (SM) Sec. III. Then, uncomputing the state vector \(\left\vert {\psi }_{{\mathcal{S}}}\right\rangle\), i.e., applying some Clifford circuit \({U}_{{\mathcal{S}}}^{\dagger }\) for which \(\left\vert {\psi }_{{\mathcal{S}}}\right\rangle ={U}_{{\mathcal{S}}}{\left\vert 0\right\rangle }^{\otimes n}\), will simultaneously diagonalize P1, …, Pm39.

An important class of stabilizer states is that of graph states40. A graph with n vertices is defined in terms of its adjacency matrix \(\Gamma =({\gamma }_{i,j})\in {{\mathbb{F}}}_{2}^{n\times n}\), where \({{\mathbb{F}}}_{2}\) is the binary field; a pair (i, j) of vertices is connected via an edge if and only if γi,j = 1. In this article, we do not distinguish between a graph and its adjacency matrix, and we follow the convention γi,j = γj,i and γi,i = 0 for all i, j {1, …, n}. Every graph Γ defines a graph state vector \(\left\vert \Gamma \right\rangle ={U}_{\Gamma }{\left\vert 0\right\rangle }^{\otimes n}\) whose preparation circuit

$${U}_{\Gamma }=\left(\prod _{i < j}{{\rm{CZ}}}_{i,j}^{{\gamma }_{i,j}}\right){H}^{\otimes n}$$
(2)

consists of a layer of Hadamard gates \(H=\frac{1}{\sqrt{2}}(X+Z)\), followed by a two-qubit gate CZ = diag(1, 1, 1, −1) for every pair of connected vertices. The stabilizer group of \(\left\vert \Gamma \right\rangle\) is \({{\mathcal{S}}}_{\Gamma }=\{{X}^{{\bf{k}}}{Z}^{\Gamma {\bf{k}}}{(-1)}^{{\sum }_{i < j}{k}_{i}{\gamma }_{i,j}{k}_{j}}\,| \,{\bf{k}}\in {{\mathbb{F}}}_{2}^{n}\}\), where we define \({X}^{{\bf{k}}}={X}^{{k}_{1}}\otimes \ldots \otimes {X}^{{k}_{n}}\) and similarly ZΓk for the matrix-vector product Γk. Hence, \({U}_{\Gamma }^{\dagger }\) would diagonalize our operators P1, …, Pm if only they were of the form ± XkZΓk.

Every stabilizer state is local-Clifford (LC) equivalent to a graph state41. Thus, there exist single-qubit Clifford gates U1, …, Un and a graph Γ such that \(({U}_{1}\otimes \ldots \otimes {U}_{n})\left\vert {\psi }_{{\mathcal{S}}}\right\rangle =\left\vert \Gamma \right\rangle\). We conclude that every set of commuting Pauli operators can be simultaneously diagonalized by some layer of single-qubit Clifford gates followed by some circuit of the form \({U}_{\Gamma }^{\dagger }\). We refer to this procedure as a graph-based diagonalization circuit, see Fig. 1.

Fig. 1: Graph-based diagonalization circuit.
figure 1

A set of commuting Pauli operators P1, …, Pm (yellow) is diagonalized in two steps: first, a layer of single-qubit Clifford gates U = U1 Un (red) rotates them into a set of the form \({\mathcal{S}}=\{\pm {X}^{{\bf{k}}}{Z}^{\Gamma {\bf{k}}}\,| \,{\bf{k}}\in {{\mathbb{F}}}_{2}^{n}\}\), where \(\Gamma \in {{\mathbb{F}}}_{2}^{n\times n}\) is an adjacency matrix. Afterward, \({\mathcal{S}}\) is rotated to the computational basis (blue) by uncomputing the graph state vector \(\left\vert \Gamma \right\rangle\) (green). The existence of U and Γ is guaranteed because every stabilizer state is LC-equivalent to a graph state41. We call a graph-based diagonalization circuit hardware-tailored (HT) if Γ is a subgraph of the connectivity graph Γcon of the considered quantum device.

The connectivity graph of a quantum computer is the graph Γcon whose vertices and edges, respectively, are given by qubits and pairs of qubits for which a CZ gate can be physically implemented. For quantum devices with a limited connectivity, general graph-based diagonalization circuits require up to \({\mathcal{O}}({n}^{2})\) SWAP gates42. This overhead renders unconstrained graph-based diagonalization circuits infeasible for certain near-term applications, see SM Sec. II. However, if Γ Γcon is a subgraph of the connectivity graph, SWAP gates are avoided completely. We refer to graph-based diagonalization circuits that are designed to meet this condition as hardware-tailored (HT). By construction, HT circuits have a very low CZ depth that is upper bounded by the maximum degree of Γcon, e.g., by 3 and 4 for heavy-hex and square-lattice connectivity, respectively43.

We now derive a technical condition for the existence of HT diagonalization circuits. Every n-qubit Clifford gate U defines a symplectic matrix

$$A=\left[\begin{array}{cc}{A}^{xx}&{A}^{xz}\\ {A}^{zx}&{A}^{zz}\end{array}\right]\in \,{\text{GL}}\,\left({{\mathbb{F}}}_{2}^{2n}\right)$$
(3)

with the property that

$$U{X}^{{\bf{r}}}{Z}^{{\bf{s}}}{U}^{\dagger }={\text{i}}^{\alpha ({\bf{r}},{\bf{s}})}\,{X}^{{A}^{xx}{\bf{r}}+{A}^{xz}{\bf{s}}}\,{Z}^{{A}^{zx}{\bf{r}}+{A}^{zz}{\bf{s}}}$$
(4)

holds for all vectors \({\bf{r}},{\bf{s}}\in {{\mathbb{F}}}_{2}^{n}\), see Table 1 for the single-qubit case39. Hereby, a matrix A is called symplectic if \({A}^{{\rm{T}}}\left[\begin{array}{cc}0&{\mathbb{1}}\\ {\mathbb{1}}&0\end{array}\right]A=\left[\begin{array}{cc}0&{\mathbb{1}}\\ {\mathbb{1}}&0\end{array}\right]\), with \(\,{\text{GL}}\,\left({{\mathbb{F}}}_{2}^{2n}\right)\) denoting the general linear group of \({{\mathbb{F}}}_{2}^{2n}\). For the time being, we can neglect the phases that are given by \(\alpha ({\bf{r}},{\bf{s}})\in {\mathbb{Z}}/4{\mathbb{Z}}=\{0,1,2,3\}\) (see Eq. (46) in SM Sec. XIII for how to recover them) and focus on the irreducible representation

$${{\mathcal{C}}}_{n}\longrightarrow \,{\text{GL}}\,\left({{\mathbb{F}}}_{2}^{2{\text{n}}}\right), \ \ U\longmapsto A$$
(5)

of the n-qubit Clifford group \({{\mathcal{C}}}_{n}\). If U = U1 Un is a single-qubit Clifford layer, the blocks in Eq. (3) are diagonal, i.e., \({A}^{xx}=\,\text{diag}\,({a}_{1}^{xx},\ldots ,{a}_{n}^{xx})\), and similarly for Axz, Azx, and Azz. Hereby, \({a}_{1}^{xx}\) is given by the xx-entry of the binary representation of U1, etc. To construct a single-qubit Clifford layer that rotates P1, …, Pm into the set { ± XkZΓk} for a graph Γ, we write \({P}_{j}={\text{i}}^{{q}_{j}}{X}^{{{\bf{r}}}_{j}}{Z}^{{{\bf{s}}}_{j}}\). By Eq. (4), the application of U, which is represented by A, transforms the operator Pj into \({P}_{j}^{{\prime} }={X}^{{{\bf{k}}}_{j}}{Z}^{{A}^{zx}{{\bf{r}}}_{j}+{A}^{zz}{{\bf{s}}}_{j}}\) (up to a global phase), where we have introduced the notation kj = Axxrj + Axzsj. Thus, \({P}_{j}^{{\prime} }\in \{\pm {X}^{{\bf{k}}}{Z}^{\Gamma {\bf{k}}}\}\) is equivalent to Γkj = Azxrj + Azzsj. These equivalent conditions can be phrased for all j {1, …, m} simultaneously as a binary matrix equation

$$\Gamma {A}^{xx}R+\Gamma {A}^{xz}S={A}^{zx}R+{A}^{zz}S,$$
(6)

where \(R=\left({{\bf{r}}}_{1}\,\cdots \,{{\bf{r}}}_{m}\right)\) and \(S=\left({{\bf{s}}}_{1}\,\cdots \,{{\bf{s}}}_{m}\right)\) are the two matrices that store the exponent vectors of P1, …, Pm as their columns.

Table 1 Binary representation of the single-qubit Clifford group \({{\mathcal{C}}}_{1}\)

Equation (6) is our first main result. First, by devising algorithms to solve it, we can tackle the challenge of constructing HT diagonalization circuits. This finally allows us to explore the trade-off between single-qubit Clifford layers and unrestricted Clifford circuits. On a related note, merely checking whether a guess (A, Γ) solves Eq. (6) is of course much more efficient than solving Eq. (6) from scratch. Therefore, our formulation of Eq. (6) supports the design of HT diagonalization circuits via educated guesses or dedicated case-based research for, in principle, arbitrarily large system sizes. We illustrate this idea in Table 2 for molecular Hamiltonians with up to n = 120 qubits.

Table 2 Examples of HT diagonalization circuits and their performance in the context of quantum chemistry

For a quantum chip whose connectivity graph Γcon has n vertices and e edges, there are 6n and 2e potential choices for A and Γ, respectively. Hence, a brute-force solver for Eq. (6) would loop through all 6n2e choices, which quickly becomes infeasible due to the exponential size of the search space. Restricting to a polynomially-large random subset gives rise to an efficient, probabilistic, restricted brute-force solver for Eq. (6), however, we empirically find that this approach has a vanishingly low success probability. After closely investigating the mathematics behind Eq. (6), we can overcome this problem and formulate an efficient probabilistic solver that performs well in practice. Having available a huge search space will then turn into a powerful feature: the expressivity of Eq. (6) enables us to construct ultra-short diagonalization circuits.

Mathematical results

We have reduced the task of constructing HT diagonalization circuits to the problem of solving Eq. (6) for a subgraph Γ Γcon and a symplectic, invertible matrix A whose blocks in Eq. (3) are diagonal. In our case, it is sufficient that A is invertible because every invertible matrix \({A}_{i}\in {{\mathbb{F}}}_{2}^{2\times 2}\), which represents Ui in U = U1 Un, is necessarily also symplectic, see Table 1. Because of \(\det (A)=\mathop{\prod }\nolimits_{i=1}^{n}\det ({A}_{i})\) and \({{\mathbb{F}}}_{2}=\{0,1\}\), the invertibility of A is equivalent to \(\det ({A}_{1})=\ldots =\det ({A}_{n})=1\). As we will show soon in Eq. (13), for a fixed graph Γ, it is possible to rewrite these determinants as quadratic forms

$$\det ({A}_{i})={{\boldsymbol{\lambda }}}^{{\rm{T}}}{Q}_{i}{\boldsymbol{\lambda }}.$$
(7)

The equation λTQiλ = 1 defines what is known in algebraic geometry as a quadric hypersurface \({{\mathcal{L}}}_{i}\subset {{\mathbb{F}}}_{2}^{d}\), i.e., a generalization of a conic section, see Fig. 2. As a consequence, \(\det ({A}_{1})=\ldots =\det ({A}_{n})=1\) defines the intersection \({\mathcal{L}}=\mathop{\bigcap }\nolimits_{i = 1}^{n}{{\mathcal{L}}}_{i}\). Note that the points in \({\mathcal{L}}\) are in one-to-one correspondence with single-qubit Clifford layers that transform P1, …, Pm into stabilizers of \(\left\vert \Gamma \right\rangle\). In particular, \({\mathcal{L}}\) is non-empty if and only if it is possible to diagonalize P1, …, Pm with a circuit based on Γ as in Fig. 1.

Fig. 2: Illustration of the solution space \({\mathcal{L}}\) (yellow dots) of Eq. (6) for a fixed graph Γ in a hypothetical scenario with n = 3 qubits.
figure 2

The ambient space is the d-dimensional null space of the binary matrix M defined in Eq. (9). In this example, \({\mathcal{L}}\) is the intersection of three quadric hypersurfaces \({{\mathcal{L}}}_{1}\) (red), \({{\mathcal{L}}}_{2}\) (green), and \({{\mathcal{L}}}_{3}\) (blue). Every hypersurface \({{\mathcal{L}}}_{i}\) is the union of four non-intersecting affine subspaces \({{\mathcal{A}}}_{i}\), \({{\mathcal{B}}}_{i}\), \({{\mathcal{C}}}_{i}\), and \({{\mathcal{D}}}_{i}\) (parallel lines) defined in Eqs. (15)–(18).

In our setting, every quadric hypersurface \({{\mathcal{L}}}_{i}\) can be written as the union of at most four affine spaces. Therefore, their intersection \({\mathcal{L}}=\mathop{\bigcap }\nolimits_{i = 1}^{n}{{\mathcal{L}}}_{i}\) is the union of at most 4n intersections of affine subspaces, each of which is itself an affine subspace. Using Gaussian elimination over \({{\mathbb{F}}}_{2}\), we can efficiently probe whether such an intersection is non-empty and, once a point \(\lambda \in {\mathcal{L}}\) is found, we have successfully constructed a HT circuit.

What is gained? At first glance, not much: we still have 2e choices for Γ Γcon and up to 4n potential intersections in which a solution might lie. Hence, also in this reformulation, a brute force approach will be inefficient in the worst case. It turns out, however, that the efficient but probabilistic version, where only a polynomial number of subgraphs Γ Γcon and intersections is probed, has a sufficiently high empirical success probability to facilitate the construction of HT circuits for various practical problems as we will demonstrate later in Fig. 5.

For pedagogical reasons, we have already revealed the geometry behind Eq. (6). Let us now delve into the corresponding algebra and show that the picture in Fig. 2 is correct. To solve Eq. (6) in practice, we have to bring \(\det (A)=1\) into a form a classical computer can deal with. First, we exploit that the blocks of A in Eq. (3) are diagonal. Thus, we can replace A by the vector

$${\bf{a}}=\left[\begin{array}{c}{{\bf{a}}}^{xx}\\ {{\bf{a}}}^{xz}\\ {{\bf{a}}}^{zx}\\ {{\bf{a}}}^{zz}\end{array}\right]\in {{\mathbb{F}}}_{2}^{4n}$$
(8)

where the vector \({{\bf{a}}}^{xx}=({a}_{1}^{xx},\ldots ,{a}_{n}^{xx})\in {{\mathbb{F}}}_{2}^{n}\) defines the xx-block via Axx = diag(axx), and likewise for axz, azx, and azz. In this notation, Eq. (6) reads Ma = 0 for the (mn × 4n)-matrix

$$M=\left[\begin{array}{cccc}\Gamma \,\text{diag}\,({{\bf{r}}}_{1})&\Gamma \,\text{diag}\,({{\bf{s}}}_{1})&\,\text{diag}\,({{\bf{r}}}_{1})&\,\text{diag}\,({{\bf{s}}}_{1})\\ \vdots &\vdots &\vdots &\vdots \\ \Gamma \,\text{diag}\,({{\bf{r}}}_{m})&\Gamma \,\text{diag}\,({{\bf{s}}}_{m})&\,\text{diag}\,({{\bf{r}}}_{m})&\,\text{diag}\,({{\bf{s}}}_{m})\end{array}\right].$$
(9)

Via Gaussian elimination, we can efficiently compute a basis \({{\bf{v}}}_{1},\ldots ,{{\bf{v}}}_{d}\in {{\mathbb{F}}}_{2}^{4n}\) of the null space of M. Thus, every vector \({\bf{a}}\in {{\mathbb{F}}}_{2}^{4n}\) with Ma = 0 is of the form

$${\bf{a}}=\mathop{\sum }\limits_{j=1}^{d}{\lambda }_{j}{{\bf{v}}}_{j}$$
(10)

for some vector \({\boldsymbol{\lambda }}\in {{\mathbb{F}}}_{2}^{d}\). In order to correspond to a physical solution, the vector a also needs to fulfill

$$\det ({A}_{i})={a}_{i}^{xx}{a}_{i}^{zz}+{a}_{i}^{zx}{a}_{i}^{xz}=1$$
(11)

for every i {1, …, n} as this will allow us to invert the irreducible representation \({{\mathcal{C}}}_{n}\to \,{\text{GL}}\,({{\mathbb{F}}}_{2}^{2n}),U\mapsto A\). Based on Ansatz (10), we find

$${a}_{i}^{xx}{a}_{i}^{zz}=\mathop{\sum }\limits_{j,{j}^{{\prime} }=1}^{d}{\lambda }_{j}{v}_{i,j}^{xx}{v}_{i,{j}^{{\prime} }}^{zz}{\lambda }_{{j}^{{\prime} }}={{\boldsymbol{\lambda }}}^{{\rm{T}}}\left({{\bf{x}}}_{i}{{\bf{z}}}_{i}^{{\rm{T}}}\right){\boldsymbol{\lambda }}$$
(12)

and similarly \({a}_{i}^{xz}{a}_{i}^{zx}={{\boldsymbol{\lambda }}}^{{\rm{T}}}({{\bf{w}}}_{i}{{\bf{y}}}_{i}^{{\rm{T}}}){\boldsymbol{\lambda }}\), where we have introduced \({{\bf{x}}}_{i}=({v}_{i,1}^{xx},\ldots ,{v}_{i,d}^{xx})\), \({{\bf{z}}}_{i}=({v}_{i,1}^{zz},\ldots ,{v}_{i,d}^{zz})\), \({{\bf{w}}}_{i}=({v}_{i,1}^{xz},\ldots ,{v}_{i,d}^{xz})\), and \({{\bf{y}}}_{i}=({v}_{i,1}^{zx},\ldots ,{v}_{i,d}^{zx})\). Further inserting these expressions into Eq. (11), we obtain \(\det ({A}_{i})={{\boldsymbol{\lambda }}}^{{\rm{T}}}({{\bf{x}}}_{i}{{\bf{z}}}_{i}^{{\rm{T}}}+{{\bf{w}}}_{i}{{\bf{y}}}_{i}^{{\rm{T}}}){\boldsymbol{\lambda }}\) and, by defining the matrix

$${Q}_{i}={{\bf{x}}}_{i}{{\bf{z}}}_{i}^{{\rm{T}}}+{{\bf{w}}}_{i}{{\bf{y}}}_{i}^{{\rm{T}}}\in {{\mathbb{F}}}_{2}^{d\times d},$$
(13)

we finally arrive at Eq. (7). To proceed, we point out that \(\det ({A}_{i})={{\boldsymbol{\lambda }}}^{{\rm{T}}}{Q}_{i}{\boldsymbol{\lambda }}=1\) is equivalent to

$$\left[\begin{array}{c}{{\boldsymbol{\lambda }}}^{{\rm{T}}}{{\bf{x}}}_{i}\\ {{\boldsymbol{\lambda }}}^{{\rm{T}}}{{\bf{z}}}_{i}\\ {{\boldsymbol{\lambda }}}^{{\rm{T}}}{{\bf{w}}}_{i}\\ {{\boldsymbol{\lambda }}}^{{\rm{T}}}{{\bf{y}}}_{i}\end{array}\right]\in \left\{\left[\begin{array}{c}0\\ 0\\ 1\\ 1\end{array}\right],\left[\begin{array}{c}0\\ 1\\ 1\\ 1\end{array}\right],\left[\begin{array}{c}1\\ 1\\ 0\\ 0\end{array}\right],\left[\begin{array}{c}1\\ 1\\ 0\\ 1\end{array}\right],\left[\begin{array}{c}1\\ 0\\ 1\\ 1\end{array}\right],\left[\begin{array}{c}1\\ 1\\ 1\\ 0\end{array}\right]\right\}.$$
(14)

For example, if λTxi = 0, the value of λTzi is irrelevant and we need λTwi = λTyi = 1 for \(\det ({A}_{i})=1\). These are the first two cases in Eq. (14). Defining affine hyperplanes \({{\mathcal{X}}}_{i}^{(c)}=\{{\boldsymbol{\lambda }}\in {{\mathbb{F}}}_{2}^{d}\,| \,{{\boldsymbol{\lambda }}}^{{\rm{T}}}{{\bf{x}}}_{i}=c\}\) for \(c\in {{\mathbb{F}}}_{2}\), and similarly \({{\mathcal{Z}}}_{i}^{(c)}\), \({{\mathcal{W}}}_{i}^{(c)}\), and \({{\mathcal{Y}}}_{i}^{(c)}\), we can rephrase these two cases as λ being contained in the affine subspace

$${{\mathcal{A}}}_{i}={{\mathcal{X}}}_{i}^{(0)}\cap {{\mathcal{W}}}_{i}^{(1)}\cap {{\mathcal{Y}}}_{i}^{(1)}.$$
(15)

For the middle two cases in Eq. (14), the roles of (x, z) and (w, y) are interchanged: these cases are equivalent to λ being contained in the affine subspace

$${{\mathcal{B}}}_{i}={{\mathcal{W}}}_{i}^{(0)}\cap {{\mathcal{X}}}_{i}^{(1)}\cap {{\mathcal{Z}}}_{i}^{(1)}.$$
(16)

Similarly, the remaining two cases in Eq. (14) lead to

$${{\mathcal{C}}}_{i}={{\mathcal{X}}}_{i}^{(1)}\cap {{\mathcal{Z}}}_{i}^{(0)}\cap {{\mathcal{W}}}_{i}^{(1)}\cap {{\mathcal{Y}}}_{i}^{(1)} \,\text{and}\,$$
(17)
$${{\mathcal{D}}}_{i}={{\mathcal{X}}}_{i}^{(1)}\cap {{\mathcal{Z}}}_{i}^{(1)}\cap {{\mathcal{W}}}_{i}^{(1)}\cap {{\mathcal{Y}}}_{i}^{(0)},$$
(18)

respectively. Finally, by introducing

$${{\mathcal{L}}}_{i}={{\mathcal{A}}}_{i}\cup {{\mathcal{B}}}_{i}\cup {{\mathcal{C}}}_{i}\cup {{\mathcal{D}}}_{i},$$
(19)

we obtain that \(\det ({A}_{i})=1\) is equivalent to \({\boldsymbol{\lambda }}\in {{\mathcal{L}}}_{i}\). We defer further technical details about how to best implement the efficient probabilistic solver based on Gaussian elimination as well as additional mathematical insights for simplifying the problem to the methods section.

Reformulation as an optimization problem

It is possible to restate Eq. (6) as the feasibility problem of an integer quadratically constrained program (IQP), a special case of a mixed integer quadratically constrained program (MIQCP) which can be tackled with powerful numerical solvers such as Gurobi44. The problem instance is encoded into the binary matrices \(R,S\in {{\mathbb{F}}}_{2}^{n\times m}\), which host the exponent vectors of the Pauli operators P1, …, Pm to be diagonalized, and into the connectivity graph Γcon with e edges. A straightforward albeit computationally expensive procedure is the following, which we refer to as naive numerical solver:

  • Introduce 4n free binary (Boolean) variables \({\bf{a}}=({a}_{1}^{xx},\ldots ,{a}_{n}^{zz})\) as vectorizations of matrices Axx, …, Azz as stated in Eq. (8).

  • Introduce e binary (Boolean) variables γi,j, one for each edge in the connectivity graph Γcon. The final values of γi,j determine whether or not the corresponding hardware-supported CZ gates are used.

  • Introduce a matrix \(N=({\nu }_{j,k})\in {{\mathbb{Z}}}^{n\times m}\) of n × m integer slack variables and define nm quadratic constraints, one for each entry in the matrix equation ΓAxxR + ΓAxzS = AzxR + AzzS + 2N. Note that the term 2N exploits Eq. (6) being defined over \({{\mathbb{F}}}_{2}^{n\times m}\), i.e., equality only needs to hold modulo 2.

  • Define quadratic constraints, \({a}_{i}^{xx}{a}_{i}^{zz}+{a}_{i}^{zx}{a}_{i}^{xz}=1\) for each i {1, …, n}, to ensure invertibility as a determinant constraint. Here, there is no need to introduce additional integer slack variables that account for the fact that \(\det ({A}_{i})=1\) is supposed to hold modulo 2 because 1 is the only odd value that the expression \({a}_{i}^{xx}{a}_{i}^{zz}+{a}_{i}^{zx}{a}_{i}^{xz}\) can possibly take.

  • Specify a target function to be minimized, e.g., the number of two-qubit gates in the final circuit, cost(a, Γ) = ∑i<jγi,j.

In general, IQP is computationally hard in worst case complexity: it is in NP, while the decision version is NP-complete45. We can reduce the complexity of the naive IQP above by leveraging our mathematical insights from the previous subsection. Treating only one fixed subgraph Γ Γcon at a time, we can compute Q1, …, Qn via Eq. (13). Here, however, we regard the entries of Qi as real numbers rather than equivalence classes of integers modulo 2. As a direct alternative to intersecting affine subspaces, we can attempt to find a solution \({\boldsymbol{\lambda }}\in {\mathcal{L}}\) by executing the following IQP which we call informed numerical solver:

  • Introduce d binary variables λ = (λ1, …, λd) as well as n integer slack variables μ1, …, μn.

  • Define n quadratic constraints, λTQiλ = 1 + 2μi.

  • If necessary, specify a trivial target function.

Note that, by considering only one subgraph Γ at a time, the quadratic constraints in Eq. (6) become linear constraints over Boolean variables. This allows us to efficiently identify the ambient space in Fig. 2, however, the constraints ensuring invertibility of A still define quadratic constraints, which can be cast into the form of an IQP. The structured problems encountered here feature only n quadratic constraints and can be solved in practice for larger system sizes. Empirically, we will demonstrate later in Fig. 5 that highly-optimized, commercially available MIQCP solvers constitute a viable alternative to our provably-efficient algebraic solver.

Solver overview

For an overview of all solvers, see Table 3. In principle, each solver (except for naive num.) can be executed in an exhaustive mode, where all s(n) = 2e subgraphs of Γcon instead of only s(n) = poly(n) are probed. In the exhaustive mode, every solver has an exponential runtime which limits their range of applicability. For small system sizes (e.g., n 12 in Fig. 5), we recommend using the exhaustive, informed numerical solver as it performs best in practice. For larger system sizes, we recommend either the restricted, informed numerical solver, which can be very fast, or the restricted algebraic solver, whose runtime can be easily controlled, see methods section “Runtime Analysis”. Hereby, the hyperparameter s(n) and, if applicable, also c(n), should be selected appropriately, see Fig. 5, methods, and SM Sec. VII for further details and examples.

Table 3 Overview of our solvers for Equation (6)

Experimental demonstration

Next, we demonstrate that our theoretical framework can enhance the efficiency of today’s state-of-the-art quantum computers in practice. A central task is to estimate the expectation value 〈O〉 = Tr[ρO] for an observable O of interest and a state ρ prepared on the quantum computer. For a fixed shot budget (total number of available circuit executions), the goal is to estimate 〈O〉 as accurately as possible. As a concrete example, we consider an eight-qubit molecular Hamiltonian \(O=\mathop{\sum }\nolimits_{i = 1}^{M}{c}_{i}{P}_{i}\) with M = 184 Pauli operators that represents a four-atomic linear hydrogen chain, see SM Sec. XIII for details. To enable a fair comparison of different estimation procedures, we restrict ourselves to the paradigm of non-overlapping Pauli groupings with reliably-executable readout circuits24. To the best of our knowledge, the previous state-of-the-art method within this paradigm is the collection of single-qubit readout circuits that arise from applying the Sorted Insertion algorithm21 that groups {P1, …, P184} into, in this case, \({N}_{{\rm{TPB}}}^{{\rm{circs}}}=35\) QWC subsets. As an alternative, we propose to use \({N}_{{\rm{HT}}}^{{\rm{circs}}}=10\) HT readout circuits which allow us to measure the same M = 184 Pauli operators and which contain no more than four linearly-connected CZ gates. This reduction in the number of readout circuits is possible because being jointly-HT diagonalizable is a less stringent requirement than QWC. What matters in the end, however, is the error \(\epsilon =| {E}_{\exp }-{E}_{{\rm{ideal}}}|\) by which the experimentally measured energy \({E}_{\exp }\) differs from the ideal one. Hereby, one should minimize ϵ by optimally distributing the available shots among the \({N}_{{\rm{TPB}}}^{{\rm{circs}}}=35\) or \({N}_{{\rm{HT}}}^{{\rm{circs}}}=10\) readout circuits21, see SM Sec. IV. To enable a pronounced comparison between the two readout methods, we select a product target state vector \(\left\vert \Psi \right\rangle =\left\vert {\psi }_{1}\right\rangle \otimes \ldots \otimes \left\vert {\psi }_{8}\right\rangle\) that can be prepared with a high fidelity. This state has a large energy \({E}_{{\rm{ideal}}}=\left\langle \Psi \right\vert O\left\vert \Psi \right\rangle =-0.49264\,\,\text{Ha}\,\) compared to the ground state energy, \(\mathop{\min }\limits_{\Phi }\left\langle \Phi \right\vert O\left\vert \Phi \right\rangle =-2.26752\,\,\text{Ha}\,\), where all reported energies include Coulomb repulsion between the nuclei. We perform experiments for both readout methods and present the results in Fig. 3. The error ϵ is plotted in Fig. 3a as a function of the shot budget Nshots for both TPB (blue circles) and HT (green diamonds). We compare the experimental results (dark) to classical, noise-free simulations (bright). In Fig. 3b we display the shot reduction ratio \({N}_{{\rm{TPB}}}^{{\rm{shots}}}/{N}_{{\rm{HT}}}^{{\rm{shots}}}\) for achieving a target error ϵ that our HT circuits offer over conventional TPBs. For very low budgets of a few hundred shots in total, we see that the experimental data perfectly agree with the simulation. This is because sampling errors dominate here. While the simulated errors generally decrease as \(1/\sqrt{{N}^{{\rm{shots}}}}\), the experimental errors eventually saturate at a finite bias b stemming mainly from noise in the diagonalization circuits and the readout-error-mitigated46Z-measurements. Unsurprisingly, bHT = 0.0299(2) Ha is larger than bTPB = 0.0284(2)Ha. For shot budgets below 104, however, we observe that ϵ is smaller for HT than for TPB. Hence, our HT readout circuits outperform TPBs in the low-shot regime, which is of practical importance, see SM Sec. I.

Fig. 3: Experimental shot reduction of HT readout circuits over TPBs.
figure 3

The experiment is conducted with eight superconducting qubits on ibm_washington and compared to an ideal Qiskit simulation53. a Error ϵ (see main text) as a function of the total number of optimally-allocated shots. Error bars show the error on the mean of ϵ averaged over 5 × 107/Nshots independent repetitions of the experiment. b Ratio \({N}_{{\rm{TPB}}}^{{\rm{shots}}}/{N}_{{\rm{HT}}}^{{\rm{shots}}}\) of shots needed to estimate 〈O〉 up to a total error of ϵ. In the absence of implementation errors (Theo.), this ratio is theoretically independent of ϵ and given by RHT/RTPB ≈ 1.87. See SM Sec. XIII for further details.

Discussion

In the remainder of this article, we discuss the reusability potential of HT Pauli groupings, investigate the performance of our solvers, and highlight further potential use cases for our theoretical framework.

Every grouping of a fixed set {P1, …, PM} into jointly-diagonalizable subsets has vast reusability potential: for a fixed observable \(O=\mathop{\sum }\nolimits_{i = 1}^{M}{c}_{i}{P}_{i}\), one can estimate Tr[ρO] for countless experimental states ρ, e.g., different eigenstates of O as well as the myriad of VQE trial states that are encountered until such eigenstates are found. Similarly, in simulations of chemical reactions, Tr[ρ(t)O] is mapped out for a time series of quantum states ρ(t).

Moreover, for any linear combination \({O}^{{\prime} }=\mathop{\sum }\nolimits_{i = 1}^{M}{c}_{i}^{{\prime} }{P}_{i}\) of the same Pauli operators, it is possible to estimate \(\langle {O}^{{\prime} }\rangle\) with the same readout circuits as for 〈O〉. Therefore, a single Pauli grouping can suffice to map out low-energy Born-Oppenheimer surfaces via, e.g., VQE. At the same time, we are guaranteed that the quality of the reused grouping does not deteriorate abruptly as the estimated shot reduction \(\hat{R}\) depends continuously on the nuclear coordinates21.

We illustrate this fact in Fig. 4 for the example of an eight-qubit Hamiltonian \(O(d)=\mathop{\sum }\nolimits_{i = 1}^{184}{c}_{i}(d){P}_{i}\) that represents a four-atomic linear hydrogen chain for which the interatomic spacing d is varied. For different Pauli groupings of O(d), we plot \(\hat{R}\), which should be regarded as a state-independent estimate of the shot reduction ratio \({N}_{{\rm{GPM}}}^{{\rm{shots}}}/{N}_{{\rm{IPM}}}^{{\rm{shots}}}\) (cf. Fig. 3b), where \({N}_{{\rm{GPM}}}^{{\rm{shots}}}\) and \({N}_{{\rm{IPM}}}^{{\rm{shots}}}\), respectively, denote the number of shots required to measure 〈O(d)〉 to a fixed precision ϵ if grouped Pauli measurements (GPM) and individual Pauli measurements (IPM) are performed. For example, O(d = 1.0 Å) coincides with the Hamiltonian from Fig. 3, and the estimated shot reduction ratio of HT over TPB, \({\hat{R}}_{{\rm{HT}}}/{\hat{R}}_{{\rm{TPB}}}\approx 1.76\), is close to the state-dependent value, RHT/RTPB ≈ 1.87. As explained above, we can see in Fig. 4 that \(\hat{R}\) depends continuously on d if a fixed Pauli grouping is reused (dark dashed lines). Also shown is the piecewise continuous dependence of \({\hat{R}}_{{\rm{GC}}}\) and \({\hat{R}}_{{\rm{TPB}}}\) if a new Pauli grouping is recomputed for every value of d (bright solid lines), where the jumps unveil the values of d at which the output of the Sorted Insertion algorithm changes. Since Sorted Insertion is a greedy algorithm, recomputing the Pauli grouping can even diminish the value of \(\hat{R}\). Finally, note that Pauli groupings can be reused for numerous other applications in the context of parameter-shift rules15,16,47,48.

Fig. 4: Reusability opportunity.
figure 4

Estimated shot reduction \(\hat{R}\) defined in ref. 21 for measuring the energy 〈O〉 of a hydrogen chain Hamiltonian as a function of the interatomic distance d between the hydrogen atoms. Due to continuity, there is no need to regroup the Pauli operators in O when d is updated. This enables enormous savings in preprocessing costs.

It is straightforward to adapt Sorted Insertion to HT Pauli groupings, see SM Sec. V. Instead of checking if a set of Pauli operators (qubitwise) commutes, one has to construct a HT readout circuit using one of the solvers from Table 3. In Fig. 5, we investigate the performance of our various solvers by applying a modified Sorted Insertion algorithm to three classes of paradigmatic Hamiltonians. In all cases, we observe in Fig. 5a–c that the estimated shot reduction ratio, \({\hat{R}}_{{\rm{HT}}}/{\hat{R}}_{{\rm{TPB}}}\), takes values in between 1.3 and 3.5. Therefore, our HT readout circuits consistently outperform conventional TPBs.

Fig. 5: Performance of a greedy Pauli grouping algorithm with various solvers from Table 3.
figure 5

ac Estimated shot reduction \({\hat{R}}_{{\rm{HT}}}/{\hat{R}}_{{\rm{TPB}}}\) for three classes of n-qubit Hamiltonians O. The HT readout circuits assume a linear hardware connectivity. Different curves correspond to different solvers from Table 3 and, in the case of random Hamiltonians (b, e, h), to different numbers M of Pauli operators in O. For the single data point with the red star symbol () at n = 14 in (c and f), we use the restr. alg. solver with a fine-tuned choice of hyperparameters. df Runtime of our HT Pauli grouping algorithm. gi Estimated shot reduction \({\hat{R}}_{{\rm{GC}}}/{\hat{R}}_{{\rm{TPB}}}\) that is in principle achievable with unrestricted Clifford circuits if two-qubit errors are neglected, which is an unrealistic assumption for near-term quantum devices. Therefore, HT readout circuits present the best viable option in this comparison. See “Methods” for further details.

While Sorted Insertion has a runtime of \({\mathcal{O}}({M}^{2}n)\), the runtime of our adaptation is given by \({\mathcal{O}}({M}^{2}f(n))\), where M still denotes the number of Pauli operators in the observable O, and f(n) is the complexity of the selected solver, e.g., \(f(n)={\mathcal{O}}({2}^{e}{6}^{n}{n}^{3})\) for the brute-force algebraic solver and \(f(n)={\mathcal{O}}(\,{\text{poly}}\,(n))\) for the restricted algebraic solver, see methods section “Runtime Analysis”. For molecular hydrogen chain Hamiltonians for example, two polynomial fits (green dashed lines) in Fig. 5d reveal total runtimes t(1) n9.8 and t(2) n9.4. Since the number of Pauli operators is here given by M ≈ 0.17 × n3.68, we can conclude that the restricted algebraic solver has an empirical runtime f(n) lying somewhere in between \({\mathcal{O}}({n}^{2.0})\) and \({\mathcal{O}}({n}^{2.4})\), which is in agreement with the expected runtime of \(f(n)={\mathcal{O}}({n}^{3})\), see methods.

For the example of random Hamiltonians, we compare in Fig. 5b the algebraic solver with the informed numerical solver from Table 3, both in combination with an exhaustive (exh.) search over all 2e subgraphs Γ Γcon. We find that both solvers produce HT Pauli groupings of the same quality, which we believe to be near optimal for the assumed linear connectivity constraint. By construction, both exhaustive solvers are inefficient, however, we observe in Fig. 5e that the numerical solver is much faster in practice. This is unsurprising as this option leverages a highly-optimized MIQCP solver.

Since the exhaustive numerical solver is our best option for constructing near-optimal HT Pauli groupings, we can now turn to the question about how much the quality (as measured by \({\hat{R}}_{{\rm{HT}}}\)) of the HT Pauli grouping decreases when we replace the brute-force solver by our provably-efficient restricted algebraic solver. For the example of the Hubbard model we observe in Fig. 5c that both the restricted numerical solver and the restricted algebraic solver produce HT Pauli groupings with the same qualitative behavior of \(\hat{R}\) as the (near-optimal) exhaustive numerical solver. For all solvers, \(\hat{R}\) first increases with the number of qubits n, before it drops again. We attribute this saturation effect to the linear hardware connectivity constraint because \({\hat{R}}_{{\rm{GC}}}/{\hat{R}}_{{\rm{TPB}}}\) continues to grow as shown in Fig. 5i (also see Fig. 1 in SM Sec. II). Whether a similar growth is retained with 2-dimensional HT readout circuits deserves further investigation. For n = 14, the exhaustive numerical solver constructs a grouping with \({\hat{R}}_{{\rm{HT}}}/{\hat{R}}_{{\rm{TPB}}}\approx 3.34\) after 5 h. By carefully balancing the hyperparameters of the restricted algebraic solver (see “Methods”), we find an even better grouping with \({\hat{R}}_{{\rm{HT}}}/{\hat{R}}_{{\rm{TPB}}}\approx 3.42\) in only 11 min (red star).

The discussion above shows that the classical bottleneck of our HT Pauli grouper can be overcome. To corroborate this claim, we propose and test a rudimentary HT Pauli grouper (Algorithm 2 in SM Sec. VI) for which the enabled runtime savings outweigh the classical preprocessing cost for a 52-qubit example. This demonstrates that the advantage of HT readout circuits can be scaled up to meaningful system sizes. Finally, let us point out that our approach also works for more sophisticated molecular basis sets, see SM Sec. XIV.

In this article, we have introduced a theoretical framework for the construction of diagonalization circuits that can be tailored to any given hardware connectivity. Our starting point was the observation of the fact that every set of commuting Pauli operators can be cast into the stabilizer group of a stabilizer state which is local-Clifford equivalent to a graph state. The Pauli operators can thus be diagonalized with a quantum circuit that completely avoids SWAP gates whenever this graph state matches the connectivity of the quantum computer. We derived an algebraic criterion for the existence of such hardware-tailored (HT) diagonalization circuits and introduced solvers for their construction. An important empirical observation is that in many cases it is not necessary to apply unconstrained diagonalization circuits because also HT circuits can be constructed. For super- and semiconducting chips, for which the connectivity graph has a bounded degree, HT circuits have a constant depth. In comparison to state preparation circuits such as UCCSD, which have a gate count of \({\mathcal{O}}({n}^{4})\) and a gate depth of \({\mathcal{O}}({n}^{3})\), the circuit complexity of HT readout circuits is therefore negligible14. The construction of HT circuits can be computationally demanding but is worthwhile given their reusability potential. Finally, we have demonstrated the advantage of our approach over previous ones both in theory and in experiment.

There are multiple ways to further improve the efficiency of our solvers for the construction of HT diagonalization circuits, see methods section “Ideas for Improving our Solvers”. Furthermore, it would be worthwhile to combine the paradigm of HT readout circuits with complementary state-of-the-art methods for lowering shot requirements such as iterative measurement allocation and iterative coefficient splitting25.

Lowering shot requirements, e.g., in variational algorithms as discussed above, is just one of many potential applications for HT diagonalization circuits. Other, equally important use cases arise in the contexts of classical shadows28, Hamiltonian time simulation49,50,51, and quantum error correction52. For instance, an encoding circuit for a stabilizer quantum error-correcting code is the same as a time-reversed diagonalization circuit for its stabilizer group. For an outline how Hamiltonian exponentiation could potentially benefit from our HT diagonalization circuits, see SM Sec. IX.

Methods

Choices of Hamiltonians

All molecular hydrogen chain Hamiltonians were computed in the STO-3G minimal basis using Qiskit nature53, either in combination with pyquante54 or PySCF55. More sophisticated basis sets are discussed in SM Sec. XIV. For Table 2, Figs. 3, and 4, we use PyQuante and the Bravyi-Kitaev fermion-to-qubit mapper to obtain n-qubit Hamiltonians representing a hydrogen chain with \(\frac{n}{2}\) nuclei at an equidistant spacing of d, where we fix d = 1.0 Å in Table 2 and Fig. 3. The exact numbers of Pauli operators M(n) of the n-qubit Hamiltonians in Table 2 are given by M(20) = 7150, M(40) = 116594, M(60) = 594954, M(80) = 1886542, M(100) = 4611050, and M(120) = 9559318. For the hydrogen chain Hamiltonians in the left column of Fig. 5, on the other hand, we use PySCF and the parity-encoding fermion-to-qubit mapper, resulting in Hamiltonians with only n = 2nat − 2 qubits, where nat is the number of hydrogen atoms; also here the interatomic spacing is chosen as d = 1.0 Å.

In Fig. 5b, we average over twenty random Hamiltonians of the form \(O=\mathop{\sum }\nolimits_{i = 1}^{M}{c}_{i}{P}_{i}\) for each choice of qubit number n and number of Pauli operators M. For every O, the Pauli operators Pi {I, X, Y, Z}n\{In} and coefficients ci [−1, 1] are drawn uniformly at random. Error bars show one standard deviation. For an in-depth investigation of the results, see SM Sec. VIII.

Figure 5 c features the Hubbard model of L fermionic modes \({\hat{c}}_{{k}_{j},\sigma }^{\dagger }\) with momentum kj = 2πj/L and spin σ on a 1-dimensional lattice with L sites and periodic boundary conditions. Its Hamiltonian is given by

$$\begin{array}{rcl}O&=&\frac{U}{L}\mathop{\sum }\limits_{i,j,l=1}^{L}{\hat{c}}_{{k}_{i}-{k}_{l},\uparrow }^{\dagger }{\hat{c}}_{{k}_{j}+{k}_{l},\downarrow }^{\dagger }{\hat{c}}_{{k}_{j},\downarrow }{\hat{c}}_{{k}_{i},\uparrow }+\mathop{\sum }\limits_{j=1}^{L}\sum _{\sigma \in \{\uparrow ,\downarrow \}}{\epsilon }_{{k}_{j}}{\hat{c}}_{{k}_{j},\sigma }^{\dagger }{\hat{c}}_{{k}_{j},\sigma }\end{array}$$
(20)

where U ≥ 0 is the Coulomb energy, \({\epsilon }_{{k}_{j}}=-2t\cos ({k}_{j})\) is the dispersion relation for non-interacting (U = 0) fermions, and t ≥ 0 is the hopping strength56. We use the block-spin Jordan-Wigner encoding to convert this Hamiltonian into a linear combination of Pauli operators. In Fig. 5, we assume U = 1 and t = 1 to obtain concrete values for \(\hat{R}\).

Choices of hyperparameters

Here we report the choices of hyperparameters of our solvers that were used to create Fig. 5. In all cases, we tailor the readout circuits to a linear connectivity by applying Algorithm 1 from SM Sec. V. The total runtime and the quality of the resulting Pauli grouping depends on the choice of the solver from Table 3 and its hyperparameters. Unless specified otherwise, all computations were carried out on an 18 core Intel Xeon CPU E5-2697 v4 @2.30GHz device (Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.).

The exhaustive informed numerical solver (dark blue triangles in Fig. 5d,f and dark markers with dotted lines in Fig. 5e) searches over all 2n−1 circuit templates Γ Γcon and constructs readout circuits (one for each subgraph Γ) by solving the informed IQP problem (if possible). Then, the best circuit is chosen to assign a collection of Pauli operators to a joint measurement, see SM Sec. V for further details. The exhaustive algebraic solver (bright markers with solid lines in Fig. 5e) similarly searches over all 2n−1 circuit templates but applies the algebraic brute-force solver instead of the informed numerical solver. Both exhaustive solvers have an exponential runtime because they take all subgraphs Γ Γcon into account.

The restricted numerical solver has one hyperparameter: the number s(n) of subgraphs Γ Γcon that are probed. In our current implementation, the specific choice of subgraphs is taken uniformly at random. For the Hubbard model (Fig. 5f), we work with \(s(n)=\min \{1.5{n}^{2}-0.5n,{2}^{n-1}\}\) random subgraphs.

The restricted algebraic solver has two hyperparameters: the number s(n) of random subgraphs and the cutoff c(n) defined before Eq. (27). For the Hubbard model (Fig. 5f), we work with \(s(n)=\min \{{n}^{2},{2}^{n-1}\}\) and \(c(n)=\lfloor {\log }_{2}(n)\rfloor\), except for the fine-tuned parameter choice (red star) where the number of subgraphs is increased from s(n) = n2 = 196 to s = 1000. For hydrogen chains (Fig. 5d), on the other hand, we use two different choices of constant hyperparameters:

  1. (1)

    s(n) = 2000 and c(n) = 5.

  2. (2)

    s(n) = 1000 and c(n) = 3.

Besides HT Pauli groupings, we also compute GC Pauli groupings in Figs. 4 and 5 and groupings of O into TPBs in Figs. 35. These groupings were obtained by applying the Sorted Insertion algorithm21 with the insertion conditions “commute” and “qubitwise commute”, respectively.

For additional numerical investigations about how the choice of hyperparameters influences the performance of the HT Pauli grouping algorithm, see SM Secs. VII and XIV.

Technical details on the algebraic solver

Here we continue our technical discussion about how to algebraically construct HT diagonalization circuits. For every qubit i {1, …, n}, there are three cases how the hypersurface \({{\mathcal{L}}}_{i}=\{{\boldsymbol{\lambda }}\in {{\mathbb{F}}}_{2}^{d}\,| \,{{\boldsymbol{\lambda }}}^{{\rm{T}}}{Q}_{i}{\boldsymbol{\lambda }}=1\}\) could look like: since the image of the \({{\mathbb{F}}}_{2}\)-linear map defined by the matrix Qi is contained in the span of xi and wi, the rank of Qi can only take one of the three values: 0, 1, and 2. If \({\text{rank}}_{{{\mathbb{F}}}_{2}}({Q}_{i})=0\), i.e., Qi = 0, then \({{\mathcal{L}}}_{i}=\varnothing\) is empty, which implies \({\mathcal{L}}=\mathop{\bigcap }\nolimits_{i = 1}^{n}{{\mathcal{L}}}_{i}=\varnothing\) and proves the non-existence of a HT diagonalization circuit for the fixed choice of P1, …, Pm and Γ. On the other hand, if \({\text{rank}}_{{{\mathbb{F}}}_{2}}({Q}_{i})=1\), e.g., \({{\bf{x}}}_{i}{{\bf{z}}}_{i}^{{\rm{T}}}=0\) but \({{\bf{w}}}_{i}{{\bf{y}}}_{i}^{{\rm{T}}}\ne 0\), then we need \({{\boldsymbol{\lambda }}}^{{\rm{T}}}{{\bf{w}}}_{i}{{\bf{y}}}_{i}^{{\rm{T}}}{\boldsymbol{\lambda }}=1\), which is equivalent to λTwi = λTyi = 1. Thus, this case is degenerate and the hypersurface collapses to an affine subspace

$${{\mathcal{L}}}_{i}={{\mathcal{W}}}_{i}^{(1)}\cap {{\mathcal{Y}}}_{i}^{(1)}.$$
(21)

For a detailed explanation about how to compute intersections of affine subspaces numerically, see SM Sec. X. Finally, in the most general case of \({\text{rank}}_{{{\mathbb{F}}}_{2}}({Q}_{i})=2\), the hypersurface \({{\mathcal{L}}}_{i}\) defies a description simpler than the union of four affine subspaces as in Eq. (19). As we show next, the occurrence of rank-1 hypersurfaces can greatly simplify the problem.

Denote the number of rank-2 hypersurfaces by k. In our search for a point \({\boldsymbol{\lambda }}\in {\mathcal{L}}\), we first compute the intersection of all rank-1 hypersurfaces \({{\mathcal{L}}}_{i}\). After potentially relabeling some of the qubits, we can assume \({Q}_{i}={{\bf{x}}}_{i}{{\bf{z}}}_{i}^{{\rm{T}}}\) and \({Q}_{j}={{\bf{w}}}_{j}{{\bf{y}}}_{j}^{{\rm{T}}}\) for all i {1, …, l} and j {l + 1, …, n − k}. Then, the intersection of all rank-1 quadric hypersurfaces is given by

$$\begin{array}{ll}\mathop{\bigcap }\limits_{i=1}^{n-k}{{\mathcal{L}}}_{i}\,=\,\left\{{\boldsymbol{\lambda }}\in {{\mathbb{F}}}_{2}^{d}\,\left| \begin{array}{l}{{\boldsymbol{\lambda }}}^{{\rm{T}}}{{\bf{x}}}_{i}={{\boldsymbol{\lambda }}}^{{\rm{T}}}{{\bf{z}}}_{i}={{\boldsymbol{\lambda }}}^{{\rm{T}}}{{\bf{y}}}_{j}={{\boldsymbol{\lambda }}}^{{\rm{T}}}{{\bf{w}}}_{j}=1\\ \,{\text{for}}\, {\text{all}}\,1\le i\le l\, < \,j\le n-k\end{array}\right.\right\}\\\qquad\quad\,=\,\left\{{\boldsymbol{\lambda }}\in {{\mathbb{F}}}_{2}^{d}\left| \,C{\boldsymbol{\lambda }}={\bf{1}}\right.\right\},\end{array}$$
(22)

where 1 = [1, …, 1]T and for the (2(n − k) × d)-matrix \(C={[{{\bf{x}}}_{1},{{\bf{z}}}_{1},\ldots ,{{\bf{x}}}_{l},{{\bf{z}}}_{l},{{\bf{y}}}_{l+1},{{\bf{w}}}_{l+1},\ldots ,{{\bf{y}}}_{n-k},{{\bf{w}}}_{n-k}]}^{{\rm{T}}}\) whose rows are given by the row vectors \({{\bf{x}}}_{1}^{{\rm{T}}}\) etc. Using Gaussian elimination over \({{\mathbb{F}}}_{2}\), we can compute the reduced row-echelon form (RREF) of the extended matrix \([C,{\bf{1}}]\in {{\mathbb{F}}}_{2}^{2(n-k)\times (l+1)}\). If the last column of the RREF is a pivot column, we are in the solutionless case \({\mathcal{L}}=\varnothing\). Otherwise, this column is an offset vector for \({{\mathcal{L}}}_{1}\cap \ldots \cap {{\mathcal{L}}}_{n-k}\), while every non-pivot column of the RREF of C yields a basis vector in the standard way of linear algebra, see SM Sec. X for technical details.

Finally, we turn to the computationally most demanding part that deals with the rank-2 hypersurfaces \({{\mathcal{L}}}_{n-k+1},\ldots ,{{\mathcal{L}}}_{n}\). Similar to the matrix C in Eq. (22), we introduce a (4k × d)-matrix \({C}^{{\prime} }={[{{\bf{x}}}_{n-k+1},{{\bf{z}}}_{n-k+1},{{\bf{w}}}_{n-k+1},{{\bf{y}}}_{n-k+1},\ldots ,{{\bf{x}}}_{n},{{\bf{z}}}_{n},{{\bf{w}}}_{n},{{\bf{y}}}_{n}]}^{{\rm{T}}}\), i.e., for every j {1, …, k}, the four rows of \({C}^{{\prime} }\) from row 4j − 3 to row 4j are given by \({{\bf{x}}}_{n-k+j}^{{\rm{T}}}\), \({{\bf{z}}}_{n-k+j}^{{\rm{T}}}\), \({{\bf{w}}}_{n-k+j}^{{\rm{T}}}\), and \({{\bf{y}}}_{n-k+j}^{{\rm{T}}}\). Then, a vector \({\boldsymbol{\lambda }}\in {{\mathbb{F}}}_{2}^{d}\) is contained in \({{\mathcal{L}}}_{n-k+1}\cap \ldots \cap {{\mathcal{L}}}_{n}\) if and only if \({C}^{{\prime} }{\boldsymbol{\lambda }}={{\bf{b}}}_{{i}_{1}}\oplus \ldots \oplus {{\bf{b}}}_{{i}_{k}}\) for some i1, …, ik {1, …, 6}, where the vectors b1, …, b6 in the direct sum

$${{\bf{b}}}_{{i}_{1}}\oplus \ldots \oplus {{\bf{b}}}_{{i}_{k}}=\left[\begin{array}{c}{{\bf{b}}}_{{i}_{1}}\\ \vdots \\ {{\bf{b}}}_{{i}_{k}}\\ \end{array}\right]$$
(23)

are the six vectors from Eq. (14), i.e., b1 = [0, 0, 1, 1]T, etc. If it is our goal to unambiguously ascertain whether or not \({\mathcal{L}}\) is empty, we have to check an exponential number of cases. This is only feasible for small qubit numbers n or for graphs with small components, as we explain in SM Sec. XII. To save computing time, we treat all combinations of the vectors bi at once by introducing (4 × 6j)-matrices

$$B_j = [\mathop{\underbrace{{\mathbf{b}}_1,\ldots, {\mathbf{b}}_1}}\limits_{6^{j-1} {\text{times}}}, \ldots, \mathop{\underbrace{{\mathbf{b}}_6,\ldots, {\mathbf{b}}_6}}\limits_{6^{j-1} {\text{times}}}]$$
(24)

for j {1, …, k} and using them as blocks for enlarging \({C}^{{\prime} }\) to a matrix \(B\in {{\mathbb{F}}}_{2}^{4k\times (d+{6}^{k})}\). Hereby, the four rows from row (4j − 3) to row 4j of B are given by

$$\left[{\begin{array}{l} {\mathbf{x}}_{n-k+j}^{\mathrm{T}}, \\ {\mathbf{z}}_{n-k+j}^{\mathrm{T}}, \\ {\mathbf{w}}_{n-k+j}^{\mathrm{T}}, \\ {\mathbf{y}}_{n-k+j}^{\mathrm{T}}, \end{array}} \,\,\displaystyle\underbrace{\displaystyle{B_{j}},\ldots, {\displaystyle{B}_{j}}}_{6^{k-j}\,{\rm{times}}}\,\,\right]\in {\mathbb{F}}_{2}^{4\times (d+6^{k})}.$$
(25)

In other words, B is the matrix that arises from \({C}^{{\prime} }\) by appending all vectors of the form \({{\bf{b}}}_{{i}_{1}}\oplus \ldots \oplus {{\bf{b}}}_{{i}_{k}}\) as additional columns. Next, we use Gaussian elimination to bring B to a row-echelon form; here, time can be saved as it is not necessary to compute the RREF of B. This reveals the non-pivot columns of B. Every non-pivot column of the form \({{\bf{b}}}_{{i}_{1}}\oplus \ldots \oplus {{\bf{b}}}_{{i}_{k}}\) indicates the existence of at least one vector \({\boldsymbol{\lambda }}\in {{\mathcal{L}}}_{n-k-1}\cap \ldots \cap {{\mathcal{L}}}_{n}\). However, we are looking for a λ that also lies in the affine space \({{\mathcal{L}}}_{1}\cap \ldots \cap {{\mathcal{L}}}_{n-k}\). To accomplish this, we start by computing a basis and an offset vector of the entire affine space

$$\left\{{\boldsymbol{\lambda }}\in {{\mathbb{F}}}_{2}^{d}\,| \,{C}^{{\prime} }{\boldsymbol{\lambda }}={{\bf{b}}}_{{i}_{1}}\oplus \ldots \oplus {{\bf{b}}}_{{i}_{k}}\right\}.$$
(26)

Then, we use the procedure explained in SM Sec. X to compute the intersection of the two affine subspaces in Eqs. (22) and (26). Since this results in a subset of \({\mathcal{L}}\), we can finish if we find a non-pivot column \({{\bf{b}}}_{{i}_{1}}\oplus \ldots \oplus {{\bf{b}}}_{{i}_{k}}\) in the right part of B for which this intersection is not empty. Otherwise, if this approach fails for all non-pivot columns, we can finally infer \({\mathcal{L}}=\varnothing\). In any case, we obtain a conclusive answer whether or not a layer of single-qubit Clifford gates exists such that the corresponding graph-based circuit (see main text, Fig. 1) diagonalizes the given set of commuting Pauli operators.

Restricting the algebraic solver

In the exhaustive approach of the previous subsection to algebraically solve Eq. (6) from the main text, we iterate through a number of affine subspaces that grow exponentially in the number k ≤ n of qubits i for which \({{\mathcal{L}}}_{i}\) is a rank-2 quadric hypersurface. For large problem sizes, this is infeasible and we can instead restrict the search to a smaller number c(n) ≤ k of rank-2 quadric hypersurfaces. For example, we can work with a constant cutoff

$$c(n)=\,\text{const.}\,$$
(27)

or with a logarithmically-growing cutoff

$$c(n)=\,\text{const.}\,\times \lfloor \log (n)\rfloor .$$
(28)

This restriction turns our algebraic solver (for attempting to find a solution \({\boldsymbol{\lambda }}\in {\mathcal{L}}\)) into an efficient but probabilistic algorithm because only a polynomial number of intersections of affine subspaces will be probed, see methods section “Runtime Analysis” below. If no solution is found this way, we treat this case as if \({\mathcal{L}}\) was empty, i.e., we skip the current subgraph in our Pauli grouping algorithm from SM Sec. V.

We can incorporate the cutoff c = c(n) by replacing the matrix \(B\in {{\mathbb{F}}}_{2}^{4k\times (d+{6}^{k})}\) in Eq. (25) by a smaller matrix \({B}^{{\prime} }\in {{\mathbb{F}}}_{2}^{(4c+3(k-c))\times (d+{6}^{c})}\). The first 4c rows of \({B}^{{\prime} }\) are again given by Eq. (25), but with 6cj instead of 6kj blocks of the form Bj. For the remaining part, we set the three rows from row (4c + 3j − 2) to row (4c + 3j) to

$$\left[\begin{array}{cccc}{{\bf{x}}}_{n-k+c+j}^{{\rm{T}}}&0&\cdots \,&0\\ {{\bf{w}}}_{n-k+c+j}^{{\rm{T}}}&1&\cdots \,&1\\ {{\bf{y}}}_{n-k+c+j}^{{\rm{T}}}&1&\cdots \,&1\\ \end{array}\right]\in {{\mathbb{F}}}_{2}^{3\times (d+{6}^{c})}$$
(29)

for all j {1, …, kc}, i.e., we dispose of the z-row. In this way, we have effectively combined the two cases that correspond to b1 and b2 from Eq. (14). The remainder of our approach stays unchanged. By restricting from B to \({B}^{{\prime} }\), we will only be able to find solutions \({\boldsymbol{\lambda }}\in {{\mathcal{L}}}_{n-k+1}\cap \ldots \cap {{\mathcal{L}}}_{n}\) which are contained in the subspace

$$\left(\bigcap\limits_{i=n-k+1}^{n-k+c} \mathop{\underbrace{{\mathcal{A}}_i \cup{\mathcal{B}}_i \cup{\mathcal{C}}_i \cup{\mathcal{D}}_i}}\limits_{={{\mathcal{L}}}_i}\right) \cap \bigcap\limits_{i=n-k+c+1}^n \mathop{\underbrace{{\mathcal{A}}_i}}\limits_{\subset {{\mathcal{L}}}_i}.$$
(30)

Let us illustrate the working principle of the cutoff c. In the hypothetical example of Fig. 2 in the main text, we have highlighted the affine subspaces

$${{\mathcal{A}}}_{i}\subset {{\mathcal{L}}}_{i}={{\mathcal{A}}}_{i}\cup {{\mathcal{B}}}_{i}\cup {{\mathcal{C}}}_{i}\cup {{\mathcal{D}}}_{i}$$
(31)

with an increased line width to distinguish them from \({{\mathcal{B}}}_{i}\), \({{\mathcal{C}}}_{i}\), and \({{\mathcal{D}}}_{i}\). In this example, the number of rank-2 hypersurfaces is k = 3. Hence, there are four possible choices for the cutoff c {0, …, k}. For c = k = 3, we recover the original approach and are able to probe all six depicted intersection spaces (yellow dots). For c = 2, the restricted search space \({{\mathcal{L}}}_{1}\cap {{\mathcal{L}}}_{2}\cap {{\mathcal{A}}}_{3}\) only contains two non-empty intersection spaces, namely \({{\mathcal{B}}}_{1}\cap {{\mathcal{A}}}_{2}\cap {{\mathcal{A}}}_{3}\) and \({{\mathcal{D}}}_{1}\cap {{\mathcal{C}}}_{2}\cap {{\mathcal{A}}}_{3}\). For c = 1, only \(\varnothing \,\ne\, {{\mathcal{B}}}_{1}\cap {{\mathcal{A}}}_{2}\cap {{\mathcal{A}}}_{3}={{\mathcal{L}}}_{1}\cap {{\mathcal{A}}}_{2}\cap {{\mathcal{A}}}_{3}\) remains. Note that neither \({{\mathcal{A}}}_{1}\cap {{\mathcal{A}}}_{2}\) nor \({{\mathcal{A}}}_{1}\cap {{\mathcal{A}}}_{3}\) nor \({{\mathcal{A}}}_{2}\cap {{\mathcal{A}}}_{3}\) is empty, but \({{\mathcal{A}}}_{1}\cap {{\mathcal{A}}}_{2}\cap {{\mathcal{A}}}_{3}\) is. Therefore, we would not be able to find any solution for c = 0 in the example of Fig. 2 in the main text. For a non-hypothetical example, see SM Sec. XI.

Runtime analysis

The restricted algebraic solver consists of an outer loop over s(n) ≤ 2e subgraphs and an inner loop over 4c(n) subspace intersections, where s(n) and c(n) are two hyperparameters that may explicitly depend on n. For each subgraph and each intersection, the restricted algebraic solver applies Gaussian elimination to a binary matrix \({B}^{{\prime} }\) of size (4c(n) + 3(kc(n))) × (d + 6c(n)), where k ≤ n and d ≤ 4n. If we choose a constant cutoff c(n) = const. as in Eq. (27), then the size of \({B}^{{\prime} }\) is proportional to n × n. Hence, Gaussian elimination has a runtime of \({\mathcal{O}}({n}^{3})\), i.e., the restricted algebraic solver has a runtime of \({\mathcal{O}}(s(n)\times {n}^{3})\). This is efficient if we choose \(s(n)={\mathcal{O}}(\,{\text{poly}}\,(n))\). In particular, for the hydrogen chain Hamiltonians in Fig. 5 where we work with constant values of s(n), the solver should have a runtime of \(f(n)={\mathcal{O}}({n}^{3})\). This is consistent with our empirical estimate of a runtime that lies somewhere in between \({\mathcal{O}}({n}^{2.0})\) and \({\mathcal{O}}({n}^{2.4})\).

On the other hand, if we choose a logarithmically-growing cutoff \(c(n)={c}_{0}\times {\log }_{6}(n)\), the inner loop of the restricted algebraic solver iterates over \({6}^{c(n)}={n}^{{c}_{0}}\) many intersections. Furthermore, we apply Gaussian elimination of a matrix \({B}^{{\prime} }\) of size proportional to \(n\times {n}^{{c}_{0}}\), which has a runtime of \({\mathcal{O}}({n}^{2+{c}_{0}})\) assuming c0 ≥ 1. Hence, the overall runtime of the restricted algebraic solver follows as \(f(n)={\mathcal{O}}(s(n)\times {n}^{2{c}_{0}+2})\), which is efficient for \(s(n)={\mathcal{O}}(\,{\text{poly}}\,(n))\).

Ideas for improving our solvers

There is much room to further improve the runtime of both the solver for Eq. (6) and the Pauli grouping algorithm into which it is embedded:

  1. 1.

    Since our solvers are highly parallelizable, a more distributed software implementation would allow us to trade runtime for computational resources.

  2. 2.

    Currently, our Pauli grouper calls the solver without making use of previously computed solutions. Warm starting methods could improve upon this. For example, the RREF of M from Eq. (9) could perhaps be reused.

  3. 3.

    Methods for automatically tuning the hyperparameters of our restricted algebraic solver deserve further investigation.

  4. 4.

    Instead of selecting the polynomially-large subset of subgraphs at random, one could implement a more sophisticated approach such as simulated annealing.

  5. 5.

    The geometry of the solution space \({\mathcal{L}}\) could be further explored. If we had access to a lexicographical Gröbner basis for the ideal generated by λTQ1λ + 1, …, \({{\boldsymbol{\lambda }}}^{{\rm{T}}}{Q}_{n}{\boldsymbol{\lambda }}+1\in {{\mathbb{F}}}_{2}[{\lambda }_{1},\ldots ,{\lambda }_{d}]\) (which defines the Zariski-closed set \({\mathcal{L}}\subset {{\mathbb{F}}}_{2}^{d}\)), we could construct a solution \({\boldsymbol{\lambda }}\in {\mathcal{L}}\) via elimination theory57.