Introduction

Neural quantum states (NQS) have been successfully used within Variational Monte Carlo (VMC) to describe highly accurate and flexible parametrizations of the ground state wavefunction of a variety of many-body physical systems1,2,3,4,5,6,7. Parallel developments have expanded NQS capabilities to capture excited states8,9, while improvements of the stochastic reconfiguration method10,11 have enhanced both the scalability and accuracy of these variational ansätze. Recently, hybrid approaches which integrate NQS with experimental or computational projective measurements in a pre-training stage12,13,14,15, or quantum-classical ansätze16,17 have also shown substantial VMC performance improvements.

Neural autoregressive quantum states (NAQS), which are based on the idea of efficiently parameterizing joint distributions as a product of conditional probabilities, have acquired substantial attention due to their general expressiveness and ability to perform efficient and exact sampling5,18. Recurrent Neural Networks19,20 and Transformers21,22 constitute prominent examples of autoregressive architectures commonly used as variational ansätze23,24,25,26,27. Transformer quantum states (TQS), in particular, have proven effective in providing highly accurate representations of ground states in frustrated magnetism27,28, quantum chemistry29,30, and Rydberg atoms31, while also holding promise for interpretability within the context of the self-attention mechanism32,33,34,35.

Despite their versatility, NQS effectiveness may still depend on the basis in which the Hamiltonian is represented. For instance, Robledo-Moreno et al.36 demonstrated that variationally optimized single-particle orbital rotations can significantly improve the accuracy of calculated observables. Furthermore, NQS wave function representations may lack direct physical interpretability, e.g., with respect to the relative frequency of sampled states from the Hilbert space. This contrasts with post-Hartree-Fock (HF) methods in quantum chemistry, such as coupled cluster theory37, where corrections are naturally interpreted as single or double excitations to the HF state.

We present a modified VMC approach that simultaneously addresses these aspects. Although the method is architecture-agnostic, we demonstrate its effectiveness using a Transformer-based26,27 framework. As a first step, an effective theory—a simplified solvable model, \({\hat{H}}_{0}\), that aims at capturing the essential physics in specific parameter regimes of the full Hamiltonian \(\hat{H}\)—is introduced and its spectrum defines the computational basis (see also Fig. 1a); for concreteness, we here use two examples—the basis that diagonalizes the Hamiltonian in the mean-field approximation and a natural basis in the limit of strong interactions of our model. Both of these bases contain a “reference state” (RS) which is a candidate for an approximate description of the ground state of the system. In the case of the mean-field approximation, the RS just corresponds to the Hartree-Fock (HF) ground state. Meanwhile, for the second basis, the RS is the exact ground state at strong coupling. We explicitly parametrize the weight of the RS using a single parameter \(\alpha \in {\mathbb{R}}\) while the Transformer network focuses on describing the corrections to it. Apart from enhancing convergence, α is convenient as it directly quantifies how close the many-body state is to the interpretable RS. We emphasize that this approach (as opposed to, e.g., coupled cluster methods) is not biased toward favoring states close to the RS or, equivalently, α near 1. In fact, we demonstrate explicitly that the technique leads to a vanishingly small weight of the RS should the latter not be a good approximation to the true ground state. In addition, for example, in the HF basis, the remaining basis states have a natural interpretation as being associated with a certain number of particle-hole excitations in the HF bands. This produces a natural energetic hierarchy that we also recover both in their relative weight and hidden representation of the Transformer’s parameterization of the many-body ground state.

Fig. 1: General methodology.
figure 1

a First, we choose an effective theory \({\hat{H}}_{0}\) approximating the target Hamiltonian \(\hat{H}\), e.g., via a mean-field approximation or by taking the strong-coupling limit. We use the groundstate \(\left\vert {{{\rm{RS}}}}\right\rangle\) and excited states \(\left\vert {{{\boldsymbol{s}}}}\right\rangle\) of \({\hat{H}}_{0}\) to define a physics-informed, interpretable basis for the Transformer (b) in Equation (4); as long as the dominant weight of the ground state of \(\hat{H}\) is in the low-energy part of the spectrum E0(s) of \({\hat{H}}_{0}\), this further improves sampling efficiency and the expressivity of the ansatz. c We sample the states s using the batch-autoregressive sampler57,58,63. It is controlled by the batch size Ns and the number of partial unique strings nU, and directly produces the relative frequencies r(s) associated with each state in a tree structure format. Back to (b), the states s are then mapped to a high-dimensional representation of size demb and passed through Ndec decoder-layers26, containing Nh attention heads, which produce correspondent representations \({{{\boldsymbol{h}}}}\left({{{\boldsymbol{s}}}}\right)\in {{\mathbb{R}}}^{{d}_{{{{\rm{emb}}}}}}\) in latent space. In Supplementary Note B6 we explain how these parameters are chosen. As discussed in the main text, the wavefunctions \({\psi }_{{{{\boldsymbol{\theta }}}}}({{{\boldsymbol{s}}}})=\sqrt{{q}_{{{{\boldsymbol{\theta }}}}}({{{\boldsymbol{s}}}})}{e}^{i{\phi }_{{{{\boldsymbol{\theta }}}}}({{{\boldsymbol{s}}}})}\) can be directly obtained from these vectors. A new set of states \({{{\mathcal{C}}}}\) is then obtained, according to the updated qθ(s), and the process is repeated until the convergence of {θα} according to Equation (5).

To exemplify this methodology, we use a one-dimensional interacting fermionic many-body model in momentum space. This model features an exactly solvable strong-coupling limit, which is used to define the strong-coupling basis mentioned above. Moreover, it exhibits a finite regime where integrability is no longer apparent, showing clear differences between exact diagonalization (ED) and HF, where corrections to mean-field treatments become significant.

Our results demonstrate that when the true ground state is close to a product state (the strong coupling limit), the HF basis (strong coupling basis) guides the TQS to converge to a variational representation with two key characteristics: (i) the number of states required for an accurate ground state representation only involves a fraction of the total Hilbert space which is learned and efficiently sampled from by the Transformer; (ii) the states self-organize hierarchically by their statistical weights, with a clear physical structure on latent space, naturally representing excitations on top of the RS. Finally, we show how these features contrast sharply with a generic basis, which generically requires an exponentially large amount of states, hindering scalability and the identification of dominant corrections to mean-field treatments.

Results

General formalism

Our central goal is to determine the ground state of a general interacting fermionic Hamiltonian \(\hat{H}\) given by

$${\hat{H}}= {\sum}_{a,b,{{{\boldsymbol{ k}}}}}{d}_{{{{\boldsymbol{k}}}},a}^{{\dagger}}{h}_{a,b}({{{\boldsymbol{k}}}}){d}_{{{{\boldsymbol{k}}}},b}+\\ +{\sum}_{{a}_{1},{a}_{2},{b}_{1},{b}_{2} \atop {{{{\boldsymbol{k}}}}}_{1},{{{{\boldsymbol{k}}}}}_{2},{{{{\boldsymbol{k}}}}}_{3},{{{{\boldsymbol{k}}}}}_{4}}{d}_{{{{{\boldsymbol{k}}}}}_{1},{a}_{1}}^{{\dagger}}{d}_{{{{{\boldsymbol{k}}}}}_{2},{a}_{2}}^{{\dagger}}{d}_{{{{{\boldsymbol{k}}}}}_{3},{b}_{2}} {d}_{{{{{\boldsymbol{k}}}}}_{4},{b}_{1}} {V}_{{a}_{1},{a}_{2},{b}_{2},{b}_{1}}^{{{{{\boldsymbol{k}}}}}_{1},{{{{\boldsymbol{k}}}}}_{2},{{{{\boldsymbol{k}}}}}_{3},{{{{\boldsymbol{k}}}}}_{4}},$$
(1)

where \({d}_{{{{\boldsymbol{k}}}},a}^{{{\dagger}} }\) and dk,a are fermionic, second quantized creation and annihilation operators with momentum k, and indices ab, …  indicate additional internal degrees of freedom of the system, such as spin and/or bands. The one and two-body terms are determined by ha,b(k) and \({V}_{{a}_{1},{a}_{2},{b}_{2},{b}_{1}}^{{{{{\boldsymbol{k}}}}}_{1},{{{{\boldsymbol{k}}}}}_{2},{{{{\boldsymbol{k}}}}}_{3},{{{{\boldsymbol{k}}}}}_{4}}\), respectively; although not a prerequisite for our method, we assume translational invariance for notational simplicity.

A first approximation to the ground state of Equation (1) can be provided by HF38,39; restricting ourselves to translation-invariant Slater-determinants, HF can be stated as finding the momentum-dependent unitary transformations Uk of the second-quantized operators,

$${\bar{d}}_{{{{\boldsymbol{k}}}},p}={\sum}_{a}{\left({U}_{{{{\boldsymbol{k}}}}}\right)}_{p,a}{d}_{{{{\boldsymbol{k}}}},a},$$
(2)

such that the HF self-consistency equations are obeyed (see Supplementary Note A4) and the Hamiltonian assumes a diagonal quadratic form within the mean-field approximation, i.e.,

$$\hat{H}={\sum}_{{{{\boldsymbol{k}}}},p}{\epsilon }_{{{{\boldsymbol{k}}}},p}{\bar{d}}_{{{{\boldsymbol{k}}}},p}^{{{\dagger}}}{\bar{d}}_{{{{\boldsymbol{k}}}},p}+\ldots,$$
(3)

where the ellipsis indicates terms beyond mean-field. The transformations in Equation (2) are obtained in an iterative approach until a specified tolerance is reached.

The ground state within HF is given by filling the lowest fermionic states in Equation (3), which we will use as our RS, denoted by \(\left| {{{\rm{RS}}}}\right.\rangle\) in the following. Importantly, though, HF also defines an entire basis via Equation (2), which is approximately related to the spectrum of the full Hamiltonian and parametrized by ϵk,p. We leverage both the spectrum ϵk,p and its associated basis to improve sampling efficiency and physical interpretability within the VMC framework. As summarized graphically in Fig. 1(a, b), we express the many-body state in the HF basis (2) and denote the associated computational basis by \(\left| {{{\boldsymbol{s}}}}\right.\rangle\), where \({{{\boldsymbol{s}}}}=\left({s}_{1},\ldots,{s}_{{N}_{k}}\right)\) labels the occupations of the fermionic modes created by \({\bar{d}}_{{{{\boldsymbol{k}}}},p}^{{{\dagger}} }\) in the Nk different electronic momenta k. Our variational many-body ansatz then reads as

$$\left| {\Psi }_{\{{{{\boldsymbol{\theta }}}},\alpha \}}\right. \rangle=\alpha \left| {{{\rm{RS}}}}\right. \rangle+\sqrt{1-{\alpha }^{2}}{\sum}_{{{{\boldsymbol{s}}}}\ne {{{\rm{RS}}}}}{\psi }_{{{{\boldsymbol{\theta }}}}}({{{\boldsymbol{s}}}})\left| {{{\boldsymbol{s}}}}\right. \rangle,$$
(4)

where \({\psi }_{{{{\boldsymbol{\theta }}}}}({{{\boldsymbol{s}}}})\in {\mathbb{C}}\) is a neural network representation1 of the amplitudes for the states s that are not the RS, and α is an additional variational parameter describing the weight associated with the RS. Note that a global phase choice allows us to take \(\alpha \in {\mathbb{R}}\) without loss of generality.

The motivation for the variational parameter α is two-fold. First, it explicitly quantifies deviations of the ground state from the RS, which for HF refers to the optimal product state. A ground state being close to the RS is then reflected by α approaching unity, while small α will indicate strong deviations from a product state. As such, our approach combines the interpretability of HF with the lack of being constrained to (the vicinity of) a Slater determinant. We emphasize that different HF calculations, e.g., restricted to be in certain symmetry channels, can be used and compared. Secondly, through Equation (4), the NQS can solely focus on the corrections δE to the RS energy ERS. Since the RS is never sampled by the NQS by construction, this separation is beneficial when HF captures the dominant ground state contributions, as targeting corrections would be hindered by low acceptance probabilities in Metropolis-Hastings sampling40—a phenomenon analogous to mode collapse in generative adversarial networks41,42. If HF is not a good approximation, there is, in general, no reason why splitting up the contribution of the RS would be detrimental to the network’s performance.

It remains to discuss how the other states, s ≠ RS, are described through ψθ(s) which depends on a set of parameters \({{{\boldsymbol{\theta }}}}\in {{\mathbb{R}}}^{n}\). These parameters are jointly optimized with α according to

$$\arg {\min }_{\{{{{\boldsymbol{\theta }}}},\alpha \}}E({{{\boldsymbol{\theta }}}},\alpha )=\arg {\min }_{\{{{{\boldsymbol{\theta }}}},\alpha \}}\frac{\left\langle {\Psi }_{\{{{{\boldsymbol{\theta }}}},\alpha \}}\left| \hat{H}\right| {\Psi }_{\{{{{\boldsymbol{\theta }}}},\alpha \}}\right\rangle }{\left\langle {\Psi }_{\{{{{\boldsymbol{\theta }}}},\alpha \}}\Big| {\Psi }_{\{{{{\boldsymbol{\theta }}}},\alpha \}}\right\rangle },$$
(5)

i.e., via a minimization of the energy functional E(θα) (see Methods section). We emphasize that this approach is distinct from neural network backflow43,44, but not mutually exclusive, as we use the HF basis to express the many-body state rather than dressing its single-particle orbitals with many-body correlations. While other approaches are feasible, too, we here employ a Transformer21,26 to represent the Born distribution \({q}_{{{{\boldsymbol{\theta }}}}}({{{\bf{s}}}})={\left| {\psi }_{{{{\boldsymbol{\theta }}}}}({{{\boldsymbol{s}}}})\right\vert }^{2}/{\sum }_{{{{{\boldsymbol{s}}}}}^{{\prime} }}{\left| {\psi }_{{{{\boldsymbol{\theta }}}}}({{{{\boldsymbol{s}}}}}^{{\prime} })\right\vert }^{2}\) autoregressively, i.e.,

$${q}_{{{{\boldsymbol{\theta }}}}}({{{\boldsymbol{s}}}})={\prod }_{i=1}^{{N}_{{{{\boldsymbol{k}}}}}}q\left({s}_{i}| {s}_{i-1},\ldots,{s}_{1}\right).$$
(6)

From this distribution, both the amplitudes and phases are obtained for the associated wave functions, \({\psi }_{{{{\boldsymbol{\theta }}}}}({{{\boldsymbol{s}}}})=\sqrt{{q}_{{{{\boldsymbol{\theta }}}}}({{{\boldsymbol{s}}}})}{e}^{i{\phi }_{{{{\boldsymbol{\theta }}}}}({{{\boldsymbol{s}}}})}\), from the Transformer’s latent space (see Fig. 1b). Both components are calculated from the same output of the final Addition and Normalization layer of the Transformer. The amplitude is obtained through an affine linear transformation followed by a softmax activation function, while the phase uses a scaled softsign activation function to ensure ϕθ(s) [ − ππ]23,26. This approach guarantees that the output of the Transformer output yields normalized conditional probabilities in Equation (6)18.

Model Hamiltonian

To test and explicitly demonstrate our methodology, we construct a concrete minimal model of the form given in Equation (1). The model has exact strong and weak coupling limits that can be used as effective theories \({\hat{H}}_{0}\)—together with HF—for intermediate coupling regimes.

It describes spinless, one-dimensional electrons which can occupy two different bands, a = ± , as described by the creation and annihilation operators \({d}_{k,a}^{{{\dagger}} }\) and dk,a, respectively. They interact through a repulsive Coulomb potential \(V(q)={(2{N}_{k}\left(1+{q}^{2}\right))}^{-1}\). More explicitly, the Hamiltonian reads as

$$\hat{H}=t{\sum}_{k\in {{{\rm{BZ}}}}}\cos (k)\,{d}_{k}^{{{\dagger}} }{\sigma }_{z}{d}_{k}+U{\sum}_{q\in {\mathbb{R}}}V(q){\rho }_{q}{\rho }_{-q},$$
(7)

where the momenta k are defined on the first Brillouin zone (BZ) [ − π, …, π − 2π/Nk] of a finite system with Nk sites and σj (j = 0, xyz) are the Pauli matrices in band space. The density operator is given by

$${\rho }_{q}={\sum}_{k\in {{{\rm{BZ}}}}}\left({d}_{k+q}^{{{\dagger}} }{{{\mathcal{F}}}}(k,q){d}_{k}-{\sum}_{G\in {{{\rm{RL}}}}}{\delta }_{q,G} \, {f}_{1}(k,G)\right),$$
(8)

where \({{{\rm{RL}}}}=2\pi {\mathbb{Z}}\) is the reciprocal lattice and the “form factors” read as \({{{\mathcal{F}}}}(k,q)={f}_{1}(k,q)+i{\sigma }_{y}{f}_{2}(k,q)\); for concreteness, we choose \({f}_{1}\left(k,q\right)=1\) and \({f}_{2}\left(k,q\right)=0.9\sin \left(k\right)\left(\sin \left(q\right)+\sin \left(k+q\right)\right)\) in our computations below.

Note that this model is non-sparse since all momenta are coupled and, as such, is generally expected to be challenging to solve. It is inspired by models of correlated moiré superlattices, most notably of graphene, which exhibit multiple low-energy bands that are topologically obstructed45,46; they can, hence, not be written as symmetric local theories in real space and are, thus, typically studied in momentum space47,48,49.

Furthermore, the strong coupling limit, t/U → 0, of Equation (7) can be readily solved: to this end, we introduce a new basis defined by \({U}_{k}=\left(\begin{array}{cc}1&-i\\ 1&i\end{array}\right)/\sqrt{2}\) in Equation (2) which diagonalizes the form factors \({{{\mathcal{F}}}}(k,q)\) at all momenta. It follows (see Supplementary Note A1) that, at half-filling (the number of electrons Ne = Nk), any of the states \(\left| \pm \right\rangle={\prod }_{k}{\bar{d}}_{k,\pm }^{{{\dagger}} }\left| 0\right\rangle\) are exact ground states in the limit t/U → 0; which of the two ground states is picked is determined by spontaneous symmetry breaking: the Hamiltonian is invariant under the anti-unitary operator PT with action \(PT{\bar{d}}_{k,\pm }{(PT)}^{{{\dagger}} }={\bar{d}}_{k,\mp }\), which is broken by both of these states. The resulting symmetry-broken phase can be shown to exhibit a finite gap. Importantly, this strong coupling limit defines another natural computational basis and associated \(\left| {{{\rm{RS}}}}\right\rangle=\left|+\right\rangle\) or \(\left| -\right\rangle\), which we will use and compare with the HF basis defined in the previous section; in analogy to twisted bilayer graphene48, we will refer to this strong-coupling basis as “chiral basis”.

In contrast, at large t/U, the non-interacting term in Equation (7) dominates and we obtain a symmetry-unbroken metallic phase. As such, there is an interaction-driven metal-insulator transition at half-filling at some intermediate value of t/U ( 0.14 according to HF). To be able to compare both chiral and HF bases and since half-filling has the largest Hilbert space, we will focus on Ne = Nk in the following. Furthermore, we will neglect double-occupancy of each of the Ne momenta for simplicity such that the basis states \(\left| {{{\boldsymbol{s}}}}\right\rangle\), with sk {0, 1}, in Equation (4) can be compactly written as \(\left| s\right\rangle={\prod }_{k=1}^{{N}_{k}}{\bar{d}}_{k,{(-1)}^{{s}_{k}}}^{{{\dagger}} }\left| 0\right\rangle \).

Hartree-Fock as an effective theory

We first discuss the results using HF as \({\hat{H}}_{0}\). The solid gray lines in Fig. 2a show the deviations of the HF ground state energy from that obtained by ED for system sizes Ne where the latter is feasible. As expected, the corrections exhibit a higher magnitude near the metal-insulator transition (gray region). In the metallic regime (t/U > 0.10), the corrections decay more gradually, forming an extended tail. In contrast, in the insulating regime (t/U < 0.05), the corrections decrease rapidly as Ne increases. To simultaneously display the performance of the Transformer-corrected ansatz (4) using the HF basis—which we refer to as HF-TQS from now on—the colored markers in Fig. 2a show the deviations of HF from the HF-TQS ground-state energy. The fact that they are very close to the deviation of HF to ED for all parameters demonstrates the expressivity and convergence of our approach; this can also be more explicitly seen in the inset that directly shows the difference in ground-state energy between ED and HF-TQS.

Fig. 2: Performance of HF-TQS for different system sizes and couplings t/U.
figure 2

a Difference between the HF-TQS ground state energy per electron and HF as a function of t/U at various system sizes Ne. The solid lines show the difference between ED and the HF ground state energy. The inset shows the absolute value of the relative error \(\delta {E}_{{{{\rm{ED}}}}}=\left\vert {E}_{{{{\rm{HF}}}}-{{{\rm{TQS}}}}}-{E}_{{{{\rm{ED}}}}}\right\vert\). The corresponding converged α values [according to Equation (14)] are shown in panel c. The gray regions indicate the vicinity of the metal-insulator transition. We fix nU = 4 × 103 in this region to highlight how this parameter controls the accessible corrections. Therefore, the break in trend for the corrections at Ne≥14 highlights that a larger nU is necessary to correctly capture them. To illustrate this point, the white circles in panels (a, d) for Ne = 14 were computed using nU = 17,000. We refer the reader to the main text for more details. b Convergence of the ground state energy per electron and of α (panel c) during training for t/U = 0.16 and Ne = 30. The total number of unique states \({n}_{U}^{f}\) indicates how many states are retained by the Transformer from the initial value nU determined in Fig. 1c. Training was performed on one NVIDIA H100 GPU with the displayed network hyperparameters as defined in Fig. 1b (see also Supplementary Note B6). The total number of network parameters is denoted by #θ.

Naturally, the HF-TQS ansatz can also be applied to larger system sizes not accessible in ED. For instance, in Fig. 2b–c, we show the variational energy and α during training for Ne = 30 electrons at t/U = 0.16. Here and similarly on the low-t/U side of the phase transition, we obtain fast convergence and systematic corrections to the HF energy consistent with the trend at smaller Ne in Fig. 2a, although we just use the moderately large number of nU = 4 × 103 unique samples (cf. Fig. 1c). The data in Fig. 2b–c reveals that the Transformer properly captures the non-product corrections to the HF state. In fact, we can see that the converged Transformer only ends up having to sample \({n}_{u}^{f}=1812\) distinct states (out of the 109 total states). This efficiency extends to even larger systems, as we demonstrate in Supplementary Note B7 with results up to Ne = 60.

The situation is different in the critical region, where the method’s performance is primarily constrained by our current choice of a comparatively small total number of uniquely sampled states nU. This leads to the drop (increase) of the energy correction (α) in Fig. 2a, d for large system sizes in the gray region. Here a larger number of states is required for an accurate representation. When nU is sufficiently large to represent a substantial portion—or even the entirety—of the Hilbert space \({{{\mathcal{H}}}}\), the HF-TQS converges with good accuracy in this region. As we will see later, all effective theories exhibit the same behavior in the gray region, confirming this is a challenging regime for the Transformer-NQS to solve, regardless of the effective theory considered in this work. The trend change observed for the corrections in this region in Fig. 2a then comes naturally from the fact that we have fixed nU = 4 × 103 for all system sizes Ne, which seems insufficient for Ne≥14. To demonstrate this, we increased nU leading to the white circles in Fig. 2a, d (see Supplementary Note B7 for more details).

The relevance of the RS can also be conveniently seen from the parameter α in Fig. 2b. Away from the critical region, α approaches 1 signaling that HF becomes an increasingly accurate approximation while, within the critical region, we find \(\alpha \simeq \sqrt{1-{\alpha }^{2}}\simeq 0.7\simeq 1/\sqrt{2}\), indicating that only about half of the ground state or half of its energy [cf. Equation (14)] is described by the HF state.

Finally, we point out that the learning rate \({\lambda }_{{\alpha }_{0}}\) for optimizing α0 was set to a fixed value, such that the training dynamics is dominated by the one of θ (Fig. 2d). While alternative learning rate scheduling strategies could be proposed, they should be done with care. Specifically, we observed that low values of \({\lambda }_{{\alpha }_{0}}\) can cause the optimization of θ, according to Equation (15), to become trapped in local minima, particularly near to the phase transition.

Other effective theories

We next compare the performance when using the HF basis with that of the chiral basis, whose associated RS is expected to provide a good approximation to the ground state for small t/U. To this end, we show in Fig. 3a the deviation of the variational ground state energy from HF (main panel) and ED (inset) for these bases choices. We see that the chiral and HF bases both provide accurate representations of the ground state across the entire phase diagram demonstrating again that the method is not intrinsically biased to being close to the RS, which, for the chiral basis, is not a good approximation for the ground state away from t/U → 0; this is also confirmed by the behavior of the respective α shown in Fig. 3b: for the HF basis, it only dips significantly below 1 in the critical region, where non-product-state corrections are crucial, while dropping to zero for increasing t/U in the chiral basis.

Fig. 3: Performance comparison of the TQS with distinct effective theories.
figure 3

a Difference between the TQS (with distinct effective theories labeled by the markers) ground state energy per electron and HF as a function of t/U for Ne = 10. The solid line shows the difference between ED and the HF ground state energy. The inset shows the absolute value of the relative error \(\delta {E}_{{{{\rm{ED}}}}}=\left\vert {E}_{{{{\rm{TQS}}}}}-{E}_{{{{\rm{ED}}}}}\right\vert\) on a log scale. b Converged α for the TQS (markers) in panel a as a function of t/U, with dashed lines as a guide to the eye. The gray region indicates the vicinity of the metal-insulator transition. c Histograms showing the total relative frequencies Rj, according to Equation (10), for the excitation classes \({{{\mathcal{E}}}}({{{\boldsymbol{s}}}})\) from Equation (9). These quantities represent the importance of corrections for each particle-hole excitation class. From left to right, the columns correspond to t/U values in the insulating, critical, and metallic regimes, respectively.

To analyze the performance of our ansatz further, in Fig. 3a, b, we also show results using the band basis [\({U}_{k}={\mathbb{1}}\) in Equation (2)], and choose a fully filled band (e.g., a = − ) as RS which, importantly, is not close to the ground state for any t/U—not even in the non-interacting limit [as can be seen in Equation (7), the band occupation has to change with momentum for U = 0]. In line with these expectations, we find α 1 in the entire phase diagram, see yellow pentagon markers in Fig. 3(b). As expected from Equation (4), the formalism then reduces to standard Transformer-NQS approaches in this regime. Nonetheless, the expressivity of the Transformer in the ansatz (4) allows to approximate the ground-state energy better than HF; it is not quite as good as in the HF or chiral basis which seems natural since the RS does not have any simple relation to the ground state in any part of the phase diagram. Thus, representing it and sampling from it is generically expected to be more challenging than in physics-informed bases. We checked that, for larger t/U, the transformer converges to the exact ground state energy also in the band basis as the asymptotic ground state is just one of the basis states (see Supplementary Fig. 5).

Additional important details about the wavefunction and sampling efficiency in the different bases can be revealed by studying the contributions of the various basis states. To group them, we recall that each \(\left| {{{\boldsymbol{s}}}}\right\rangle\) in Equation (4) is labeled by \({{{\boldsymbol{s}}}}=({s}_{1},\ldots {s}_{{N}_{k}})\), sk {0, 1}, and with the convention \(\left| {{{\rm{RS}}}}\right\rangle=\left| (1,1,\ldots,1)\right\rangle\) it makes sense to use the number of “excitations” or “flips”

$${{{\mathcal{E}}}}({{{\boldsymbol{s}}}}):={\sum }_{k=1}^{{N}_{k}}\left(1-{s}_{k}\right)$$
(9)

relative to the RS; in the case of the HF basis, these are in one-to-one correspondence to the particle-hole pairs described by the mean-field Hamiltonian (3). For Ne = 6, for example, states like \(\left| 111110\right\rangle\) and \(\left| 111101\right\rangle\) belong to the class with \({{{\mathcal{E}}}}=1\), i.e., with a single excitation above the RS. We also define

$${R}^{j}:=\mathop{\sum}_{{{{\boldsymbol{s}}}}| {{{\mathcal{E}}}}({{{\boldsymbol{s}}}})=j}r({{{\boldsymbol{s}}}})\quad {{{\rm{for}}}}\quad j=1,\ldots {N}_{e},$$
(10)

where r(s) are the relative frequencies defined in Equation (12). This quantity represents the relative weight of the ground state wavefunction in the sector with j excitations, normalized such that \({\sum }_{j=1}^{{N}_{e}}{R}^{j}=1\) (excluding the reference state with j = 0). More formally, if we define the projector \({\hat{P}}_{j}={\sum }_{{{{\boldsymbol{s}}}}}{\delta }_{{{{\mathcal{E}}}}({{{\boldsymbol{s}}}}),j}\left| {{{\boldsymbol{s}}}}\right\rangle \left\langle {{{\boldsymbol{s}}}}\right\vert\) onto the subspace with \({{{\mathcal{E}}}}({{{\boldsymbol{s}}}})=j\), then \({R}^{j}=| | {\hat{P}}_{j}\left| \Psi \right\rangle | {| }^{2}/(1-{\alpha }^{2})\), where \(\left| \Psi \right. \rangle\) is the ground state. These quantities provide a quantitative measure of which particle-hole excitation sectors contribute most to the corrections needed to reach the true ground state from \({\hat{H}}_{0}\).

The histograms in Fig. 3c show these quantities for the three respective values of t/U indicated in Fig. 3a. The chiral and band bases are fundamentally limited by the curse of dimensionality: the ground state physics cannot be captured by just a few dominant basis states, as demonstrated by the significant contributions of all Rj away from the insulating regime. This broad distribution would consequently limit their applicability for larger system sizes Ne. In contrast, the ground state representation in the HF basis for both the insulating and metallic parameter range is dominated by low-order excitations relative to the HF state, as expected from Equation (3). This behavior enables accurate calculations for Ne≥16 in Fig. 2, both in the metallic and insulating regimes, in spite of the non-sparse nature of the Hamiltonian.

Unlike coupled cluster methods in quantum chemistry, for example, the Transformer independently selects the most important excitation classes. This can be seen particularly from the HF-basis histogram close to the phase transition (t/U = 0.09), as an increasing number of higher order excitations starts contributing to the ground-state energy. The number of accessible classes is then limited by only two factors: the total number of unique partial strings nU allowed in the batch-autoregressive sampler (Fig. 1c), and the Transformer’s expressiveness, which is primarily controlled by demb, Nh and Ndec50 (see Fig. 1b and Supplementary Fig. 2). Interestingly, though, we see in Fig. 3c that even close to the phase transition, the HF basis clearly benefits more from importance sampling than the other bases.

Observables

Apart from the ground-state energy of Equation (7), we can naturally estimate other observables, such as the momentum-resolved fermionic bilinears,

$${{{{\mathcal{N}}}}}_{{{{\boldsymbol{k}}}}}^{j}={\bar{d}}_{{{{\boldsymbol{k}}}}}^{{{\dagger}} }{\sigma }_{j}{\bar{d}}_{{{{\boldsymbol{k}}}}},\quad j=x,y,z,$$
(11)

where \({\bar{d}}_{{{{\boldsymbol{k}}}}}\) are the fermionic operators in the chiral basis. In Fig. 4, we show their expectation values within HF and HF-TQS in the critical region (t/U = 0.09). As the dispersion involves [first term in Equation (7)] σx in the chiral basis, it is natural to recover the \(\cos\)-like shape in \(\langle {{{{\mathcal{N}}}}}_{{{{\boldsymbol{k}}}}}^{x}\rangle\). Most importantly, \(\langle {{{{\mathcal{N}}}}}_{{{{\boldsymbol{k}}}}}^{z}\rangle\), which describes the symmetry breaking in the insulating regime, is sizeable in HF, showing that the system is already in the symmetry-broken, insulating regime. However, the additional quantum corrections from our HF-TQS approach lead to a much smaller almost vanishing \(\langle {{{{\mathcal{N}}}}}_{{{{\boldsymbol{k}}}}}^{z}\rangle \simeq 0\). This is in line with general expectations that HF overestimates the tendency to order. Moreover, corrections to \({{{{\mathcal{N}}}}}_{{{{\boldsymbol{k}}}}}^{y}\) are more pronounced near the points where the kinetic term in Equation (7) changes sign (vertical gray lines in the plot). In combination with the fact that the deviations between HF and HF-TQS are much less pronounced away from the critical region (see Supplementary Fig. 4), these results demonstrate that the value of the parameter α ( 0.76 at t/U = 0.09) also serves as an indicator of expected deviations from HF predictions for other physical observables.

Fig. 4: Momentum-resolved fermionic bilinears \({{{{\mathcal{N}}}}}_{{{{\boldsymbol{k}}}}}^{j}\).
figure 4

HF-TQS results (markers) as a function of momentum k for the observable defined in Equation (11), in comparison to those obtained solely from HF (dashed lines) for Ne = 12 at t/U = 0.09.

Hidden representation

Finally, we investigate the influence of the three different bases on the Transformer’s latent space by projecting the high-dimensional parametrization of qθ(s) onto low-dimensional spaces using principal component analysis (PCA)51,52,53. We apply this method to the set of vectors \(\{{{{\boldsymbol{H}}}}({{{\boldsymbol{s}}}})={\sum }_{j}^{{N}_{e}}{{{{\boldsymbol{h}}}}}_{j}({{{\boldsymbol{s}}}})| \forall {{{\boldsymbol{s}}}}\ne {{{\rm{RS}}}}\}\)28, which are obtained at the output of the Transformer’s Ndec layers (see Fig. 1b). For visual clarity, we focus on Ne = 10 electrons. Figure 5 shows the first and second principal components of PCA for all bases at t/U = 0.04 (insulator), t/U = 0.12 (close to critical region) and t/U = 0.16 (metal). To first compare the two natural, energetically-motivated bases—the HF and chiral basis—we see that the states are indeed approximately ordered based on the classes defined via Equation (9) in the regimes where they are expected to be natural choices, i.e., for all t/U (small t/U) for the HF (chiral) basis. This illustrates that the physical motivation for choosing these respective bases is not only visible in the histograms in Fig. 3b and the sampling efficiency but also “learned” by the Transformer’s hidden representation. While some clear structure also emerges for the band basis, we emphasize that the labels \({{{\mathcal{E}}}}({{{\boldsymbol{s}}}})\) do not directly translate to the energetics of the states: as discussed above, the RS is never close to the ground states in any regime, such that the number of excitations \({{{\mathcal{E}}}}\) above it also does not present clear energetic relevance either. Only for large t/U does a related quantity, the excitations away from the product ground state, that can be defined in this basis become relevant. The Transformer appears unable to uncover any additional emergent structure, which is likely related to the poor performance of the band basis, as shown in Fig. 3a. Hence, interpretability of these structures is not automatically ensured, as the above example illustrates.

Fig. 5: Visualization of the Transformer’s latent space.
figure 5

Results are shown for Ne = 10 electrons at different values of t/U for the band, chiral and HF bases. Each point represents a basis state s, which is colored according to the class label \({{{\mathcal{E}}}}({{{\boldsymbol{s}}}})\) [cf. Equation (9)], and has been obtained by projecting the respective latent space features H(s) onto the first two principal components (PCs)  using PCA. All simulations use embedding dimension demb = 300 with single attention head and decoder layer (Nh = Ndec = 1).

Discussion

We have introduced and demonstrated a modified transformer-based variational description of the ground state of a many-body Hamiltonian, which is based on first choosing an energetically motivated basis \(\{\left| {{{\rm{RS}}}}\right. \rangle,\left| {{{\boldsymbol{s}}}}\right.\rangle \}\), according to Equation (4) and Fig. 1a. We showed that HF provides a very natural and general route towards finding such a basis since the associated mean-field Hamiltonian (3) encodes an approximate energetic hierarchy of the states. As a second example, we used a basis defined in the strong-coupling limit. Overall, our approach has the following advantages: (i) there is a single parameter, α, which quantifies how close the (variational representation of the) ground state is to \(\left| {{{\rm{RS}}}}\right.\rangle\); for instance, for the HF basis, this would be the mean-field-theory prediction, i.e., the Slater determinant closest to the true ground state; (ii) except for right at the critical point, the HF basis is found to be particularly useful for improving the sampling efficiency since only a small subset of the exponentially large basis states contribute. This is expected based on general energetic reasoning and is most directly visible in the histograms in Fig. 3c. Finally, (iii) the physical nature of these bases also allows for a clear interpretation of the different contributions, e.g., as excitations on top of the RS, which we also recover in the transformer’s hidden representation (see Fig. 5).

Several directions can be addressed in future work. First, applying this methodology to different Hamiltonians and deep learning architectures is a natural next step to determine more generally under which conditions only a small subset of basis states is required for the ground state in metallic and insulating regimes. In particular, since the mean-field approximation becomes more accurate in higher dimensions, one would expect HF to provide even greater advantages as an effective theory in higher dimensions. Additionally, inspired by Ref. 36, where orbital rotations applied to determinant-based wavefunctions43,44,54 were shown to improve variational energies and to modify orbitals in certain scenarios, it would be interesting to investigate whether effective theories could provide similar sampling benefits in the context of such ansätze.

From a methodological perspective, efficiency improvements could be achieved through the usage of modified stochastic reconfiguration techniques for the optimization of the network parameters10,11, and with the incorporation of symmetries in the HF-based ansatz55,56. For systems with non-sparse Hamiltonians like our current model, the implementation of the recently proposed GPU-optimized batch auto-regressive sampling without replacement57 should also be beneficial. Furthermore, our approach could be used to test the validity and accuracy of different effective theories by using them as \({\hat{H}}_{0}\) in Fig. 1a to define the computational basis.

Methods

Local energy estimators

The energy expectation values for the corrections δE are calculated as a weighted average over a set \({{{\mathcal{S}}}}\) of nU unique states (from a batch of Ns sampled states s) from qθ(s) through the batch auto-regressive sampler57,58,59 (see Fig. 1c) as

$$\langle \delta E\rangle={{\mathbb{E}}}_{{{{\boldsymbol{s}}}} \sim {q}_{{{{\boldsymbol{\theta }}}}}}\left[{H}_{{{{\rm{loc}}}}}({{{\bf{s}}}})\right]\simeq \mathop{\sum}_{{{{\bf{s}}}}\in {{{\mathcal{S}}}}\ne {{{\rm{RS}}}}}{H}_{{{{\rm{loc}}}}}({{{\bf{s}}}})r({{{\bf{s}}}}).$$
(12)

Here, r(s) = n(s)/Ns represents the relative frequency of each state s and

$${H}_{{{{\rm{loc}}}}}({{{\bf{s}}}})=\mathop{\sum}_{{{{{\bf{s}}}}}^{{\prime} }\ne {{{\rm{RS}}}}}\frac{\langle {{{\bf{s}}}}| \hat{H}| {{{{\bf{s}}}}}^{{\prime} }\rangle {\psi }_{{{{\boldsymbol{\theta }}}}}({{{{\bf{s}}}}}^{{\prime} })}{{\psi }_{{{{\boldsymbol{\theta }}}}}({{{\bf{s}}}})}$$
(13)

are the typical local estimators. According to Equation (4), the energy functional is divided into sectors

$$E({{{\boldsymbol{\theta }}}},\alpha )= {\alpha }^{2}{E}_{{{{\rm{RS}}}}}+\left(1-{\alpha }^{2}\right){{\mathbb{E}}}_{{{{\boldsymbol{s}}}} \sim {q}_{{{{\boldsymbol{\theta }}}}}}\left[{H}_{{{{\rm{loc}}}}}({{{\bf{s}}}})\right]+\\ +2\alpha \sqrt{1-{\alpha }^{2}}{{{\rm{Re}}}}\left({{\mathbb{E}}}_{{{{\boldsymbol{s}}}} \sim {q}_{{{{\boldsymbol{\theta }}}}}}\left[{H}_{{{{\rm{loc}}}}}^{{{{\rm{RS}}}}}\left({{{\boldsymbol{s}}}}\right)\right]\right),$$
(14)

with the modified local estimator \({H}_{{{{\rm{loc}}}}}^{{{{\rm{RS}}}}}\left({{{\boldsymbol{s}}}}\right)=\langle {{{\boldsymbol{s}}}}| \hat{H}| {{{\rm{RS}}}}\rangle /{\psi }_{{{{\boldsymbol{\theta }}}}}({{{\boldsymbol{s}}}})\). The network parameters θ are optimized as usual with the gradients of the expression (14) given by

$${\nabla }_{{{{\boldsymbol{\theta }}}}}E({{{\boldsymbol{\theta }}}},\alpha )=2{{{\rm{Re}}}}\left({{\mathbb{E}}}_{{{{\bf{s}}}} \sim {q}_{{{{\boldsymbol{\theta }}}}}}\left[{{{{\mathcal{H}}}}}_{{{{\rm{loc}}}}}\left({{{\boldsymbol{s}}}},\alpha \right)\cdot {\nabla }_{{{{\boldsymbol{\theta }}}}}\log {\psi }_{{{{\boldsymbol{\theta }}}}}^{*}\left({{{\boldsymbol{s}}}}\right)\right]\right),$$
(15)

with

$${{{{\mathcal{H}}}}}_{loc}\left({{{\boldsymbol{s}}}},\alpha \right)=\left(\left(1-{\alpha }^{2}\right){H}_{{{{\rm{loc}}}}}({{{\boldsymbol{s}}}})+\alpha \sqrt{1-{\alpha }^{2}}{H}_{{{{\rm{loc}}}}}^{{{{\rm{RS}}}}}({{{\boldsymbol{s}}}})\right).$$

To prevent numerical instabilities during the optimization of Equation (4), it is necessary to constrain α with the parametrization \(\alpha=(1+\tanh {\alpha }_{0})/2\) to the interval [ − 1, 1]. After updating the network parameters θ at each iteration, the reweighting parameters are dynamically adjusted according to the gradient of E(θα) in Equation (14) with respect to α0, i.e.,

$${\nabla }_{{\alpha }_{0}}E({{{\boldsymbol{\theta }}}},\alpha )=2\alpha {\nabla }_{{\alpha }_{0}}\alpha \left[{E}_{{{{\rm{RS}}}}}-{E}_{{{{\boldsymbol{s}}}}{{{{\boldsymbol{s}}}}}^{{\prime} }}+\frac{{E}_{{{{\boldsymbol{s}}}}{{{\rm{RS}}}}}\left(1-2{\alpha }^{2}\right)}{2\alpha \sqrt{1-{\alpha }^{2}}}\right],$$
(16)

where \({E}_{{{{\boldsymbol{s}}}}{{{{\boldsymbol{s}}}}}^{{\prime} }}={{\mathbb{E}}}_{{{{\boldsymbol{s}}}} \sim {q}_{{{{\boldsymbol{\theta }}}}}}\left[{H}_{{{{\rm{loc}}}}}({{{\bf{s}}}})\right]\) and \({E}_{{{{\boldsymbol{s}}}}{{{\rm{RS}}}}}=2{{{\rm{Re}}}}\left({{\mathbb{E}}}_{{{{\boldsymbol{s}}}} \sim {q}_{{{{\boldsymbol{\theta }}}}}}\left[{H}_{{{{\rm{loc}}}}}^{{{{\rm{RS}}}}}\left({{{\boldsymbol{s}}}}\right)\right]\right)\). For the optimizer, we use stochastic gradient descent for Equation (16) and preconditioned gradient methods60,61 for Equation (15) with adaptable learning rate schedulers (see Supplementary Note B6 for more details).