Introduction

Full configuration interaction (FCI) offers the most complete and accurate description of a molecule’s electronic structure within a given basis set, providing the exact spectral solution to the non-relativistic electronic Schrödinger equation1,2,3,4,5,6. Due to its variational nature, FCI is particularly well-suited for treating relativistic effects, such as spin–orbit and spin–spin couplings beyond perturbation theory7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23, which are fundamentally rooted in the electronic Dirac equation.

At its core, solving the FCI problem reduces to diagonalizing a large many-electron Hamiltonian matrix. This matrix is Hermitian, sparse, and typically diagonally dominant, making it well-suited to iterative diagonalization techniques that can efficiently converge on a few eigenstates without requiring full storage or construction of the entire matrix. However, as the number of determinants grows factorially with system size, reflecting the combinatorial nature of Slater determinant enumeration within the full Hilbert space, even iterative methods become intractable beyond a certain threshold.

The CI wavefunction is often expressed as a linear combination of Slater determinants, typically generated by excitations from a mean-field self-consistent field (SCF) ground-state reference. In the relativistic regime, this framework must be reformulated using complex-valued 2- or 4-spinor wavefunctions, as required by the Dirac formalism24,25.

Figure 1 illustrates the historical progression of CI implementations, highlighting major breakthroughs in the achievable scale of determinants. Prior to this work, over a span of 35 years (1990–2024), the field advanced from handling billions to trillions of determinants, driven largely by advances in computer hardware technologies. Although CI is amenable to large-scale parallel processing schemes26,27,28,29,30,31, the explosive growth in memory requirements has historically restricted its applicability to only the smallest chemical systems. Relativistic CI is even more limited, due to the intrinsically larger spinor configuration space associated with complex-valued 2- or 4-component wavefunctions. Simply put, enabling CI for practical quantum chemistry applications demands alternative theoretical frameworks and data representations that can circumvent the brute-force enumeration of the CI space.

Fig. 1: The evolution of the state-of-the-art for CI calculations over time26,27,28,29,30,31.
figure 1

Each historical point is colored according to the nature of the key development of the respective work, which we classify as either intrinsic algorithmic developments (in purple) or optimizations on the HPC hardware of the time (in blue). * Only one CI iteration was performed. Source data are provided as a Source data file.

Many CI-based wavefunction methods aim to approximate the FCI solution. These methods use different types of approximations, which affect the accuracy of the resulting wavefunction. The two main approaches are complete active space CI (CASCI) and selected CI (SCI). While both CASCI and SCI methods effectively truncate the Hilbert space of the system to a subspace of significant determinants, this significance is determined differently and at different stages of the computation.

In the CASCI method9,10,29,30,32,33,34,35,36,37,38, it is assumed that only a subspace of the full Hilbert space of the system contains meaningful correlation, and the FCI wavefunction is approximated as the CI wavefunction in the truncated space (the so-called active space). The truncation often leads to an underestimation of dynamic correlation. Applying more computationally demanding methods such as multiconfigurational self-consistent field (MCSCF)6,7,8,12,13,16,39,40,41,42,43,44,45,46,47,48, multireference configuration interaction (MRCI)6,19,48,49,50,51,52,53,54, and many-body perturbation theory (MRPT2, CASPT2, NEVPT2, MC-PDFT)13,14,22,55,56,57,58,59,60,61 is typically required to achieve qualitative and quantitative agreement with experiment.

SCI-based methods estimate the importance of each configuration in the total wavefunction based on a predefined significance criterion, which depends on the chemical problem of interest62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78. Once the most significant determinants are identified, the Hamiltonian is constructed and diagonalized within this reduced space to approximate the FCI wavefunction. As in CASCI, SCI approaches are often combined with perturbation theory to recover contributions from neglected configurations and improve quantitative accuracy64,65,67,68,69.

Even with advances in dimensionality reduction, CI remains fundamentally constrained by memory limitations. State-of-the-art implementations still require explicit storage of either Hamiltonian matrix elements or excitation lists to support on-the-fly matrix-vector operations, commonly referred to as the σ-build32,79,80,81. For large CI spaces, storing the full Hamiltonian matrix and performing direct diagonalization is clearly impractical. In on-the-fly CI algorithms, the one-electron excitation list, which encodes the allowed excitations between determinants for efficient Hamiltonian construction, scales as ne × (nh + 1) × N, where N is the number of determinants, and ne and nh denote the numbers of electrons and holes (unoccupied orbitals), respectively. Since N increases factorially with system size, the associated memory requirements grow rapidly, making conventional CI calculations infeasible for anything beyond the smallest systems.

For a CI problem involving N determinants, the size of each CI expansion vector scales linearly with N. For instance, a relativistic CI problem with one quadrillion (1015, 100 orbitals, 88 electrons) determinants would require ~16 petabytes (PB) of memory just to store a single CI vector composed of complex-valued double-precision coefficients. While this memory footprint alone makes such problems challenging to tackle, the memory required to store the excitation list can easily scale to the exabyte (EB) regime. This poses a fundamental barrier to scalability, even before considering computational cost.

A recently introduced CI matrix-vector product algorithm31 leverages the exact factorization of the active space into small tensor products of distributed active spaces—an approach known as the small-tensor product distributed active space (STP-DAS) framework, illustrated in Fig. 2A. The STP-DAS algorithm reformulates the large CI matrix-vector product into a sequence of small tensor products, each embedded within a distributed active space, computed on-the-fly using string-based methods. This advance exploits the mathematical condition governing the phase relationship between the global address and the local DAS address of any CI matrix element, enabling the use of only small local determinant address strings in the CI matrix-vector build and overcoming the memory bottleneck associated with storing the full excitation list. This formulation enables extensive reuse of Hamiltonian excitation lists, leading to a dramatic reduction in memory demands. For a CI problem involving one quadrillion (1015) determinants, the STP-DAS framework reduces the excitation list memory requirement from 12 exabytes (EB) to 25 gigabytes (GB), an 8-orders-of-magnitude reduction! In addition, by evenly distributing the computation of small tensor products, the STP-DAS algorithm achieves excellent load balance with minimal node-to-node communication overhead, ensuring strong scalability across both single-node and large-scale parallel architectures.

Fig. 2: Categorical compression within the small-tensor-product distributed active space (STP-DAS) framework.
figure 2

A The STP-DAS framework decomposition of a complete active space configuration interaction (CASCI) calculation into a direct sum of categorical excitations. The large excitation lists can be factored into much smaller categorical excitation lists. Purple sections within active spaces represent electron-occupied orbitals. B The exact two-component full configuration interaction (X2C-FCI) ground state energy of the Mg2+ ion within the cc-pVNZ-DK142,143 (N = 2, 3, 4) basis sets, along with the extrapolated complete basis set limit. Source data are provided as a Source data file. C Average execution time (in seconds) of the compression-compatible STP-DAS algorithm per σ-build of a thallium hydride (TlH) test case versus the node count (5 iterations, 1 message passing interface (MPI) process per node, 40 symmetric multiprocessing (SMP) threads per MPI process). Here, H is the Hamiltonian, C is the CI vector, and σ is their product. The dashed lines illustrate the ideal strong scaling behavior of each CASCI calculation. Source data are provided as a Source data file. D The representation of the subspace expansion vector in a traditional configuration interaction (CI) picture, the decomposed subspace vector in the STP-DAS framework, and the numerically exact compression of the subspace expansion vector in the categorically compression-compatible STP-DAS representation. The color of CI coefficients indicates their configuration category, while their brightness symbolizes their magnitude. White indicates a magnitude of zero. E A schematic representation of the lossless, compression-compatible, STP-DAS σ-build algorithm. The Hamiltonian matrix is represented as a heatmap, where brighter elements have larger magnitudes. The color of the vector elements indicates the configuration category of the corresponding CI coefficients. Note that the σ-build preserves categorical compression. F An illustration comparing the traditional Davidson preconditioner with the compression-compatible preconditioner to generate successive subspace expansion vectors. The compression-compatible preconditioner appends the subspace with the same effective search direction as the traditional Davidson preconditioner without compromising its compression.

Since the excitation list typically dominates the storage requirements in CI calculations, the STP-DAS framework overcomes a longstanding memory bottleneck and enables CI computations that were previously deemed intractable. By effectively eliminating the memory footprint of the excitation lists, the storage of CI vector coefficients now emerges as the bottleneck in many large-scale CI problems.

Revisiting the earlier CI example of 1015 determinants: even after eliminating the memory bottleneck associated with storing excitation lists, the same calculation still demands 16 PB of memory to store the numerically exact coefficients of a single CI expansion vector in an iterative solver. Given that practical CI calculations typically require multiple subspace vectors for convergence, it becomes clear that storing these expansion vectors now represents the dominant memory bottleneck in large CI calculations. To address this challenge, we leverage the locally compressed nature of the STP-DAS framework to efficiently compute ultra-large-scale CI problems involving up to a quadrillion (1015) determinants. This approach yields deterministic, numerically exact solutions and effectively shifts CI calculations from being memory-bound to compute-bound.

Results

Before presenting methodological details and performance benchmarks, we highlight the largest CASCI calculation to date for the ground state of HBrTe, enabled by the compression-compatible STP-DAS framework to be introduced herein. HBrTe is a substituted form of a hydrogen chalcogenide where one of the hydrogens was substituted with a bromine atom to decrease the symmetry to the molecule. We performed a relativistic exact two-component21,22,82,83,84,85,86,87,88,89,90,91,92,93 CASCI (X2C-CASCI) calculation (100 2-spinor orbitals, 88 electrons, complex-valued 1.05 × 1015 2-spinor determinants) of the ground state of HBrTe using the compression-compatible STP-DAS framework. We also performed a calculation with the same number of determinants for the ground state of a magnesium atom (see Section S3 of the Supplementary Information). The calculation ran on the National Energy Research Scientific Computing Center’s Perlmutter high-performance supercomputer with a total of 1000 nodes (AMD EPYC 7763 Milan, 128,000 compute cores, 512 GB of RAM per node, 200 GB/s NIC, 2 message passing interface (MPI) processes per node, and 64 symmetric multiprocessing (SMP) threads per MPI process).

The excitation list is generated without assuming any symmetry of the target state. Consequently, the calculation is formally performed in a quadrillion-determinant space, with all determinants explicitly included. Because the CI coefficients are complex, the memory footprint is twice that of an analogous non-relativistic calculation. Moreover, the complex arithmetic makes the computational cost (in FLOP count) equivalent to that of a non-relativistic calculation with more than twice as many determinants.

Table 1 summarizes the results of the HBrTe calculation. The ground-state energy converged in 9 iterations to microhartree precision (<10−6 a.u.) with a total runtime of 34.5 h. In each iteration, an additional CI expansion vector was introduced to accelerate convergence, while the compression algorithm dynamically adapted to the expanding vector space, leading to a gradual increase in computational cost. A total of 9 expansion vectors were involved in the σ-build. Despite the enormous configuration space of over one quadrillion (1.05 × 1015) complex-valued 2-spinor determinants, the average σ-build time per vector remained just 3.8 h.

Table 1 Details of the convergence of the X2C-CASCI calculation (100 2-spinor orbitals, 88 electrons, 1015 2-spinor determinants) of the ground state of HBrTe (x2c-TZVPall96)

The ground-state energy of the HBrTe molecule from this CI calculation is −9395.028004 a.u. Leveraging the gap theorem94,95, we determine that our \(\left(\begin{array}{c}100\\ 88\end{array}\right)\) X2C-CASCI result lies within 10 × 10−6 a.u. of the true x2c-TZVPall96 ground state energy within that active space, well below any chemically meaningful threshold. A detailed analysis is provided in “Methods”.

This work represents a 3-orders-of-magnitude increase in CI space and a 6-orders-of-magnitude increase in FLOP count, which is estimated using \({{\mathscr{O}}}({N}^{2})\), compared to the previous state-of-the-art in CI calculations30,31. Compared to previous state-of-the-art CI calculations, this work also achieves a 6-orders-of-magnitude speedup in time-to-completion, as measured in core seconds per exaFLOP (\(\frac{\,{\mbox{core}}\cdot {\mbox{second}}}{{\mbox{exaFLOP}}\,}\), see Section S2 in the Supplementary Information for analysis). This ultra-large-scale CI calculation is enabled by the STP-DAS-based numerically exact categorical compression scheme, which reduces the memory required to store the 9 CI expansion vectors from 134 PB to less than 500 TB while maintaining a good load balance31, making the computation feasible on most existing supercomputing infrastructures.

We now describe the algorithmic developments that enable such CI calculations to be performed on existing supercomputing resources. Detailed algorithms, parallel implementation strategies, and error bound analyses are provided in the Supplementary Information. The central concept of the STP-DAS framework is the systematic partitioning of the full CI orbital space into a collection of distributed active spaces. Within each active space, configurations are further classified into categorical subspaces, rigorously defined by distinct electron occupation patterns, as illustrated in Fig. 2A31. The STP-DAS framework reformulates the CI σ-build as a sum of small-tensor products, each uniquely addressed via a global tensor looping structure. The major memory bottleneck associated with storing the excitation list is eliminated by allowing categorical subspaces to share compact, local excitation lists.

In the largest CASCI calculation (1015 determinants) presented here, employing 13 distributed active spaces, the STP-DAS approach reduces the excitation list memory requirement from 12 × 109 GB to just 25 GB. However, storing all nine CI expansion vectors for a system with 1015 determinants would require ~134 PB of memory. On a high-performance computing system such as Perlmutter, this translates to more than 275,000 nodes, each equipped with 512 GB of memory. A straightforward element-wise sparsity treatment, however, does not meet the STP-DAS condition. To overcome this limitation, the following section introduces a categorical compression scheme that achieves the necessary memory reduction, making it possible to carry out large relativistic CI calculations on a medium-sized computing cluster.

With the STP-DAS framework, the memory bottleneck associated with storing CI expansion vectors is effectively eliminated by applying numerically exact categorical compression. The STP-DAS CI expansion vectors take the form

$${{\bf{C}}}={\bigoplus}_{{{\mathcal{B}}}}{{{\bf{C}}}}^{{{\mathcal{B}}}},$$
(1)

where \({{\mathcal{B}}}\) is a category, defined by a unique electron occupation pattern within the distributed active spaces31.

The compression scheme, categorical compression (see Fig. 2D), stores the \({{{\bf{C}}}}^{{{\mathcal{B}}}}\) vectors as compressed sparse column (CSC) vectors. In this format, each categorical expansion vector \({{{\bf{C}}}}^{{{\mathcal{B}}}}\) is represented by two equally sized arrays: one, \({V}_{{{\rm{CSC}}}}^{{{{\bf{C}}}}^{{{\mathcal{B}}}}}\equiv \{{C}_{{K}^{{{\mathcal{B}}}}}:{C}_{{K}^{{{\mathcal{B}}}}}\ne 0\}\), stores the values of the nonzero coefficients, while the other, \({A}_{{{\rm{CSC}}}}^{{{{\bf{C}}}}^{{{\mathcal{B}}}}}\equiv \{{K}^{{{\mathcal{B}}}}:{C}_{{K}^{{{\mathcal{B}}}}}\ne 0\}\), stores their corresponding local addresses. The tensor-loop structure in the STP-DAS σ-build algorithm is reformulated in terms of categorically compressed local addresses together with their corresponding global phase factors (see Fig. 2E).

In contrast to element-wise compression, categorical compression can eliminate all configurations within a category, i.e., skip an entire category at once. This approach is better able to preserve the vectorized structure compared to element-wise compression. The major difficulty of any sparse matrix-vector product algorithm is the lack of a priori knowledge of the location of the nonzero elements. This difficulty leads to a major bottleneck rooted in the nonuniform accesses to memory. The categorical compression localizes nonuniform memory accesses within a category, which is always orders-of-magnitudes smaller than the size of the full CI space. This localized memory access pattern is the central strength of the categorical compression scheme.

Within each category, element-wise compression is still applicable to maximize sparsity. Most importantly, a category-based compression scheme naturally supports distributed small tensor products, taking advantage of the reduced memory footprint and improved parallel load balance of the STP-DAS framework.

In summary, the numerically exact categorical compression introduced here allows the STP-DAS σ-build algorithm to bypass entire categories of determinants while preserving both the vectorized structure and the local addressing scheme of STP-DAS. Because the compression is fully lossless, the omitted determinants have no impact on the resulting Ritz eigenvalue–eigenvector pair.

The final hurdle in reducing CI memory demands lies in the iterative solver. In CI, the Hamiltonian operator is diagonalized iteratively within the full Hilbert space of the system. As a result, the error in the computed Ritz value directly reflects the missing correlation energy in the associated approximate wavefunction. This enables the Ritz residual, defined as rHC − EC, to be computed and appended as an additional CI expansion vector. The norm of the residual provides rigorous bounds on the missing correlation energy relative to the true eigenpair (eigenvector and eigenvalue) of the Hamiltonian in the chosen basis94,95,97,98,99,100,101,102,103. A widely used approach that leverages this principle is the Davidson iterative solver104. Other related methods that use residual norm as the convergence criterion include the locally optimal block preconditioned conjugate gradient (LOBPCG) method105, the Jacobi-Davidson106 method and generalized preconditioned locally harmonic residual (GPLHR) method107,108.

By applying numerical or convergence thresholds at various stages of the Davidson method104, one can exploit the sparsity of newly generated CI expansion vectors. With sufficiently tight thresholds, these vectors can span the part of the Hilbert space required to accurately represent the desired wavefunction and drive the iterative diagonalization to any desired level of precision. The Davidson method104 utilized the Davidson preconditioner, which generates the ith component of the next trial expansion vector according to

$${t}_{i}\leftarrow \left\{\begin{array}{ll}0,\quad &\,{{\mbox{if}}}\,\,| \lambda -{H}_{ii}| \,\,{{\rm{is}}}\; {{\rm{small}}}\,\\ \frac{{r}_{i}}{\lambda -{H}_{ii}},\quad &\,{{\mbox{else}}}\,\end{array}\right.$$
(2)

Here, λ is the Ritz value of the current iteration, Hii is the ith element of the diagonal of the Hamiltonian H, and ri is the ith component of the current residual. Among the various preconditioners employed in the Davidson method97,98,103,109,110,111, compression-compatible preconditioners, which discard terms ti below a numerical threshold ε, have been shown to achieve convergence to the exact same results as the traditional Davidson preconditioner97,98,99,101,112.

In this work, we apply the compression-compatible categorical preconditioner

$$\begin{array}{rcl}{s}_{i}&\leftarrow &\left\{\begin{array}{ll}\frac{{r}_{i}}{\lambda -{H}_{ii}},\quad &\,{{\mbox{if}}}\,| \lambda -{H}_{ii}| \ge 1{0}^{-12}\\ 0,\quad &\,{{\mbox{else}}}\,\end{array}\right.\\ {t}_{i}&\leftarrow&\left\{\begin{array}{ll}{s}_{i},\quad &\,{{\mbox{if}}}\,| {s}_{i}| \ge \varepsilon \parallel {{\bf{s}}}\parallel \\ 0,\quad &\,{{\mbox{else}}}\end{array}\right.\end{array}$$
(3)

using the nonzero residual elements ri. We do this on-the-fly to avoid explicitly storing the prohibitively large diagonal of H (a dense vector of size Ndets). Figure 2F illustrates the expansion of the CI vector space enabled by the compression-compatible categorical preconditioner used in the Davidson method implemented here. Eq. (3) closely resembles the preconditioner proposed in ref. 112, with the key distinction being the inclusion of the Davidson-preconditioned residual norm s in the dropping criterion. This facilitates dynamic threshold adjustment: as the iterations progress and s decreases, the criterion becomes more stringent. More importantly, the factor of s ensures that numerical thresholding is applied to the generated expansion vectors relative to the total norm of their exact (traditional Davidson) counterparts, rather than some fixed cutoff on the absolute values of their entries. This results in a very accurate CI space expansion scheme at the cost of computing and contracting a significant number of Hamiltonian matrix elements (to evaluate s exactly), which would be intractable without the STP-DAS framework.

Algorithms and pseudocodes of the compression-compatible STP-DAS method are presented in “Methods”, along with discussions on load balancing and parallel implementation. The convergence behavior of the STP-DAS framework equipped with the compression-compatible preconditioner defined in Eq. (3) was evaluated across three systems with varying degrees of electron correlation: the magnesium atom, diatomic nitrogen, and a model carbon nanotube. The results are provided in Section S1 of the Supplementary Information.

The analysis reveals that overly aggressive thresholding can cause the Davidson procedure to stagnate, thereby preventing convergence to the correct electronic wavefunction. When thresholds are too loose, newly generated trial vectors quickly become linearly dependent on the existing subspace vectors, signaling that the span of the modified subspace has saturated before achieving convergence. However, when the preconditioning threshold satisfies

$$\varepsilon \,\lessapprox\, \frac{10}{\sqrt{{N}_{{{\rm{dets}}}}}},$$
(4)

the resulting energies agree with their exact values to better than 10−7 Eh, and the residual norms become correspondingly small, demonstrating successful and reliable convergence.

The CI wavefunction of highly correlated systems is comprised of a large number of determinants with small CI coefficients. Because the Hilbert space of the problem is never truncated, and no determinants are discarded from the wavefunction itself. As a result, the compression-compatible preconditioner easily facilitates convergence to the exact wavefunction, provided that the preconditioning threshold ε is sufficiently small for the true eigenvector to be accurately represented in the subspace spanned by the modified expansion vectors.

With the capability to perform large CI calculations, energetic extrapolation to the correlation limit becomes feasible for many-electron systems. Figure 2B illustrates the correlation-consistent extrapolation for the Mg2+ ion using double-zeta (DZ, 36 orbitals), triple-zeta (TZ, 68 orbitals), and quadruple-zeta (QZ, 118 orbitals) basis sets, involving 2.54 × 108, 2.91 × 1011, and 9.75 × 1013 2-spinor determinants, respectively. The complete basis set limit of  −199.16704295 a.u. was obtained using a mixed Gaussian extrapolation scheme tailored for correlation-consistent basis sets113.

To demonstrate the strong scaling behavior of the compression-compatible STP-DAS σ-build algorithm, we performed relativistic X2C-CASCI calculations on the thallium hydride (TlH) molecule using active spaces of 40 and 41 2-spinor orbitals and 24 electrons, \(\left(\begin{array}{c}40\\ 24\end{array}\right)\) and \(\left(\begin{array}{c}41\\ 24\end{array}\right)\), corresponding to 63 and 152 billion 2-spinor determinants, respectively. Five distributed active spaces (DASs) were employed. These calculations were executed on the University of Washington’s Hyak HPC system, a small-sized cluster where each node is equipped with two Intel Xeon 6230 Gold CPUs and a single 100 GB/s network interface card. As shown in Fig. 2C, even with just 5 compute nodes, relativistic X2C-CASCI calculations involving tens to hundreds of billions of determinants require only 4–6 min per σ-build on average. Increasing to 30 nodes further reduces the cost to just over 1 min. Past 30 nodes, the communication time dominates the runtime and the calculations no longer scale. This benchmark demonstrates that billion- and even trillion-determinant CI calculations are now feasible on a small-scale computing cluster.

Additionally, we performed X2C-CASCI calculations for the ground states of two highly correlated systems, showcasing the applicability of the compression-compatible STP-DAS framework to highly correlated systems. These systems were chosen from opposite ends of the correlation spectrum from strongly statically correlated to strongly dynamically correlated. Square Rb4, a relativistic analog of H4114,115,116,117,118,119,120,121,122,123,124,125,126, displays strong static correlation and Xe2 is a dynamically correlated noble gas dimer127,128,129,130,131,132,133,134,135.

Tables 2 and 3 summarize the results of the Rb4 and Xe2 calculations. The Rb4 calculation (50 2-spinor orbitals, 28 electrons, 8.9 × 1013 2-spinor determinants) ran on the National Energy Research Scientific Computing Center’s Perlmutter high-performance supercomputer with a total of 100 nodes (AMD EPYC 7763 Milan, 12,800 compute cores, 512 GB of RAM per node, 200 GB/s NIC, 1 MPI processes per node, and 128 SMP threads per MPI process) and took 6 iterations and 11.8 h to converge. The Xe2 calculation (60 2-spinor orbitals, 12 electrons, 1.4 × 1012 2-spinor determinants) ran on the same platform with 256 nodes and took 7 iterations and 36.1 h to converge.

Table 2 Details of the convergence of the X2C-CASCI calculation (50 2-spinor orbitals, 28 electrons, 8.9 × 1013 2-spinor determinants) of the ground state of Rb4 (cc-pvtz-x2c141)
Table 3 Details of the convergence of the X2C-CASCI calculation (60 2-spinor orbitals, 12 electrons, 1.4 × 1012 2-spinor determinants) of the ground state of Xe2 (x2c-TZVPall-2c96)

The ground-state energies of the Rb4 and Xe2 molecules from these calculations are  −11916.152725 a.u. and  −14889.646696 a.u., accordingly. The gap theorem94,95 guarantees that these X2C-CASCI results lie within 0.53 × 10−6 a.u. and 17.56 × 10−6 a.u. of the true ground state energies within the corresponding active spaces and basis sets (see “Methods”).

In summary, by combining compression-compatible preconditioners with compression-compatible categorical CI vectors, the STP-DAS framework drastically reduces the memory footprint of both the excitation lists and the CI expansion vectors. In the largest relativistic CASCI calculation presented here, spanning 1015 determinants across 13 distributed active spaces, the STP-DAS approach reduces the memory required for the excitation list from 12 × 109 GB to just 25 GB, and for 9 CI expansion vectors from 134 PB to less than 500 TB. These reductions make quadrillion-determinant calculations tractable on current supercomputing architectures. While most of the community may not have access to the hundreds of compute nodes required for such runs, this work also demonstrates the practical feasibility of trillion-determinant calculations on just a few nodes and even on a laptop.

Discussion

In this work, we conducted a relativistic configuration interaction (CI) calculation for the ground state of HBrTe in a quadrillion-determinantal space. This calculation was enabled by numerically exact categorical compression within the STP-DAS framework, which effectively eliminates the memory bottlenecks associated with storing both excitation lists and CI expansion vectors. Compared to previous state-of-the-art CI calculations, this work represents a 3-orders-of-magnitude increase in CI space, a 6-orders-of-magnitude increase in FLOP count, and a 6-orders-of-magnitude improvement in computational speed.

We introduced a categorically compressed representation of the CI expansion vectors and reformulated the STP-DAS σ-build algorithm to take advantage of this structure. By expressing the global expansion vector as a direct sum of compressed local components, the algorithm efficiently skips all coefficients that do not contribute to the categorical σ-vector. This approach is further enabled by a compression-compatible preconditioner, which generates compressed expansion directions within the Davidson procedure.

The resulting categorically compressed STP-DAS σ-build algorithm demonstrates excellent strong scaling behavior and yields dramatic reductions in both runtime and memory footprint. These benefits extend seamlessly to both relativistic (two- and four-component) and non-relativistic CI calculations. To highlight this capability, we computed the \(\left(\begin{array}{c}100\\ 88\end{array}\right)\) X2C-CASCI ground-state energy of HBrTe using over one quadrillion (1015) complex-valued 2-spinor determinants. The categorically compressed STP-DAS approach spans 1015 determinants across 13 distributed active spaces, reducing the memory required for the excitation list from 12 × 109 GB to only 25 GB, and for nine CI expansion vectors from 134 PB to under 500 TB. It converges the ground-state wavefunction of HBrTe in just nine iterations over a 34.5-h runtime. This achievement represents the largest CI calculation reported to date. Additionally, we achieved σ-build times of just 5 minutes for systems with ~150 billion complex-valued 2-spinor determinants using only a few compute nodes. The capability to perform large CI calculations makes basis set extrapolations to the complete basis set limit and computations on highly correlated molecular systems readily achievable with CI.

The integration of categorical compression with STP-DAS marks a paradigm shift in tackling large-scale CI problems. As quantum chemistry continues to push the limits of system complexity, the ability to carry out quadrillion-determinant calculations within tractable resource bounds establishes a powerful foundation for studying highly correlated, multireference, relativistic systems. While access to hundreds of compute nodes for quadrillion-determinant calculations may remain out of reach for most of the community, this work demonstrates the practical feasibility of trillion-determinant calculations on a small cluster.

For transition-metal, rare-earth, and heavy-element complexes, such large-scale CI calculations enable predictive simulations of electronic structure properties (bond order, covalency, polarization, etc.), spectroscopic observables (UV/Vis, X-ray, etc.), and reaction pathways, with the full orbital space consisting of both metal and ligand orbitals, treated on an equal footing.

The ability to simulate a full CI space of 100 orbitals on a classical computer not only challenges current notions of quantum supremacy, but also establishes a robust platform for developing and benchmarking quantum algorithms aimed at achieving chemical accuracy.

Methods

Lossless σ-build using the categorical compression of small tensor products

The categorical σ-build algorithm within STP-DAS31 can be reformulated to exploit the categorical compression of the expansion vectors. The compact nature of the categorical representation enables fast and memory-efficient computation of σ-vectors. The categorical σ-build algorithm implements the evaluation of

$$\sigma_{L^{{{\mathcal{A}}}}}={\scriptstyle{{1{{\rm{e}}}}}\atop} \! \sigma_{L^{{\mathcal{A}}}}+{\scriptstyle{{2{{\rm{e}}}}}\atop} \! \sigma_{L^{{\mathcal{A}}}},$$
(5)
$${\scriptstyle{{1{{\rm{e}}}}}\atop} \! \sigma_{{L}^{{\mathcal{A}}}}= {\sum}_{{\mathcal{B}}} {\sum}_{{\mathbb{K}}^{{\mathcal{B}}}_\mu \oplus {\mathbb{K}}^{{\mathcal{B}}}_\nu} {\sum}_{pq} P_{\mu \nu} \delta_{{\bar{\mathbb{X}}}_{\mu \nu}^{{\mathcal{A}}}{\bar{\mathbb{X}}}_{\mu \nu}^{{\mathcal{B}}}} \\ h_{pq}^{\prime} \langle{{\mathbb{L}}^{{\mathcal{A}}}_\mu \oplus {\mathbb{L}}^{{\mathcal{A}}}_\nu}| {\hat{E}}_{pq} |{{\mathbb{K}}^{{\mathcal{B}}}_\mu \oplus {\mathbb{K}}^{{\mathcal{B}}}_\nu}\rangle C_{{K}^{{\mathcal{B}}}},$$
(6)
$${\scriptstyle{{2{{\rm{e}}}}}\atop} \! \sigma_{{L}^{{\mathcal{A}}}}= \frac{1}{2} {\sum}_{{{\mathcal{C}}} {{\mathcal{B}}}}{\sum}_{{\mathbb{J}}^{{\mathcal{C}}}_\mu\oplus {\mathbb{J}}^{{\mathcal{C}}}_\nu}{\sum}_{{\mathbb{J}}^{{\mathcal{C}}}_{\kappa}\oplus {\mathbb{J}}^{{\mathcal{C}}}_\lambda} {\sum}_{{\mathbb{K}}^{{\mathcal{B}}}_{\kappa}\oplus {\mathbb{K}}^{{\mathcal{B}}}_\lambda} {\sum}_{pqrs} P_{\mu\nu}P_{\kappa\lambda} \\ \delta_{{\bar{\mathbb{X}}}_{\mu\nu}^{{\mathcal{A}}}{\bar{\mathbb{X}}}_{\mu\nu}^{{\mathcal{C}}}}\delta_{{\bar{\mathbb{X}}}_{\kappa\lambda}^{{\mathcal{C}}}{\bar{\mathbb{X}}}_{\kappa\lambda}^{{\mathcal{B}}}} g_{pqrs} \langle{{\mathbb{L}}^{{\mathcal{A}}}_\mu\oplus {\mathbb{L}}^{{\mathcal{A}}}_\nu} |{\hat{E}}_{pq} |{{\mathbb{J}}^{{\mathcal{C}}}_\mu\oplus {\mathbb{J}}^{{\mathcal{C}}}_\nu}\rangle \\ \langle{{\mathbb{J}}^{{\mathcal{C}}}_{\kappa}\oplus {\mathbb{J}}^{{\mathcal{C}}}_\lambda}| {\hat{E}}_{rs} |{{\mathbb{K}}^{{\mathcal{B}}}_{\kappa}\oplus {\mathbb{K}}^{{\mathcal{B}}}_\lambda}\rangle C_{{K}^{{\mathcal{B}}}},$$
(7)

where \(p\in {{\mathbb{X}}}_{\mu }^{{{\mathcal{A}}}},\,q\in {{\mathbb{X}}}_{\nu }^{{{\mathcal{C}}}},r\in {{\mathbb{X}}}_{\kappa }^{{{\mathcal{C}}}},\,s\in {{\mathbb{X}}}_{\lambda }^{{{\mathcal{B}}}}\), \(\left\langle \right.{{\mathbb{X}}}_{\mu }^{{{\mathcal{A}}}}\oplus {{\mathbb{X}}}_{\nu }^{{{\mathcal{A}}}}| {\hat{E}}_{pq}| {{\mathbb{X}}}_{\mu }^{{{\mathcal{B}}}}\oplus {{\mathbb{X}}}_{\nu }^{{{\mathcal{B}}}}\left.\right\rangle\) are categorical one-electron excitation lists, Pμν are global phase factors, and \({h}_{pq}^{{\prime} }\) (gpqrs) are one (two) body Hamiltonian elements. See ref. 31 for algorithmic details.

Equations (5) to (7) define the categorical σ-vector in terms of local STP-DAS one-electron excitation lists and the categorical CI expansion vector. Notably, when the expansion coefficients \({C}_{{K}^{{{\mathcal{B}}}}}\) are categorically compressed, the resulting σ-vector coefficients \({\sigma }_{{L}^{{{\mathcal{A}}}}}\) are also categorically compressed. In such cases, the categorical σ-build reduces to a contraction between a categorically compressed expansion vector and categorically compressed STP-DAS one-electron excitation lists. Thus, Eqs. (5) to (7) yield a categorically compressed σ-vector, in a manner directly analogous to the compression-preserving behavior of sparse matrix-sparse vector products (SpMSpV). Importantly, this compression preservation is general and independent of the specific storage format used for the categorically compressed representations.

The categorically compressed representation can be implemented in various forms, with the choice of storage format guided primarily by computational efficiency. Since the categorical σ-build algorithm often involves reading numerous expansion coefficients with increasing local addresses during contraction, it is natural to adopt a compressed sparse column (CSC) format for storing the categorical expansion coefficients.

Eigenvalue bound analysis

We wish to apply the gap theorem94,95,136 to bound the error in the computed X2C-CASCI ground state energy:

$$| \delta E| \le \frac{\parallel {{\bf{r}}}{\parallel }^{2}}{{\gamma }_{0}},$$
(8)

where r is the Ritz residual of the computed ground state and the gap \({\gamma }_{0}\equiv {E}_{1}-{\tilde{E}}_{0}\) is the difference between E1, the (unknown) exact energy of the first excited state, and \({\tilde{E}}_{0}\), the computed Ritz value of the ground state. Because γ0 is unknown, one can estimate its order-of-magnitude using approximate methods or use experimental values to compute a surrogate for the true gap. One can also obtain an exact lower bound on the gap by including the posterior error bound of the first excited state136,137,138 in the Davidson calculation100:

$${\gamma }_{0}={E}_{1}-\tilde{{E}_{0}}\ge \left(\tilde{{E}_{1}}-\parallel {{{\bf{r}}}}_{{{\bf{1}}}}\parallel \right)-{\tilde{E}}_{0}\equiv {\gamma }_{0}^{-},$$
(9)

where r1 is the residual associated with the Ritz value \({\tilde{E}}_{1}\).

Using the gap theorem94,95,136, we can place an exact bound on the error in the computed X2C-CASCI ground-state energy. This requires an estimate of the energy gap between the ground and first excited states of the X2C-CASCI Hamiltonian. To obtain this, we performed an X2C-CISD calculation for the two lowest-lying states and determined a gap of ~0.095 a.u. for HBrTe. Based on this estimate, the gap theorem bounds the error in our X2C-CASCI ground-state energy for HBrTe to within 10 microhartree, which is well below any chemically meaningful threshold.

Compression-compatible STP-DAS algorithm

Algorithm 1 shows the categorically compressed STP-DAS algorithm, in which only nonzero elements contribute to the categorical σ-vector. Its advantage over the traditional STP σ-build algorithm31 is twofold: in the outermost loop, where we skip entire categories whose expansion vector vanishes (see lines 2–3), and in the inner loop of line 7, where we only process excitations \(\langle {{\mathbb{J}}}_{\kappa }^{{{\mathcal{C}}}}\oplus {{\mathbb{J}}}_{\lambda }^{{{\mathcal{C}}}}| {\hat{E}}_{rs}| {{\mathbb{K}}}_{\kappa }^{{{\mathcal{B}}}}\oplus {{\mathbb{K}}}_{\lambda }^{{{\mathcal{B}}}}\rangle\) for which \({C}_{{K}^{{{\mathcal{B}}}}}\ne 0\).

Algorithm 1: Two-electron σ-build using categorical compression. Bold text represents algorithmic logic and typewritten text represents comments.

There is a potential workload imbalance associated with Algorithm 1: because the collection of categorical expansion vectors is distributed among computing nodes, the contraction workload of a given node is proportional to the number of nonzero categorical expansion coefficients it has. We alleviated some of the resulting computational delay by implementing passive one-sided MPI communication of categorical expansion vectors using remote memory access (RMA). This allows idle nodes to contract more categorical expansion vectors with their excitation lists without waiting for the corresponding busy nodes to broadcast them. Ultimately, overcoming this load-balancing issue requires dynamically redistributing categorical expansion vectors according to their sparsity, which changes during the iterative diagonalization.

As illustrated in Algorithm 1, the computational cost, both in memory and runtime, of the compression-compatible STP-DAS σ-build procedure increases with the density of the expansion vectors. Therefore, it is essential to maintain maximal compression in these vectors. To achieve this, we replace the traditional Davidson preconditioner with a compression-compatible alternative for generating new trial expansion vectors. This modification alters only the subspace expansion strategy in the Davidson algorithm, while preserving exact treatment of the full determinantal space. As a result, the computed matrix-vector product HC and the corresponding residual norm r = HC − λC remain exact, unlike in selected CI and other truncated approaches, where both the Hamiltonian and CI vectors are explicitly approximated.

The compression-compatible STP-DAS σ-build algorithm significantly reduces the overall workload associated with the σ build. The reduction in workload can be nonuniform: the contraction workload associated with a determinant \({{\mathbb{J}}}_{\mu \nu \kappa \lambda }^{{{\mathcal{C}}}}\) is proportional to the number of local addresses containing both nonzero categorical expansion vectors elements and Hamiltonian matrix elements. To improve the load-balance, we implemented dynamic SMP thread-level parallelism in the outermost loop (line 1 in Algorithm 1) instead of in the loop over determinants \(\{{{\mathbb{J}}}_{\mu \nu \kappa \lambda }^{{{\mathcal{C}}}}\}\) (line 4 in Algorithm 1). Under dynamic parallelism, some threads execute many light contractions, while others execute fewer heavy contractions, resulting in a more uniform distribution of contraction workload. Such dynamic parallelism is ineffective in the loop over determinants \(\{{{\mathbb{J}}}_{\mu \nu \kappa \lambda }^{{{\mathcal{C}}}}\}\) due to the small tensor product nature of the STP-DAS framework.