Background & Summary

Materials serve as the cornerstone of critical economic sectors, including transportation, health, information technology and energy. Hence, the discovery of materials with improved properties is crucial. Materials databases (e.g., Materials Project1, OQMD2, AFLOW3, JARVIS4, Materials Cloud5) constructed with high-throughput calculations based on density functional approximations (DFAs) play a key role in accelerating materials discovery. For instance, these databases enable the construction of artificial intelligence (AI) models for material properties6,7. However, the usefulness of such databases, and consequently the applicability of AI models trained on them, are constrained by the accuracy of the underlying theoretical assumptions used to generate the data. As a result, the AI models might fail when deployed to explore materials and properties that are not well described by the chosen DFA. For example, electron exchange correlation (XC) functionals under the generalized gradient approximation (GGA) such as the Perdew-Burke-Ernzerhof (PBE)8, face accuracy limitations for estimating the electronic properties, especially for systems with localized electronic states such as transition-metal oxides9,10,11. Recently, high-throughput studies using the meta-GGA functional SCAN12 or its regularized variants13 and non-local, range-separated hybrid functional HSE0614,15 have been used for constructing materials databases which, to some extent, address the accuracy limitations of GGA. However, many of these studies have focused on specific material properties (e.g., band gaps) or system types (e.g., binary solids). Additionally, most of them employed plane-wave basis sets with pseudopotential approximation for core electrons, which may face accuracy and transferability challenges across materials with diverse crystal structures16,17. All-electron calculations using beyond-GGA DFAs can provide more reliable data, and databases constructed from such calculations could play a crucial role in developing AI models with enhanced generalizability.

Here, we present a database of 7,024 materials constructed from all-electron DFT calculations with hybrid functional for XC with a focus on oxides relevant for catalysis and energy related applications. The construction of this database has been enabled by the scalability-improved implementation in FHI-aims which allows to perform hybrid functional calculations at feasible costs for a large number of materials18. We also illustrate how the database facilitates the development of AI models for material properties using the symbolic-regression based Sure-Independence Screening and Sparsifying Operator (SISSO)19 approach, which identifies the key parameters correlated with material properties, making the model interpretable.

Methods

The database consists of 7,024 materials. We considered M-O-H (M-N-O-H) as the reference chemical system for selecting binary (ternary) materials as we are interested in evaluating their thermodynamic and electrochemical stability. This involves oxides (M-O/M-N-O), intermetallics (M-N), hydrides (M-H/N-H/M-N-H), hydroxides (M-O-H/N-O-H/M-N-O-H) and elemental solids (M/N). The initial crystal structures for these materials are queried from the Inorganic Crystral Structure Database (ICSD, v2020). Since ICSD consists of a large number of materials with the same composition (duplicate entries or polymorphs), filtering of entries for which the ICSD-id is associated with at least one Materials Project ID (MP-id, v2023.11.1) is done based on the lowest energy/atom criteria according to MP (GGA/GGA+U) data. In case there is no MP-id found for a formula, the ICSD entry with the lowest number of atoms in the unit cell is considered. Notably, we imposed no restrictions on unit cell sizes or crystal prototypes, resulting in structures with up to 616 atoms per unit cell. A schematic representation of the computational workflow used for constructing the database is provided in Fig. 1.

Fig. 1
figure 1

Computational workflow employed for curating the database. d1 and d2 correspond to the descriptors derived from SISSO model.

For the selected structures, geometry optimizations are performed with the PBEsol functional20 as it provides an accurate estimation of lattice constants of solids21. Using the structures optimized with PBEsol, we performed HSE06 energy evaluation and electronic structure calculations, as HSE06 gives more accurate electronic properties22,23. Our previous studies confirmed that HSE06 provides only slight improvements in lattice constants with respect to GGA functionals9,10. For both PBEsol and HSE06, properties such as the electronic band structure, density of states, Hirshfeld charges are also computed. All the calculations are performed using the all-electron code FHI-aims using numerically atom-centered orbtial (NAO) basis sets24. The “light” settings are chosen for NAO basis sets, as our previous study10 has demonstrated that they offer a reasonable trade-off between accuracy and computational efficiency for formation energies compared to “tight” or “really tight” settings. A computational workflow based on the Taskblaster framework25 is used to automate the multiple tasks involved in curating the data. A convergence criterion of 10−3 eV/Å is considered for forces. For all the potentially magnetic structures (ie., either labelled as magnetic in MP or containing elements such as Fe, Ni, Co, etc.), spin-polarized calculations are performed. Our DFT-HSE06 calculations failed to converge for 198 materials, with the majority (167) containing 3d- or 4f-elements. This is not unexpected, as HSE06 electronic-structure calculations are known to exhibit challenging convergence behaviour due to their higher sensitivity to localized states, often requiring denser k-point sampling or more sophisticated computational settings. In spin-polarized calculations, HSE06 may also favor different spin configurations than GGA, leading to adjustments in lattice geometry for different magnetic orders. Furthermore, HSE06 calculations can encounter multiple energy minima with comparable energies, necessitating case-specific parameter tuning. Implementing such careful, system-dependent adjustments is non-trivial within the high-throughput framework of this study, and we therefore excluded these cases.

Data Records

The electronic structure data from FHI-aims calculations can be accessed at the NOMAD archive26. The SQLite3 ASE databases and computed properties in tabular format can be downloaded from the figshare repository27. In Table 1, we provide a description of the properties computed using PBEsol and HSE06.

Table 1 Data fields in tabulated form.

Figure 2a reflects the chemical diversity of the materials in the database. The materials of the database present elements covering the majority of the periodic table except actinides, noble gases and elements with the atomic number ≥ 80. Figure 2b shows the distribution of space groups (determined using spglib28 with tolerance of 10−5 Å) across the database. The most frequent spacegroups observed are P21/c, Pnma, Fm-3m, C2/m, and P63mc. As shown in Fig. 2c, we observe that a significant proportion of materials contain at least one transition metal. Figure 2d illustrates the distribution of the number of atoms per unit cell (limited to 200 for ease of visualization). While the majority of materials have 25 or fewer atoms in their unit cell, 14% of materials have unit cells with more than 50 atoms.

Fig. 2
figure 2

(a) Periodic table style heatmap showing the frequency of elements constituting the materials present in the database, distribution of (b) space group indexes, (c) material types further classified by containing (red) and lacking (blue) transition metals, (d) number of atoms in the unit cell across the materials in the database.

Technical Validation

We first compared the formation energies and band gaps computed using PBEsol and HSE06. The formation energies are calculated by considering bulk phases for the elements except oxygen for which gaseous O2 is considered as the reference phase. In general, HSE06 provides lower formation energies compared to PBEsol as shown in Fig. 3a. A mean absolute deviation (MAD) of 0.15 eV/atom is found between the formation energies calculated by the two methods. Compared to formation energies, a higher disparity (MAD = 0.77 eV) is observed in the band gaps estimated by PBEsol and HSE06, as evident from Fig. 3b with HSE06 showing a shift in band gap values toward higher ranges. This is expected owing to the well-known underestimation of band gaps by GGA methods which is partially corrected by HSE06. For 342 materials, PBEsol estimates a metallic character whereas HSE06 provides a band gap value  >= 0.5 eV. We also performed a benchmarking against experimental band gap data for binary systems curated by Borlido et al.29. For the 121 materials common to both our dataset and theirs (determined by MP-id), PBEsol yields a mean absolute error (MAE) of 1.35 eV, which improves by over 50% with HSE06 (0.62 eV). For a few materials, PBEsol predicts a larger band gap, while HSE06 unexpectedly yields a metallic behavior with zero band gap. This discrepancy arises because the calculations are spin-unconstrained, allowing PBEsol and HSE06 to converge to different spin configurations (e.g., MgV2O5, Rb2V3O8). Note that the limited availability and significant uncertainties of existing experimental data preclude a comprehensive benchmarking of formation energies and band gaps against experimental measurements. This limitation is further compounded by the challenges high-throughput simulations face in accurately capturing magnetic ordering and defect-related effects under experimental conditions30,31.

Fig. 3
figure 3

Comparison of (a) formation energies and (b) band gaps between PBEsol and HSE06 for the materials in the database. The color gradient represents the absolute difference in the property values between the methods.

To understand whether the accuracy differences in formation energies translate to thermodynamic stability, we constructed the convex hull phase diagrams (CPDs) of the materials in the database. We choose Li-Al and Co-Pt-O as representative chemical systems for binary and ternary compositions and their CPDs are shown in Fig. 4. Distinct CPDs are obtained by PBEsol and HSE06 for these systems. For example, Li2Al is stable with PBEsol but slightly unstable by 4 meV with HSE06 and similarly, Co(PtO3)2 is unstable with PBEsol by 11 meV but stable with HSE06. The critical decomposition reactions (decomposition in the CPD with the highest positive reaction energy) are determined for all the phases present in the CPDs along with the associated decomposition energy (ΔHd) as a quantitative metric of phase stability. For Li-Al, all the decomposition reactions are identical for both PBEsol and HSE06. However, different critical decomposition reactions are identified for Pt3O4, Co3O4 and CoPtO2 from the Co-Pt-O CPD. Some phases such as PtO6 are found to have the same critical decomposition reaction but with significantly differing ΔHd values for PBEsol (0.17 eV/atom) and HSE06 (0.5 eV/atom).

Fig. 4
figure 4

Convex hull phase diagram of Li-Al (a,b) and Co-Pt-O (c,d) chemical systems calculated with PBEsol (left panel) and HSE06 (right panel). The stable and unstable phases (energy above hull  > 0) are denoted by green circles and red squares, respectively.

To analyze the phase stability prediction trends between PBEsol and HSE06, we plot the distribution of ΔHd for all the materials in the database, obtained from both methods in Fig. 5a. 78% and 75% of the considered materials are found having ΔHd ≤ 50 meV/atom for PBEsol and HSE06, respectively. The higher ΔHd values for the remaining materials despite them being synthesized (ie., associated with an ICSD-id) could be attributed to synthetic conditions which are not accounted for by the zero-Kelvin DFT stability prediction. The general tendency of HSE06 estimating lower formation energies does not lead to more materials being stable in the corresponding CPD, as many of these materials decompose into binary or ternary compounds. It is important to note that this analysis is limited by the absence of experimental data for certain phases and also by the exclusion of polymorphs, both of which are crucial for constructing more accurate CPDs.

Fig. 5
figure 5

Comparison of (a) convex hull decomposition energies and (b) Pourbaix decomposition energies (at pH=0 and applied potential U=1.23 V) calculated with PBEsol and HSE06 for the materials in the database.

In addition to phase stability, we have also determined the electrochemical stability of the materials as this is relevant for their potential applicability to areas such as electrocatalysis, corrosion resistance, batteries etc. This is carried out by constructing Pourbaix diagrams32,33 which provide information regarding the material’s stability across different redox conditions within a potential-pH coordinate space, electrochemical decomposition pathways, passivation and corrosion regions. Representative PBEsol and HSE06 calculated Pourbaix diagrams for Ru and Ag-Rh chemical systems by considering RuO2 and AgRhO2 as the target materials of interest are given in Fig. 6. These oxides have been identified to be stable as part of our previous study employing the DFT-HSE06 method10. For both systems, we observe qualitative differences in the Pourbaix diagrams obtained across the methods. In the case of Ru, the stability domain of both RuO2 and \({\text{Ru}}_{3}^{+}\) are enlarged in the HSE06 Pourbaix diagram compared to PBEsol. An additional phase \({\text{RuO}}_{4}^{2-}\) is also present in the HSE06 diagram. In this study, we limit our focus to analyzing the acid-stability (hence used equivalent to electrochemical stability from hereafter) of material under conditions (pH=0 and applied potential U=1.23 V) relevant to oxygen evolution reaction (OER), an important reaction involved in electrocatalytic water splitting. As a quantitative metric of a material’s acid-stability, we computed the Pourbaix decomposition free energy (\(\Delta {G}_{pbx}^{\text{OER}}\)) at these conditions. RuO2 has a \(\Delta {G}_{pbx}^{\text{OER}}\) of -0.06 eV/atom and -0.26 eV/atom with PBEsol and HSE06, respectively, indicating both the methods predict RuO2 as acid-stable though to different extents. The Ag-Rh Pourbaix diagrams generated using PBEsol and HSE06 exhibit differences in both the stable phases present and the pH-U stability range of the common phases between the two methods. A notable difference is that PBEsol predicts AgRhO2 to be acid-unstable with \(\Delta {G}_{pbx}^{\text{OER}}\) = 0.21 eV/atom, whereas HSE06 predicts it to be stable with \(\Delta {G}_{pbx}^{\text{OER}}\) = -0.01 eV/atom. A comparison of \(\Delta {G}_{pbx}^{\text{OER}}\) calculated using PBEsol and HSE06 is provided in Fig. 5b for all the materials in the database for which Pourbaix diagrams are constructed. A MAD of 0.27 eV/atom is observed between the two methods. This indicates that the disparity in predicting the electrochemical stability is much higher compared to the phase stability which is expected due to the increased number of redox reactions under aqueous conditions. Under a stability criterion of \(\Delta {G}_{pbx}^{\text{OER}}\) <= 0.1 eV/atom, 222 materials are acid-stable with PBEsol which is 255 for HSE06.

Fig. 6
figure 6

Pourbaix diagrams of Ru (a,b) and Ag-Rh (c,d) chemical systems calculated using PBEsol (left panel) and HSE06 (right panel). An aqueous ion concentration of 10−6 M and temperature of 298.15 K are considered for the construction of Pourbaix diagrams. The chemical potentials of ions are retrieved from ASE database. The green and red regions indicate stable and unstable domains of the pH-U coordinate space with respect to the reference solid. The stability domain of H2O is shown with white dashed lines.

We also developed AI models based on the database. Specifically, we focused on predicting the HSE06 band gaps of materials using SISSO, an interpretable AI method that identifies descriptors correlated with material properties. For this, a set of primary features, ie., basic material parameters which could be potentially correlated with the target property of interest are collected. These include elemental properties (composition-weighted), as well as structural and electronic properties determined using PBEsol. Since a strong focus on the database is on energy related applications, only materials with non-zero HSE06 band gaps were considered, as these are more suitable for applications such as photovoltaics. A dataset of 3091 materials is considered for the model training. For accessing the model performance, we performed a nested cross validation (NCV) where the model validation and selection are done in the inner loop and model performance estimation is done on the outer loop. The SISSO model thus obtained is:

$$\begin{array}{c}{E}_{gap,\text{HSE06}}^{SISSO}={c}_{0}+{a}_{0}[\frac{\langle {N}_{VAC}\rangle -\langle {N}_{VAL}\rangle }{\langle {N}_{VAL}\rangle \langle {R}_{VAL}\rangle }]\,\\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,+\,{a}_{1}[\frac{\sqrt{\langle AN\rangle }}{\langle EN\rangle \langle {R}_{s}\rangle }]\,+{a}_{2}[\frac{(\langle EA\rangle )}{\langle AN\rangle }\,-\,{E}_{gap,\text{PBEsol}}]\end{array}$$
(1)

where Egap,PBEsol is PBEsol calculated band gap. NVAL is the number of valence orbitals, Rs is the s-orbital radii, RVAL is the valence orbital radii, EA is the electron affinity, EN is the electronegativity, AN is the atomic number, all computed as composition based weight averaged (indicated by ) properties. The coefficients of the SISSO model are c0 = 2.45 eV, a0 = −0.07ÅeV, a1 = −0.08 Å eV2 and a2 = −1.05. Figure 7 illustrates the distribution of test errors from NCV scheme, showing that the model achieves reasonable accuracy, with 90% of errors falling within the 7.5% of the ratio between mean test errors and the standard deviation of the band gaps in the dataset. This indicates that AI models trained on this database can achieve reliable HSE06 band gap predictions.

Fig. 7
figure 7

Distribution of nested cross-validation test errors (primary y-axis) of the SISSO model for the Egap,HSE06 prediction. In the secondary y-axis, the ratio of mean absolute error and the standard deviation of Egap,HSE06 is shown. The area between the upper and lower edges of the black box represents the interquartile range of errors, and the whiskers from bottom to top correspond to 10 percentile, median and 90 percentile errors. The mean and median of absolute errors are indicated by the cross mark and horizontal line inside the interquartile range, respectively.

The database presented in this work includes properties related to stability and electronic structure, evaluated at the HSE06 level of theory while assuming that PBEsol provides reasonable geometries for the material classes considered. However, for studying certain properties such as defects or achieving more accurate band edge alignments, consideration of HSE06-relaxed geometries may be necessary. Additionally, considerations of van der Waals interactions using advanced methods such as the nonlocal many-body dispersion34 and/or using HSE06 for relaxation9 could provide improved geometries. Also, our formation energy calculations do not account for temperature and pressure effects, which could be important for a more realistic phase diagrams under experimental conditions. Similarly, for more accurate band gap estimations, advanced hybrid functionals such as the dielectric-dependent hybrid functionals could be employed35,36. The database is expected to serve as a reference for further investigations, including these refinements. Future improvements in the quality of materials data could also be enabled by employing higher-accuracy electronic structure methods beyond hybrid functionals, such as GW approaches.