Background & Summary

Metal-organic frameworks (MOFs) are porous materials with applications in gas capture and storage1,2, catalysis3,4, biomedicine5,6, electrical conductivity7,8, transport and diffusion9,10 and chemical sensing11,12. They consist of structural building blocks formed by metal clusters and organic linkers, and form ordered, nanoscale pores13. Their properties are determined by the combination of their building units with a certain topology, which is unique to each structure.

Given the wide variety of MOFs that have been hypothesized14,15 and synthesized16,17, high-throughput computational screening plays an important role in identifying MOF candidates that are suitable for a specific application. Even though screening MOF datasets by means of ab-initio simulations is feasible, see18, it is computationally unpractical in many cases. Data-driven techniques can aid in processing large amounts of MOF data by identifying correlations between structure and property in machine-learning (ML) based workflows19. Likewise, computational discovery by means of inverse design with pre-defined target properties20, such as, for example, the creation of MOFs with high CO2 affinity21,22,23, requires the application of advanced ML methods and high-quality datasets.

To effectively screen and design MOFs for applications, both structural and electronic data are needed. While structural data files are publicly available in large quantities16,17,24,25,26,27, electronic structure data are sparse. For MOFs, high-quality electronic structure data could be created by deploying a suitable physical model that accounts for electronic correlations in the presence of metal clusters.

Computationally, PAOFLOW28,29 is a key tool for projecting the electronic structure into a tight-binding (TB) Hamiltonian using localized valence atomic orbitals as a basis. Thus, nearest-neighbors interactions are captured by the matrix coefficients which could potentially be used in ML workflows to predict new structures with optimized features.

For the purpose of this contribution, we consider a lattice Hamiltonian with correctional terms to properly account for the electronic correlations and hybridization occurring within and between the building units of representative MOFs. In the Extended Hubbard (EH) model30, strong electronic correlations are captured by the intra-site and inter-site Hubbard parameters U and V, respectively. Typically, U corrects for transition metal contributions, due to their localized atomic orbitals, while V represents the interaction between metal ions and their nearest neighbors. In the case of MOFs, we expect these parameters to improve upon standard tight-binding (TB) description by accounting for the electronic interaction within metal clusters and the hybridization with organic constituents.

An important application of the EH model in electronic structure calculations based on density-functional theory (DFT) is introducing corrections to d and f-electron energies. Known as DFT+U or DFT+U+V, these methods improve the accuracy of band gap predictions where standard DFT typically fails31. The U parameter can be estimated by matching experimental data or by invoking computational methods involving hybrid functionals, e.g., in the prediction of gas adsorption energies in MOFs32. However, such parameters can also be derived from first-principles33,34,35. As such predictions are computationally costly, there is a growing need for creating datasets of Hubbard parameters36,37. In the case of MOFs, a dataset augmentation with computed U and V values would be a valuable contribution to the theoretical investigation of MOFs. In particular, it would enable the application of data-driven strategies in materials screening and design workflows.

In this work, we provide two augmented datasets based on QMOF38,39 that are visualized in Fig. 1. In the TB dataset, we projected the electronic density onto a tight-binding Hamiltonian with PAOFLOW28,29 and provided TB matrices as well as Smooth Overlap of Atomic Positions (SOAP)40 descriptors for 10,435 materials. In the EH dataset, besides the TB parameters, we computed intra-site U and inter-site V Hubbard parameters for a select group of 242 MOFs using Quantum Espresso41,42,43, enabling the creation of an Extended Hubbard Hamiltonian representing each material. The datasets allow for exploring potential correlations between structural and electronic properties in computational workflows for MOF screening and design.

Fig. 1
Fig. 1
Full size image

(a) Illustration of the TBHubbard dataset. The QMOF38,39 dataset is indicated in pink, providing over 20,000 MOF structures. From this data collection, the TBHubbard dataset comprises two subsets of materials: the Tight-binding (in green) and Extended Hubbard (in blue) subsets with ≈ 10,000 and ≈ 200 materials, respectively; (b) t-SNE projection of tight-binding matrices, where points are colored according to the different datasets analyzed in this study; (c) t-SNE projection of SOAP-3 Å descriptors for metal atoms across the dataset. A preliminary PCA step reduced the descriptor dimensionality to 8 components, retaining 97 % of the total variance. The color scheme for the t-SNE plots is as follows: pink for the QMOF dataset, blue for the EH subset, and green for the TB subset.

Methods

We provide two complementary subsets tailored for inverse design: the TB and the EH subsets. In the case of the TB subset, the first-principles DFT calculations were performed for 10,435 MOFs along with their TB projection and Smooth Overlap of Atomic Positions (SOAP) descriptors, which were used as fingerprints of the local environment to describe the topology of each material. While the DFT-based calculations provide a quantum-mechanical foundation for electronic structure analysis, the SOAP descriptors offer a data-driven representation of atomic environments, enabling material discovery. In the case of the EH subset, besides the TB projection, additional EH parameters, i.e. intra-site U and inter-site V interactions, were computed for a smaller set of 242 MOFs. In the following, we outline the methodologies used for generating the data. This includes selecting the structures, performing the ground-state calculations and tight-binding projections, as well as computing the SOAP descriptors, TB embeddings and Hubbard parameters.

Structure Selection

Our starting point is the QMOF dataset38,39, containing 20,375 metal-organic framework structures. The selection process for creating the TB subset began by down-sampling from 20,375 to 10,435 MOFs, prioritizing the diversity of metal ions in the clusters and focusing on structures without spin polarization. In a next step, we filtered the data set down to 242 MOFs to form the EH subset. The step involved selecting materials with a pore-limiting diameter (PLD) larger than 3.3 Å for maximizing the application potential in, for example, CO2 capture and ionic transport. For the subset, we focused on materials with at least one transition metal in the cluster. Note, that we avoided the computation of U for metals such as Zn, Cd, Hg, Cn and La which led to unphysical values for U > 20 eV.

Ground State

To perform electronic structure calculations based on DFT, we used Quantum Espresso (QE) 41,42,43, within the generalized gradient approximation (GGA)44. The inputs of our calculations, such as atomic positions, spin-polarization and k-point mesh, resembled those available in the QMOF dataset. Following previous work related to ab initio modeling of MOFs18, we included Grime’s D3 van der Waals corrections45 with zero damping (dftd3 = 4). The values used as convergence threshold and mixing factor for self-consistency were 10−8 and 0.1, respectively, along with Davidson diagonalization. We fixed the atomic positions throughout the self-consistent cycle, considering that the structure’s geometries were previously optimized. For the composition of each material, we used the kinetic energy cutoff of the hardest pseudopotential among the species involved, except when unavailable. In that case, we used 50 Ry and 400 Ry for wavefunction and density, respectively.

For the TB subset, to ensure the Γ point is included for all materials, we ignored shifted k-point meshes. Nevertheless, we validated for a few, representative materials that the use of shifted/unshifted meshes did not alter the total energy by more than ~10 μeV. The van der Waals (vdW) correction was excluded from the SCF calculations, since it did not affect the wavefunction analysis and projection of the density of states. As a result, the total energy did not include any vdW contributions. Given the broad range of elements involved, we chose ultrasoft pseudopotentials from the PSlibrary46 to ensure compatibility with a diverse set of materials. The electronic density for computing Bader charges was exported in cube format, as they are commonly used as descriptors of atomic structure and features.

For the EH subset, we used pseudopotentials taken from the SSSP PBE Efficiency v1.3.0 set47, taking into account that the prediction of Hubbard parameters are computationally expensive. Note, that the ground-state computations were carried out considering both initial, negligible guesses of Hubbard parameters, of the order of 10−8 eV, which are necessary for the application of density-functional perturbation theory, in addition to self-consistently computed U- and V-values. At this point, we computed the band gaps as the energy difference between the lowest unoccupied and highest occupied levels. The band-gap values computed with DFT+U+V for d-p and d-s perturbations differ, as the computations were performed with different sets of U- and V-values.

Tight-binding Projection

In order to obtain a tight-binding representation of the selected materials, we used PAOFLOW28,29, a software tool which embeds outputs from electronic structure, plane-wave pseudopotential calculations into pseudo-atomic orbitals. The projection provides the Hamiltonian, both in real and reciprocal space, within the localized orthogonal atomic basis according to the valence orbitals present in the pseudopotential used for each element. The Hamiltonian coefficients are known as TB parameters. PAOFLOW is fully compatible with QE, which facilitates the data conversion, as only the output of the self-consistent part is required for the projection. In the present subset, we computed the tight-binding coefficients for each material based on the standard DFT output and the information of the corresponding atomic orbital basis. In Fig. 2, we show examples of the TB matrix visualization for two representative materials.

Fig. 2
Fig. 2
Full size image

False-color image representing the normalized tij2 tight-binding matrix coefficients of the localized orbital basis set for MOFs (a) FePtC8H4N6 (or qmof-3dfbcbd) and (b) CdNiC8H12N6 (or qmof-4d9a98c), with their respective structural representations shown above. For visualizing the matrix, the maximum intensity is set to 0.5 and the matrix diagonal to 0. The MOF structure images were created using VESTA64.

Machine-learning Descriptors

To characterize the structure of MOFs, we computed two complementary descriptor sets. The first is based on SOAP descriptors, which encode local atomic environments. The second is based on an original contribution of this work: a DFT-informed fingerprint derived from TB projections and structured into fixed-size embeddings that capture essential features of the electronic structure.

Analysis of the local environment of the metal clusters in MOFs is crucial, and we anonymized the identity of metal atoms in our SOAP descriptor calculations. We adopted a reduced species set that merges metals within a single category, while maintaining detailed classifications for the key elements of most organic linkers, such as hydrogen, oxygen, carbon, and nitrogen. This simplification is especially useful when dealing with large datasets. For the purpose of this contribution, we used two sets of SOAP descriptors. The SOAP-3 Å descriptor captures information about the first nearest-neighbor shell while the SOAP-5 Å descriptor extends beyond the second shell, thus providing a broader range of local structural information. While the former contains 684 values, the latter has 1500 values.

For creating the TB embeddings, we standardized the TB matrix entries for each atom pair and cast them into a fixed-size format to enable consistent application in machine-learning models. Each tij between atomic orbitals is embedded into a 13 × 13 matrix block, regardless of the specific elements involved.

The 13 × 13 matrix stems from setting up the orbital configuration per element as 2 s orbitals, e.g., semicore and valence, 6 p orbitals, i.e., 3 semicore and 3 valence, as well as 5 d orbitals. This leads to 13 × 13 blocks that captures all pairwise orbital interactions between two atoms commonly found in MOFs. The matrix is filled with zeros in case orbitals are absent from a specific element.

We excluded f-orbitals from this basis due to their relative scarcity in the systems studied and due to the significant increase in matrix dimensionality they would entail, up to 20 × 20. Their inclusion would have doubled the computational burden and potentially diluted the machine-learning correlation signal of the prevalent orbital types. In cases where a pseudopotential provides fewer orbitals, e.g., only a single s or p orbital, we mapped the available orbitals to the outermost valence orbitals within the standardized block structure. This ensured that the most chemically relevant contributions, i.e., those related to bonding and reactivity, were retained and aligned consistently across different elements. In spin-polarized systems, the matrix size could be doubled to account for spin-up and spin-down channels. While not within the scope of the present work, this would not affect the generalization of the approach, unlike the inclusion of f-orbitals which would significantly increase the representation size. For each atom pair (ij), a slice of the full TB matrix was extracted, corresponding to their orbital interaction block. The 13 × 13 matrices were extracted from both diagonal and off-diagonal entries, ranked by the maximum absolute value of the strength of interaction, and the top-k blocks were then selected. The resulting 13 × 13 × k tensor was flattened to form a fixed-length vector that represents the TB embedding for the atom. These embeddings were aligned with corresponding SOAP descriptors to serve as structured inputs for machine learning workflows.

In this context, the TB embeddings provide the source data, while the SOAP descriptors act as the target data. The embeddings provide a structured representation of a material’s electronic environment and can be used to predict atomic pair species as well as SOAP descriptors, facilitating material characterization and structure reconstruction.

Hubbard Parameters

To compute the intra-site U and inter-site V Hubbard parameters, we used the QE software based on the implementation of density functional perturbation theory in the DFT+U+V framework30,34. The U correction applies to transition metals and V represents the coupling between the transition metal and its nearest-neighbors. The Hubbard parameter calculation is orbital-dependent and the desired manifold should be defined a priori in the self-consistent step. We chose the d and d-p orbitals as manifolds for U and V, respectively, where the localized d orbital is located at the transition metal and the p orbital is located at the metal’s nearest neighbor. Note, that it is not possible to compute Hubbard parameters for more than one manifold at a time, due to the high computational cost. We selected a subset of 186 materials for computing U and V for d and d-s orbitals as manifolds. For identifying the different results for U and V, we adopted d-p and d-s perturbations as nomenclature following the chosen manifolds. For constructing the Hubbard projector, we used Lowdin orthogonalized atomic orbitals and the parameters of q-grid were set to half of the k-point mesh used in the self-consistent step. For performing the perturbation, we identified nonequivalent atoms by symmetry. Atoms of the same type, but not equivalent by symmetry, are differentiated (find_atpert = 3) and the convergence threshold for the response function χ was set to 10−7.

Computational resources

The calculations were performed in a high-performance computing cluster equipped with x86 and PowerPC compute nodes. Computation time depended on simulation type, and the calculation of Hubbard parameters stands out as the most computationally intensive. Specifically, the 10,435 ground-state calculations took about 12 CPU-years while the 412 Hubbard parameter calculations took roughly 10 times that, about 127 CPU-years.

Data Records

The TBHubbard dataset is available in the Harvard Dataverse https://dataverse.harvard.edu/dataverse/tbhubbard48. In total, the TBHubbard repository provides 10,863 ground state calculations, including the QE self-consistent input and output files. The PAOFLOW projection output is provided along with a serialized object file (pickle) containing additional information. We provide the list of norb orbitals used as basis and the tight-binding Hamiltonian, in real and reciprocal space. The Hamiltonian is stored as tensor of dimensions [norbnorbk1k2k3, 1], where the k-point grid (k1 × k2 × k3) is the Monkhorst-Pack grid used in the QE input.

In the TB subset, each file contains the specific QE inputs and outputs, SOAP descriptors, and Bader/QE electronic charge distributions49. The SOAP descriptors, which capture the local atomic environments within MOFs, are provided in two variations: SOAP-3 Å and SOAP-5 Å. The descriptors represent the atomic topology at different length scales. The TB embeddings for each atomic species in each MOF were computed at the Γ point. For visualizing the data and input generation, we include a collection of auxiliary Python scripts. Specifically, we provide scripts for generating self-consistent QE calculations which are based on Crystallographic Information Files (CIF) and metadata within the QMOF dataset. In addition, we make available the scripts for extracting the TB projection, for visualizing the projection matrix at Γ points and for constructing the TB embeddings. The code contributions are designed to ensure reproducibility of the computational workflow.

In the EH subset, we provide 428 files with the computation of the Hubbard parameters, using the hp.x executable from QE, through the construction of the susceptibility matrix, the corresponding set of U and V values, as well as the QE input/output for DFT+U+V calculations. From these files, 186 refer to the d-s perturbations and the remaining 242 are for d-p perturbations. For enabling accessibility, we provide JSON files containing the inputs and outputs of the QE calculations. In addition, Python scripts are included for reproducing the graphics, for generating JSON files as well as the QE input files used in the ground-state calculations.

Technical Validation

Diversity of structures

Before using first-principles simulation data for training machine learning models, we investigate the chemical diversity of the data sets. We created two distinct selections of materials data, each aiming at a different outcome. While the TB dataset is approximately half the size of the original QMOF dataset, the EH dataset contains, due to the higher computational cost, a much smaller number of materials. In Fig. 3(b)–(e), we present density plots of the materials distributions with regards to representative structural and electronic properties, for the QMOF dataset as well as the two data sets provided in this work. For all properties analyzed, the TB data exhibits a diverse distribution similar to the original dataset while the EH data differs, mainly due to the much smaller number of materials included.

Fig. 3
Fig. 3
Full size image

(a) Density histogram comparing the transition metal distribution in the QMOF dataset with the Tight-binding (TB) and Extended Hubbard (EH) data sets. The inset shows the PCA projection of TB embeddings for symmetry-independent metal atoms. In (b)–(e), probability density functions, comparing the QMOF dataset with the TB and EH data sets, are shown with regards to the number of atoms in the unit cell; the pore-limiting diameter, PLD, (in Å); the mass density (in g/cm3) and the standard DFT band gap (in eV), respectively.

Another important feature is the distribution of transition metals, since their presence in metal-organic frameworks is essential to the application of the EH model. The density histogram for the two data sets and the QMOF dataset is plotted in Fig. 3(a) for the structures that contain transition metal atoms. By design, the EH data set exhibits a concentration of MOF structures containing Zr and Hf atoms. Note, that Zr-based MOFs are of particular interest from and application perspective50,51,52,53,54 due the discovery of UiO-66’s high hydrothermal stability55. Similar structures have shown application potential as electrochemical sensors and biosensors56, as well as catalysts57,58. While the inter-site V parameters can provide insights into the metal-organic hybridization, the tight-binding parameters can be explored for investigating topological properties.

In the inset of Fig. 3(a), we present a two-dimensional Principal Component Analysis (PCA)59 projection of the tight-binding (TB) embeddings. The explained variance ratios for the first two and first four principal components are 0.90 and 0.93, respectively, indicating that a large fraction of the variance is captured in low-dimensional space. Interestingly, clusters corresponding to different TB embeddings of the same metal emerge, suggesting that the embeddings capture metal-specific electronic structure characteristics. In addition, for providing a broader view of the chemical diversity, we have included a t-distributed Stochastic Neighbor Embedding (t-SNE) plot60, constructed using the following features: number of atoms per unit cell, pore limiting diameter (PLD), largest cavity diameter, mass density, volume, band gap, and atomic number of the transition metal. In Fig. 1(b), we observe a concentration of the EH data set, while the TB data set is spread out across the QMOF dataset, indicating a good representation of the source dataset. To visualize the diversity in the vicinity of metal atoms across the dataset, we computed a t-SNE projection of SOAP-3Å descriptors, using PCA for the initial dimensionality reduction, see Figure 1(c). The QMOF dataset as well as the EH and TB data sets form distinct clusters in the reduced space, suggesting that differences in structural motifs and generation protocols are well captured by the SOAP representation.

The diversity analysis is useful for differentiating the two data sets with regards to their application potential. While the TB projection provides topological data for exploring a broader range of metal-organic frameworks, the EH data provides Hubbard parameters for the focused investigation of Zr- and Hf-based MOFs.

Tight-binding matrix

By using the PAOFLOW software, we are able to represent metal-organic framework structures with tight-binding Hamiltonians and project the electronic density onto localized atomic orbitals. We performed the projection for each k-point in the grid used in the ground state calculation, where the Hamiltonian tensor can be obtained both in real and reciprocal space. For simplicity, we have chosen the Γ point to visualize the matrix shown in Fig. 2 for two representative MOFs: Fe Pt C8 H4 N6 (or qmof-3dfbcbd) and Cd Ni C8 H12 N6 (or qmof-4d9a98c). The coefficients are always real numbers at the Γ point, however, we might obtain complex numbers at other k-points. In the graphics, we plot the modulus-squared TB parameters tij2. The localized atomic orbitals are the valence orbitals within the pseudopotential for every atom in the unit cell, which typically are s, p and d orbitals. Thus, the TB matrix can be divided into blocks, where diagonal blocks represent interactions among orbitals of the same chemical element, and off-diagonal blocks represent the interactions between different valence orbitals of different elements.

The false-color image visualizes the strength of the TB parameters, which is an indicator of nearest-neighbors interactions. By looking at Fig. 2(a), the TB parameters correctly indicate the hopping between Fe-N, N-C, C-H and Pt-C, as verified with the corresponding MOF structures. In Fig. 2(b), the relevant hopping terms are Ni-C, C-H, C-N, Cd-N and N-H. Note, that we have computed both U and V parameters for the Ni atoms and the Ni-N bonds in the MOF structure shown. However, we have not performed any Hubbard calculations for the Cd atoms.

Since interactions among intra-site orbitals might be strong, the maximum value of tij2 in the color bar has been set to 0.5 for facilitating the data visualization. The absolute maximum values are 121.8 eV2 and 42.4 eV2 for qmof-3dfbcbd (Fig. 2(a)) and qmof-4d9a98c (Fig. 2(b)), respectively. Based on the results obtained, we conclude that the representation of MOFs in TB lattice Hamiltonians can provide useful information on MOF topology and hybridization.

Hubbard parameters

Modeling MOFs using the EH Hamiltonian requires not only TB parameters, but also intra-site U and inter-site V Hubbard parameters. U is associated with a transition metal d-orbital and V refers to the interaction between the transition metal d-orbital and one of its nearest-neighbor orbitals. We provide two sets of values for U and V, here defined as d-p and d-s perturbations referring to the manifolds perturbed. In Fig. 4(a), the distribution of U values is plotted for each material of the EH data set. While a few structures may contain more than one transition metal species, we only plot one metal per MOF for simplicity. Interestingly, we observe that U increases with the atomic number for elements within the same line in the periodic table. Also, depending on metal species, the intra-site parameter can have a large dispersion, as is the case for Ag and Cu. Or it can have very similar values in different environments, such as in the case of Zr, Hf and Y. While U generally refers to the same orbital d, performing the inter-site perturbation on p or s orbitals for computing V can alter the U outcome, yielding systematically smaller values for d-s perturbations.

Fig. 4
Fig. 4
Full size image

Scatter plot of intra-site (a) U and (b) inter-site V Hubbard parameters for the Extended Hubbard (EH) data set, ordered by occurrence of transition metals in each material, considering the calculations with different type of manifolds, i.e., performing perturbations on d-p or d-s orbitals, see Methods section for details. Scatter plot of band gap energies computed using the standard DFT \(({{\rm{E}}}_{{\rm{g}}}^{{\rm{DFT}}})\) and the DFT+U+V \(({{\rm{E}}}_{{\rm{g}}}^{{\rm{DFT+U+V}}})\) framework for the EH subset considering (c) d-p and (d) d-s perturbations. The color map represents the atomic number of the transition metal associated with each material.

The V distribution per metal-organic interaction is shown in Fig. 4(b). For clarity, we plot one V value per structure, which is equivalent to the average nearest-neighbors V. The data exhibits a large dispersion, where we observe positive values for d-p perturbations, and relatively small, negative values for d-s perturbations.

In view of applications, the sets of U and V values provided in this work are aimed at supporting tight-binding modeling of metal-organic frameworks. They can be utilized in the training of machine-learning models and for exploratory data analysis. In addition, the EH data might support applications of quantum computing. In one example, the data are used as input for computing the band gap of representative semiconductors in a quantum-centric materials simulation workflow61.

Band gap predictions

Standard DFT can be combined with higher-level hybrid functionals to correct the self-interaction error62 in systems with strong electronic correlations31. The DFT+U+V methodology is a computationally efficient alternative to improve band gap predictions, which typically fail under GGA. Stronger electronic correlations could occur in MOFs containing transition metals with localized orbitals, which can be explored in our data contribution.

In Fig. 4(c,d), we show a comparison of the band gap energies computed using DFT and DFT+U+V for d-p and d-s perturbations, respectively. In both cases, we observe a large concentration of materials in the diagonal, indicating that band gap energies of most structures remain unaffected by DFT+U+V corrections. While this result is surprising, we note that most materials in the data set contain Zr or Hf. For other materials the band gap systematically increases, as expected. For additional information, we refer the reader to the Supporting Information.

Downstream applications in machine-learning

The combination of TB embeddings and SOAP descriptors offers a powerful approach for generative materials discovery. In this framework, TB embeddings serve as a compact representation of the electronic structure, where a generative model would explore the parameter space proposing novel structures. SOAP descriptors could assist in reconstructing structural information, acting as an auxiliary tool for decoding atomic arrangements that may not be captured by the TB embeddings alone. Overall, this allows for predicting interactions and elemental species based on the TB embeddings while simultaneously leveraging the SOAP descriptors for refining structural details.

For investigating this scenario, we have analyzed the metal atoms present in the TB data set. We have grouped atoms that are equivalent by symmetry, including those that appear distinct based on their CIF files, if they occupied nearly equivalent sites. From each group, we have selected one representative atom to ensure that the dataset remains balanced, avoiding over-representation of redundant entries.

We have constructed the TB embeddings by selecting the blocks that contain the six strongest interactions for each metal atom, see Methods section for details. Each embedding consists of six 13 × 13 blocks, resulting in a total vector size of 1014. The dataset contains 21,186 entries. In Fig. 5(a), we show a visualization of the TB embedding for a specific atom.

Fig. 5
Fig. 5
Full size image

(a) TB embeddings for a Zn atom in qmof-fffeb7b, showing the top-7 strongest interaction blocks extracted from the Hamiltonian parameters. Each block is represented by a 13 × 13 matrix and ranked by the absolute maximum value within the block, capturing the most significant electronic interactions regardless of sign. The y-label indicates the atom for which the embedding is computed, while the x-label denotes the rank in the i-j block. To aid visualization, the values are plotted within the range [ − 2.5, 2.5], with any values outside this range clipped by the color palette. (b) Distribution of pairwise Euclidean distances between SOAP feature vectors computed with SOAP-3Å (red) and SOAP-5Å (blue). Vertical dotted lines indicate the mean averaged error of the Euclidean distance in test set predictions, located at 33.346 for SOAP-3Å (red) and 246.464 for SOAP-5Å (blue). (c) Mean Euclidean Distance Error between true and predicted values (computed over all test samples) as a function of the number of Tight-binding embedding blocks used. The error is reported for both short-range (SOAP-3Å) and long-range (SOAP-5Å) descriptors.

We have trained a RandomForestRegressor63 for predicting SOAP descriptors based on TB embeddings in a four-dimensional PCA-reduced space. To that end, we have used the reduced TB embeddings as input features and the full SOAP vectors as targets. For testing, the same PCA transformation computed on the training set is applied to the unseen test samples before making predictions. This approach ensures that no information from the test set is leaked into the training process. The same settings are applied to the prediction of SOAP-3Å and of SOAP-5Å descriptors. We have used an 85:15 train-test split and set the number of estimators to 100. Further details are provided in the Methods section.

In Fig. 5(b), we show the distribution of pairwise Euclidean distances obtained for SOAP feature vectors using cutoff radii of 3 and 5 Å, respectively, indicated in red and blue. The vertical dotted lines mark the mean error of the Euclidean distance in the test set predictions using the six strongest blocks. The error made for SOAP-3Å falls within the lower range of its overall distance distribution, indicating that the predicted embeddings remain relatively close to their true values. For SOAP-5Å, the error is slightly higher, as expected due to the larger environment being predicted. However, it still falls within the lower part of its distribution and provides an acceptable level of predictive accuracy.

To assess the predictive power of the TB embeddings, we have progressively increased the number of included blocks from 1 to 10 and applied PCA to each variation. This approach allows us to evaluate how many blocks are necessary to achieve accurate SOAP predictions. Fig. 5(c) shows the mean euclidean distance error between the predicted and actual SOAP vectors as function of the number of included blocks. We observe that the error decreases significantly when increasing from 3 to 6 blocks. However, it remains fairly constant by further increasing the number of blocks. This suggests that six blocks capture sufficient information for predicting SOAP descriptors.

From an application perspective, the TB embeddings can be used to identify species involved in interactions. When combined with SOAP descriptors, TB embeddings enable the resolution of the entire material composition. In the context of MOFs, the SOAP representation can be employed for searching similar structures within the existing MOF building blocks, i.e., metal clusters and organic linkers. This would enable the reconstruction and validation of MOF structures through simulations, facilitating material property optimization as well as structural analysis. In the Supporting Information, we present an example illustrating this process in detail.

Prediction of Hubbard U and V Parameters from TB Embeddings

In the following, we assess the utility of tight-binding (TB) embeddings for predicting the Hubbard intra-site U and inter-site V parameters. In our setup, each 13 × 13 TB matrix block serves as input for a regression model predicting Hubbard parameter values.

The intra-site Hubbard Ui parameter corresponds to diagonal blocks representing electron interactions on atom i. The inter-site Vij parameters are derived from off-diagonal blocks representing interactions between atoms i and j within the same unit cell. For simplicity, we treat the intra-site Hubbard parameter Ui ≡ Vii as a special (diagonal) case of the inter-site parameter. For each MOF, we have selected the 10 strongest TB blocks – ranked by the magnitude of their largest matrix element – for building the training data set.

For avoiding ambiguity in periodic systems, we restrict our analysis to Hamiltonians evaluated at the Γ point and apply the minimum image convention. This means that for any given atom i, interactions are considered only if the associated atom j lies within a closer distance than any periodic image of i. The cutoff ensures that only physically relevant, short-range interactions are included.

The training data is taken from the extended_hubbard_model/dp_perturbations subfolder. It contains 240 unique MOFs, lacking two MOFs with anomalously high V-values that we excluded. For each MOF in the data set, we selected the U and V parameters corresponding to the first metal atom appearing in the structure. Of the original 9,754 entries, we selected those 2,386 with the highest, absolute V values per MOF. We then performed the data splitting at MOF-level for ensuring that the test sets contained unseen materials, thus preventing data leakage.

We trained a single RandomForestRegressor model for predicting both U and V values, without distinguishing between them. To validate the model’s predictive capabilities, we performed a 10-fold cross-validation using the top 10 strongest TB embedding blocks per MOF. We trained the model with 100 estimators and default hyperparameters, leveraging the embeddings as input features and the Hubbard V-values as targets. We implemented the cross-validation in Python using scikit-learn’s RandomForestRegressor and KFold utilities. Further implementation specifics, such as the selection of top-k blocks and data pre-processing, are described in the Methods section.

Even though the data are limited, we observe that the model exhibits reasonable predictive performance. The average coefficient of determination R2 is 0.914, with minimum and maximum values of 0.716 and 0.989, respectively. We obtain an average test mean absolute error (MAE) of 0.134 and mean squared error (MSE) of 0.179 across folds.

The modeling results are shown in Fig. 6. Overall, they demonstrate that the TB blocks carry the critical information with regards to electronic interactions in MOFs, and that both U and V values can be predicted robustly based on TB embeddings across a chemically diverse set of MOF structures.

Fig. 6
Fig. 6
Full size image

Target and predicted Hubbard parameter values, for train (blue) and test (orange) sets, for one of the ten cross-validation splits. (a) Intra-site Hubbard parameters Ui ≡ Vii. (b) Inter-site Hubbard parameters Vij, for i ≠ j.