TBHubbard: tight-binding and extended Hubbard model dataset for metal-organic frameworks

Costa Carvalho, Pamela; Zipoli, Federico; Duriez, Alan C.; Barroca, Marco Antonio; Neumann Barros Ferreira, Rodrigo; Jones, Barbara; Wunsch, Benjamin; Steiner, Mathias

doi:10.1038/s41597-025-06054-w

Download PDF

Data Descriptor
Open access
Published: 12 November 2025

TBHubbard: tight-binding and extended Hubbard model dataset for metal-organic frameworks

Pamela Costa Carvalho¹^na1,
Federico Zipoli ORCID: orcid.org/0000-0001-8345-9965^2,3^na1,
Alan C. Duriez⁴,
Marco Antonio Barroca⁴,
Rodrigo Neumann Barros Ferreira ORCID: orcid.org/0000-0003-4435-0507⁴,
Barbara Jones⁵,
Benjamin Wunsch⁶ &
…
Mathias Steiner⁴

Scientific Data volume 12, Article number: 1776 (2025) Cite this article

2258 Accesses
4 Altmetric
Metrics details

Subjects

Abstract

Metal-organic frameworks (MOFs) are porous materials composed of metal ions and organic linkers. Due to their chemical diversity, MOFs can support a broad range of applications in chemical separations. However, the vast amount of structural compositions encoded in crystallographic information files complicates application-oriented computational screening and design. The existing crystallographic data, therefore, requires augmentation by simulated data so that suitable descriptors for machine-learning tasks become available. Here, we provide extensive simulation data augmentation for MOFs within the QMOF dataset. We have applied a tight-binding, lattice Hamiltonian and density functional theory to MOFs for performing electronic structure calculations. Specifically, we provide a tight-binding representation of 10,000 MOFs, and an Extended Hubbard model representation for a sub-set of 240 MOFs containing transition metals, where intra-site U and inter-site V parameters are computed self-consistently. In addition to computational workflows for identifying structure-property correlations, the data supports quantum computing tasks that rely on tight-binding Hamiltonian and self-consistent computed Hubbard parameters. For validation and reuse, we have made the data publicly available.

Much ado about MOFs: metal-organic-frameworks as quantum materials

Article Open access 06 February 2026

Machine learning Hubbard parameters with equivariant neural networks

Article Open access 25 January 2025

Machine learned force-fields for an Ab-initio quality description of metal-organic frameworks

Article Open access 20 January 2024

Background & Summary

Metal-organic frameworks (MOFs) are porous materials with applications in gas capture and storage^1,2, catalysis^3,4, biomedicine^5,6, electrical conductivity^7,8, transport and diffusion^9,10 and chemical sensing^11,12. They consist of structural building blocks formed by metal clusters and organic linkers, and form ordered, nanoscale pores¹³. Their properties are determined by the combination of their building units with a certain topology, which is unique to each structure.

Given the wide variety of MOFs that have been hypothesized^14,15 and synthesized^16,17, high-throughput computational screening plays an important role in identifying MOF candidates that are suitable for a specific application. Even though screening MOF datasets by means of ab-initio simulations is feasible, see¹⁸, it is computationally unpractical in many cases. Data-driven techniques can aid in processing large amounts of MOF data by identifying correlations between structure and property in machine-learning (ML) based workflows¹⁹. Likewise, computational discovery by means of inverse design with pre-defined target properties²⁰, such as, for example, the creation of MOFs with high CO₂ affinity^21,22,23, requires the application of advanced ML methods and high-quality datasets.

To effectively screen and design MOFs for applications, both structural and electronic data are needed. While structural data files are publicly available in large quantities^{16,17,24,25,26,27}, electronic structure data are sparse. For MOFs, high-quality electronic structure data could be created by deploying a suitable physical model that accounts for electronic correlations in the presence of metal clusters.

Computationally, PAOFLOW^28,29 is a key tool for projecting the electronic structure into a tight-binding (TB) Hamiltonian using localized valence atomic orbitals as a basis. Thus, nearest-neighbors interactions are captured by the matrix coefficients which could potentially be used in ML workflows to predict new structures with optimized features.

For the purpose of this contribution, we consider a lattice Hamiltonian with correctional terms to properly account for the electronic correlations and hybridization occurring within and between the building units of representative MOFs. In the Extended Hubbard (EH) model³⁰, strong electronic correlations are captured by the intra-site and inter-site Hubbard parameters U and V, respectively. Typically, U corrects for transition metal contributions, due to their localized atomic orbitals, while V represents the interaction between metal ions and their nearest neighbors. In the case of MOFs, we expect these parameters to improve upon standard tight-binding (TB) description by accounting for the electronic interaction within metal clusters and the hybridization with organic constituents.

An important application of the EH model in electronic structure calculations based on density-functional theory (DFT) is introducing corrections to d and f-electron energies. Known as DFT+U or DFT+U+V, these methods improve the accuracy of band gap predictions where standard DFT typically fails³¹. The U parameter can be estimated by matching experimental data or by invoking computational methods involving hybrid functionals, e.g., in the prediction of gas adsorption energies in MOFs³². However, such parameters can also be derived from first-principles^33,34,35. As such predictions are computationally costly, there is a growing need for creating datasets of Hubbard parameters^36,37. In the case of MOFs, a dataset augmentation with computed U and V values would be a valuable contribution to the theoretical investigation of MOFs. In particular, it would enable the application of data-driven strategies in materials screening and design workflows.

In this work, we provide two augmented datasets based on QMOF^38,39 that are visualized in Fig. 1. In the TB dataset, we projected the electronic density onto a tight-binding Hamiltonian with PAOFLOW^28,29 and provided TB matrices as well as Smooth Overlap of Atomic Positions (SOAP)⁴⁰ descriptors for 10,435 materials. In the EH dataset, besides the TB parameters, we computed intra-site U and inter-site V Hubbard parameters for a select group of 242 MOFs using Quantum Espresso^41,42,43, enabling the creation of an Extended Hubbard Hamiltonian representing each material. The datasets allow for exploring potential correlations between structural and electronic properties in computational workflows for MOF screening and design.

Methods

We provide two complementary subsets tailored for inverse design: the TB and the EH subsets. In the case of the TB subset, the first-principles DFT calculations were performed for 10,435 MOFs along with their TB projection and Smooth Overlap of Atomic Positions (SOAP) descriptors, which were used as fingerprints of the local environment to describe the topology of each material. While the DFT-based calculations provide a quantum-mechanical foundation for electronic structure analysis, the SOAP descriptors offer a data-driven representation of atomic environments, enabling material discovery. In the case of the EH subset, besides the TB projection, additional EH parameters, i.e. intra-site U and inter-site V interactions, were computed for a smaller set of 242 MOFs. In the following, we outline the methodologies used for generating the data. This includes selecting the structures, performing the ground-state calculations and tight-binding projections, as well as computing the SOAP descriptors, TB embeddings and Hubbard parameters.

Structure Selection

Our starting point is the QMOF dataset^38,39, containing 20,375 metal-organic framework structures. The selection process for creating the TB subset began by down-sampling from 20,375 to 10,435 MOFs, prioritizing the diversity of metal ions in the clusters and focusing on structures without spin polarization. In a next step, we filtered the data set down to 242 MOFs to form the EH subset. The step involved selecting materials with a pore-limiting diameter (PLD) larger than 3.3 Å for maximizing the application potential in, for example, CO₂ capture and ionic transport. For the subset, we focused on materials with at least one transition metal in the cluster. Note, that we avoided the computation of U for metals such as Zn, Cd, Hg, Cn and La which led to unphysical values for U > 20 eV.

Ground State

To perform electronic structure calculations based on DFT, we used Quantum Espresso (QE) ^41,42,43, within the generalized gradient approximation (GGA)⁴⁴. The inputs of our calculations, such as atomic positions, spin-polarization and k-point mesh, resembled those available in the QMOF dataset. Following previous work related to ab initio modeling of MOFs¹⁸, we included Grime’s D3 van der Waals corrections⁴⁵ with zero damping (dftd3 = 4). The values used as convergence threshold and mixing factor for self-consistency were 10⁻⁸ and 0.1, respectively, along with Davidson diagonalization. We fixed the atomic positions throughout the self-consistent cycle, considering that the structure’s geometries were previously optimized. For the composition of each material, we used the kinetic energy cutoff of the hardest pseudopotential among the species involved, except when unavailable. In that case, we used 50 Ry and 400 Ry for wavefunction and density, respectively.

For the TB subset, to ensure the Γ point is included for all materials, we ignored shifted k-point meshes. Nevertheless, we validated for a few, representative materials that the use of shifted/unshifted meshes did not alter the total energy by more than ~10 μeV. The van der Waals (vdW) correction was excluded from the SCF calculations, since it did not affect the wavefunction analysis and projection of the density of states. As a result, the total energy did not include any vdW contributions. Given the broad range of elements involved, we chose ultrasoft pseudopotentials from the PSlibrary⁴⁶ to ensure compatibility with a diverse set of materials. The electronic density for computing Bader charges was exported in cube format, as they are commonly used as descriptors of atomic structure and features.

For the EH subset, we used pseudopotentials taken from the SSSP PBE Efficiency v1.3.0 set⁴⁷, taking into account that the prediction of Hubbard parameters are computationally expensive. Note, that the ground-state computations were carried out considering both initial, negligible guesses of Hubbard parameters, of the order of 10⁻⁸eV, which are necessary for the application of density-functional perturbation theory, in addition to self-consistently computed U- and V-values. At this point, we computed the band gaps as the energy difference between the lowest unoccupied and highest occupied levels. The band-gap values computed with DFT+U+V for d-p and d-s perturbations differ, as the computations were performed with different sets of U- and V-values.

Tight-binding Projection

In order to obtain a tight-binding representation of the selected materials, we used PAOFLOW^28,29, a software tool which embeds outputs from electronic structure, plane-wave pseudopotential calculations into pseudo-atomic orbitals. The projection provides the Hamiltonian, both in real and reciprocal space, within the localized orthogonal atomic basis according to the valence orbitals present in the pseudopotential used for each element. The Hamiltonian coefficients are known as TB parameters. PAOFLOW is fully compatible with QE, which facilitates the data conversion, as only the output of the self-consistent part is required for the projection. In the present subset, we computed the tight-binding coefficients for each material based on the standard DFT output and the information of the corresponding atomic orbital basis. In Fig. 2, we show examples of the TB matrix visualization for two representative materials.

Machine-learning Descriptors

To characterize the structure of MOFs, we computed two complementary descriptor sets. The first is based on SOAP descriptors, which encode local atomic environments. The second is based on an original contribution of this work: a DFT-informed fingerprint derived from TB projections and structured into fixed-size embeddings that capture essential features of the electronic structure.

Analysis of the local environment of the metal clusters in MOFs is crucial, and we anonymized the identity of metal atoms in our SOAP descriptor calculations. We adopted a reduced species set that merges metals within a single category, while maintaining detailed classifications for the key elements of most organic linkers, such as hydrogen, oxygen, carbon, and nitrogen. This simplification is especially useful when dealing with large datasets. For the purpose of this contribution, we used two sets of SOAP descriptors. The SOAP-3 Å descriptor captures information about the first nearest-neighbor shell while the SOAP-5 Å descriptor extends beyond the second shell, thus providing a broader range of local structural information. While the former contains 684 values, the latter has 1500 values.

For creating the TB embeddings, we standardized the TB matrix entries for each atom pair and cast them into a fixed-size format to enable consistent application in machine-learning models. Each t_ij between atomic orbitals is embedded into a 13 × 13 matrix block, regardless of the specific elements involved.

The 13 × 13 matrix stems from setting up the orbital configuration per element as 2 s orbitals, e.g., semicore and valence, 6 p orbitals, i.e., 3 semicore and 3 valence, as well as 5 d orbitals. This leads to 13 × 13 blocks that captures all pairwise orbital interactions between two atoms commonly found in MOFs. The matrix is filled with zeros in case orbitals are absent from a specific element.

We excluded f-orbitals from this basis due to their relative scarcity in the systems studied and due to the significant increase in matrix dimensionality they would entail, up to 20 × 20. Their inclusion would have doubled the computational burden and potentially diluted the machine-learning correlation signal of the prevalent orbital types. In cases where a pseudopotential provides fewer orbitals, e.g., only a single s or p orbital, we mapped the available orbitals to the outermost valence orbitals within the standardized block structure. This ensured that the most chemically relevant contributions, i.e., those related to bonding and reactivity, were retained and aligned consistently across different elements. In spin-polarized systems, the matrix size could be doubled to account for spin-up and spin-down channels. While not within the scope of the present work, this would not affect the generalization of the approach, unlike the inclusion of f-orbitals which would significantly increase the representation size. For each atom pair (i, j), a slice of the full TB matrix was extracted, corresponding to their orbital interaction block. The 13 × 13 matrices were extracted from both diagonal and off-diagonal entries, ranked by the maximum absolute value of the strength of interaction, and the top-k blocks were then selected. The resulting 13 × 13 × k tensor was flattened to form a fixed-length vector that represents the TB embedding for the atom. These embeddings were aligned with corresponding SOAP descriptors to serve as structured inputs for machine learning workflows.

In this context, the TB embeddings provide the source data, while the SOAP descriptors act as the target data. The embeddings provide a structured representation of a material’s electronic environment and can be used to predict atomic pair species as well as SOAP descriptors, facilitating material characterization and structure reconstruction.

Hubbard Parameters

To compute the intra-site U and inter-site V Hubbard parameters, we used the QE software based on the implementation of density functional perturbation theory in the DFT+U+V framework^30,34. The U correction applies to transition metals and V represents the coupling between the transition metal and its nearest-neighbors. The Hubbard parameter calculation is orbital-dependent and the desired manifold should be defined a priori in the self-consistent step. We chose the d and d-p orbitals as manifolds for U and V, respectively, where the localized d orbital is located at the transition metal and the p orbital is located at the metal’s nearest neighbor. Note, that it is not possible to compute Hubbard parameters for more than one manifold at a time, due to the high computational cost. We selected a subset of 186 materials for computing U and V for d and d-s orbitals as manifolds. For identifying the different results for U and V, we adopted d-p and d-s perturbations as nomenclature following the chosen manifolds. For constructing the Hubbard projector, we used Lowdin orthogonalized atomic orbitals and the parameters of q-grid were set to half of the k-point mesh used in the self-consistent step. For performing the perturbation, we identified nonequivalent atoms by symmetry. Atoms of the same type, but not equivalent by symmetry, are differentiated (find_atpert = 3) and the convergence threshold for the response function χ was set to 10⁻⁷.

Computational resources

The calculations were performed in a high-performance computing cluster equipped with x86 and PowerPC compute nodes. Computation time depended on simulation type, and the calculation of Hubbard parameters stands out as the most computationally intensive. Specifically, the 10,435 ground-state calculations took about 12 CPU-years while the 412 Hubbard parameter calculations took roughly 10 times that, about 127 CPU-years.

Data Records

The TBHubbard dataset is available in the Harvard Dataverse https://dataverse.harvard.edu/dataverse/tbhubbard⁴⁸. In total, the TBHubbard repository provides 10,863 ground state calculations, including the QE self-consistent input and output files. The PAOFLOW projection output is provided along with a serialized object file (pickle) containing additional information. We provide the list of n_orb orbitals used as basis and the tight-binding Hamiltonian, in real and reciprocal space. The Hamiltonian is stored as tensor of dimensions [n_orb, n_orb, k₁, k₂, k₃, 1], where the k-point grid (k₁ × k₂ × k₃) is the Monkhorst-Pack grid used in the QE input.

In the TB subset, each file contains the specific QE inputs and outputs, SOAP descriptors, and Bader/QE electronic charge distributions⁴⁹. The SOAP descriptors, which capture the local atomic environments within MOFs, are provided in two variations: SOAP-3 Å and SOAP-5 Å. The descriptors represent the atomic topology at different length scales. The TB embeddings for each atomic species in each MOF were computed at the Γ point. For visualizing the data and input generation, we include a collection of auxiliary Python scripts. Specifically, we provide scripts for generating self-consistent QE calculations which are based on Crystallographic Information Files (CIF) and metadata within the QMOF dataset. In addition, we make available the scripts for extracting the TB projection, for visualizing the projection matrix at Γ points and for constructing the TB embeddings. The code contributions are designed to ensure reproducibility of the computational workflow.

In the EH subset, we provide 428 files with the computation of the Hubbard parameters, using the hp.x executable from QE, through the construction of the susceptibility matrix, the corresponding set of U and V values, as well as the QE input/output for DFT+U+V calculations. From these files, 186 refer to the d-s perturbations and the remaining 242 are for d-p perturbations. For enabling accessibility, we provide JSON files containing the inputs and outputs of the QE calculations. In addition, Python scripts are included for reproducing the graphics, for generating JSON files as well as the QE input files used in the ground-state calculations.

Technical Validation

Diversity of structures

Before using first-principles simulation data for training machine learning models, we investigate the chemical diversity of the data sets. We created two distinct selections of materials data, each aiming at a different outcome. While the TB dataset is approximately half the size of the original QMOF dataset, the EH dataset contains, due to the higher computational cost, a much smaller number of materials. In Fig. 3(b)–(e), we present density plots of the materials distributions with regards to representative structural and electronic properties, for the QMOF dataset as well as the two data sets provided in this work. For all properties analyzed, the TB data exhibits a diverse distribution similar to the original dataset while the EH data differs, mainly due to the much smaller number of materials included.

Another important feature is the distribution of transition metals, since their presence in metal-organic frameworks is essential to the application of the EH model. The density histogram for the two data sets and the QMOF dataset is plotted in Fig. 3(a) for the structures that contain transition metal atoms. By design, the EH data set exhibits a concentration of MOF structures containing Zr and Hf atoms. Note, that Zr-based MOFs are of particular interest from and application perspective^{50,51,52,53,54} due the discovery of UiO-66’s high hydrothermal stability⁵⁵. Similar structures have shown application potential as electrochemical sensors and biosensors⁵⁶, as well as catalysts^57,58. While the inter-site V parameters can provide insights into the metal-organic hybridization, the tight-binding parameters can be explored for investigating topological properties.

In the inset of Fig. 3(a), we present a two-dimensional Principal Component Analysis (PCA)⁵⁹ projection of the tight-binding (TB) embeddings. The explained variance ratios for the first two and first four principal components are 0.90 and 0.93, respectively, indicating that a large fraction of the variance is captured in low-dimensional space. Interestingly, clusters corresponding to different TB embeddings of the same metal emerge, suggesting that the embeddings capture metal-specific electronic structure characteristics. In addition, for providing a broader view of the chemical diversity, we have included a t-distributed Stochastic Neighbor Embedding (t-SNE) plot⁶⁰, constructed using the following features: number of atoms per unit cell, pore limiting diameter (PLD), largest cavity diameter, mass density, volume, band gap, and atomic number of the transition metal. In Fig. 1(b), we observe a concentration of the EH data set, while the TB data set is spread out across the QMOF dataset, indicating a good representation of the source dataset. To visualize the diversity in the vicinity of metal atoms across the dataset, we computed a t-SNE projection of SOAP-3Å descriptors, using PCA for the initial dimensionality reduction, see Figure 1(c). The QMOF dataset as well as the EH and TB data sets form distinct clusters in the reduced space, suggesting that differences in structural motifs and generation protocols are well captured by the SOAP representation.

The diversity analysis is useful for differentiating the two data sets with regards to their application potential. While the TB projection provides topological data for exploring a broader range of metal-organic frameworks, the EH data provides Hubbard parameters for the focused investigation of Zr- and Hf-based MOFs.

Tight-binding matrix

By using the PAOFLOW software, we are able to represent metal-organic framework structures with tight-binding Hamiltonians and project the electronic density onto localized atomic orbitals. We performed the projection for each k-point in the grid used in the ground state calculation, where the Hamiltonian tensor can be obtained both in real and reciprocal space. For simplicity, we have chosen the Γ point to visualize the matrix shown in Fig. 2 for two representative MOFs: Fe Pt C₈ H₄ N₆ (or qmof-3dfbcbd) and Cd Ni C₈ H₁₂ N₆ (or qmof-4d9a98c). The coefficients are always real numbers at the Γ point, however, we might obtain complex numbers at other k-points. In the graphics, we plot the modulus-squared TB parameters ∣t_ij∣². The localized atomic orbitals are the valence orbitals within the pseudopotential for every atom in the unit cell, which typically are s, p and d orbitals. Thus, the TB matrix can be divided into blocks, where diagonal blocks represent interactions among orbitals of the same chemical element, and off-diagonal blocks represent the interactions between different valence orbitals of different elements.

The false-color image visualizes the strength of the TB parameters, which is an indicator of nearest-neighbors interactions. By looking at Fig. 2(a), the TB parameters correctly indicate the hopping between Fe-N, N-C, C-H and Pt-C, as verified with the corresponding MOF structures. In Fig. 2(b), the relevant hopping terms are Ni-C, C-H, C-N, Cd-N and N-H. Note, that we have computed both U and V parameters for the Ni atoms and the Ni-N bonds in the MOF structure shown. However, we have not performed any Hubbard calculations for the Cd atoms.

Since interactions among intra-site orbitals might be strong, the maximum value of ∣t_ij∣² in the color bar has been set to 0.5 for facilitating the data visualization. The absolute maximum values are 121.8 eV² and 42.4 eV² for qmof-3dfbcbd (Fig. 2(a)) and qmof-4d9a98c (Fig. 2(b)), respectively. Based on the results obtained, we conclude that the representation of MOFs in TB lattice Hamiltonians can provide useful information on MOF topology and hybridization.

Hubbard parameters

Modeling MOFs using the EH Hamiltonian requires not only TB parameters, but also intra-site U and inter-site V Hubbard parameters. U is associated with a transition metal d-orbital and V refers to the interaction between the transition metal d-orbital and one of its nearest-neighbor orbitals. We provide two sets of values for U and V, here defined as d-p and d-s perturbations referring to the manifolds perturbed. In Fig. 4(a), the distribution of U values is plotted for each material of the EH data set. While a few structures may contain more than one transition metal species, we only plot one metal per MOF for simplicity. Interestingly, we observe that U increases with the atomic number for elements within the same line in the periodic table. Also, depending on metal species, the intra-site parameter can have a large dispersion, as is the case for Ag and Cu. Or it can have very similar values in different environments, such as in the case of Zr, Hf and Y. While U generally refers to the same orbital d, performing the inter-site perturbation on p or s orbitals for computing V can alter the U outcome, yielding systematically smaller values for d-s perturbations.

The V distribution per metal-organic interaction is shown in Fig. 4(b). For clarity, we plot one V value per structure, which is equivalent to the average nearest-neighbors ⟨V⟩. The data exhibits a large dispersion, where we observe positive values for d-p perturbations, and relatively small, negative values for d-s perturbations.

In view of applications, the sets of U and V values provided in this work are aimed at supporting tight-binding modeling of metal-organic frameworks. They can be utilized in the training of machine-learning models and for exploratory data analysis. In addition, the EH data might support applications of quantum computing. In one example, the data are used as input for computing the band gap of representative semiconductors in a quantum-centric materials simulation workflow⁶¹.

Band gap predictions

Standard DFT can be combined with higher-level hybrid functionals to correct the self-interaction error⁶² in systems with strong electronic correlations³¹. The DFT+U+V methodology is a computationally efficient alternative to improve band gap predictions, which typically fail under GGA. Stronger electronic correlations could occur in MOFs containing transition metals with localized orbitals, which can be explored in our data contribution.

In Fig. 4(c,d), we show a comparison of the band gap energies computed using DFT and DFT+U+V for d-p and d-s perturbations, respectively. In both cases, we observe a large concentration of materials in the diagonal, indicating that band gap energies of most structures remain unaffected by DFT+U+V corrections. While this result is surprising, we note that most materials in the data set contain Zr or Hf. For other materials the band gap systematically increases, as expected. For additional information, we refer the reader to the Supporting Information.

Downstream applications in machine-learning

The combination of TB embeddings and SOAP descriptors offers a powerful approach for generative materials discovery. In this framework, TB embeddings serve as a compact representation of the electronic structure, where a generative model would explore the parameter space proposing novel structures. SOAP descriptors could assist in reconstructing structural information, acting as an auxiliary tool for decoding atomic arrangements that may not be captured by the TB embeddings alone. Overall, this allows for predicting interactions and elemental species based on the TB embeddings while simultaneously leveraging the SOAP descriptors for refining structural details.

For investigating this scenario, we have analyzed the metal atoms present in the TB data set. We have grouped atoms that are equivalent by symmetry, including those that appear distinct based on their CIF files, if they occupied nearly equivalent sites. From each group, we have selected one representative atom to ensure that the dataset remains balanced, avoiding over-representation of redundant entries.

We have constructed the TB embeddings by selecting the blocks that contain the six strongest interactions for each metal atom, see Methods section for details. Each embedding consists of six 13 × 13 blocks, resulting in a total vector size of 1014. The dataset contains 21,186 entries. In Fig. 5(a), we show a visualization of the TB embedding for a specific atom.

We have trained a RandomForestRegressor⁶³ for predicting SOAP descriptors based on TB embeddings in a four-dimensional PCA-reduced space. To that end, we have used the reduced TB embeddings as input features and the full SOAP vectors as targets. For testing, the same PCA transformation computed on the training set is applied to the unseen test samples before making predictions. This approach ensures that no information from the test set is leaked into the training process. The same settings are applied to the prediction of SOAP-3Å and of SOAP-5Å descriptors. We have used an 85:15 train-test split and set the number of estimators to 100. Further details are provided in the Methods section.

In Fig. 5(b), we show the distribution of pairwise Euclidean distances obtained for SOAP feature vectors using cutoff radii of 3 and 5 Å, respectively, indicated in red and blue. The vertical dotted lines mark the mean error of the Euclidean distance in the test set predictions using the six strongest blocks. The error made for SOAP-3Å falls within the lower range of its overall distance distribution, indicating that the predicted embeddings remain relatively close to their true values. For SOAP-5Å, the error is slightly higher, as expected due to the larger environment being predicted. However, it still falls within the lower part of its distribution and provides an acceptable level of predictive accuracy.

To assess the predictive power of the TB embeddings, we have progressively increased the number of included blocks from 1 to 10 and applied PCA to each variation. This approach allows us to evaluate how many blocks are necessary to achieve accurate SOAP predictions. Fig. 5(c) shows the mean euclidean distance error between the predicted and actual SOAP vectors as function of the number of included blocks. We observe that the error decreases significantly when increasing from 3 to 6 blocks. However, it remains fairly constant by further increasing the number of blocks. This suggests that six blocks capture sufficient information for predicting SOAP descriptors.

From an application perspective, the TB embeddings can be used to identify species involved in interactions. When combined with SOAP descriptors, TB embeddings enable the resolution of the entire material composition. In the context of MOFs, the SOAP representation can be employed for searching similar structures within the existing MOF building blocks, i.e., metal clusters and organic linkers. This would enable the reconstruction and validation of MOF structures through simulations, facilitating material property optimization as well as structural analysis. In the Supporting Information, we present an example illustrating this process in detail.

Prediction of Hubbard U and V Parameters from TB Embeddings

In the following, we assess the utility of tight-binding (TB) embeddings for predicting the Hubbard intra-site U and inter-site V parameters. In our setup, each 13 × 13 TB matrix block serves as input for a regression model predicting Hubbard parameter values.

The intra-site Hubbard U_i parameter corresponds to diagonal blocks representing electron interactions on atom i. The inter-site V_ij parameters are derived from off-diagonal blocks representing interactions between atoms i and j within the same unit cell. For simplicity, we treat the intra-site Hubbard parameter U_i ≡ V_ii as a special (diagonal) case of the inter-site parameter. For each MOF, we have selected the 10 strongest TB blocks – ranked by the magnitude of their largest matrix element – for building the training data set.

For avoiding ambiguity in periodic systems, we restrict our analysis to Hamiltonians evaluated at the Γ point and apply the minimum image convention. This means that for any given atom i, interactions are considered only if the associated atom j lies within a closer distance than any periodic image of i. The cutoff ensures that only physically relevant, short-range interactions are included.

The training data is taken from the extended_hubbard_model/dp_perturbations subfolder. It contains 240 unique MOFs, lacking two MOFs with anomalously high V-values that we excluded. For each MOF in the data set, we selected the U and V parameters corresponding to the first metal atom appearing in the structure. Of the original 9,754 entries, we selected those 2,386 with the highest, absolute V values per MOF. We then performed the data splitting at MOF-level for ensuring that the test sets contained unseen materials, thus preventing data leakage.

We trained a single RandomForestRegressor model for predicting both U and V values, without distinguishing between them. To validate the model’s predictive capabilities, we performed a 10-fold cross-validation using the top 10 strongest TB embedding blocks per MOF. We trained the model with 100 estimators and default hyperparameters, leveraging the embeddings as input features and the Hubbard V-values as targets. We implemented the cross-validation in Python using scikit-learn’s RandomForestRegressor and KFold utilities. Further implementation specifics, such as the selection of top-k blocks and data pre-processing, are described in the Methods section.

Even though the data are limited, we observe that the model exhibits reasonable predictive performance. The average coefficient of determination R² is 0.914, with minimum and maximum values of 0.716 and 0.989, respectively. We obtain an average test mean absolute error (MAE) of 0.134 and mean squared error (MSE) of 0.179 across folds.

The modeling results are shown in Fig. 6. Overall, they demonstrate that the TB blocks carry the critical information with regards to electronic interactions in MOFs, and that both U and V values can be predicted robustly based on TB embeddings across a chemically diverse set of MOF structures.

Data availability

The dataset is available at https://dataverse.harvard.edu/dataverse/tbhubbard⁴⁸.

Code availability

Auxiliary scripts for creating the datasets, for generating the QE input files, and for plotting the figures are available in the dataset repository https://dataverse.harvard.edu/dataverse/tbhubbard⁴⁸.

References

Li, H. et al. Recent advances in gas storage and separation using metal-organic frameworks. Materials Today 21, 108–121 (2018).
Article CAS ADS Google Scholar
Zhao, H., Dong, J., Chen, S., Wang, H. & Zhao, G. Metal-organic frameworks and their composites for carbon dioxide capture: Recent advances and challenges. Fuel 378, 132973 (2024).
Article CAS Google Scholar
Iliescu, A., Oppenheim, J. J., Sun, C. & Dincǎ, M. Conceptual and practical aspects of metal-organic frameworks for solid-gas reactions. Chemical Reviews 123, 6197–6232 (2023).
Article CAS PubMed Google Scholar
Bavykina, A. et al. Metal-organic frameworks in heterogeneous catalysis: Recent progress, new trends, and future perspectives. Chemical Reviews 120, 8468–8535 (2020).
Article CAS PubMed Google Scholar
Sezgin, P. et al. Biomedical applications of metal-organic frameworks revisited. Industrial & Engineering Chemistry Research 64, 1907–1932 (2025).
Article CAS Google Scholar
Abánades Lázaro, I. et al. Metal-organic frameworks for biological applications. Nature Reviews Methods Primers 4, 42 (2024).
Article Google Scholar
Check, B., Bairley, K., Santarelli, J., Pham, H. T. B. & Park, J. Applications of electrically conductive metal-organic frameworks: From design to fabrication. ACS Materials Letters 7, 465–488 (2025).
Article CAS Google Scholar
Xie, L. S., Skorupskii, G. & Dincă, M. Electrically conductive metal-organic frameworks. Chemical Reviews 120, 8536–8580 (2020).
Article CAS PubMed PubMed Central Google Scholar
Fujie, K., Ikeda, R., Otsubo, K., Yamada, T. & Kitagawa, H. Lithium ion diffusion in a metal-organic framework mediated by an ionic liquid. Chemistry of Materials 27, 7355–7361 (2015).
Article CAS Google Scholar
Wu, Z. et al. A metal-organic framework based quasi-solid-state electrolyte enabling continuous ion transport for high-safety and high-energy-density lithium metal batteries. ACS Applied Materials & Interfaces 15, 22065–22074 (2023).
Article CAS Google Scholar
Kreno, L. E. et al. Metal-organic framework materials as chemical sensors. Chemical Reviews 112, 1105–1125 (2012).
Article CAS PubMed Google Scholar
Chang, K. et al. Advances in metal-organic framework-plasmonic metal composites based sers platforms: Engineering strategies in chemical sensing, practical applications and future perspectives in food safety. Chemical Engineering Journal 459, 141539 (2023).
Article CAS Google Scholar
Yusuf, V. F., Malek, N. I. & Kailasa, S. K. Review on metal-organic framework classification, synthetic approaches, and influencing factors: Applications in energy, drug delivery, and wastewater treatment. ACS Omega 7, 44507–44531 (2022).
Article CAS PubMed PubMed Central Google Scholar
Lee, S. et al. Computational screening of trillions of metal–organic frameworks for high-performance methane storage. ACS Applied Materials & Interfaces 13, 23647–23654 (2021).
Article CAS Google Scholar
Lee, S. et al. Computational screening of trillions of metal-organic frameworks for high-performance methane storage (1753). https://acs.figshare.com/collections/Computational_Screening_of_Trillions_of_Metal_Organic_Frameworks_for_High-Performance_Methane_Storage/5424948/1.
Moghadam, P. Z. et al. Development of a cambridge structural database subset: a collection of metal–organic frameworks for past, present, and future. Chemistry of materials 29, 2618–2625 (2017).
Article CAS Google Scholar
Moghadam, P. Z. et al. Development of a Cambridge Structural Database Subset: A Collection of Metal-Organic Frameworks for Past, Present, and Future. https://acs.figshare.com/articles/dataset/Development_of_a_Cambridge_Structural_Database_Subset_A_Collection_of_Metal_Organic_Frameworks_for_Past_Present_and_Future/4794040 (2017).
Mancuso, J. L., Mroz, A. M., Le, K. N. & Hendon, C. H. Electronic structure modeling of metal-organic frameworks. Chemical Reviews 120, 8641–8715 (2020).
Article CAS PubMed Google Scholar
Liu, Y., Dong, Y. & Wu, H. Comprehensive overview of machine learning applications in mofs: from modeling processes to latest applications and design classifications. J. Mater. Chem. A 13, 2403–2440 (2025).
Article CAS Google Scholar
Han, X.-Q. et al. Ai-driven inverse design of materials: Past, present and future. Chinese Physics Letters (2025).
Park, H., Majumdar, S., Zhang, X., Kim, J. & Smit, B. Inverse design of metal-organic frameworks for direct air capture of co2via deep reinforcement learning. Digital Discovery 3, 728–741 (2024).
Article CAS Google Scholar
Boyd, P. G. et al. Data-driven design of metal-organic frameworks for wet flue gas co₂ capture. Nature 576, 253–256 (2019).
Article CAS PubMed ADS Google Scholar
Jablonka, K. M., Ongari, D., Moosavi, S. M. & Smit, B. Big-data science in porous materials: Materials genomics and machine learning. Chemical Reviews 120, 8066–8129 (2020).
Article CAS PubMed PubMed Central Google Scholar
Zhao, G. et al. Computation-ready experimental metal-organic framework (core mof) 2024 dataset (2025).
Zhao, G. et al. Core mof db: A curated experimental metal-organic framework database with machine-learned properties for integrated material-process screening. Matter 8, 102140 (2025).
Article Google Scholar
Gibaldi, M. et al. Mosaec-db: a comprehensive database of experimental metal-organic frameworks with verified chemical accuracy suitable for molecular simulations. Chem. Sci. 16, 4085–4100 (2025).
Article CAS PubMed PubMed Central Google Scholar
Gibaldi, M. et al. Mosaec mof database (mosaec-db). https://doi.org/10.5281/zenodo.14025238 (2024).
Buongiorno Nardelli, M. et al. Paoflow: A utility to construct and operate on ab initio hamiltonians from the projections of electronic wavefunctions on atomic orbital bases, including characterization of topological materials. Computational Materials Science 143, 462–472 (2018).
Article CAS Google Scholar
Cerasoli, F. T. et al. Advanced modeling of materials with paoflow 2.0: New features and software design. Computational Materials Science 200, 110828 (2021).
Article CAS Google Scholar
Leiria Campo, V. & Cococcioni, M. Extended dft + u + v method with on-site and inter-site electronic interactions. Journal of Physics: Condensed Matter 22, 055602 (2010).
ADS Google Scholar
Pavarini, E. Solving the strong-correlation problem in materials. La Rivista del Nuovo Cimento 44, 597–640 (2021).
Article CAS ADS Google Scholar
Cho, Y. & Kulik, H. J. Improving gas adsorption modeling for mofs by local calibration of hubbard u parameters. The Journal of Chemical Physics 160, 154101 (2024).
Article CAS PubMed ADS Google Scholar
Mann, G. W., Lee, K., Cococcioni, M., Smit, B. & Neaton, J. B. First-principles hubbard u approach for small molecule binding in metal-organic frameworks. The Journal of Chemical Physics 144, 174104 (2016).
Article PubMed ADS Google Scholar
Timrov, I., Marzari, N. & Cococcioni, M. Hp - a code for the calculation of hubbard parameters using density-functional perturbation theory. Computer Physics Communications 279, 108455 (2022).
Article CAS Google Scholar
Bastonero, L. et al. First-principles hubbard parameters with automated and reproducible workflows. npj Computational Materials 11, 183 (2025).
Article CAS PubMed PubMed Central ADS Google Scholar
Yu, M., Yang, S., Wu, C. & Marom, N. Machine learning the hubbard u parameter in dft+u using bayesian optimization. npj Computational Materials 6, 180 (2020).
Article CAS ADS Google Scholar
Uhrin, M., Zadoks, A., Binci, L., Marzari, N. & Timrov, I. Machine learning hubbard parameters with equivariant neural networks. npj Computational Materials 11, 19 (2025).
Article PubMed PubMed Central ADS Google Scholar
Rosen, A. S. et al. Machine learning the quantum-chemical properties of metal-organic frameworks for accelerated materials discovery. Matter 4, 1578–1597 (2021).
Article CAS Google Scholar
Rosen, A. S. et al. High-throughput predictions of metal-organic framework electronic properties: theoretical challenges, graph neural networks, and data exploration. npj Computational Materials 8, 112 (2022).
Article CAS ADS Google Scholar
Bartók, A. P., Kondor, R. & Csányi, G. On representing chemical environments. Physical Review B 87, 184115 (2013).
Article ADS Google Scholar
Giannozzi, P. et al. Quantum espresso: a modular and open-source software project for quantum simulations of materials. Journal of Physics: Condensed Matter 21, 395502 (19pp) (2009).
Google Scholar
Giannozzi, P. et al. Advanced capabilities for materials modelling with quantum espresso. Journal of Physics: Condensed Matter 29, 465901 (2017).
CAS PubMed Google Scholar
Giannozzi, P. et al. Quantum espresso toward the exascale. The Journal of Chemical Physics 152, 154105 (2020).
Article CAS PubMed ADS Google Scholar
Perdew, J. P., Burke, K. & Ernzerhof, M. Generalized gradient approximation made simple. Physical Review Letters 77, 3865–3868 (1996).
Article CAS PubMed ADS Google Scholar
Grimme, S., Antony, J., Ehrlich, S. & Krieg, H. A consistent and accurate ab initio parametrization of density functional dispersion correction (dft-d) for the 94 elements h-pu. The Journal of Chemical Physics 132, 154104 (2010).
Article PubMed ADS Google Scholar
Dal Corso, A. Pseudopotentials periodic table: From h to pu. Computational Materials Science 95, 337–350 (2014).
Article CAS Google Scholar
Prandini, G., Marrazzo, A., Castelli, I. E., Mounet, N. & Marzari, N. Precision and efficiency in solid-state pseudopotential calculations. npj Computational Materials 4, 72 (2018).
Article ADS Google Scholar
Carvalho, P. C. & Zipoli, F. TBHubbard Dataset (2025).
Henkelman, G., Arnaldsson, A. & Jónsson, H. A fast and robust algorithm for bader decomposition of charge density. Computational Materials Science 36, 354–360 (2006).
Article Google Scholar
Ahmad, K. et al. Engineering of zirconium based metal-organic frameworks (zr-mofs) as efficient adsorbents. Materials Science and Engineering: B 262, 114766 (2020).
Article CAS Google Scholar
El-Sayed, E.-S. M., Yuan, Y. D., Zhao, D. & Yuan, D. Zirconium metal-organic cages: Synthesis and applications. Accounts of Chemical Research 55, 1546–1560 (2022).
Article CAS PubMed Google Scholar
Bai, Y. et al. Zr-based metal-organic frameworks: design, synthesis, structure, and applications. Chem. Soc. Rev. 45, 2327–2367 (2016).
Article CAS PubMed Google Scholar
Daliran, S. et al. Defect-enabling zirconium-based metal-organic frameworks for energy and environmental remediation applications. Chem. Soc. Rev. 53, 6244–6294 (2024).
Article CAS PubMed Google Scholar
Gomez-Gualdron, D. A. et al. Computational design of metal-organic frameworks based on stable zirconium building units for storage and delivery of methane. Chemistry of Materials 26, 5632–5639 (2014).
Article CAS Google Scholar
Cavka, J. H. et al. A new zirconium inorganic building brick forming metal organic frameworks with exceptional stability. Journal of the American Chemical Society 130, 13850–13851 (2008).
Article PubMed ADS Google Scholar
Khosropour, H., Keramat, M., Tasca, F. & Laiwattanapaisal, W. A comprehensive review of the application of zr-based metal-organic frameworks for electrochemical sensors and biosensors. Microchimica Acta 191, 449 (2024).
Article CAS PubMed Google Scholar
Zhang, Q. et al. Zr-based metal-organic frameworks for green biodiesel synthesis: A minireview. Bioengineering 9 (2022).
AbouSeada, N., Elmahgary, M. G., Abdellatif, S. O. & Kirah, K. Synergistic integration of zirconium-based metal-organic frameworks and graphitic carbon nitride for sustainable energy solutions: A comprehensive review. Journal of Alloys and Compounds 1002, 175325 (2024).
Article CAS Google Scholar
Jolliffe, I. T.Principal Component Analysis. Springer Series in Statistics (Springer, New York, 2002), 2nd edn.
van der Maaten, L. & Hinton, G. Visualizing data using t-sne. Journal of Machine Learning Research 9, 2579–2605 (2008).
Google Scholar
Duriez, A. et al. Computing band gaps of periodic materials via sample-based quantum diagonalization https://arxiv.org/abs/2503.10901.2503.10901 (2025).
Janesko, B. G., Henderson, T. M. & Scuseria, G. E. Screened hybrid density functionals for solid-state chemistry and physics. Phys. Chem. Chem. Phys. 11, 443–454 (2009).
Article CAS PubMed Google Scholar
Breiman, L. Random forests. Machine Learning 45, 5–32 (2001).
Article Google Scholar
Momma, K. & Izumi, F. Vesta: a three-dimensional visualization system for electronic and structural analysis. Journal of Applied Crystallography 41, 653–658 (2008).
Article CAS ADS Google Scholar

Download references

Acknowledgements

We acknowledge funding (grant numbers 180544 and 225147) by NCCR Catalysis, a National Centre of Competence in Research funded by the Swiss National Science Foundation. We thank Gavin Jones (IBM) for project support and Aleksandros Sobczyk (IBM) for the compilation and optimization of the Quantum Espresso code in several HPC architectures. Also, we would like to thank Ramon Cardias (CBPF) for introducing to us the PAOFLOW software and Marcio Costa (UFF) for the fruitful discussions and guidance on the use of PAOFLOW.

Author information

These authors contributed equally: Pamela Costa Carvalho, Federico Zipoli.

Authors and Affiliations

IBM Research, São Paulo, 04007-900, SP, Brazil
Pamela Costa Carvalho
IBM Research Europe, Saümerstrasse 4 Rüschlikon, 8803, Zurich, Switzerland
Federico Zipoli
National Center for Competence in Research-Catalysis (NCCR-Catalysis), Zurich, Switzerland
Federico Zipoli
IBM Research, Rio de Janeiro, 20031-170, RJ, Brazil
Alan C. Duriez, Marco Antonio Barroca, Rodrigo Neumann Barros Ferreira & Mathias Steiner
IBM Quantum, IBM Research Almaden, San Jose, 95120, CA, USA
Barbara Jones
IBM Research, IBM T.J. Watson Research Center, Yorktown Heights, New York, 10598, NY, USA
Benjamin Wunsch

Authors

Pamela Costa Carvalho
View author publications
Search author on:PubMed Google Scholar
Federico Zipoli
View author publications
Search author on:PubMed Google Scholar
Alan C. Duriez
View author publications
Search author on:PubMed Google Scholar
Marco Antonio Barroca
View author publications
Search author on:PubMed Google Scholar
Rodrigo Neumann Barros Ferreira
View author publications
Search author on:PubMed Google Scholar
Barbara Jones
View author publications
Search author on:PubMed Google Scholar
Benjamin Wunsch
View author publications
Search author on:PubMed Google Scholar
Mathias Steiner
View author publications
Search author on:PubMed Google Scholar

Contributions

P.C.C. developed the simulation workflow, created the dataset and wrote the manuscript. F.Z. developed the simulation workflow, TB embeddings, and predictive models, created the dataset and wrote the manuscript. A.C.D. and M.A.B. defined the dataset output requirements. R.N.B.F. developed the simulation workflow, analyzed the data and wrote the manuscript. B.J. (in memoriam) and B.W. proposed the creation of the dataset. M.S. proposed the creation of the dataset, analyzed the data and wrote the manuscript.

Corresponding author

Correspondence to Mathias Steiner.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supporting Information (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Costa Carvalho, P., Zipoli, F., Duriez, A.C. et al. TBHubbard: tight-binding and extended Hubbard model dataset for metal-organic frameworks. Sci Data 12, 1776 (2025). https://doi.org/10.1038/s41597-025-06054-w

Download citation

Received: 30 June 2025
Accepted: 24 September 2025
Published: 12 November 2025
Version of record: 12 November 2025
DOI: https://doi.org/10.1038/s41597-025-06054-w