Background & Summary

The combination of first-principles calculation methods and high-throughput workflow has demonstrated potential in systematically predicting complex material properties and constructing large-scale material databases such as the Materials Project1, JARVIS2, AFLOW3, and C2DB4,5. Those databases significantly accelerate the discovery and screening of functional materials with desirable properties for catalysis6,7, optoelectronics8,9, and quantum defects10,11.

While these existing databases provide well-organized material information, they mainly focused on fundamental properties such as the formation energy, band gap, and thermodynamic stability. Other functional properties that are also commonly used to access the performance of materials in device applications still remain limited. For example, the optical conductivity (the dielectric function) quantifies the linear relation between the induced electric current (polarization) to the shined light of an arbitrary frequency12, which can help to determine the absorption and reflectance of materials and also to reveal the exotic electronic structures in materials such as the topological nodal points and the chirality of materials in experiments13,14. Besides, the shift current response, which is a second-order optical effect, describes the photocurrent generated by the shift of wave functions in non-centrosymmetric materials15,16. As a major component of the bulk photovoltaic effect, it provides an alternative approach to overcome the Shockley-Queisser limit in conventional photovoltaic devices such as the p-n junction17,18. Furthermore, transport properties, such as the electrical conductivity, the thermal conductivity, and the Seebeck coefficient, can characterize the ability of materials to conduct electric and heat currents19; their combination produces the thermoelectric figure of merit (zT) that describes the maximum efficiency of materials to convert heat into electricity20,21,22. Therefore, having a dataset with the above mentioned optical and transport properties can greatly benefit identifying promising materials, including those that have not been well characterized in experiments, for advanced device applications.

Furthermore, since 2018, machine learning models such as graph neural networks have been used to predict various properties of solid-state materials23,24,25,26,27. However, those models have primarily focused on scalar properties such as the formation energy and the band gap23,24,25, and only few were applied to predict spectral properties such as the optical conductivity (as a function of photon frequency)28 or the thermal conductivity (as a function of temperature), even though there already exists established model architecture for sequential target prediction in the machine learning field29,30. This limitation arises from the lack of sufficient high-fidelity data on the spectral properties of materials, which were often required for training deep machine learning models.

While previous attempts were made to construct material datasets with the above mentioned optical and transport properties, those datasets only contain a limited amount of data, with only entries for first-order optical properties such as the optical conductivity and the dielectric function1,31,32,33,34,35,36,37,38. Besides, these datasets mostly report the properties in scalar form, such as the trace of the optical response tensor only. However, these scalar properties are mostly useful to isotropic materials. For anisotropic materials, the full response tensor for the optical and transport properties are more accurate to assess their performance in experimental setup, because the tensor components are closely related to the space group symmetry and can reflect the directional dependence of the physical responses to external fields39,40,41,42. A notable exception is the work of Ricci et al.43, which provides tensorial transport properties for a large number of materials at different doping levels. However, due to the computational cost, the calculations were performed using a relatively coarse k-point grid, which may affect the numerical accuracy of a small fraction of materials. The scarcity of high-quality tensorial property datasets poses a major challenge for the development of machine learning models. Although recent advances in equivariant graph neural networks have shown potential, in principle, for predicting tensorial properties44, they were rarely studied in practice due to the lack of high-quality tensorial property data, including the second-rank (such as the optical conductivity, electrical conductivity, and thermal conductivity tensors) and third-rank tensors (such as the shift current response tensor). Therefore, the development of reliable datasets for those tensorial spectral properties is essential to the design and application of advanced machine learning models in materials science.

Based on these motivations, in this work we present a database of the tensorial optical and transport properties of 7301 materials calculated from the tight-binding Hamiltonian in the basis of maximally localized Wannier functions. Our high-throughput workflow can automatically choose the projection functions and the optimal energy windows to construct the Wannier functions by analyzing the chemical orbital character of the electronic states near the Fermi level. After excluding the entries whose Wannier functions are not well-localized, we calculated the optical properties as a function of photon energy up to 2.5 eV, including the optical conductivity and shift current response tensor, and the transport properties as a function of temperature which ranges from 100 to 1000 K, including the tensorial electrical conductivity, Seebeck coefficient, thermal conductivity (with only electron contributions), and figure of merit (zT). We applied this workflow to all nonmagnetic elemental and binary materials with less than 20 atomic sites and with band gap less than 1 eV from the Materials Project1, which results in 10142 material entries in our initial set. After excluding those with unconverged relaxation calculations, 9235 entries remain. By further comparing the band structures from first-principles calculations and Wannier tight-binding Hamiltonian and excluding those with mean absolute error (MAE) of the band energy difference greater than 0.05 eV, we finally arrived at 7301 material entries. The average MAE of materials in the final dataset is 0.01 eV, suggesting the high quality of the constructed Wannier functions and the reliability of the calculated optical and transport properties. The presented workflow and dataset serves as a foundation for future data-driven discovery of functional materials and the development of advanced machine learning models for predicting complex material properties.

Methods

We performed our calculations using the Vienna Ab initio Simulation Package (VASP)/6.4.245,46 and the Wannier90/3.1.0 package47, and the overall workflow is shown in Fig. 1.

Fig. 1
figure 1

A schematic plot of the high-throughput computational workflow to calculate tight-binding Hamiltonian and corresponding optical and transport properties. The pink, blue, and green blocks represent the first-principles calculation, automatic Wannierization, and optical and transport property calculation stages.

First-Principles Calculations

To construct our dataset, we first selected all elemental and binary nonmagnetic material entries from the Materials Project1. We excluded the materials with more than 20 atomic sites or with elements heavier than bismuth, and we also restricted to materials with band gap less than 1 eV. For each material entry, we first symmetrized the structure according to the procedure in Ref. 48 and performed several chained first-principles calculations to obtain the ground-state charge density and wave functions, as well as the electronic band structures along the high-symmetry lines for next-step comparison with the band structures calculated from Wannier tight-binding Hamiltonian. At this stage, material entries with unconverged structural relaxations will be excluded.

The first-principles calculations were performed using the Vienna Ab initio Simulation Package45,46 with projector augmented wave pseudopotentials49,50. We used the Perdew-Burke-Ernzerhof functional in the generalized gradient approximation for all calculations51. The kinetic energy cutoff for the plane-wave basis sets is 550 eV. The force convergence threshold is 0.01 eV/Å, and the energy convergence threshold is 10−7 eV. For structural relaxation and self-consistent static calculations, we used the k-point grid with the density of 8 k-points per Å along each Cartesian direction, while for the non-self-consistent calculation we chose the density of 15 k-points per Å along each Cartesian direction. All calculations were spin-unpolarized, since we only selected nonmagnetic materials, and also without the spin-orbit coupling effect. To take account of the strong correlation effect, we employed the DFT+U method52,53, where the U values are adapted from the high-throughput workflow of the Materials Project and summarized in Table 11,54.

Table 1 The U values used for the DFT+U method.

Automatic Wannierization

To construct the tight-binding Hamiltonian, we chose the basis of maximally localized Wannier functions, which provide an efficient and physically intuitive representation of the electronic states55,56,57. Since constructing high-quality Wannier functions requires the identification of the leading chemical orbitals as projection functions that contribute to the bands near the Fermi level and the proper choices of the disentanglement and frozen energy windows, we followed the procedure in Ref. 58 by analyzing the total density of states from the static calculations and selecting an energy range [EminEmax] which covers the energy range [EF − 2.5eV, EF + 2.5eV], where EF is the Fermi energy. The value of 2.5 eV corresponds to the maximal photon energy in our optical property calculations (see next section), ensuring that the tight-binding Hamiltonian contain all states within this energy range and thus the accuracy of our calculated properties. Besides, we required that the bands within this chosen energy range [EminEmax] are separated from other band manifolds, especially the core states and the unphysical states considerably above the Fermi level. Secondly, we analyzed the projected density of states (PDOS) within this energy range [EminEmax], and selected the chemical orbitals with the largest relative contribution to the total density of states as the projection functions to construct the Wannier functions. Finally, to choose the optimal disentanglement and frozen energy windows, we scanned the energy values within [EminEmax] and selected those which produce the minimal spread of Wannier functions.

The quality of the constructed Wannier functions is mainly characterized by the maximum spread of the Wannier functions, which suggests the extent of localization of the Wannier functions57. In this work, we required that the maximum spread is less than 1.5 × maxi{ai}, where ai is the lattice constant along the Cartesian direction i = xyz. Besides, we also accessed the quality of the Wannier functions by comparing the band structures from first-principles calculations with those obtained from the Wannier function method, and we required that the MAE of the band energy difference is less than 0.05 eV. Only materials that satisfy both criteria are selected for subsequent optical and transport property calculations.

Optical and Transport Property Calculations

For the material entries which satisfy the above spread and MAE criteria, we calculated their tensorial optical and transport properties. The optical properties, including the optical conductivity and the shift current, are functions of the photon energy ranging from 0 eV to 2.5 eV; this energy range is generally adequate to detect and validate the exotic electronic structures using experimental techniques such as infrared spectroscopy or spectroscopic ellipsometer13. The optical conductivity tensor σαβ(ω)59, and the shift current tensor σαβγ(0; ω, − ω)60 are calculated as

$${\sigma }^{\alpha \beta }(\hslash \omega )=\frac{i{e}^{2}\hslash }{{(2\pi )}^{3}}\sum _{nm}\int \frac{{f}_{m{\bf{k}}}-{f}_{n{\bf{k}}}}{{E}_{m{\bf{k}}}-{E}_{n{\bf{k}}}}\frac{{v}_{nm}^{\alpha }({\bf{k}}){v}_{mn}^{\beta }({\bf{k}})}{{E}_{m{\bf{k}}}-{E}_{n{\bf{k}}}-(\hslash \omega +i\eta )}d{\bf{k}}$$
(1)
$$\begin{array}{rcl}{\sigma }^{\alpha \beta \gamma }(0;\hslash \omega ,-\hslash \omega ) & = & -\frac{i\pi {e}^{3}}{{(2\pi )}^{3}4{\hslash }^{2}}\sum _{nm}\int ({f}_{n{\bf{k}}}-{f}_{m{\bf{k}}})({I}_{mn}^{\alpha \beta \gamma }+{I}_{mn}^{\alpha \gamma \beta })\\ \hspace{9.99756pt}\qquad & & \times [\delta ({E}_{m{\bf{k}}}-{E}_{n{\bf{k}}}-\hslash \omega )+\delta ({E}_{n{\bf{k}}}-{E}_{m{\bf{k}}}-\hslash \omega )]d{\bf{k}}\end{array}$$
(2)

Here αβγ are Cartesian directions, m,  n are band indices, fmk and Emk are the Fermi-Dirac occupation and the band energy, \({v}_{nm}^{a}\) is the velocity matrix element, and \({I}_{mn}^{\alpha \beta \gamma }={r}_{mn}^{\beta }{r}_{nm}^{\gamma ;\alpha }\), where \({r}_{mn}^{\alpha }\) is the position matrix element, and \({r}_{nm}^{\gamma ;\alpha }\) is its generalized derivative to kα. Finally, η is the smearing parameter, which is set to be 0.05 eV; this value corresponds to the typical relaxation time of 10 fs in semiconductors and metals61,62.

The transport properties, including the electrical conductivity, the thermal conductivity, the Seebeck coefficient, and the thermoelectric figure of merit (zT), are calculated as functions of temperature ranging from 100 K to 1000 K19,63. Above this temperature range, those transport properties are strongly influenced by disordering effects, such as phase transitions, lattice vibrations, and configurational disorder effect, and may not be relevant for experimental validation. Besides, when calculating the transport properties, we chose the Fermi level as the intrinsic Fermi level from self-consistent calculations; therefore, the reported transport properties correspond to the condition of no chemical doping. We acknowledge that transport properties are strongly dependent on the chemical potential and doping effects; however, it lies beyond the scope of current work and will be addressed in future research. The transport function is defined as

$${\Sigma }^{\alpha \beta }(E)=\frac{1}{{(2\pi )}^{3}}\sum _{n}\int {v}_{n}^{\alpha }({\bf{k}}){v}_{n}^{\beta }({\bf{k}})\delta (E-{E}_{n,{\bf{k}}})\tau (n,{\bf{k}})d{\bf{k}}$$
(3)

where τ(nk) is the electron relaxation time. In this work, we adopted the constant relaxation time approximation and set τ = 10 fs, which is the typical relaxation time in metals and semiconductors61,62. More accurate descriptions of the relaxation time in materials involve solving for the electron self-energy through the electron-phonon coupling theory64,65,66,67. However, owing to the significant computational expense associated with the electron-phonon coupling methods in high-throughput calculations, we chose the constant relaxation time approximation in this study as a practical compromise, while acknowledging its limitations in capturing the electron scattering effects.

From the transport function, the kinetic coefficient tensors at different orders \({K}_{n}^{\alpha \beta }(T)\) (n = 0, 1, 2) are defined as

$${K}_{n}^{\alpha \beta }(T)=\int (-\frac{\partial f(E,T)}{\partial E}){\Sigma }^{\alpha \beta }(E){(E-{E}_{F})}^{n}dE$$
(4)

The electrical conductivity, Seebeck coefficient, and the thermal conductivity (with only electron contributions) tensors are calculated respectively as

$${\boldsymbol{\sigma }}(T)={e}^{2}{{\bf{K}}}_{0}(T)$$
(5)
$${\bf{S}}(T)=\frac{1}{eT}{{\bf{K}}}_{0}^{-1}(T){{\bf{K}}}_{1}(T)$$
(6)
$${\boldsymbol{\kappa }}(T)=\frac{1}{T}[{{\bf{K}}}_{2}(T)-{{\bf{K}}}_{0}^{-1}(T){{\bf{K}}}_{1}^{2}(T)]$$
(7)

Finally, we calculated the thermoelectric figure of merit along the three Cartesian directions α = xyz only,

$${({\rm{zT}})}_{\alpha }=\frac{{S}_{\alpha \alpha }^{2}{\sigma }_{\alpha \alpha }T}{{\kappa }_{\alpha \alpha }}$$
(8)

The density of the k-point grid for optical and transport properties is chosen to be 100 k-points per Å along each Cartesian direction for materials with one atom in the unit cell and 50 k-points per Å along each Cartesian direction for others. The benchmark results over the k-point density can be found in the Technical Validation section.

Data Records

The dataset is available at Figshare repository68. In the main directory, the “entries.csv” file contains the mp-id of the material in the Materials Project1, the chemical formula, the MAE of the band energy difference (eV), the calculated band gap (eV), and the projections used to generate the tight-binding Hamiltonian. The “INCAR_sample” file is a sample INCAR file with which users can generate the input files for the Wannier90 package47 by modifying the projection functions.

Each subdirectory, with name being the mp-id of the material, contains the crystal structure (“mp-id.dat”) the band structures from first-principles methods (“VASP_bands.dat”) and from Wannier function methods (“wannier_bands.dat”), the real part (“optical_conductivity_real.dat”) and the imaginary part (“optical_conductivity_imaginary.dat”) of the optical conductivity, the shift current (“shift_current.dat”), the electrical conductivity (“electrical_conductivity.dat”), the Seebeck coefficient (“Seebeck_coefficients.dat”), the thermal conductivity (“thermal_conductivity.dat”), and the figure of merit zT (“ZT.dat”), all in the tensorial form.

For the band structures, the high symmetry lines are generated according to Ref. 48, with 20 k-points per line segment. Each block within the band structure files represents a certain band, with the first column being the x-grid for the k-point path, and the second column being the band energies (with respect to the Fermi level). Note that the band structures from first-principles methods contain core electronic states. For optical conductivity and shift current, the first column represents the photon energy, and the remaining columns correspond to the components of the response tensor, as indicated by the file header. For electrical conductivity, Seebeck coefficient, thermal conductivity, and figure of merit zT, the first column represents the temperature, and the remaining columns correspond to the components of the response tensor.

In the dataset, the unit of photon energy is eV, the temperature is K, the optical conductivity is 1/(Ω m), the shift current is μA/V2, the electrical conductivity is 1/(Ω m), the Seebeck coefficient is V/K, the thermal conductivity is W/(m K), and the figure of merit zT is dimensionless.

In Fig. 2, we show the calculated optical conductivity, electrical conductivity, shift current of four material entries (GaAs, InN, AsPd3, BaAs2) in our dataset, where each line represents different tensor components. Note that in this work, we used the primitive cell, instead of the conventional cell, of each material to calculate the optical and transport properties. This could lead to different conventions on the definition of tensor components. For example, the xyz-component of the shift current of GaAs is negative between 0 and 2.5 eV in our work, while when using the conventional cell it is positive60.

Fig. 2
figure 2

Examples of the calculated optical conductivity, electrical conductivity, and shift current of GaAs, InN, AsPd3, BaAs2.

Technical Validation

The initial set of materials from the Materials Project, chosen according to the criteria in the Methods section, contains 10142 entries. After excluding those whose relaxation calculations cannot converge, this set restricts down to 9235 entries. Furthermore, by excluding materials whose Wannier functions do not satisfy the spread and MAE criteria, our final dataset contains 7301 entries.

As described in the Methods section, the maximal spread of the Wannier functions and the MAE of the band energy difference between the first-principles calculations and the Wannier function method are two important criteria to validate the quality of the Wannier functions. In Fig. 3(a), we show the ratio of the maximal spread of the Wannier functions with respect to the maximum lattice constant for each material entry (whose relaxation calculations can converge), and 82.1% of the material entries satisfy the criteria for the Wannier function spread (below the read dashed line). On the other hand, the MAE of the band energy differences is shown in Fig. 3(b) and 80.5% of the entries satisfy the criteria for the band energy difference (left to the red dashed line). After combining both criteria, we arrived at 7301 entries, which constitutes 79.1% of the relaxed entries, for our final dataset.

Fig. 3
figure 3

(a) The ratio of the maximal spread to the maximum lattice constants for each material entry, and (b) the MAE of the band energy differences between the first-principles methods and the Wannier function method. The red lines represent our criteria for selecting materials for subsequent optical and transport property calculations.

Furthermore, we randomly selected 18 material entries with MAE smaller than 0.05 eV, and show the calculated band structures from first-principles calculations (black lines) and the Wannier function method (red lines) in Figs. 4 and 5. In Fig. 4, we show material entries where the MAEs are smaller than 0.02 eV; those material entries constitute 66.8% of the 9235 materials whose relaxation calculations converged. The strong agreement between the two methods suggests the effectiveness and the universality of our high-throughput automatic Wannierization workflow. In Fig. 5, we show the band structures material entries where the MAEs are between 0.02 and 0.05 eV. For materials with large MAE, although some inconsistencies are present, they usually occur at energies far above the Fermi level. Since transport properties are relevant to electronic states very close to the Fermi level, we expect that those deviations do not affect the calculated transport properties. Besides, when inconsistencies occur at a strongly localized region in the Brillouin zone, their effect on the calculated optical and transport properties are negligible since those functional properties involve an integration over the Brillouin zone. These results suggest the validity of our criteria based on the MAE of band energy differences.

Fig. 4
figure 4

The comparison of the band structures between the first-principles methods and the Wannier function method, where the MAEs of the band energy differences are smaller than 0.02 eV. The mp-id, chemical formula, and MAE for each entry are listed, and the black and red lines are band structures calculated from the first-principles and the Wannier function method, respectively.

Fig. 5
figure 5

The comparison of the band structures between the first-principles methods and the Wannier function method, where the MAEs of the band energy differences are between 0.02 and 0.05 eV. The mp-id, chemical formula, and MAE for each entry are listed, and the black and red lines are band structures calculated from the first-principles and the Wannier function method, respectively.

Having obtained the high-quality Wannier functions, we proceeded to calculate the optical and transport properties. In Fig. 6, we show the convergence test with respect to the k-point grid. We chose four materials with different number of atoms in the unit cell: Al, GaAs, FeN, and BaAs2, with 1, 2, 2, 18 atoms in the unit cell respectively, respectively. For optical conductivity and electrical conductivity, we show the calculated xx-component. For shift current, due to the space group symmetry, we show the nonzero xyz-component for GaAs and the nonzero xxx-component for FeN and BaAs2. Note that the shift current response vanishes for the centrosymmetric crystal Al. Although we choose our k-point grid for calculating the optical and transport properties by the k-point density, we observe that for materials with small unit cells, a smaller density of 100 k-points per Å is still needed to obtain converged optical and transport properties, while for materials with larger unit cells, convergence is achieved under relatively coarse k-point grid. Therefore, to construct our dataset, we chose the density of 100 k-points per Å (along each Cartesian direction) for material entries with one atom in the unit cell, and 50 k-points per Å (along each Cartesian direction) for the others.

Fig. 6
figure 6

The benchmark calculations over the k-point grid density to calculate the optical and transport properties of Al, GaAs, FeN, and BaAs2, where the black, brown, red, blue lines represent the k-point density of 100, 70, 50, 30 k-points per Å along each Cartesian direction.

Furthermore, we focus on the MAE of calculated spectrum at various k-point density ρk with respect to the finest density of 100 k-points per Å. Taking optical conductivity σαβ(ωρk) as an example, we define \(\,{\rm{MAE}}\,=\frac{1}{9{N}_{\omega }}{\sum }_{\alpha \beta }{\sum }_{{\omega }_{i}}| {\sigma }^{\alpha \beta }(\hslash {\omega }_{i},{\rho }_{{\bf{k}}})-{\sigma }^{\alpha \beta }(\hslash {\omega }_{i},100)| \), where Nω is the number of grid points in the photon frequency, and the factor of 9 is the number of components in the optical conductivity tensor. The MAEs of the calculated optical conductivity, electrical conductivity, and shift current of eight selected materials are shown in Fig. 7. The selected materials span both metals and non-metals, various crystal families, and different unit cell sizes, providing a broad representation of the materials in our dataset. As shown in Fig. 7, materials with smaller unit cell sizes, such as Al, require a finer k-point grid to obtain converged results, while for materials with more atoms in the unit cell, such as GeSe, AgO, and BaAs2, convergence can be reached at a coarser k-point grid, justifying our choice of k-point density above.

Fig. 7
figure 7

The benchmark results for the MAE of the calculated optical conductivity, electrical conductivity, and shift current at various k-point density.

Finally, we compare our calculated transport properties with previous datasets43. In Ref. 43, the transport properties were calculated through the BoltzTraP package69, which uses the same formalism to calculate the transport properties as the Wannier90 package63. In Fig. 8, we show the comparison of the calculated Seebeck coefficients at 300 K, and we observe overall agreement of the Seebeck coefficients between the two methods. However, Ref. 43 includes doping effects (n-doping for panel (a) and p-doping for panel (b)), whereas our dataset did not include this effect, which explains the deviations between the calculated data.

Fig. 8
figure 8

The comparison between the calculated Seebeck coefficient in this work68 and in Ref. 43, where n-doping (panel (a)) and p-doping (panel (b)) effects were included in Ref. 43, with the doping level of 10−18 carriers / cm3.

Usage Notes

The dataset in this work is the largest-to-date collection of systematically calculated optical and transport properties of materials using the Wannier function method. We are expanding the material space to multi-component and magnetic materials, and also using more advanced computational methods such as hybrid functions to correctly describe the electronic properties of materials. The user will be able to utilize the dataset to screen for functional materials with ideal optical and transport properties for experimental verification and machine learning model development.

In this work, we adopted the constant relaxation time approximation and the relaxation time τ = 10 fs was assumed to be a constant scaling factor to all transport properties; the user can scale the transport properties by other values of relaxation time, obtained from experiments or from electron-phonon coupling theory61,62. Note that other work could report the electrical and thermal conductivity divided by the relaxation time43 (the Seebeck coefficient and the figure of merit do not depend on τ).

Besides, when calculating the figure of merit zT, we didn’t consider the lattice contribution to the thermal conductivity, so the reported figure of merit zT is in general overestimated. Especially, for non-metals, the calculated electrical conductivity and thermal conductivity (only electron contributions) are in general small, which could lead to errors in the calculated figure of merit. We expect that after including the lattice contribution to the thermal conductivity, the figure of merit of non-metals will decrease significantly.