Background & Summary

The phenomenon of zero electrical resistance in a material is of profound scientific and practical significance, referred to as superconductivity. This unique state allows electric current to flow through a material without any energy dissipation, making it an essential field of study with numerous potential applications. However, the practical use of superconductors is often constrained by the requirement for extremely low temperatures or high pressures. Since its initial discovery in 1911, the quest for superconductors that function at higher temperatures has been a major focus, as such advancements would enable a wider range of technological applications.

Superconductivity is a well-documented phenomenon, with over 10,000 superconductors identified to date1,2. Prominent examples include cuprate3, iron-based4 and nickel-based superconductors5, which highlight the typical progression in the field: experimental physicists first synthesize new superconductors, followed by theoretical physicists who seek to unravel the fundamental mechanisms of superconductivity through a variety of models and theoretical frameworks. Despite the existence of numerous theories in condensed matter physics that attempt to explain superconductivity, predicting new high-temperature superconductors remains one of the greatest challenges in the field.

In condensed matter theory, energy-band theory serves as a cornerstone for understanding the electronic properties of materials. First-principles calculations based on density functional theory (DFT) play a crucial role in this regard, offering detailed insights into a material’s electronic band structure and density of states (DOS). These elements are instrumental in determining the electrical properties of a material6. Since superconductivity is inherently an electrical property, it follows that the energy-band theory derived from DFT should be applicable for explaining and predicting superconducting behavior7.

Theoretically, the electronic band structure obtained from DFT calculations provides essential parameters for understanding superconducting behavior. These parameters are critical for elucidating both conventional superconductors, such as those explained by BCS theory (e.g., the superconducting gap and electron-phonon coupling constants8), and unconventional superconductors, where strong correlations9 and spin fluctuations10 play a pivotal role.

For instance, the electron-phonon coupling constant in ambient-pressure BCS superconductors like MgB2, which boasts a relatively high transition temperature11, can be extracted through DFT calculations. Likewise, DFT has been instrumental in identifying the key parameters in high-pressure hydrogen-rich superconductors12,13,14. Furthermore, two-dimensional carbon-based materials15,16, nonlinear phonon properties in YBa2Cu3O6.517, and magnetic interactions in iron-based superconductors10 are examples where DFT has significantly contributed to understanding unconventional superconductivity.

Moreover, DFT has provided insights into the tight-binding model parameters18, electronic Coulomb correlation terms in iron-based superconductors9, spin-orbit coupling in heavy fermion systems19, and interlayer interactions in bilayer twisted graphene20. It also helps illuminate the σ-bonds in high-pressure nickel-based superconductors21, the superconducting pairing symmetry in bilayer silicene22, and the unconventional pairing mechanisms in two-dimensional carbon materials23,24. These examples demonstrate the broad applicability of DFT in aiding our understanding of both conventional and unconventional superconductivity.

In contrast to simpler data, such as chemical formulas and lattice structures, electronic band structure data provides a more fundamental and intuitive perspective on superconducting phenomena. This deeper insight is particularly relevant in the context of recent advancements in big data processing techniques, including machine learning (ML) approaches25. The potential of ML to analyze complex electronic properties highlights the need for a comprehensive database of electronic band structures. Such a database would enable large-scale analyses, fostering the discovery of new superconductors and enhancing the understanding of their underlying mechanisms. The development of this resource is essential for advancing both theoretical and experimental research in the field of superconductivity.

In this paper, we introduce SuperBand, a comprehensive electronic band and Fermi surface structure database for superconductors, as depicted in Fig. 1. We generate lattice structure files optimized for DFT calculations and, through these calculations, obtain crucial electronic band data for experimentally realized superconductors. This dataset includes the electronic band structure, DOS, and Fermi surface information. Additionally, we outline methods for the efficient acquisition of structural data, provide high-throughput DFT calculation protocols, and offer programs designed to extract the aforementioned data from large-scale DFT computations. In SuperBand, we have compiled a dataset of 1,362 superconductors, including their experimentally determined superconducting transition temperatures (Tc), and 1,112 experimentally verified non-superconducting materials, which is ideal for ML applications.

Fig. 1
figure 1

SuperBand: an electronic-band and Fermi surface structure database of superconductors.

Methods

The chemical formulas and Tc data for superconductors presented in this paper are mainly sourced from the 2022 edition of the SuperCon database1. This extensive database contains information on 33,458 materials, including 7,190 non-superconducting compounds and 26,268 superconductors with experimentally measured Tc values. To ensure the most up-to-date dataset, we supplemented these materials with superconductors newly identified after 2022 by reviewing publicly available literature.

As depicted in Fig. 2, the crystal structure data utilized in this study are primarily obtained from the Materials Project (MP)26, with additional contributions from the Open Quantum Materials Database (OQMD)27. Since a significant proportion of superconductors are derived through doping parent compounds with various elements, we adopt the 3DSC methodology2 to deal with lattice doping. To handle the complexities of doped structures, supercell processing is applied, replacing doped atoms and generating ordered crystallographic information files (CIFs) compatible with density functional theory (DFT) calculations.

Fig. 2
figure 2

Computational and data cleaning details of SuperBand. Chemical formulas and Tc data are mainly sourced from the SuperCon database1, while structural CIF data are obtained from the Materials Project (MP)26 and the Open Quantum Materials Database (OQMD)27. High-throughput calculations for these materials are conducted using the Atomate open-source package31,32. The FireWorks package33 is utilized for managing the DFT workflow, with Pymatgen26,28 and Ifermi34 employed to extract energy band data.

Data Cleaning

The SuperCon database1 contains numerous duplicate data entries, necessitating a rigorous data cleaning process. A key distinction of this work, compared to previous studies, lies in the determination of ordered crystal lattices for superconductors suitable for DFT calculations. The initial phase involved retrieving CIFs for lattice structures from relevant databases, including the MP and the OQMD. It should be noted that CIFs obtained from these public sources often contain disordered structures. In cases where CIFs were unavailable, we construct some disordered lattice structure files manually.

To address this, we employ an order transformation method that retains only ordered structures with the lowest Ewald energy28. This method efficiently standardizes lattice structures with co-occupying atoms to generate ordered configurations. However, the method encounters difficulties when applied to materials with multiple-element co-occupations or a large number of co-occupied atomic sites. Consequently, we retained 14 materials for which disorder could not be resolved, including K2RbC60 (ID 15960), TiVNbTa (ID 16063), and Cu0.65La1.83Ni0.35Sr0.17O4 (ID 17788). These materials were excluded from further DFT calculations due to unresolved structural complexities.

Subsequently, we applied the 3DSC methodology2 to handle chemical formulas, including the definitions for exact matching, similarities, doping, and unmatched cases. This methodology is applied to the SuperCon database1 to determine whether the chemical formulas could be matched with the ordered structures collected. For materials in the SuperCon database accompanied by space group information, a space group matching analysis is also performed on the relevant CIFs to identify the most closely corresponding material structure.

When a fully matching or similar CIF could not be identified, we search for materials with chemically doped formulas. If the doping concentration exceeded 0.75, the doped atoms are replaced. For doping concentrations exceeding 0.45 (0.29, 0.19, 0.1), supercell expansions of 1 × 1 × 2 (1 × 1 × 3, 2 × 2 × 1, 2 × 2 × 2) are performed to accommodate the doped atoms. The doped atoms are then replaced while preserving the lattice symmetry as much as possible. This process is repeated until the expanded and substituted supercell achieve chemical similarity with the given chemical formulas.

It is important to note that the introduction of doping does not necessarily alter the Tc of a material. In some cases, the incorporation of dopants has little to no discernible effect on the superconducting properties. For such doped superconductors, it is sufficient to disregard minor dopants that do not significantly impact superconductivity, as seen in SiV3-based superconductors. A threshold of 0.2 is thus established to differentiate between doping and similarity for these materials.

However, for certain other systems, such as iron-based superconductors, even a small amount of elemental doping can substantially shift the Fermi level or modify DOS near the Fermi surface. These changes can markedly enhance or suppress superconductivity, often accompanied by a significant shift in Tc. For such materials, a more stringent threshold of 0.1 is applied to distinguish between doping and similarity, given the pronounced sensitivity of their superconducting properties to minor doping modifications.

Under these circumstances, we generated limited CIF representing parent compounds for each doping series less than 0.1. Taking the YBCO system as an example29,we utilized the YBa2Cu3O7 CIF to represent 344 distinct doped variant superconductors, assigning the maximum observed Tc of 95 K as the training label, thereby indicating the optimal highest Tc achievable through doping modifications of YBa2Cu3O7 parent compound. Indeed, the identification of novel parent materials for superconductivity through neural network algorithms presents both significant challenges and remarkable potential.

Following the matching process with CIFs, we obtained data for 8,590 materials with non-duplicate chemical formulas, including 6,780 superconductors. Notably, compared to the reports on superconductors, there is a significant scarcity of reports on non-superconducting materials. Although the number of non-superconducting materials likely far exceeds that of superconductors, research on superconductivity often omits such materials from published studies.

In the realm of ML and big data research, this lack of data on non-superconducting materials hinders the reliability of predictions related to superconductivity. Non-superconducting materials are just as critical to the study of superconductivity, as they offer valuable insight into the boundaries of superconducting behavior. Therefore, we also provide data for 1,780 materials that have been experimentally verified to lack superconducting properties.

Notably, a significant portion of the 6,780 superconductors are represented by the same CIF. We identified a total of 1,763 unique CIFs. It is inappropriate to classify a material as a distinct superconductor based on a minor doping of 0.01 of another element. As a result, the CIF itself is used as the definitive criterion for identifying unique superconductors in this study. Therefore, the subsequent sections of this paper focus exclusively on the 1,763 superconductors corresponding to these unique CIFs, as depicted in Fig. 3.

Fig. 3
figure 3

The elemental distribution of superconductors in SuperBand. For any superconductor containing element A, the count for element A is incremented by 1. By following this process, we obtain the elemental distribution across all superconductors.

DFT calculation

The projector-augmented wave (PAW) method, implemented in the Vienna Ab initio Simulation Package (VASP), is employed to carry out our DFT calculations30. The generalized gradient approximation (GGA) and the Perdew Burke-Ernzerhof (PBE) function are used to treat the electron exchange correlation potential. High-throughput DFT calculations are facilitated by the Atomate open-source package31, with parameter settings derived from the MIT High-Throughput Project32. For workflow automation, we employ the FireWorks package33, which efficiently manages the task flow for structure optimization, static calculations, non-self consistent field calculations.

The plane wave cut-off energy is set at 520 eV. In structure optimization, a Monkhorst-Pack k lattice with a spacing of 2π × 0.04 Å−1 is employed and the self-consistent convergence threshold is set to 5 × 10−5 eV. In static calculations, we employ Monkhorst-Pack k lattice with a spacing of 2π × 0.02 Å−1, and set self-consistent convergence threshold to 1 × 10−5 eV.

Collinear magnetism is consistently incorporated in all calculations. Transition metal elements are automatically assigned magnetic moments, with typical configurations as examples: Mn atoms are generally set to 5 μB, while Mn3+ and Mn4+ ions are assigned 4 μB and 3 μB, respectively; Fe atoms are configured with 5 μB, among other standard magnetic moment settings for other magnetic atoms.The GGA+U approach is systematically implemented across all calculations, with Hubbard U parameters assigned to specific transition metals: Ag is set to U = 1.5 eV, Co to U = 3.4 eV for example, and so forth for other atoms, following established computational protocols. Advanced methods such as spin-orbit coupling (SOC), dynamical mean-field theory (DMFT), HSE06 or GW calculations are more accurate for capturing strong correlation effects in systems like cuprates or nickelates. However, such methods are computationally intractable for high-throughput workflows, we deliberately omit these methods in all simulations to optimize computational resource utilization. Our approach prioritizes scalability and consistency, acknowledging that DFT+U and collinear magnetism serves as a pragmatic first step for large-scale electronic structure analysis.

The band structure and DOS data from non-self consistent field calculations are extracted for analysis. We utilize the Pymatgen package26,27 to facilitate the plot of band structure and DOS. For Fermi surface generation, analysis, and visualization, the Ifermi package34 is used, enabling detailed examination of electronic properties crucial for understanding superconducting mechanisms.

Data standardization

The availability and standardization of data are critical prerequisites for the development of ML models aimed at predicting material properties. In our DFT calculations, the electronic bands of different materials show significant variations due to the MIT-initialized DFT parameter settings32. Initially, lattice symmetry is considered to reduce computational costs, but the equivalent k-point values differ across space groups. Moreover, the k-space mesh density must be adjusted based on the number of atoms and lattice dimensions in each unit cell to enhance the accuracy of the calculations.

To address the normalization of k-space band data, we employ IFermi package34 to standardize the k-space by considering only symmetry-equivalent k-points. Following this, interpolation techniques are applied to standardize the k-space mesh coordinates onto a uniform k grid of 32  × 32  × 32, ensuring consistency across various materials for ML applications.

After completing the standardization process, the number of electronic bands varies among different materials. In constructing a standardized dataset for ML, one could theoretically pad the training set tensors with zero tensors to maintain uniformity. However, this approach wastes computational resources and diminishes the efficiency of the calculations. Studies on both conventional and unconventional superconductors have demonstrated that the DOS near the Fermi surface has a substantial impact on superconductivity, while bands far from the Fermi surface contribute minimally. Therefore, focusing on the electronic bands in close proximity to the Fermi surface is more computationally efficient and enhances the relevance of the dataset for predicting superconducting properties.

Therefore, we limit our analysis to the 18 electronic bands around to the Fermi surface. Each band is mapped onto a 32  × 32  × 32 grid, yielding band data with dimensions of 18  × 32  × 32  × 32. This targeted approach ensures that our dataset captures the most relevant features for predicting superconducting properties efficiently. These data can be systematically augmented through various techniques: lattice orientation variations can be achieved through simple dimensional permutations, while lattice geometry modifications can be implemented via transformation matrices. For instance, primitive and conventional cell representations in face-centered cubic systems can be interconverted. Additionally, repeated selective sampling of bands near the Fermi level enables effective simulation of band structure folding in supercell. Within our constrained storage framework motivated by an optimal balance between storage efficiency and computational precision, we prioritize preserving critical band structure information near the Fermi surface. Importantly, the incorporation of data augmentation techniques is essential for enhancing predictive accuracy in AI training, as demonstrated in our companion paper35.

Data Records

In the DFT calculations, we get the results of band structure data for 1,362 distinct superconductors as well as 1,112 experimentally verified non-superconducting materials.

Data Organization

These band structure data on Science Data Bank36, combined with experimentally reported Tc, form the basis of a ML training set. The dataset is stored in HDF5 format, providing a platform-independent, efficient means of accessing scientific and engineering data. In addition to the normalized band structure data, we also include several critical features for ML: orbital-resolved DOS data, chemical formulas, space group symmetries, lattice constants, atomic species, and atomic positions.

Within the HDF5 architecture, all data pertaining to a specific superconductor are organized into a Group (analogous to a directory). Each Group encapsulates critical metadata within the Group’s Attribute, including experimentally reported Tc, chemical formula, space group system, space group number, and cell volume. The Group further comprises multiple datasets categorized as follows:

  1. 1.

    Crystallographic Parameters

    • Atomic Species Data.

    • Unit cell vectors (in Ã…ngström units)

    • Atomic coordinates (in Ã…ngström units)

  2. 2.

    Electronic Structure Data:

    • Fermi surface data

    • DOS data partitioned by orbital contributions (s, p, d, f), normalized to a uniform length of 2001 data points

    • Reciprocal space coordinates

  3. 3.

    Normalized Energy Band Data: Standardized energy band datasets are structured as four-dimensional tensors (18  × 32  × 32  × 32), encoding band indices and momentum-space sampling.

Data Summary

A comprehensive summary of the literature documenting the initial experimental synthesis of these superconductors is provided. For 159 materials, no corresponding references were found. However, for the remaining 1,604 superconductors, relevant publications are identified. As illustrated in left of Fig. 4, the proportion of superconductors with Tc below 30 K remains consistent across various periods, suggesting that the discovery of new superconductors is largely stochastic. Additionally, the distribution of superconductors relative to their Tc follows an inverse relationship, except for those with Tc  < 2 K. The 1970s saw the advent of superconductors with Tc  > 30 K, most notably with the discovery of cuprate superconductors, which triggered a surge in high-temperature superconductor research during the 1980s.

Fig. 4
figure 4

Year distribution (left) and crystal system distribution (right) of superconductors in SuperBand.

The use of CIFs enables precise characterization of material properties via Pymatgen’s structure tool, as shown in right of Fig. 4. Among superconductors, the most prevalent crystalline structure is tetragonal, which appears in 453 distinct cases. This is followed closely by cubic symmetry in 439 cases, with the fewest occurrences noted for monoclinic (112 instances) and triclinic (27 instances). The tendency of superconductors to favor high-symmetry structures aligns with Matthias’ hypothesis regarding the correlation between symmetry and superconductivity. However, for materials with Tc  > 10 K, a significant decline in the proportion of cubic superconductors is observed, coinciding with a marked increase in orthorhombic superconductors, which exhibit lower symmetry.

For superconductors with Tc values greater than 40 K, the majority of unconventional superconductors that surpass the McMillan limit tend to have either tetragonal or orthorhombic symmetry. This shift suggests that structures with lower symmetry may play a key role in high-temperature superconductivity, especially in systems where conventional electron-phonon interactions are insufficient to explain the observed Tc.

Technical Validation

During the collection of superconductor crystal structures, we made every effort to establish one-to-one correspondence between CIF and Tc. For superconductors whose original research papers provided information such as space group, lattice constants, or specific crystal structures, we ensured that the collected CIF strictly matched those specified in the publications. However, for doped superconductors, we could only employ the supercell expansion method mentioned previously to maintain maximum consistency in their chemical formulas. Our dataset was enhanced by incorporating superconductors reported after 2022 beyond the SuperCon database, thereby ensuring comprehensive coverage of materials. This completeness is demonstrated in Fig. 4, which presents statistics regarding the discovery timeline of superconductors.

Figure 5 presents the energy band data for three representative superconductors in SuperBand, BCS superconductor MgB2 (mp-763) with a hexagonal system11, cuperate superconductor YBa2Cu3O7 (mp-22215) with an orthorhombic system29, and iron-based superconductor KFe2Se2 (mp-1070735) with a tetragonal system37. The electronic band structures of these three materials exhibit excellent consistency with data from the MP database, demonstrating the accuracy of our DFT calculations.

Fig. 5
figure 5

Typical band data of three superconductors in SuperBand as examples, BCS superconductor MgB2 (mp-763) with a hexagonal system11, cuperate superconductor YBa2Cu3O7 (mp-22215) with an orthorhombic system29, and iron-based superconductor KFe2Se2 (mp-1070735) with a tetragonal system37. For each material in SuperBand, we provide figures for crystal structure, electronic band structure, and Fermi surface.

For the technical validation and initial training of this dataset, we employed the 3D-Vision Transformer model35,38 and compare the predicted Tc with the experimental values. We use a set of optimal hyperparameters P  × Q  × F  × D = 18  × 8  × 8  × 8, Ld = 534, De = 0.127, Hd = 64, Dm = 0.197, Md = 1038, and Lt = 3 in 3D-Vision Transformer model. We employ the mean squared error (MSE) between the predicted outputs and therescaling log(Tc +1) values of the training set as the loss function. In training, we utilize stochastic gradient descent (SGD) with a learning rate of 0.001, momentum of 0.9, weight decay of 10−5, and batch size of 32.

The goodness of fit between the predicted and experimental Tc values is quantified using the coefficient of determination, R2, defined by:

$${R}^{2}=1-\frac{{S}_{{\rm{Res}}}}{{S}_{{\rm{Tot}}}}=1-\frac{{\sum }_{i}{({T}_{i}-{\widehat{T}}_{i})}^{2}}{{\sum }_{i}{({T}_{i}-\bar{T})}^{2}},$$
(1)

where Ti represents the predicted Tc values, \({\widehat{T}}_{i}\) denotes the average of predicted Tc values, and \(\bar{T}\) is the average experimental Tc. The deep learning model’s predictions, illustrated in Fig. 6, provide good agreement with the experimental superconductors, giving an R2 = 0.976. Our training code is provided on the our Github repository (https://github.com/ljcj007/SuperBand)39. As a preliminary demonstration, our dataset exhibits promising potential for application in neural network algorithms. Beyond the band structures we primarily utilized, the dataset encompasses diverse material properties, such as DOS and Fermi surface data, that can serve as comprehensive training features for machine learning models.

Fig. 6
figure 6

Comparison between the deep learning model predicted Tc and experimentally measured Tc for the training set. The present in-sample model serves as a proof-of-concept demonstration, establishing the viability of our comprehensive dataset for neural network implementations. While this work focuses on datasets describe, detailed methodology regarding data augmentation protocols and band structure-based Tc prediction is documented in our companion paper35.

Usage Notes

We publicly provide the full SuperBand dataset on Science Data Bank36. The code used to generate figures, tool for ingesting new data into this database, code for accessing and reading the HDF5 file, and a neural network model capable of training this dataset provided on the our Github repository39. For ease of use, a CSV file is included in our Github repository, which contains superconducting-related data, along with corresponding CIFs for the crystal structures.

High-pressure hydride superconductors (e.g., LaH10, ID 15969; YH9, ID 18619) are excluded due to their reliance on extreme pressure conditions and the absence of ambient-pressure structural data. Their chemical formulas are listed in the csv file in Github repository39 for reference.

While pairing mechanisms differ between conventional and unconventional superconductors, our dataset aims to provide a broad foundation for ML exploration. Subclass-specific models may yield higher accuracy. However, the inclusion of diverse materials enables cross-class feature discovery, which is critical for identifying universal trends. We encourage users to subset the data by material class for targeted analyses.

The current dataset does not account for pressure-dependent properties, which limits its applicability to high-pressure systems like hydrides. Future extensions will address this gap through targeted collaborations and advanced computational protocols.