Introduction

Materials innovation and design present a multifaceted approach to tackle humanity’s challenges of sustainable development and climate change1,2,3,4. Recently, halide perovskites have been widely recognized as a remarkable semiconductor material for optoelectronics. Within a decade, halide perovskites have demonstrated great potential to disrupt the photovoltaics and light emitting devices (LEDs) industries5,6,7. Power conversion efficiencies exceeding 26% have been achieved by single-junction perovskite solar cells and even higher efficiency of 33% have been reported for silicon-based tandem perovskite solar cells8,9, while external quantum efficiencies exceeding 23% have been demonstrated for perovskite LEDs7. Of late, two-dimensional (2D) perovskites with their huge structural diversity are emerging candidates for the next-generation perovskite materials (a.k.a. Perovskite 2.0)5,6,10,11,12,13,14,15. These layered materials afford even higher stability and greater tunability of their optoelectronic properties than their three-dimensional (3D) counterparts. However, the discovery and design of new materials is a complicated and challenging process. In fact, materials even with very simple elements can easily end up in a formidably huge chemical space, making the traditional trial-and-error-based materials discovery not only time-consuming but also extremely expensive16.

Recently, data-driven materials informatics manifests its huge potential in materials design and discovery17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34. Materials informatics17,18,19,20, which is the integration of computational materials science with machine learning (ML) technologies, is the fourth paradigm for material science35. A major milestone of artificial intelligence (AI)-based materials informatics is the Materials Genome Initiative (MGI)3, which was launched in 2011 in the United States with the goals of establishing large-scale collaboration between materials scientists and computer scientists and applying machine learning models to predict, screen, and optimize materials at an unparalleled scale and rate. MGI has promoted the creation of various databanks, such as Materials Project36, JARVIS37, NOMAD38, Aflowlib39, OQMD40, etc. Together with other well-established materials databanks, they provide a solid foundation for data-driven AI models for materials functional prediction and new material design and discovery. Currently, AI models have been widely used in the analysis of various perovskite data, including single perovskites21,41,42,43,44,45, double perovskites46,47,48,49, Pb-free perovskites50,51,52,53, and perovskite solar cell devices54,55,56,57,58.

Mathematically, data-driven material AI models can be classified into two categories, i.e., featurization-based machine learning models and end-to-end deep learning models. Featurization or feature engineering seeks to develop a set of structural, physical, chemical, or biological features that characterize the material intrinsic information28,59,60,61. These features are known as molecular descriptors or fingerprints. Traditional molecular descriptors for materials include atomic radius, ionic radius, orbital radius, tolerance factor, octahedral factor, packing factor, crystal structure measurements, surface/volume features, and physical properties such as ionization potential, ionic polarizability, electron affinity, Pauling electronegativity, valence orbital radii, and HOMO and LUMO, etc. The molecular fingerprint is a long vector composed of a series of systematically generated features, usually from molecular structures. The widely-used material molecular fingerprints include Coulomb matrix29, and Ewald sum matrices62, many-body tensor representation (MBTR)63, smooth overlap of atomic positions (SOAP)64, and atom-centered symmetry functions (ACSF)65, etc. These molecular descriptors or fingerprints are used as input features for machine learning models and have been applied in various material property predictions, in particular, 2D perovskite bandgap prediction66,67.

Deep learning models, in particular, geometric deep learning models, are free from hand-craft features and provide an end-to-end way to directly learn the materials’ properties. Geometric deep learning is proposed for the analysis of non-Euclidean data, such as graphs, networks, manifolds, etc, by the incorporation of data geometric and topological information into deep learning architectures68,69,70,71. Among the geometric deep learning models are graph neural networks (GNNs), which have been widely used in node classification, link prediction, graph classification, and property prediction72,73,74,75. Recently, various GNN models have been developed for material property analysis, such as SchNet45, crystal graph convolutional neural networks (CGCNN)21, materials graph network (MEGNet)22, improved crystal graph convolutional neural networks (iCGCNN)76, global attention-based graph neural network (GATGNN)77, crystal graph attention network (CYATT)24, compositionally restricted attention-based network (CrabNet)25, neural equivariant interatomic potentials (NequIP)26, atomistic line graph neural network (ALIGNN)78, Matformer79, MatDeepLearn27, and SIGNNA80. In general, materials-based GNN models can be mainly classified into two categories: potential-energy-based GNNs and chemical-composition-based GNNs. The potential-energy-based GNNs, which include SchNet45, MEGNet22, and NequIP26, use specially-designed GNN architecture to approximate potential energy and use it to predict material properties. The chemical-composition-based GNNs, including CYATT24 and CrabNet25, construct deep learning architectures directly on chemical formulas or chemical structures.

Despite the rapid progress in material AI models, efficient representation, and characterization of materials structures; in particular, the material periodic information still remains a major challenge32,33,34. Recently, topological data analysis (TDA) has been used in the characterization of molecular structures81,82. Different from traditional models, advanced intrinsic mathematical invariants from algebraic topology are used as molecular fingerprints in TDA models81,82. TDA-based learning models have proven their advantages in various steps of drug design81,82,83,84,85,86,87,88, and materials design and discovery89,90,91,92,93. Motivated by the success of TDA, geometric data analysis (GDA) has been developed to study the intrinsic geometric properties of data, including discrete curvatures94,95,96,97,98, flows99, and periodicity79,100,101. Persistent Ricci curvature-based learning models have demonstrated a great advantage over traditional descriptors for perovskite data analysis102. More recently, a density fingerprint (DF) model has been proposed to explicitly characterize crystal periodicity information100. The key idea of DF is to characterize the geometric distribution of atoms within a unit cell by the intersection patterns of the neighboring atoms during a filtration process. The DF approach is found to preserve the essential intrinsic structure information and remains invariant to isometries, continuity, and completeness.

Hitherto, to the best of our knowledge, the GDA-based machine learning model for 2D perovskite design is unprecedented. To fully characterize 2D perovskite structural information, we consider an element-specific representation, which is to decompose the perovskite unit cell into a series of element-specific sets. Each set is composed of one or several types of atoms depending on sites, inorganic/organic properties, and halide anion types. The geometric and periodicity information of these element-specific structures is characterized by a series of density functions, i.e., density fingerprint. The GDA-based material fingerprint is obtained by the concatenation of all the corresponding density fingerprints and then is combined with machine learning models, in particular, gradient boosting tree (GBT). Our GDA-GBT model is extensively validated and compared with state-of-the-art models in the 2D perovskite dataset sourced from the new materials for solar energetics (NMSE) databank. To the best of our knowledge, our GDA-GBT model outperforms all the existing models.

Results and discussion

GDA-based material fingerprint

Density fingerprint, which is a series of 1D density functions of different dimensions, is a powerful tool that captures the periodic geometric information of crystals100. They characterize the pattern of neighboring atom spheres overlapping with one another during a filtration process100. More specifically, in the filtration process, each atom is represented as a sphere, denoted as S(t), and their radius (t) is consistently increasing. The overlapping regions in at least k spheres are denoted as kS(t), with 0S(t) denoting the entire region and 1S(t) denoting the region occupied by at least one sphere. Note that if a certain overlapping region is from k + 1 spheres, then it must be an overlapping region from k spheres, i.e., k+1S(t) kS(t). The overlapping region of exactly k spheres is defined as kS(t) − k+1S(t). In general, the k-th density function is the ratio between the volume of the exact k-sphere overlapping region (within the unit cell) and the volume of the unit cell over a filtration process. Figure 1 shows the overlapping regions (within the unit cell) (a), the exact overlapping regions (within the unit cell) (b), and the 1D density functions (of dimensions 0 to 2) (c). Here the unit cell is a cube structure composed of 8 atoms of the same type. Details for the construction of DF on a 2D perovskite are presented in the Method section.

Fig. 1: Illustration of the generation of density fingerprint.
figure 1

In this example, the unit cell U is the set \({\left[0,1\right)}^{3}\) with the atoms centered at the vertices, and S(t) are atom spheres with radius t. Panel a: The overlapping region of k spheres (within the unit cell) is denoted as U ∩ kS(t) (k = 0, 1, 2, 3); Panel b: The overlapping region of exactly k spheres (within the unit cell) is \(U\cap ({\bigcup }^{k}S(t)-{\bigcup }^{k+1}S(t))\) (k = 0, 1, 2); Panel c: The corresponding density functions ψ0, ψ1, ψ2 on the interval [0, 1].

Element-specific DFs can provide a more detailed characterization of material intrinsic structural information. Mathematically, the essential idea of element-specific representation is to systematically select a set of atoms or site combinations to characterize the heterogenous interactions within the molecular structure81,82. For instance, it has been found that the selection of carbon atoms characterizes the hydrophobic interaction network whilst the selection of all nitrogen and/or oxygen atoms characterizes the hydrophilic interaction network81. In our GDA model, we concatenate the DFs calculated from a series of atom sets systematically obtained from sites and their combinations (ACB, ACX, ACBX, BX, B, X), A-site atom types (C, H, O, N), B-site atom types (Bi, Cd, Ge, Pb, \({{{{{{{\rm{Sn}}}}}}}}\)), and X-site atom types, i.e., halide anions (Cl, Br, I). Note that the B-site and X-site encompass all inorganic and halide atoms for each material, and we apply carbon (C atoms) ions as a representative of the A-site (the site of organic cations) for reducing the computational complexity.

GDA-based ML for 2D perovskite design

Database

Unlike three-dimensional (3D) perovskites, there is a scarcity of databases that provide information on 2D perovskites. In 2021, the Laboratory of New Materials for Solar Energetics (NMSE) launched a crowd-sourcing project to collect 2D perovskites from a variety of experiments and literature sources66. This open-access database contains crystal structures with atomic coordinates and offers density function theory (DFT)-based and experiment (Exp)-based bandgaps (with unit eV) as well as ML-based atomic partial charges. It is currently the most comprehensive database available for studying 2D perovskite crystallines and is continuously updated by the NMSE. The current database accommodates 849 compounds, all possessing 3D structures. Within this collection, 753 compounds are equipped with DFT-based bandgap values (eV), while 238 compounds possess experimental bandgap values. It is worth noting that 235 materials exhibit both types of bandgaps. In this study, diverse experimental scenarios encompass the utilization of subsets from the pool of 753 and 238 2D perovskites.

GDA-based ML for 2D perovskite bandgap prediction

Figure 2 shows a flowchart of our DF-based ML for 2D perovskite material design. Our approach involves several key steps. First, different types of atoms and their combinations within the unit cell, such as organic atoms (e.g., C), inorganic atoms (e.g., \({{{{{{{\rm{Sn}}}}}}}}\)), halide atoms (e.g., Cl), and site combinations (e.g., ACB and ACBX), are systematically extracted from the dataset. Then, we create a multiscale topological representation of the selected atomic system through a filtration process (i.e., a consistent increase of atom radius). Based on the topological representation, we employ the DF algorithm to generate a set of 1D density functions, namely ψ0, ψ1, …ψn, which are referred to as element-specific density fingerprints (see Method for more details). Finally, leveraging the extracted density fingerprints, we employ the gradient boost tree (GBT) model to predict the material bandgaps. One benefit of the GBT model is its great accuracy and robustness103,104,105. Additionally, GBT offers the advantage of providing feature component importance, enabling a more interpretable and explanatory analysis of the relationship between input features and the prediction target.

Fig. 2: Flowchart of the density fingerprint-based machine learning model.
figure 2

In this work, we atom-wise compute the density fingerprint of a given material structure. The second slot of the flowchart illustrates the balls centered at the iodine atoms of the material with different radii. The 2D perovskite structures are represented as vectors of density fingerprint curves (ψn, n = 0, 1, 2, …), which serve as features for the machine learning models. In this work, a gradient boosting tree (GBT) model is applied to the DF-based features to predict material bandgaps. Perovskite structures were visualized using the VESTA software120.

We utilize standard statistical scores to evaluate our GDA-GBT models. Four indices are used for error estimation: the coefficient of determination (COD), Pearson correlation coefficient (PCC), mean absolute error (MAE), and root-mean-square error (RMSE), where COD and MAE are major targets considered in the present literature66,67,80. To ensure a reliable comparison, we apply the 5-fold cross-validation to each experiment and record the average scores. Results and comparisons are shown in Table 1. As shown in Table 1, our GDA-GBT model exhibits superior performance compared to most other models, highlighted by a COD of 0.9101 and an MAE of 0.0695 (eV). Furthermore, our GBT model, utilizing importance-based DF features (GDA(IF)-GBT), achieves even more remarkable results, as evidenced by a COD of 0.9157 and an MAE of 0.0681 (eV). Further comparison and analysis are given in the following section and Supplementary Notes 13.

Table 1 Model performances

Extensive comparison with state-of-the-art models

We carried out an extensive comparison of our GDA-GBT model with the state-of-the-art (SOTA) models on 2D perovskite bandgap prediction, including smooth overlap of atomic position (SOAP) kernel-based ML models (SOAP-based KRR67 and MLM166) and GNN models, such as GCN75, ECCN76, CGCNN21, TFGNN106, SIGNNA80, and SIGNNAc80. In particular, SIGNNAc is an enhanced version of SIGNNA that incorporates additional chemical information and machine-learning feature vectors within the SIGNNA architecture80. In comparison to SIGNNA, SIGNNAc achieves improved results with a COD of 0.9270 and MAE of 0.0830 (eV). Table 1 provides a comprehensive performance comparison. It is important to highlight that all the models underwent evaluation using a 5-fold cross-validation analysis. Notably, given the extensive size of the current NMSE database, which comprises over 624 materials with DFT-based bandgap values, we conducted five random subsamples, each with 624 materials, and applied the same 5-fold cross-validation procedure in each case. The reported results represent the average scores achieved across these processing iterations.

Our GDA-GBT model can outperform molecular feature-based ML models. Recently, the SOAP descriptor-based GBT model, known as SOAP-MLM166, has been used for 2D perovskite bandgap prediction. It has a COD ≈ 0.90 and MAE ≈ 0.103 (eV) as demonstrated in Table 1. Further, SOAP descriptor has been combined with autoencoder network107 and kernel ridge regression (KRR) method108 for feature reduction and bandgap prediction67, in particular 3D perovskite datasets67,109,110. However, the SOAP-KRR model has a poorer performance compared to MLM1 when it comes to the NMSE database.

Our GDA-GBT model can also outperform various GNN models. Recently, GNN-based models have demonstrated good performance with 3D perovskite datasets22,23,67,102. We have systematically compared with GNN-based methods on the NMSE database, including CGCNN67, GCN75, ECCN76, TFGNN106, SIGNNA80, and SIGNNAc80. For Table 1, it has been found that our GDA-GBT generally has a better performance. In general, fingerprint-based models perform better than these GNN models. This may be due to the limited training data size and more complicated structures (of the unit cell).

Our GDA-GBT model remains the best model under different training datasets. Since the NMSE database is continually updated, the models presented in Table 1 are trained with different sizes of perovskite data. Note that the size of the dataset used in each model is presented in Table 1. In general, three different sizes, 445, 515, and 624, are used. For a fair comparison, we have also trained our GDA-GBT models on these three datasets. The results are presented in Supplementary Table 1. In general, our GDA-GBT model still outperforms existing models on all these datasets. A detailed discussion can be found in Supplementary Note 1.

Influence of selected features on GDA-GBT models

A careful analysis of the feature importance scores can provide a better understanding of the learning models. Here we consider two different models of selected features from our GDA fingerprints. We first consider the halide atom-based density fingerprint (HF) and combine it with the GBT model, i.e., termed as the GDA(HF)-GBT model. We find that the GDA(HF)-GBT model yields a COD of approximately 0.86 and MAE of around 0.094 (eV), which are better than the performance of all the GNN-based models, except the SIGNNA model. A detailed discussion of the GDA(HF)-GBT model can be found in Method and Supplementary Note 2.

Secondly, GDA importance-based density fingerprints, i.e., GDA(IF), are also considered. The central concept revolves around focusing on GDA features that have been identified as highly significant through GBT-based importance analysis, as illustrated in Supplementary Fig. 1. Specifically, we extracted the top 1000 most important features from the GDA set to create a streamlined input feature vector for the GBT model. This approach serves as a means to reduce feature dimensionality and mitigate the challenges associated with the curse of high feature dimensionality in machine learning models111. Examining the results in Table 1, it is evident that the GDA(IF)-GBT model, incorporating all these crucial features, exhibits slightly superior performance when compared to the GDA-GBT model.

An illustration of predicted values and error distributions can be found in Supplementary Fig. 2. We have also systematically studied various element-specific models with different site and atom combinations, the results can be found in Supplementary Table 2 and Supplementary Fig. 2. A detailed discussion can be found in Supplementary Note 3.

Influence of Exp/DFT-based bandgap labels on GDA-GBT models

As a further comparison of the findings presented in Table 1, which are primarily rooted in DFT bandgap data, we also conducted a comprehensive evaluation of our GDA-GBT and GDA(IF)-GBT models. This assessment encompasses both DFT-based and Exp-based bandgap labels for 2D perovskites. To be specific, our analysis involved 716 compounds with DFT-based bandgap values and 235 compounds with Exp-based bandgap values, as supported by the unit cell representation from the pymatgen package112. This assessment consisted of two distinct experiments. Firstly, we performed a 5 times 5-fold cross-validation on the subset of 235 compounds with experimental bandgap data to calculate the normalized mean squared error (NMSE). Subsequently, our second experiment involved training the GDA-GBT model on the DFT-based dataset and testing it on the Exp-based dataset. The outcomes of these experiments are presented in Table 2.

Table 2 GDA-GBT performances on Exp/DFT-based datasets

It is noteworthy that while the GDA(IF) features exhibit a slight performance edge over the GDA features in the context of DFT-based materials, the latter, in turn, exhibit a slightly superior performance compared to the importance-based features. This observation suggests that the entire set of 4500 features may offer additional insights for predicting experimental bandgaps. This, in turn, opens up a promising avenue for future research, focusing on exploring the relationship between GDA descriptors and the experimental bandgaps of materials.

It can be seen that both results are inferior to the ones using only DFT datasets. For the 5-fold cross-validation on the Exp subset, its inferior performance may be due to the relatively smaller data size, which is less than one-third of the DFT-based subset. For training on DFT and testing on the Exp model, it has the worst performance as with the MAE and RMSE roughly two times larger than the DFT-only model. This may indicate the discrepancy between the DFT calculation and experimental results. It also indicates the importance of incorporating Exp labels to boost the performance of models.

Prediction of bandgaps for new 2D materials

To investigate the predicting ability of our models, we studied the 2D perovskites in the NMSE database without bandgap labels, i.e., only the experimental 3D structures are available. More specifically, materials from the NMSE databank with material indices 110, 309, 315, 363, and 452 are used in this study. To evaluate the performance, we systematically computed their DFT bandgaps and compared them with the predictions of our proposed GDA models. The five structures are illustrated in Supplementary Fig. 3 and the detailed DFT settings are discussed in Supplementary Note 4.

Figure 3 showcases the calculated and predicted bandgap values for five distinct compounds, with the order (110, 315, 452, 309, 363) corresponding to the order of calculated DFT bandgap magnitudes. Specifically, treating the calculated DFT bandgaps as test labels, we computed the MAE values for GDA-GBT and GDA(IF)-GBT, resulting in 0.4179 eV and 0.4282 eV, respectively. While these MAE values are notably higher than those in Tables 1 and 2, this disparity may be due to two major reasons. First, the current data set is still very small with only 849 data points and very few of them are within the range of 0 to 2 eV. Second, DFT models with different parameter settings can result in a huge difference in their predicted values of the 2D band gap. This may be due to the highly complicated 2D perovskite structures. Furthermore, with only minor distinctions, as demonstrated in Fig. 3 and elaborated in Supplementary Table 3, the proposed models consistently preserve the ascending order of the DFT labels. In summary, the GAD-GBT framework exhibits potential in accurately representing crystalline structures and delivering stable bandgap predictions. Additional material details, calculated DFT bandgap values, and prediction data can be found in Supplementary Table 3 in Supplementary Note 4.

Fig. 3: Prediction of bandgaps of new 2D materials using GDA-GBT and GDA(IF)-GBT models.
figure 3

Materials are from NMSE but without bandgap labels. Using the DFT model, we evaluated their bandgap values (marked in green). The predicted bandgaps (eV) are marked in blue. The x-axes represent the material indices (110, 315, 452, 309, 363). The y-axes denote bandgap ranges in eV. The MAE values for GDA-GBT and GDA(IF)-GBT predictions are 0.4179 (eV) and 0.4282 (eV), respectively.

Methods

Material mathematical representation

As a cornerstone of the physical and machine learning models, the mathematical representation of molecular data is now a major research target in computational materials. In particular, graphs, networks, and simplicial complexes encode pairwise or higher-order interactions between or among atoms in molecules and have become one of the most important ML frameworks in predicting molecular properties81,82. This section introduces the mathematical representation of material data, in particular, the density fingerprint.

Motif and unit cell

The unit cell is a key concept in describing the crystal structure of a material. It refers to a repeating local atomic system that is mathematically defined as a parallelepiped spanned by a basis \({{{{{{{\mathcal{B}}}}}}}}=\{{{{{{{{{\bf{v}}}}}}}}}_{1},{{{{{{{{\bf{v}}}}}}}}}_{2},{{{{{{{{\bf{v}}}}}}}}}_{3}\}\) in \({{\mathbb{R}}}^{3}\). The induced lattice \({{{{{{{\mathcal{L}}}}}}}}\) is the set of all integral linear combinations of \({{{{{{{\mathcal{B}}}}}}}}\). In other words, the unit cell U is defined as \(U=\{{c}_{1}{{{{{{{{\bf{v}}}}}}}}}_{1}+{c}_{2}{{{{{{{{\bf{v}}}}}}}}}_{2}+{c}_{3}{{{{{{{{\bf{v}}}}}}}}}_{3}\,| \,{c}_{1},{c}_{2},{c}_{3}\in \left[0,1\right)\}\) and the lattice \({{{{{{{\mathcal{L}}}}}}}}\) is defined as \({{{{{{{\mathcal{L}}}}}}}}=\{{c}_{1}{{{{{{{{\bf{v}}}}}}}}}_{1}+{c}_{2}{{{{{{{{\bf{v}}}}}}}}}_{2}+{c}_{3}{{{{{{{{\bf{v}}}}}}}}}_{3}\,| \,{c}_{1},{c}_{2},{c}_{3}\in {\mathbb{Z}}\}\). A motif M is a non-empty subset of U, and the periodic point set S is defined as the Minkowski sum of \({{{{{{{\mathcal{L}}}}}}}}\) and M, i.e., \(S=M+{{{{{{{\mathcal{L}}}}}}}}=\{{{{{{{{\bf{u}}}}}}}}+{{{{{{{\bf{v}}}}}}}}\,| \,{{{{{{{\bf{u}}}}}}}}\in M,{{{{{{{\bf{v}}}}}}}}\in {{{{{{{\mathcal{L}}}}}}}}\}\). The motif can be viewed as a local atomic system with respect to a coordinate system or a unit cell in \({{\mathbb{R}}}^{3}\), and the associated periodic point set of the motif forms the entire material structure.

Material descriptors and fingerprints

Traditional material descriptors and fingerprints

Perovskite descriptors provide general insights into the structural and physical properties of perovskite crystals and have been widely used in QSAR/QSPR models. Traditional molecular descriptors for materials include atomic radius, ionic radius, orbital radius, tolerance factor, octahedral factor, packing factor, crystal structure measurements, surface/volume features, and physical properties such as ionization potential, ionic polarizability, electron affinity, Pauling electronegativity, valence orbital radii, and HOMO and LUMO, etc113,114,115. Recently, a series of material fingerprints have been proposed, including Coulomb matrix62,62, sine Coulomb matrix116, Ewald sum matrices116, many-body tensor representation (MBTR)63, atom-centered symmetry function (ACSF)65, and smooth overlap of atomic positions (SOAP)64.

TDA-based material fingerprints

Compared to traditional perovskite descriptors and fingerprints, TDA-based perovskite models have two great advantages, i.e., intrinsic representation and multiscale modeling81,82. More specifically, TDA uses topological invariants, i.e., Betti number, to characterize the intrinsic and fundamental structural and interactional properties. With high abstraction and intrinsic information, TDA-based features can have better transferability and interpretability for learning models. Furthermore, molecular systems exhibit a plethora of interactions at different scales, encompassing covalent bonds, hydrogen bonds, van der Waals interactions, electrostatic interactions, etc. Through a filtration process, TDA can comprehensively characterize all these interactions on an equal footing. This provides a multiscale model for the characterization of perovskite systems81,82. Recently, TDA-based models have demonstrated great performance in material data analysis32,89,91,102,117.

GDA-based material fingerprint

Different from other molecular systems, materials have unique periodic information at the atomic level, which plays a significant role in material functions and properties. However, traditional learning models usually ignore the information or consider periodic information implicitly by using periodic graphs or supercells. To explicitly address the periodic information, we consider the newly proposed multiscale feature representation model, called density fingerprint, which was originally introduced for the study of periodic crystalline structures in Euclidean spaces100. The key advantage of DF lies in its ability to provide an efficient multiscale geometric representation of both cell structures and periodic information. DF offers intuitive and explainable features for periodic crystalline data due to its geometric invariance and stability (or continuity). It does not require cell extension and can be computed on any unit cell, independent of the choice of motif and unit cell.

Mathematical background for density fingerprint

Given a periodic set \(S\subseteq {{\mathbb{R}}}^{3}\) and t≥ 0, let S(t) = {B(x, t)xS} denote the family of all closed balls centered at points in S with radius t. The set \({\bigcup }^{k}S(t)=\{{{{{{{{\bf{x}}}}}}}}\in {{\mathbb{R}}}^{3}| {{{{{{{\bf{x}}}}}}}}\,\,{{\mbox{belongs to at least}}}\,\,k\,\,{{\mbox{balls in}}}\,\,S(t)\}\) is defined as the k-fold cover of S(t). This concept captures the interaction behavior of neighborhoods over the set S. The fractional volume of the k-fold cover of S is defined as the probability

$${\varphi }_{k}^{S}(t) ={{{{{{{\rm{Prob}}}}}}}}({{{{{{{\bf{x}}}}}}}}\in {{\mathbb{R}}}^{3}\,\,{{\mbox{belongs to at least}}}\,\,k\,\,{{\mbox{members in}}}\,\,S(t)) \\ =\frac{{{{{{{{\rm{vol}}}}}}}}(U\cap {\bigcup }^{k}S(t))}{{{{{{{{\rm{vol}}}}}}}}(U)}.$$
(1)

The k-th density function (cf.100,101) of a periodic set S is a function \({\psi }_{k}^{S}:[0,\infty )\to [0,1]\) defined as follows:

$${\psi }_{k}^{S}(t) ={{{{{{{\rm{Prob}}}}}}}}({{{{{{{\bf{x}}}}}}}}\in {{\mathbb{R}}}^{3}\,\,{{\mbox{belongs to exactly}}}\,\,k\,\,{{\mbox{members in}}}\,\,S(t))\\ =\frac{{{{{{{{\rm{vol}}}}}}}}\left(U\cap \left({\bigcup }^{k}S(t)-{\bigcup }^{k+1}S(t)\right)\right)}{{{{{{{{\rm{vol}}}}}}}}(U)}={\varphi }_{k}^{S}(t)-{\varphi }_{k+1}^{S}(t).$$
(2)

The density fingerprint of S is defined as the sequence \(\Psi (S)={({\psi }_{k}^{S})}_{k = 0}^{\infty }\) of density functions associated with S. By selecting a finite range t [0, T] and a finite K, one can consider each curve \({\psi }_{k}^{S}:[0,T]\to [0,1]\) (k = 1, 2,…,K) in the sequence as a feature vector of the periodic set S. These feature vectors can then be used to train machine learning models.

Figure 1 shows an example of density fingerprints \({\psi }_{0}^{S},{\psi }_{1}^{S},{\psi }_{2}^{S}\) defined on the interval [0, 1]. In this case, the set S is the sum of sets \(M+{{{{{{{\mathcal{L}}}}}}}}\), where M = {(0, 0, 0)} and \({{{{{{{\mathcal{L}}}}}}}}\) is the lattice spanned by the standard basis e1 = (1, 0, 0), e2 = (0, 1, 0), and e3 = (0, 0, 1). In particular, the unit cell U and the motif M are defined as the sets \({\left[0,1\right)}^{3}\) and {(0, 0, 0)} respectively, and all the vertices of the closed cube \(\overline{U}={[0,1]}^{3}\) belong to the periodic set S.

Through detailed explicit construction of neighborhoods S(t) in \({{\mathbb{R}}}^{3}\), DF offers an advanced geometric perspective that seamlessly integrates various unit cell properties, including angles, diameters, shapes, distortions, and overall periodicity. Mathematically, DF has invariance and stability properties that will guarantee the uniqueness of DF in the representation of unit cell structures. This means that the same material, even under different unit cell models, will have exactly the same DF, and a slight change (or perturbation) of material structure (i.e., atom positions) will result in the change of DF, as demonstrated in Supplementary Figs. 45. A detailed discussion of DF invariance and stability properties can be found in Supplementary Note 5.

Element-specific density fingerprints

Element-specific density fingerprints concern the geometry and topology of certain atomic systems or their combinations. It has been applied and shown high performance in bioinformatics molecules84,118 and 2D perovskites67,102. Here we consider the DFs generated from not only the entire structure but also different site combinations and atom types. More specifically, the site combinations include ACB, ACX, ACBX, BX, B, and X. A-site atom types include C, H, O, and N. B-site atom types include Bi, Cd, Ge, Pb, and \({{{{{{{\rm{Sn}}}}}}}}\), which are the main inorganic and metal atoms of materials with DFT-based bandgaps in NMSE. X-site atom types include Cl, Br, and I. We call a DF generated from a certain atom type or site combination the element-specific DF. Our GDA-based perovskite fingerprint contains a series of these element-specific DFs.

Furthermore, different combinations of element-specific DFs are considered in our GDA(HF) and GDA(IF) models. Note that GDA(HF) considers only the halide atom-based density fingerprint (HF) and GDA(IF) uses an importance-based density fingerprint (IF). In particular, our GDA(IF) model selects element-specific DFs based on the importance analysis results as shown in Supplementary Fig. 1. Computationally, we calculate the DF over the interval [0, 12] in angstroms (Å). The interval is uniformly divided into 50 grid points. In our experiment, each DF contains five 1D density functions including ψ0, ψ1, ψ2, ψ3, and ψ4, thus its size is 250, and the total dimension of the input vector of GDA-GBT is 4500.

GDA-GBT and GNN models for 2D perovskites

In our GBT model, we utilized 10000 estimators with maximal depth 7 to analyze the input DF features and used the squared error as the loss metric. The minimum number of samples is 2, the learning rate is 0.001, and the subsampling rate in the training process is set to 0.7, respectively. Note that we have not systematically optimized these parameters, instead we just adopt the basic setting with minor revisions from the algebraic graph-assisted bidirectional transformer model119. For the input feature vector, we employ the function f(x) = 10x to rescale the curves ψ0, ψ1, …, ψ4 from the range of [0, 1] to [1, 10]. This rescaling procedure serves to accentuate the distinctions between the curves, thereby facilitating their classification.

Among all the GNN models for 2D perovskites, SIGNNA model80 has the best performance. In this model, perovskite structures are divided into different sub-structures based on physical properties (for example, the organic and inorganic atom sub-systems). Different GNN models are employed for different sub-structures and learned latent vectors are combined together as the input for a multi-layer neural network. It is found that these substructure neural networks can extract features with heterogeneous properties and achieve better performance in predicting the physical and chemical properties of perovskite structures. This is consistent with the results of our element-specific models.

The size of the training database influences the performance of AI models. Due to the continuous updating of the NMSE database, models in Table 1 use different data sizes. More precisely, the MLM1 model utilizes 515 compounds66, the SOAP-based KRR involves 445 compounds67, all the GNN models (GCN75, ECCN76, CGCNN21, TFGNN106, and SIGNNA80) use 624 data points. To ensure a fair comparison with all existing SOTA models21,66,67, we also trained our GDA-GBT models on the same datasets as previous models. Supplementary Table 1 shows the performances of GDA-GBT models with different data sizes. It can be seen that our GDA-GBT can still outperform all previous models on their datasets.