Geometric data analysis-based machine learning for two-dimensional perovskite design

Hu, Chuan-Shen; Mayengbam, Rishikanta; Wu, Min-Chun; Xia, Kelin; Sum, Tze Chien

doi:10.1038/s43246-024-00545-w

Download PDF

Article
Open access
Published: 21 June 2024

Geometric data analysis-based machine learning for two-dimensional perovskite design

Communications Materials volume 5, Article number: 106 (2024) Cite this article

5401 Accesses
9 Citations
1 Altmetric
Metrics details

Subjects

Abstract

With extraordinarily high efficiency, low cost, and excellent stability, 2D perovskite has demonstrated a great potential to revolutionize photovoltaics technology. However, inefficient material structure representations have significantly hindered artificial intelligence (AI)-based perovskite design and discovery. Here we propose geometric data analysis (GDA)-based perovskite structure representation and featurization and combine them with learning models for 2D perovskite design. Both geometric properties and periodicity information of the material unit cell, are fully characterized by a series of 1D functions, i.e., density fingerprints (DFs), which are mathematically guaranteed to be invariant under different unit cell representations and stable to structure perturbations. Element-specific DFs, which are based on different site combinations and atom types, are combined with gradient boosting tree (GBT) model. It has been found that our GDA-based learning models can outperform all existing models, as far as we know, on the widely used new materials for solar energetics (NMSE) databank.

Tracking perovskite crystallization via deep learning-based feature detection on 2D X-ray scattering data

Article Open access 03 May 2022

Impact of crystal structure symmetry in training datasets on GNN-based energy assessments for chemically disordered CsPbI₃

Article Open access 14 March 2025

Machine learning-enhanced band gaps prediction for low-symmetry double and layered perovskites

Article Open access 05 November 2024

Introduction

Materials innovation and design present a multifaceted approach to tackle humanity’s challenges of sustainable development and climate change^1,2,3,4. Recently, halide perovskites have been widely recognized as a remarkable semiconductor material for optoelectronics. Within a decade, halide perovskites have demonstrated great potential to disrupt the photovoltaics and light emitting devices (LEDs) industries^5,6,7. Power conversion efficiencies exceeding 26% have been achieved by single-junction perovskite solar cells and even higher efficiency of 33% have been reported for silicon-based tandem perovskite solar cells^8,9, while external quantum efficiencies exceeding 23% have been demonstrated for perovskite LEDs⁷. Of late, two-dimensional (2D) perovskites with their huge structural diversity are emerging candidates for the next-generation perovskite materials (a.k.a. Perovskite 2.0)^{5,6,10,11,12,13,14,15}. These layered materials afford even higher stability and greater tunability of their optoelectronic properties than their three-dimensional (3D) counterparts. However, the discovery and design of new materials is a complicated and challenging process. In fact, materials even with very simple elements can easily end up in a formidably huge chemical space, making the traditional trial-and-error-based materials discovery not only time-consuming but also extremely expensive¹⁶.

Recently, data-driven materials informatics manifests its huge potential in materials design and discovery^{17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34}. Materials informatics^17,18,19,20, which is the integration of computational materials science with machine learning (ML) technologies, is the fourth paradigm for material science³⁵. A major milestone of artificial intelligence (AI)-based materials informatics is the Materials Genome Initiative (MGI)³, which was launched in 2011 in the United States with the goals of establishing large-scale collaboration between materials scientists and computer scientists and applying machine learning models to predict, screen, and optimize materials at an unparalleled scale and rate. MGI has promoted the creation of various databanks, such as Materials Project³⁶, JARVIS³⁷, NOMAD³⁸, Aflowlib³⁹, OQMD⁴⁰, etc. Together with other well-established materials databanks, they provide a solid foundation for data-driven AI models for materials functional prediction and new material design and discovery. Currently, AI models have been widely used in the analysis of various perovskite data, including single perovskites^{21,41,42,43,44,45}, double perovskites^46,47,48,49, Pb-free perovskites^50,51,52,53, and perovskite solar cell devices^{54,55,56,57,58}.

Mathematically, data-driven material AI models can be classified into two categories, i.e., featurization-based machine learning models and end-to-end deep learning models. Featurization or feature engineering seeks to develop a set of structural, physical, chemical, or biological features that characterize the material intrinsic information^28,59,60,61. These features are known as molecular descriptors or fingerprints. Traditional molecular descriptors for materials include atomic radius, ionic radius, orbital radius, tolerance factor, octahedral factor, packing factor, crystal structure measurements, surface/volume features, and physical properties such as ionization potential, ionic polarizability, electron affinity, Pauling electronegativity, valence orbital radii, and HOMO and LUMO, etc. The molecular fingerprint is a long vector composed of a series of systematically generated features, usually from molecular structures. The widely-used material molecular fingerprints include Coulomb matrix²⁹, and Ewald sum matrices⁶², many-body tensor representation (MBTR)⁶³, smooth overlap of atomic positions (SOAP)⁶⁴, and atom-centered symmetry functions (ACSF)⁶⁵, etc. These molecular descriptors or fingerprints are used as input features for machine learning models and have been applied in various material property predictions, in particular, 2D perovskite bandgap prediction^66,67.

Deep learning models, in particular, geometric deep learning models, are free from hand-craft features and provide an end-to-end way to directly learn the materials’ properties. Geometric deep learning is proposed for the analysis of non-Euclidean data, such as graphs, networks, manifolds, etc, by the incorporation of data geometric and topological information into deep learning architectures^68,69,70,71. Among the geometric deep learning models are graph neural networks (GNNs), which have been widely used in node classification, link prediction, graph classification, and property prediction^72,73,74,75. Recently, various GNN models have been developed for material property analysis, such as SchNet⁴⁵, crystal graph convolutional neural networks (CGCNN)²¹, materials graph network (MEGNet)²², improved crystal graph convolutional neural networks (iCGCNN)⁷⁶, global attention-based graph neural network (GATGNN)⁷⁷, crystal graph attention network (CYATT)²⁴, compositionally restricted attention-based network (CrabNet)²⁵, neural equivariant interatomic potentials (NequIP)²⁶, atomistic line graph neural network (ALIGNN)⁷⁸, Matformer⁷⁹, MatDeepLearn²⁷, and SIGNNA⁸⁰. In general, materials-based GNN models can be mainly classified into two categories: potential-energy-based GNNs and chemical-composition-based GNNs. The potential-energy-based GNNs, which include SchNet⁴⁵, MEGNet²², and NequIP²⁶, use specially-designed GNN architecture to approximate potential energy and use it to predict material properties. The chemical-composition-based GNNs, including CYATT²⁴ and CrabNet²⁵, construct deep learning architectures directly on chemical formulas or chemical structures.

Despite the rapid progress in material AI models, efficient representation, and characterization of materials structures; in particular, the material periodic information still remains a major challenge^32,33,34. Recently, topological data analysis (TDA) has been used in the characterization of molecular structures^81,82. Different from traditional models, advanced intrinsic mathematical invariants from algebraic topology are used as molecular fingerprints in TDA models^81,82. TDA-based learning models have proven their advantages in various steps of drug design^{81,82,83,84,85,86,87,88}, and materials design and discovery^{89,90,91,92,93}. Motivated by the success of TDA, geometric data analysis (GDA) has been developed to study the intrinsic geometric properties of data, including discrete curvatures^{94,95,96,97,98}, flows⁹⁹, and periodicity^79,100,101. Persistent Ricci curvature-based learning models have demonstrated a great advantage over traditional descriptors for perovskite data analysis¹⁰². More recently, a density fingerprint (DF) model has been proposed to explicitly characterize crystal periodicity information¹⁰⁰. The key idea of DF is to characterize the geometric distribution of atoms within a unit cell by the intersection patterns of the neighboring atoms during a filtration process. The DF approach is found to preserve the essential intrinsic structure information and remains invariant to isometries, continuity, and completeness.

Hitherto, to the best of our knowledge, the GDA-based machine learning model for 2D perovskite design is unprecedented. To fully characterize 2D perovskite structural information, we consider an element-specific representation, which is to decompose the perovskite unit cell into a series of element-specific sets. Each set is composed of one or several types of atoms depending on sites, inorganic/organic properties, and halide anion types. The geometric and periodicity information of these element-specific structures is characterized by a series of density functions, i.e., density fingerprint. The GDA-based material fingerprint is obtained by the concatenation of all the corresponding density fingerprints and then is combined with machine learning models, in particular, gradient boosting tree (GBT). Our GDA-GBT model is extensively validated and compared with state-of-the-art models in the 2D perovskite dataset sourced from the new materials for solar energetics (NMSE) databank. To the best of our knowledge, our GDA-GBT model outperforms all the existing models.

Results and discussion

GDA-based material fingerprint

Density fingerprint, which is a series of 1D density functions of different dimensions, is a powerful tool that captures the periodic geometric information of crystals¹⁰⁰. They characterize the pattern of neighboring atom spheres overlapping with one another during a filtration process¹⁰⁰. More specifically, in the filtration process, each atom is represented as a sphere, denoted as S(t), and their radius (t) is consistently increasing. The overlapping regions in at least k spheres are denoted as ⋃^kS(t), with ⋃⁰S(t) denoting the entire region and ⋃¹S(t) denoting the region occupied by at least one sphere. Note that if a certain overlapping region is from k + 1 spheres, then it must be an overlapping region from k spheres, i.e., ⋃^k+1S(t) ⊂ ⋃^kS(t). The overlapping region of exactly k spheres is defined as ⋃^kS(t) − ⋃^k+1S(t). In general, the k-th density function is the ratio between the volume of the exact k-sphere overlapping region (within the unit cell) and the volume of the unit cell over a filtration process. Figure 1 shows the overlapping regions (within the unit cell) (a), the exact overlapping regions (within the unit cell) (b), and the 1D density functions (of dimensions 0 to 2) (c). Here the unit cell is a cube structure composed of 8 atoms of the same type. Details for the construction of DF on a 2D perovskite are presented in the Method section.

**Fig. 1: Illustration of the generation of density fingerprint.**

Element-specific DFs can provide a more detailed characterization of material intrinsic structural information. Mathematically, the essential idea of element-specific representation is to systematically select a set of atoms or site combinations to characterize the heterogenous interactions within the molecular structure^81,82. For instance, it has been found that the selection of carbon atoms characterizes the hydrophobic interaction network whilst the selection of all nitrogen and/or oxygen atoms characterizes the hydrophilic interaction network⁸¹. In our GDA model, we concatenate the DFs calculated from a series of atom sets systematically obtained from sites and their combinations (A_CB, A_CX, A_CBX, BX, B, X), A-site atom types (C, H, O, N), B-site atom types (Bi, Cd, Ge, Pb, ${{{{{{{\rm{Sn}}}}}}}}$), and X-site atom types, i.e., halide anions (Cl, Br, I). Note that the B-site and X-site encompass all inorganic and halide atoms for each material, and we apply carbon (C atoms) ions as a representative of the A-site (the site of organic cations) for reducing the computational complexity.

GDA-based ML for 2D perovskite design

Database

Unlike three-dimensional (3D) perovskites, there is a scarcity of databases that provide information on 2D perovskites. In 2021, the Laboratory of New Materials for Solar Energetics (NMSE) launched a crowd-sourcing project to collect 2D perovskites from a variety of experiments and literature sources⁶⁶. This open-access database contains crystal structures with atomic coordinates and offers density function theory (DFT)-based and experiment (Exp)-based bandgaps (with unit eV) as well as ML-based atomic partial charges. It is currently the most comprehensive database available for studying 2D perovskite crystallines and is continuously updated by the NMSE. The current database accommodates 849 compounds, all possessing 3D structures. Within this collection, 753 compounds are equipped with DFT-based bandgap values (eV), while 238 compounds possess experimental bandgap values. It is worth noting that 235 materials exhibit both types of bandgaps. In this study, diverse experimental scenarios encompass the utilization of subsets from the pool of 753 and 238 2D perovskites.

GDA-based ML for 2D perovskite bandgap prediction

Figure 2 shows a flowchart of our DF-based ML for 2D perovskite material design. Our approach involves several key steps. First, different types of atoms and their combinations within the unit cell, such as organic atoms (e.g., C), inorganic atoms (e.g., ${{{{{{{\rm{Sn}}}}}}}}$), halide atoms (e.g., Cl), and site combinations (e.g., A_CB and A_CBX), are systematically extracted from the dataset. Then, we create a multiscale topological representation of the selected atomic system through a filtration process (i.e., a consistent increase of atom radius). Based on the topological representation, we employ the DF algorithm to generate a set of 1D density functions, namely ψ₀, ψ₁, …ψ_n, which are referred to as element-specific density fingerprints (see Method for more details). Finally, leveraging the extracted density fingerprints, we employ the gradient boost tree (GBT) model to predict the material bandgaps. One benefit of the GBT model is its great accuracy and robustness^103,104,105. Additionally, GBT offers the advantage of providing feature component importance, enabling a more interpretable and explanatory analysis of the relationship between input features and the prediction target.

**Fig. 2: Flowchart of the density fingerprint-based machine learning model.**

We utilize standard statistical scores to evaluate our GDA-GBT models. Four indices are used for error estimation: the coefficient of determination (COD), Pearson correlation coefficient (PCC), mean absolute error (MAE), and root-mean-square error (RMSE), where COD and MAE are major targets considered in the present literature^66,67,80. To ensure a reliable comparison, we apply the 5-fold cross-validation to each experiment and record the average scores. Results and comparisons are shown in Table 1. As shown in Table 1, our GDA-GBT model exhibits superior performance compared to most other models, highlighted by a COD of 0.9101 and an MAE of 0.0695 (eV). Furthermore, our GBT model, utilizing importance-based DF features (GDA(IF)-GBT), achieves even more remarkable results, as evidenced by a COD of 0.9157 and an MAE of 0.0681 (eV). Further comparison and analysis are given in the following section and Supplementary Notes 1–3.

Table 1 Model performances

Full size table

Extensive comparison with state-of-the-art models

We carried out an extensive comparison of our GDA-GBT model with the state-of-the-art (SOTA) models on 2D perovskite bandgap prediction, including smooth overlap of atomic position (SOAP) kernel-based ML models (SOAP-based KRR⁶⁷ and MLM1⁶⁶) and GNN models, such as GCN⁷⁵, ECCN⁷⁶, CGCNN²¹, TFGNN¹⁰⁶, SIGNNA⁸⁰, and SIGNNA_c⁸⁰. In particular, SIGNNA_c is an enhanced version of SIGNNA that incorporates additional chemical information and machine-learning feature vectors within the SIGNNA architecture⁸⁰. In comparison to SIGNNA, SIGNNA_c achieves improved results with a COD of 0.9270 and MAE of 0.0830 (eV). Table 1 provides a comprehensive performance comparison. It is important to highlight that all the models underwent evaluation using a 5-fold cross-validation analysis. Notably, given the extensive size of the current NMSE database, which comprises over 624 materials with DFT-based bandgap values, we conducted five random subsamples, each with 624 materials, and applied the same 5-fold cross-validation procedure in each case. The reported results represent the average scores achieved across these processing iterations.

Our GDA-GBT model can outperform molecular feature-based ML models. Recently, the SOAP descriptor-based GBT model, known as SOAP-MLM1⁶⁶, has been used for 2D perovskite bandgap prediction. It has a COD ≈ 0.90 and MAE ≈ 0.103 (eV) as demonstrated in Table 1. Further, SOAP descriptor has been combined with autoencoder network¹⁰⁷ and kernel ridge regression (KRR) method¹⁰⁸ for feature reduction and bandgap prediction⁶⁷, in particular 3D perovskite datasets^67,109,110. However, the SOAP-KRR model has a poorer performance compared to MLM1 when it comes to the NMSE database.

Our GDA-GBT model can also outperform various GNN models. Recently, GNN-based models have demonstrated good performance with 3D perovskite datasets^22,23,67,102. We have systematically compared with GNN-based methods on the NMSE database, including CGCNN⁶⁷, GCN⁷⁵, ECCN⁷⁶, TFGNN¹⁰⁶, SIGNNA⁸⁰, and SIGNNA_c⁸⁰. For Table 1, it has been found that our GDA-GBT generally has a better performance. In general, fingerprint-based models perform better than these GNN models. This may be due to the limited training data size and more complicated structures (of the unit cell).

Our GDA-GBT model remains the best model under different training datasets. Since the NMSE database is continually updated, the models presented in Table 1 are trained with different sizes of perovskite data. Note that the size of the dataset used in each model is presented in Table 1. In general, three different sizes, 445, 515, and 624, are used. For a fair comparison, we have also trained our GDA-GBT models on these three datasets. The results are presented in Supplementary Table 1. In general, our GDA-GBT model still outperforms existing models on all these datasets. A detailed discussion can be found in Supplementary Note 1.

Influence of selected features on GDA-GBT models

A careful analysis of the feature importance scores can provide a better understanding of the learning models. Here we consider two different models of selected features from our GDA fingerprints. We first consider the halide atom-based density fingerprint (HF) and combine it with the GBT model, i.e., termed as the GDA(HF)-GBT model. We find that the GDA(HF)-GBT model yields a COD of approximately 0.86 and MAE of around 0.094 (eV), which are better than the performance of all the GNN-based models, except the SIGNNA model. A detailed discussion of the GDA(HF)-GBT model can be found in Method and Supplementary Note 2.

Secondly, GDA importance-based density fingerprints, i.e., GDA(IF), are also considered. The central concept revolves around focusing on GDA features that have been identified as highly significant through GBT-based importance analysis, as illustrated in Supplementary Fig. 1. Specifically, we extracted the top 1000 most important features from the GDA set to create a streamlined input feature vector for the GBT model. This approach serves as a means to reduce feature dimensionality and mitigate the challenges associated with the curse of high feature dimensionality in machine learning models¹¹¹. Examining the results in Table 1, it is evident that the GDA(IF)-GBT model, incorporating all these crucial features, exhibits slightly superior performance when compared to the GDA-GBT model.

An illustration of predicted values and error distributions can be found in Supplementary Fig. 2. We have also systematically studied various element-specific models with different site and atom combinations, the results can be found in Supplementary Table 2 and Supplementary Fig. 2. A detailed discussion can be found in Supplementary Note 3.

Influence of Exp/DFT-based bandgap labels on GDA-GBT models

As a further comparison of the findings presented in Table 1, which are primarily rooted in DFT bandgap data, we also conducted a comprehensive evaluation of our GDA-GBT and GDA(IF)-GBT models. This assessment encompasses both DFT-based and Exp-based bandgap labels for 2D perovskites. To be specific, our analysis involved 716 compounds with DFT-based bandgap values and 235 compounds with Exp-based bandgap values, as supported by the unit cell representation from the pymatgen package¹¹². This assessment consisted of two distinct experiments. Firstly, we performed a 5 times 5-fold cross-validation on the subset of 235 compounds with experimental bandgap data to calculate the normalized mean squared error (NMSE). Subsequently, our second experiment involved training the GDA-GBT model on the DFT-based dataset and testing it on the Exp-based dataset. The outcomes of these experiments are presented in Table 2.

Table 2 GDA-GBT performances on Exp/DFT-based datasets

Full size table

It is noteworthy that while the GDA(IF) features exhibit a slight performance edge over the GDA features in the context of DFT-based materials, the latter, in turn, exhibit a slightly superior performance compared to the importance-based features. This observation suggests that the entire set of 4500 features may offer additional insights for predicting experimental bandgaps. This, in turn, opens up a promising avenue for future research, focusing on exploring the relationship between GDA descriptors and the experimental bandgaps of materials.

It can be seen that both results are inferior to the ones using only DFT datasets. For the 5-fold cross-validation on the Exp subset, its inferior performance may be due to the relatively smaller data size, which is less than one-third of the DFT-based subset. For training on DFT and testing on the Exp model, it has the worst performance as with the MAE and RMSE roughly two times larger than the DFT-only model. This may indicate the discrepancy between the DFT calculation and experimental results. It also indicates the importance of incorporating Exp labels to boost the performance of models.

Prediction of bandgaps for new 2D materials

To investigate the predicting ability of our models, we studied the 2D perovskites in the NMSE database without bandgap labels, i.e., only the experimental 3D structures are available. More specifically, materials from the NMSE databank with material indices 110, 309, 315, 363, and 452 are used in this study. To evaluate the performance, we systematically computed their DFT bandgaps and compared them with the predictions of our proposed GDA models. The five structures are illustrated in Supplementary Fig. 3 and the detailed DFT settings are discussed in Supplementary Note 4.

Figure 3 showcases the calculated and predicted bandgap values for five distinct compounds, with the order (110, 315, 452, 309, 363) corresponding to the order of calculated DFT bandgap magnitudes. Specifically, treating the calculated DFT bandgaps as test labels, we computed the MAE values for GDA-GBT and GDA(IF)-GBT, resulting in 0.4179 eV and 0.4282 eV, respectively. While these MAE values are notably higher than those in Tables 1 and 2, this disparity may be due to two major reasons. First, the current data set is still very small with only 849 data points and very few of them are within the range of 0 to 2 eV. Second, DFT models with different parameter settings can result in a huge difference in their predicted values of the 2D band gap. This may be due to the highly complicated 2D perovskite structures. Furthermore, with only minor distinctions, as demonstrated in Fig. 3 and elaborated in Supplementary Table 3, the proposed models consistently preserve the ascending order of the DFT labels. In summary, the GAD-GBT framework exhibits potential in accurately representing crystalline structures and delivering stable bandgap predictions. Additional material details, calculated DFT bandgap values, and prediction data can be found in Supplementary Table 3 in Supplementary Note 4.

**Fig. 3: Prediction of bandgaps of new 2D materials using GDA-GBT and GDA(IF)-GBT models.**

Methods

Material mathematical representation

As a cornerstone of the physical and machine learning models, the mathematical representation of molecular data is now a major research target in computational materials. In particular, graphs, networks, and simplicial complexes encode pairwise or higher-order interactions between or among atoms in molecules and have become one of the most important ML frameworks in predicting molecular properties^81,82. This section introduces the mathematical representation of material data, in particular, the density fingerprint.

Motif and unit cell

The unit cell is a key concept in describing the crystal structure of a material. It refers to a repeating local atomic system that is mathematically defined as a parallelepiped spanned by a basis ${{{{{{{\mathcal{B}}}}}}}}=\{{{{{{{{{\bf{v}}}}}}}}}_{1},{{{{{{{{\bf{v}}}}}}}}}_{2},{{{{{{{{\bf{v}}}}}}}}}_{3}\}$ in ${{\mathbb{R}}}^{3}$. The induced lattice ${{{{{{{\mathcal{L}}}}}}}}$ is the set of all integral linear combinations of ${{{{{{{\mathcal{B}}}}}}}}$. In other words, the unit cell U is defined as $U=\{{c}_{1}{{{{{{{{\bf{v}}}}}}}}}_{1}+{c}_{2}{{{{{{{{\bf{v}}}}}}}}}_{2}+{c}_{3}{{{{{{{{\bf{v}}}}}}}}}_{3}\,| \,{c}_{1},{c}_{2},{c}_{3}\in \left[0,1\right)\}$ and the lattice ${{{{{{{\mathcal{L}}}}}}}}$ is defined as ${{{{{{{\mathcal{L}}}}}}}}=\{{c}_{1}{{{{{{{{\bf{v}}}}}}}}}_{1}+{c}_{2}{{{{{{{{\bf{v}}}}}}}}}_{2}+{c}_{3}{{{{{{{{\bf{v}}}}}}}}}_{3}\,| \,{c}_{1},{c}_{2},{c}_{3}\in {\mathbb{Z}}\}$. A motif M is a non-empty subset of U, and the periodic point set S is defined as the Minkowski sum of ${{{{{{{\mathcal{L}}}}}}}}$ and M, i.e., $S=M+{{{{{{{\mathcal{L}}}}}}}}=\{{{{{{{{\bf{u}}}}}}}}+{{{{{{{\bf{v}}}}}}}}\,| \,{{{{{{{\bf{u}}}}}}}}\in M,{{{{{{{\bf{v}}}}}}}}\in {{{{{{{\mathcal{L}}}}}}}}\}$. The motif can be viewed as a local atomic system with respect to a coordinate system or a unit cell in ${{\mathbb{R}}}^{3}$, and the associated periodic point set of the motif forms the entire material structure.

Material descriptors and fingerprints

Traditional material descriptors and fingerprints

Perovskite descriptors provide general insights into the structural and physical properties of perovskite crystals and have been widely used in QSAR/QSPR models. Traditional molecular descriptors for materials include atomic radius, ionic radius, orbital radius, tolerance factor, octahedral factor, packing factor, crystal structure measurements, surface/volume features, and physical properties such as ionization potential, ionic polarizability, electron affinity, Pauling electronegativity, valence orbital radii, and HOMO and LUMO, etc^113,114,115. Recently, a series of material fingerprints have been proposed, including Coulomb matrix^62,62, sine Coulomb matrix¹¹⁶, Ewald sum matrices¹¹⁶, many-body tensor representation (MBTR)⁶³, atom-centered symmetry function (ACSF)⁶⁵, and smooth overlap of atomic positions (SOAP)⁶⁴.

TDA-based material fingerprints

Compared to traditional perovskite descriptors and fingerprints, TDA-based perovskite models have two great advantages, i.e., intrinsic representation and multiscale modeling^81,82. More specifically, TDA uses topological invariants, i.e., Betti number, to characterize the intrinsic and fundamental structural and interactional properties. With high abstraction and intrinsic information, TDA-based features can have better transferability and interpretability for learning models. Furthermore, molecular systems exhibit a plethora of interactions at different scales, encompassing covalent bonds, hydrogen bonds, van der Waals interactions, electrostatic interactions, etc. Through a filtration process, TDA can comprehensively characterize all these interactions on an equal footing. This provides a multiscale model for the characterization of perovskite systems^81,82. Recently, TDA-based models have demonstrated great performance in material data analysis^{32,89,91,102,117}.

GDA-based material fingerprint

Different from other molecular systems, materials have unique periodic information at the atomic level, which plays a significant role in material functions and properties. However, traditional learning models usually ignore the information or consider periodic information implicitly by using periodic graphs or supercells. To explicitly address the periodic information, we consider the newly proposed multiscale feature representation model, called density fingerprint, which was originally introduced for the study of periodic crystalline structures in Euclidean spaces¹⁰⁰. The key advantage of DF lies in its ability to provide an efficient multiscale geometric representation of both cell structures and periodic information. DF offers intuitive and explainable features for periodic crystalline data due to its geometric invariance and stability (or continuity). It does not require cell extension and can be computed on any unit cell, independent of the choice of motif and unit cell.

Mathematical background for density fingerprint

Given a periodic set $S\subseteq {{\mathbb{R}}}^{3}$ and t≥ 0, let S(t) = {B(x, t)∣x ∈ S} denote the family of all closed balls centered at points in S with radius t. The set ${\bigcup }^{k}S(t)=\{{{{{{{{\bf{x}}}}}}}}\in {{\mathbb{R}}}^{3}| {{{{{{{\bf{x}}}}}}}}\,\,{{\mbox{belongs to at least}}}\,\,k\,\,{{\mbox{balls in}}}\,\,S(t)\}$ is defined as the k-fold cover of S(t). This concept captures the interaction behavior of neighborhoods over the set S. The fractional volume of the k-fold cover of S is defined as the probability

$${\varphi }_{k}^{S}(t) ={{{{{{{\rm{Prob}}}}}}}}({{{{{{{\bf{x}}}}}}}}\in {{\mathbb{R}}}^{3}\,\,{{\mbox{belongs to at least}}}\,\,k\,\,{{\mbox{members in}}}\,\,S(t)) \\ =\frac{{{{{{{{\rm{vol}}}}}}}}(U\cap {\bigcup }^{k}S(t))}{{{{{{{{\rm{vol}}}}}}}}(U)}.$$

(1)

The k-th density function (cf.^100,101) of a periodic set S is a function ${\psi }_{k}^{S}:[0,\infty )\to [0,1]$ defined as follows:

$${\psi }_{k}^{S}(t) ={{{{{{{\rm{Prob}}}}}}}}({{{{{{{\bf{x}}}}}}}}\in {{\mathbb{R}}}^{3}\,\,{{\mbox{belongs to exactly}}}\,\,k\,\,{{\mbox{members in}}}\,\,S(t))\\ =\frac{{{{{{{{\rm{vol}}}}}}}}\left(U\cap \left({\bigcup }^{k}S(t)-{\bigcup }^{k+1}S(t)\right)\right)}{{{{{{{{\rm{vol}}}}}}}}(U)}={\varphi }_{k}^{S}(t)-{\varphi }_{k+1}^{S}(t).$$

(2)

The density fingerprint of S is defined as the sequence $\Psi (S)={({\psi }_{k}^{S})}_{k = 0}^{\infty }$ of density functions associated with S. By selecting a finite range t ∈ [0, T] and a finite K, one can consider each curve ${\psi }_{k}^{S}:[0,T]\to [0,1]$ (k = 1, 2,…,K) in the sequence as a feature vector of the periodic set S. These feature vectors can then be used to train machine learning models.

Figure 1 shows an example of density fingerprints ${\psi }_{0}^{S},{\psi }_{1}^{S},{\psi }_{2}^{S}$ defined on the interval [0, 1]. In this case, the set S is the sum of sets $M+{{{{{{{\mathcal{L}}}}}}}}$, where M = {(0, 0, 0)} and ${{{{{{{\mathcal{L}}}}}}}}$ is the lattice spanned by the standard basis e₁ = (1, 0, 0), e₂ = (0, 1, 0), and e₃ = (0, 0, 1). In particular, the unit cell U and the motif M are defined as the sets ${\left[0,1\right)}^{3}$ and {(0, 0, 0)} respectively, and all the vertices of the closed cube $\overline{U}={[0,1]}^{3}$ belong to the periodic set S.

Through detailed explicit construction of neighborhoods S(t) in ${{\mathbb{R}}}^{3}$, DF offers an advanced geometric perspective that seamlessly integrates various unit cell properties, including angles, diameters, shapes, distortions, and overall periodicity. Mathematically, DF has invariance and stability properties that will guarantee the uniqueness of DF in the representation of unit cell structures. This means that the same material, even under different unit cell models, will have exactly the same DF, and a slight change (or perturbation) of material structure (i.e., atom positions) will result in the change of DF, as demonstrated in Supplementary Figs. 4–5. A detailed discussion of DF invariance and stability properties can be found in Supplementary Note 5.

Element-specific density fingerprints

Element-specific density fingerprints concern the geometry and topology of certain atomic systems or their combinations. It has been applied and shown high performance in bioinformatics molecules^84,118 and 2D perovskites^67,102. Here we consider the DFs generated from not only the entire structure but also different site combinations and atom types. More specifically, the site combinations include A_CB, A_CX, A_CBX, BX, B, and X. A-site atom types include C, H, O, and N. B-site atom types include Bi, Cd, Ge, Pb, and ${{{{{{{\rm{Sn}}}}}}}}$, which are the main inorganic and metal atoms of materials with DFT-based bandgaps in NMSE. X-site atom types include Cl, Br, and I. We call a DF generated from a certain atom type or site combination the element-specific DF. Our GDA-based perovskite fingerprint contains a series of these element-specific DFs.

Furthermore, different combinations of element-specific DFs are considered in our GDA(HF) and GDA(IF) models. Note that GDA(HF) considers only the halide atom-based density fingerprint (HF) and GDA(IF) uses an importance-based density fingerprint (IF). In particular, our GDA(IF) model selects element-specific DFs based on the importance analysis results as shown in Supplementary Fig. 1. Computationally, we calculate the DF over the interval [0, 12] in angstroms (Å). The interval is uniformly divided into 50 grid points. In our experiment, each DF contains five 1D density functions including ψ₀, ψ₁, ψ₂, ψ₃, and ψ₄, thus its size is 250, and the total dimension of the input vector of GDA-GBT is 4500.

GDA-GBT and GNN models for 2D perovskites

In our GBT model, we utilized 10000 estimators with maximal depth 7 to analyze the input DF features and used the squared error as the loss metric. The minimum number of samples is 2, the learning rate is 0.001, and the subsampling rate in the training process is set to 0.7, respectively. Note that we have not systematically optimized these parameters, instead we just adopt the basic setting with minor revisions from the algebraic graph-assisted bidirectional transformer model¹¹⁹. For the input feature vector, we employ the function f(x) = 10^x to rescale the curves ψ₀, ψ₁, …, ψ₄ from the range of [0, 1] to [1, 10]. This rescaling procedure serves to accentuate the distinctions between the curves, thereby facilitating their classification.

Among all the GNN models for 2D perovskites, SIGNNA model⁸⁰ has the best performance. In this model, perovskite structures are divided into different sub-structures based on physical properties (for example, the organic and inorganic atom sub-systems). Different GNN models are employed for different sub-structures and learned latent vectors are combined together as the input for a multi-layer neural network. It is found that these substructure neural networks can extract features with heterogeneous properties and achieve better performance in predicting the physical and chemical properties of perovskite structures. This is consistent with the results of our element-specific models.

The size of the training database influences the performance of AI models. Due to the continuous updating of the NMSE database, models in Table 1 use different data sizes. More precisely, the MLM1 model utilizes 515 compounds⁶⁶, the SOAP-based KRR involves 445 compounds⁶⁷, all the GNN models (GCN⁷⁵, ECCN⁷⁶, CGCNN²¹, TFGNN¹⁰⁶, and SIGNNA⁸⁰) use 624 data points. To ensure a fair comparison with all existing SOTA models^21,66,67, we also trained our GDA-GBT models on the same datasets as previous models. Supplementary Table 1 shows the performances of GDA-GBT models with different data sizes. It can be seen that our GDA-GBT can still outperform all previous models on their datasets.

Data availability

All 2D perovskite data used in this work are available on the official website of the NMSE database: http://www.pdb.nmse-lab.ru/. All materials in the database are encoded in the Crystallographic Information File (CIF) format and can be freely accessed. Descriptions of material properties such as DFT-based or experimental bandgaps, chemical formulas, space groups, etc., are provided on the official website.

Code availability

The code used to generate the DF in this paper can be accessed on the project’s GitHub page: https://github.com/peterbillhu/DFOn2DProveskites. Historic versions, updates, tutorials, parameter descriptions, and essential code can be browsed and accessed on the project’s GitHub. If necessary, please contact the authors for further technical support.

References

Crabtree, G., Glotzer, S., McCurdy, B. & Roberto, J. Computational materials science and chemistry: accelerating discovery and innovation through simulation-based engineering and science. Tech. Rep., USDOE Office of Science (SC)(United States) (2010).
Moskowitz, S. L. The advanced materials revolution: technology and economic growth in the age of globalization. (John Wiley & Sons, Hoboken, NJ, USA, 2014).
Google Scholar
Science, N & T. C. Materials genome initiative for global competitiveness. (Executive Office of the President, National Science, and Technology Council: Washington, D.C., USA, (2011).
Green, M. L. et al. Fulfilling the promise of the materials genome initiative with high-throughput experimental methodologies. Appl. Phys. Rev. 4, 011105 (2017).
Article Google Scholar
Grancini, G. & Nazeeruddin, M. K. Dimensional tailoring of hybrid perovskites for photovoltaics. Nat. Rev. Mater. 4, 4–22 (2019).
Article CAS Google Scholar
Mao, L., Stoumpos, C. C. & Kanatzidis, M. G. Two-dimensional hybrid halide perovskites: principles and promises. J. Am. Chem. Soc. 141, 1171–1190 (2018).
Article PubMed Google Scholar
Fakharuddin, A. et al. Perovskite light-emitting diodes. Nat. Electron. 5, 203–216 (2022).
Article CAS Google Scholar
Min, H. et al. Perovskite solar cells with atomically coherent interlayers on SnO₂ electrodes. Nature 598, 444–450 (2021).
Article CAS PubMed Google Scholar
Sharma, R., Sharma, A., Agarwal, S. & Dhaka, M. S. Stability and efficiency issues, solutions and advancements in perovskite solar cells: a review. Solar Energy 244, 516–535 (2022).
Ahmad, S. et al. Dion-Jacobson phase 2D layered perovskites for solar cells with ultrahigh stability. Joule 3, 794–806 (2019).
Article CAS Google Scholar
Mao, L., Wu, Y., Stoumpos, C. C., Wasielewski, M. R. & Kanatzidis, M. G. White-light emission and structural distortion in new corrugated two-dimensional lead bromide perovskites. J. Am. Chem. Soc. 139, 5210–5215 (2017).
Article CAS PubMed Google Scholar
Ma, C. et al. Photovoltaically top-performing perovskite crystal facets. Joule 6, 2626–2643 (2022).
Article CAS Google Scholar
Jia, Q. et al. Strong synergistic effect of the (110) and (100) facets of the SrTiO₃ perovskite micro/nanocrystal: decreasing the binding energy of exciton and superb photooxidation capability for ${{{{{{{{\rm{Co}}}}}}}}}_{2}^{+}$. Nanoscale 14, 12875–12884 (2022).
Article CAS PubMed Google Scholar
Wu, G. et al. 2D hybrid halide perovskites: structure, properties, and applications in solar cells. Small 17, 2103514 (2021).
Article CAS Google Scholar
Han, Y., Yue, S. & Cui, B.-B. Low-dimensional metal halide perovskite crystal materials: structure strategies and luminescence applications. Adv. Sci. 8, 2004805 (2021).
Article CAS Google Scholar
Pilania, G., Wang, C. C., Jiang, X., Rajasekaran, S. & Ramprasad, R. Accelerating materials property predictions using machine learning. Sci. Rep. 3, 2810 (2013).
Article PubMed PubMed Central Google Scholar
Rajan, K. Materials informatics. Mater. Today 8, 38–45 (2005).
Article CAS Google Scholar
Rajan, K. Informatics for materials science and engineering: data-driven discovery for accelerated experimentation and application. (Butterworth-Heinemann, Kidlington, Oxford, UK, 2013).
Google Scholar
Agrawal, A. & Choudhary, A. Perspective: Materials informatics and big data: realization of the “fourth paradigm” of science in materials science. Apl. Mater. 4, 053208 (2016).
Article Google Scholar
Ward, L. & Wolverton, C. Atomistic calculations and materials informatics: A review. Curr. Opin. Solid State Mater. Sci. 21, 167–176 (2017).
Article CAS Google Scholar
Xie, T. & Grossman, J. C. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Phys. Rev. Lett. 120, 145301 (2018).
Article CAS PubMed Google Scholar
Chen, C., Ye, W., Zuo, Y., Zheng, C. & Ong, S. P. Graph networks as a universal machine learning framework for molecules and crystals. Chem. Mater. 31, 3564–3572 (2019).
Article CAS Google Scholar
Schütt, K. et al. Schnet: A continuous-filter convolutional neural network for modeling quantum interactions. Adv. Neural Inf. Process. Syst. 30, 992–1002 (2017).
Schmidt, J., Pettersson, L., Verdozzi, C., Botti, S. & Marques, M. A. Crystal graph attention networks for the prediction of stable materials. Sci. Adv. 7, eabi7948 (2021).
Article CAS PubMed PubMed Central Google Scholar
Wang, A. Y.-T., Kauwe, S. K., Murdock, R. J. & Sparks, T. D. Compositionally restricted attention-based network for materials property predictions. Npj Comput. Mater. 7, 77 (2021).
Article Google Scholar
Batzner, S. et al. E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nat. Commun. 13, 2453 (2022).
Article CAS PubMed PubMed Central Google Scholar
Fung, V., Zhang, J., Juarez, E. & Sumpter, B. G. Benchmarking graph neural networks for materials chemistry. npj Comput. Mater. 7, 84 (2021).
Article CAS Google Scholar
Ramprasad, R., Batra, R., Pilania, G., Mannodi-Kanakkithodi, A. & Kim, C. Machine learning in materials informatics: recent applications and prospects. npj Comput. Mater. 3, 54 (2017).
Article Google Scholar
Himanen, L. et al. DScribe: Library of descriptors for machine learning in materials science. Comput. Phys. Commun. 247, 106949 (2020).
Article CAS Google Scholar
Goodall, R. E. A. & Lee, A. A. Predicting materials properties without crystal structure: deep representation learning from stoichiometry. Nat. Commun. 11, 6280 (2020).
Article CAS PubMed PubMed Central Google Scholar
Antunes, L. M., Grau-Crespo, R. & Butler, K. T. Distributed representations of atoms and materials for machine learning. npj Comput. Mater. 8, 44 (2022).
Article Google Scholar
Li, S. et al. Encoding the atomic structure for machine learning in materials science. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 12, e1558 (2022).
CAS Google Scholar
Liu, Y. et al. Machine learning for perovskite solar cells and component materials: key technologies and prospects. Adv. Funct. Mater. 33, 2214271 (2023).
Damewood, J. et al. Representations of materials for machine learning. Annu. Rev. Mater. Res. 53, 399–426 (2023).
Article CAS Google Scholar
Hey, T. et al. The fourth paradigm: data-intensive scientific discovery, vol. 1 (Microsoft Research Redmond, Redmond, WA, US, 2009).
Jain, A. et al. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. Appl. Mater. 1, 011002 (2013).
Article Google Scholar
Choudhary, K. et al. The joint automated repository for various integrated simulations (JARVIS) for data-driven materials design. npj Comput. Mater. 6, 173 (2020).
Article Google Scholar
Scheidgen, M. et al. NOMAD: A distributed web-based platform for managing materials science research data. J. Open Source Softw. 8, 5388 (2023).
Article Google Scholar
Curtarolo, S. et al. Aflowlib.org: A distributed materials properties repository from high-throughput ab initio calculations. Comput. Mater. Sci. 58, 227–235 (2012).
Article CAS Google Scholar
Kirklin, S. et al. The open quantum materials database (OQMD): assessing the accuracy of DFT formation energies. npj Comput. Mater. 1, 1–15 (2015).
Article Google Scholar
Pilania, G., Balachandran, P. V., Kim, C. & Lookman, T. Finding new perovskite halides via machine learning. Front. Mater. 3, 19 (2016).
Article Google Scholar
Balachandran, P. V. et al. Predictions of new ABO₃ perovskite compounds by combining machine learning and density functional theory. Phys. Rev. Mater. 2, 043802 (2018).
Article CAS Google Scholar
Li, Z., Xu, Q., Sun, Q., Hou, Z. & Yin, W.-J. Stability engineering of halide perovskite via machine learning. arXiv preprint arXiv:1803.06042 (2018).
Park, H. et al. Exploring new approaches towards the formability of mixed-ion perovskites by DFT and machine learning. Phys. Chem. Chem. Phys. 21, 1078–1088 (2019).
Article CAS PubMed Google Scholar
Schmidt, J. et al. Predicting the thermodynamic stability of solids combining density functional theory and machine learning. Chem. Mater. 29, 5090–5103 (2017).
Article CAS Google Scholar
Pilania, G. et al. Machine learning bandgaps of double perovskites. Sci. Rep. 6, 19375 (2016).
Article CAS PubMed PubMed Central Google Scholar
Askerka, M. et al. Learning-in-templates enables accelerated discovery and synthesis of new stable double perovskites. J. Am. Chem. Soc. 141, 3682–3690 (2019).
Article CAS PubMed Google Scholar
L. Agiorgousis, M., Sun, Y.-Y., Choe, D.-H., West, D. & Zhang, S. Machine learning augmented discovery of chalcogenide double perovskites for photovoltaics. Adv. Theory Simul. 2, 1800173 (2019).
Article Google Scholar
Li, Z., Xu, Q., Sun, Q., Hou, Z. & Yin, W.-J. Thermodynamic stability landscape of halide double perovskites via high-throughput computing and machine learning. Adv. Funct. Mater. 29, 1807280 (2019).
Article Google Scholar
Im, J. et al. Identifying Pb-free perovskites for solar cells by machine learning. npj Comput. Mater. 5, 37 (2019).
Article Google Scholar
Jacobs, R., Luo, G. & Morgan, D. Materials discovery of stable and nontoxic halide perovskite materials for high-efficiency solar cells. Adv. Funct. Mater. 29, 1804354 (2019).
Article Google Scholar
Wu, T. & Wang, J. Global discovery of stable and non-toxic hybrid organic-inorganic perovskites for photovoltaic systems by combining machine learning method with first principle calculations. Nano Energy 66, 104070 (2019).
Article CAS Google Scholar
Lu, S. et al. Accelerated discovery of stable lead-free hybrid organic-inorganic perovskites via machine learning. Nat. Commun. 9, 3405 (2018).
Article PubMed PubMed Central Google Scholar
Li, J., Pradhan, B., Gaur, S. & Thomas, J. Predictions and strategies learned from machine learning to develop high-performing perovskite solar cells. Adv. Energy Mater. 9, 1901891 (2019).
Article CAS Google Scholar
Odabaşı, Ç. & Yıldırım, R. Assessment of reproducibility, hysteresis, and stability relations in perovskite solar cells using machine learning. Energy Technol. 8, 1901449 (2020).
Article Google Scholar
Odabaşı, Ç. & Yıldırım, R. Machine learning analysis on stability of perovskite solar cells. Sol. Energy Mater. Sol. Cells 205, 110284 (2020).
Article Google Scholar
Howard, J. M., Tennyson, E. M., Neves, B. R. A. & Leite, M. S. Machine learning for perovskites’ reap-rest-recovery cycle. Joule 3, 325–337 (2019).
Article CAS Google Scholar
Yu, Y., Tan, X., Ning, S. & Wu, Y. Machine learning for understanding compatibility of organic–inorganic hybrid perovskites with post-treatment amines. ACS Energy Lett. 4, 397–404 (2019).
Article CAS Google Scholar
Schütt, K. T. et al. How to represent crystal structures for machine learning: towards fast prediction of electronic properties. Phys. Rev. B 89, 205118 (2014).
Article Google Scholar
Isayev, O. et al. Materials cartography: representing and mining materials space using structural and electronic fingerprints. Chem. Mater. 27, 735–743 (2015).
Article CAS Google Scholar
Huan, T. D., Mannodi-Kanakkithodi, A. & Ramprasad, R. Accelerated materials property predictions and design using motif-based fingerprints. Phys. Rev. B 92, 014106 (2015).
Article Google Scholar
Rupp, M., Tkatchenko, A., Müller, K.-R. & Von Lilienfeld, O. A. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett. 108, 058301 (2012).
Article PubMed Google Scholar
Huo, H. & Rupp, M. Unified representation of molecules and crystals for machine learning. Mach. Learn.: Sci. Technol. 3, 045017 (2022).
Google Scholar
Bartók, A. P., Kondor, R. & Csányi, G. On representing chemical environments. Phys. Rev. B 87, 184115 (2013).
Article Google Scholar
Behler, J. Atom-centered symmetry functions for constructing high-dimensional neural network potentials. J. Chem. Phys. 134, 074106 (2011).
Marchenko, E. I. et al. Database of two-dimensional hybrid perovskite materials: open-access collection of crystal structures, band gaps, and atomic partial charges predicted by machine learning. Chem. Mater. 32, 7383–7388 (2020).
Article CAS Google Scholar
Mayr, F. & Gagliardi, A. Global property prediction: A benchmark study on open-source, perovskite-like datasets. ACS Omega 6, 12722–12732 (2021).
Article CAS PubMed PubMed Central Google Scholar
Atz, K., Grisoni, F. & Schneider, G. Geometric deep learning on molecular representations. Nat. Mach. Intell. 3, 1023–1032 (2021).
Article Google Scholar
Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A. & Vandergheynst, P. Geometric deep learning: Going beyond Euclidean data. IEEE Signal Process. Mag. 34, 18–42 (2017).
Article Google Scholar
Bronstein, M. M., Bruna, J., Cohen, T. & Veličković, P. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478 (2021).
Masci, J., Boscaini, D., Bronstein, M. & Vandergheynst, P. Geodesic convolutional neural networks on Riemannian manifolds. In Proceedings of the IEEE international conference on computer vision workshops, 37–45 (2015).
Hamilton, W., Ying, Z. & Leskovec, J. Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 30, 1025–1035 (2017).
Vashishth, S., Sanyal, S., Nitin, V. & Talukdar, P. Composition-based multi-relational graph convolutional networks. In International Conference on Learning Representations (ICLR 2020) (2020).
Veličković, P. et al. Graph attention networks. stat 1050, 10–48550 (2017).
Google Scholar
Welling, M. & Kipf, T. N. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR 2017) (2017).
Park, C. W. & Wolverton, C. Developing an improved crystal graph convolutional neural network framework for accelerated materials discovery. Phys. Rev. Mater. 4, 063801 (2020).
Article CAS Google Scholar
Louis, S.-Y. et al. Graph convolutional neural networks with global attention for improved materials property prediction. Phys. Chem. Chem. Phys. 22, 18141–18148 (2020).
Article CAS PubMed Google Scholar
Choudhary, K. & DeCost, B. Atomistic line graph neural network for improved materials property predictions. npj Comput. Mater. 7, 185 (2021).
Article Google Scholar
Yan, K., Liu, Y., Lin, Y. & Ji, S. Periodic graph transformers for crystal material property prediction. Adv. Neural Inf. Process. Syst. 35, 15066–15080 (2022).
Google Scholar
Na, G. S. Substructure interaction graph network with node augmentation for hybrid chemical systems of heterogeneous substructures. Comput. Mater. Sci. 216, 111835 (2023).
Article CAS Google Scholar
Cang, Z., Mu, L. & Wei, G.-W. Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening. PLoS Comput. Biol. 14, e1005929 (2018).
Article PubMed PubMed Central Google Scholar
Nguyen, D. D., Cang, Z. X. & Wei, G.-W. A review of mathematical representations of biomolecular data. Phys. Chem. Chem. Phys. 22, 4343–4367 (2020).
Cang, Z. X. et al. A topological approach for protein classification. Comput. Math. Biophys. 3, 140–162 (2015).
Cang, Z. X. & Wei, G.-W. Integration of element specific persistent homology and machine learning for protein-ligand binding affinity prediction. Int. J. Numer. Methods Biomed. Eng. https://doi.org/10.1002/cnm.2914 (2017).
Cang, Z. X. & Wei, G.-W. TopologyNet: Topology based deep convolutional and multi-task neural networks for biomolecular property predictions. PLOS Comput. Biol. 13, e1005690 (2017).
Article PubMed PubMed Central Google Scholar
Wu, K. D. & Wei, G.-W. Quantitative toxicity prediction using topology based multi-task deep neural networks. J. Chem. Inf. Model. https://doi.org/10.1021/acs.jcim.7b00558 (2018).
Wu, K., Zhao, Z., Wang, R. & Wei, G.-W. TopP–S: Persistent homology-based multi-task deep neural networks for simultaneous predictions of partition coefficient and aqueous solubility. J. Comput. Chem. 39, 1444–1454 (2018).
Article CAS PubMed Google Scholar
Demir, A. & Kiziltan, B. Multiparameter persistent homology for molecular property prediction. In ICLR 2023-Machine Learning for Drug Discovery workshop (2023).
Hiraoka, Y. et al. Hierarchical structures of amorphous solids characterized by persistent homology. Proc. Natl Acad. Sci. 113, 7035–7040 (2016).
Article CAS PubMed PubMed Central Google Scholar
Lee, Y. et al. Quantifying similarity of pore-geometry in nanoporous materials. Nat. Commun. 8, 1–8 (2017).
Google Scholar
Saadatfar, M., Takeuchi, H., Robins, V., Francois, N. & Hiraoka, Y. Pore configuration landscape of granular crystallization. Nat. Commun. 8, 15082 (2017).
Article CAS PubMed PubMed Central Google Scholar
Hirata, A., Wada, T., Obayashi, I. & Hiraoka, Y. Structural changes during glass formation extracted by computational homology with machine learning. Commun. Mater. 1, 98 (2020).
Article Google Scholar
Obayashi, I., Nakamura, T. & Hiraoka, Y. Persistent homology analysis for materials research and persistent homology software: HomCloud. J. Phys. Soc. Jpn. 91, 091013 (2022).
Article Google Scholar
Najman, L. & Romon, P. Modern Approaches to Discrete Curvature. (Springer International Publishing, Gewerbestrasse, Cham, Switzerland, 2017).
Book Google Scholar
Forman, R. Bochner’s method for cell complexes and combinatorial Ricci curvature. Discrete Comput. Geom. 29, 323–374 (2003).
Article Google Scholar
Bakry, D. & Émery, M. Diffusions hypercontractives. In Séminaire de Probabilités XIX 1983/84: Proceedings, 177–206 (Springer, Berlin, Heidelberg, 2006).
Wee, J. & Xia, K. Ollivier persistent Ricci curvature-based machine learning for the protein–ligand binding affinity prediction. J. Chem. Inf. Model. 61, 1617–1626 (2021).
Article CAS PubMed Google Scholar
Wee, J. & Xia, K. Forman persistent Ricci curvature (FPRC)-based machine learning models for protein–ligand binding affinity prediction. Brief. Bioinforma. 22, bbab136 (2021).
Article Google Scholar
Weber, M., Saucan, E. & Jost, J. Characterizing complex networks with Forman-Ricci curvature and associated geometric flows. J. Complex Netw. 5, 527–550 (2017).
Article Google Scholar
Edelsbrunner, H., Heiss, T., Kurlin, V., Smith, P. & Wintraecken, M. The density fingerprint of a periodic point set. In 37th International Symposium on Computational Geometry (2021).
Anosova, O. & Kurlin, V. Density functions of periodic sequences of continuous events. J. Math. Imaging Vis. 65, 689–701 (2023).
Article Google Scholar
Anand, D. V., Xu, Q., Wee, J., Xia, K. & Sum, T. C. Topological feature engineering for machine learning based halide perovskite materials design. npj Comput. Mater. 8, 203 (2022).
Article Google Scholar
Friedman, J. H. Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001).
Piryonesi, S. M. & El-Diraby, T. E. Data analytics in asset management: Cost-effective prediction of the pavement condition index. J. Infrastruct. Syst. 26, 04019036 (2020).
Article Google Scholar
Chun, M. et al. Stroke risk prediction using machine learning: a prospective cohort study of 0.5 million Chinese adults. J. Am. Med. Inform. Assoc. 28, 1719–1727 (2021).
Article PubMed PubMed Central Google Scholar
Shi, Y. et al. Masked label prediction: Unified message passing model for semi-supervised classification. In 30th International Joint Conference on Artificial Intelligence, 1548–1554 (2021).
Wang, Y., Yao, H. & Zhao, S. Auto-encoder based dimensionality reduction. Neurocomputing 184, 232–242 (2016).
Article Google Scholar
Vovk, V. Kernel ridge regression. In Empirical inference: Festschrift in honor of Vladimir N. Vapnik, 105–116 (Springer, Berlin, Heidelberg, 2013).
Kim, C., Huan, T. D., Krishnan, S. & Ramprasad, R. A hybrid organic-inorganic perovskite dataset. Sci. data 4, 1–11 (2017).
Article CAS Google Scholar
Pandey, M. & Jacobsen, K. W. Promising quaternary chalcogenides as high-band-gap semiconductors for tandem photoelectrochemical water splitting devices: A computational screening approach. Phys. Rev. Mater. 2, 105402 (2018).
Article CAS Google Scholar
Rao, H. et al. Feature selection based on artificial bee colony and gradient boosting decision tree. Appl. Soft Comput. 74, 634–642 (2019).
Article Google Scholar
Ong, S. P. et al. Python Materials Genomics (pymatgen): A robust, open-source Python library for materials analysis. Comput. Mater. Sci. 68, 314–319 (2013).
Article CAS Google Scholar
Li, W., Ionescu, E., Riedel, R. & Gurlo, A. Can we predict the formability of perovskite oxynitrides from tolerance and octahedral factors? J. Mater. Chem. A 1, 12239–12245 (2013).
Article CAS Google Scholar
Filip, M. R. & Giustino, F. The geometric blueprint of perovskites. Proc. Natl Acad. Sci. 115, 5397–5402 (2018).
Article CAS PubMed PubMed Central Google Scholar
Bartel, C. et al. New tolerance factor to predict the stability of perovskite oxides and halides. Sci. Adv. 5, eaav0693 (2019).
Article CAS PubMed PubMed Central Google Scholar
Faber, F., Lindmaa, A., Von Lilienfeld, O. A. & Armiento, R. Crystal structure representations for machine learning models of formation energies. Int. J. Quantum Chem. 115, 1094–1101 (2015).
Article CAS Google Scholar
Jiang, Y. et al. Topological representations of crystalline compounds for the machine-learning prediction of materials properties. npj Comput. Mater. 7, 28 (2021).
Article CAS PubMed PubMed Central Google Scholar
Szocinski, T., Nguyen, D. D. & Wei, G.-W. AweGNN: Auto-parametrized weighted element-specific graph neural networks for molecules. Comput. Biol. Med. 134, 104460 (2021).
Article PubMed PubMed Central Google Scholar
Chen, D. et al. Algebraic graph-assisted bidirectional transformers for molecular property prediction. Nat. Commun. 12, 3521 (2021).
Article CAS PubMed PubMed Central Google Scholar
Momma, K. & Izumi, F. Vesta 3 for three-dimensional visualization of crystal, volumetric and morphology data. J. Appl. Crystallogr. 44, 1272–1276 (2011).
Article CAS Google Scholar

Download references

Acknowledgements

This work was supported in part by the Nanyang Technological University SPMS Collaborative Research Award 2022, Singapore Ministry of Education Academic Research fund Tier 2 grants MOE-T2EP20120-0013, MOE-T2EP20221-0003, and MOE-T2EP50120-0004 as well as the National Research Foundation (NRF), Singapore under its NRF Investigatorship (NRF-NRFI2018-04) and Competitive Research Program (CRP) (NRF-CRP25-2020-0004).

Author information

Authors and Affiliations

Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, 21 Nanyang Link, Singapore, 637371, Singapore
Chuan-Shen Hu, Min-Chun Wu & Kelin Xia
Division of Physics & Applied Physics, School of Physical and Mathematical Sciences, Nanyang Technological University, 21 Nanyang Link, Singapore, 637371, Singapore
Rishikanta Mayengbam & Tze Chien Sum

Authors

Chuan-Shen Hu
View author publications
Search author on:PubMed Google Scholar
Rishikanta Mayengbam
View author publications
Search author on:PubMed Google Scholar
Min-Chun Wu
View author publications
Search author on:PubMed Google Scholar
Kelin Xia
View author publications
Search author on:PubMed Google Scholar
Tze Chien Sum
View author publications
Search author on:PubMed Google Scholar

Contributions

T.C.S. and K.X. conceived the idea and designed the research. M.C.W. and C.S.H. wrote the code for the density fingerprint model. C.S.H. performed the density fingerprint simulations. C.S.H. performed the GDA-based machine learning models. R. M. provided the DFT results. All authors analyzed the data, discussed the results, and contributed to the manuscript. T.C.S. and K.X. led the project. The computational work for this paper was partially performed on resources of the National Supercomputing Centre (NSCC), Singapore (https://www.nscc.sg).

Corresponding authors

Correspondence to Kelin Xia or Tze Chien Sum.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Materials thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Aldo Isidori.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

supplementary file

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Hu, CS., Mayengbam, R., Wu, MC. et al. Geometric data analysis-based machine learning for two-dimensional perovskite design. Commun Mater 5, 106 (2024). https://doi.org/10.1038/s43246-024-00545-w

Download citation

Received: 28 February 2024
Accepted: 14 June 2024
Published: 21 June 2024
Version of record: 21 June 2024
DOI: https://doi.org/10.1038/s43246-024-00545-w

This article is cited by

Recent Advancements and Perspectives of Low-Dimensional Halide Perovskites for Visual Perception and Optoelectronic Applications
- Humaira Rafique
- Ghulam Abbas
- Santanu Jana
Nano-Micro Letters (2026)

Subjects

Abstract

Similar content being viewed by others

Tracking perovskite crystallization via deep learning-based feature detection on 2D X-ray scattering data

Impact of crystal structure symmetry in training datasets on GNN-based energy assessments for chemically disordered CsPbI3

Machine learning-enhanced band gaps prediction for low-symmetry double and layered perovskites

Introduction

Results and discussion

GDA-based material fingerprint

GDA-based ML for 2D perovskite design

Database

GDA-based ML for 2D perovskite bandgap prediction

Extensive comparison with state-of-the-art models

Influence of selected features on GDA-GBT models

Influence of Exp/DFT-based bandgap labels on GDA-GBT models

Prediction of bandgaps for new 2D materials

Methods

Material mathematical representation

Motif and unit cell

Material descriptors and fingerprints

GDA-based material fingerprint

GDA-GBT and GNN models for 2D perovskites

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

supplementary file

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Recent Advancements and Perspectives of Low-Dimensional Halide Perovskites for Visual Perception and Optoelectronic Applications

Search

Quick links

Impact of crystal structure symmetry in training datasets on GNN-based energy assessments for chemically disordered CsPbI₃