Introduction

The modern discovery of advanced functional materials demands predictive frameworks that successfully reconcile frequently conflicting requirements: quantum-level precision, generalizable physical insights, and human-interpretable design principles. While conventional computational methods struggle to address this challenge, machine learning (ML) approaches have shown remarkable progress across a wide range of materials property prediction and discovery of novel materials1,2. Specifically, two distinct ML paradigms have emerged in solid state materials research: symbolic regression (SR), offering equation-based interpretability, and graph neural networks (GNNs) excelling in structure-property mapping accuracy. SR reveals fundamental correlations through transparent mathematical descriptors to generate explicit descriptors. These descriptors are expressed as mathematical relationships among various material features, providing actionable insights for rational material design to achieve desired properties. However, the explicit feature combination in SR limits its ability to capture complex relationships among features in high-dimensional spaces. In contrast, GNNs circumvent this limitation through automated learning of hidden material representations via message-passing architectures3,4,5,6, but their black-box nature obscures the underpinning atomic-scale physics driving property variations—a critical barrier for scientific discovery. This inherent trade-off between interpretative clarity (SR) and automated learning capability (GNN) has constrained computational materials science to situation-specific solutions rather than unified and robust discovery platforms.

Over the past decade, remarkable progress has been achieved in advancing both GNNs and SR methodologies within the realm of computational materials science. For GNNs, a diverse array of architectures has emerged, ranging from message passing neural networks (MPNN)7 and crystal graph convolutional neural networks (CGCNN)8, to atomistic line graph neural networks (ALIGNN)9 and Graphformer10,11,12, each pushing the boundaries of accuracy in predicting material properties. Concurrently, substantial efforts have been dedicated to refining GNN components, including feature engineering13,14,15,16, algorithms for updating node and edge attributes17,18,19, and readout functions20, all aimed at enhancing the alignment of graph-based representations with fundamental physical principles. Furthermore, integrating algorithms and strategies from other domains in GNN can improve its performance and has been validated as feasible in specific cases21,22,23,24,25,26,27,28,29,30. On the other hand, SR has found significant success in deriving interpretable mathematical descriptors for specific material systems, such as perovskites, where it has guided the design of materials with optimal band gaps for photovoltaics, enhanced catalytic activity for oxygen evolution reactions, and similar applications31,32,33,34. Despite these advancements, a critical gap remains: the lack of an integrated framework that combines GNNs and SR to simultaneously achieve model enerality, high predictive accuracy, and interpretable descriptor generation. Current approaches in material screening and guided synthesis often prioritize one aspect at the expense of others, highlighting the need for a unified methodology that harmonizes the strengths of both paradigms to accelerate materials discovery and design.

In this work, we present a computational framework that synergistically integrates graph neural networks (GNNs) and symbolic regression (SR) facilitating advancements in the understanding of structure-property relationships in materials science. Our approach introduces a sophisticated feature encoding algorithm within the GNN architecture, wherein each physical quantity feature is assigned independent weight parameters and subsequently aggregated in a high-dimensional latent space. This unique self-adaptable Graph Attention Networks combined with SR (SA-GAT-SR) formulation enables the generation of importance coefficients (ICs) at both atomic and crystallographic levels, facilitating rigorous feature screening and offering unprecedented physical interpretability of deep learning models. The re-weighted feature representations, augmented by the predictive outputs from our GNN module, are then processed through a highly optimized SR framework to extract explicit mathematical descriptors. The analysis of the resulting symbolic expressions derived from our framework, elucidates the fundamental mathematical relationships governing material property variations. This integrated paradigm effectively bridges the gap between the predictive accuracy of deep learning approaches and the interpretability of traditional scientific modeling techniques, thereby offering a robust platform for accelerated materials discovery and design.

Results

Joint paradigm of graph attention network and symbolic regression

The SA-GAT-SR model is an end-to-end framework that takes crystal structures and associated features as input. Compared to previous works, which are predominately based on deep learning (DL) methods, and aim to develop universal models capable of predicting material properties across diverse systems35,36,37,38, the proposed SA-GAT-SR methodology provides both a mathematical expression and predictive results that offer physical insight. Additionally, the model outputs an importance ranking of the initial features. Figure 1a illustrates the flowchart for training a joint prediction model for specific material systems using the SA-GAT-SR approach, which can be broadly divided into four key steps. The data acquisition stage involves gathering materials from the system of interest to form the high-quality training dataset. To construct node feature vectors, the dataset also requires a large set of corresponding atoms and crystal characteristics including those that construct the best descriptors as much as possible. Subsequently, the feature engineering algorithm respectively converts the material structure information and physical characteristics into a graph representation and nodes, global features and produce the ICs of each characteristics. After that, we obtain the well-trained GNN model and the corresponding GNN-level prediction result, which completes SA-GAT step. According to the ICs and initial prediction results given by pre-stage GNN model, the third level filters atomic and global features based on the specified number of reserved features. The input features of SR module are combined with the reserved features and the GNN module prediction. Finally, in the fourth step, SR module derives a series of expressions based on the recombined features.

Fig. 1: The SA-GAT-SR model for materials prediction.
Fig. 1: The SA-GAT-SR model for materials prediction.
Full size image

a The flowchart of SA-GAT-SR model encompasses four sequential stages: data acquisition (blue), preliminary GNN prediction (yellow), feature screening (pink), and symbolic regression (brown). The combination of GNN prediction and reserved features in the feature screening step is used as the input features of SR module. b The architecture of the GNN module. In the self-adaptable encoding (SAE) algorithm, the raw feature vector consists of scalar properties associated with atoms and unit cells from the crystal structure. The SAE assigns a weight to each characteristic and generates the initial feature vector. The blue and orange circles represent atomic and global node features, respectively. The message-passing layers include stacked node update modules, as illustrated, allowing iterative updating of feature vectors through the GNN architecture.

From the perspective of model architecture, the SA-GAT-SR framework consists primarily of two modules, i.e., the GNN module enhanced by self-adaptable encoding (SAE), and the SR solver module. The GNN module comprises an SAE layer, multiple message-passing and global feature updating layers, and a readout layer. The SAE layer is responsible for performing feature engineering, constructing the initial node, edge, and global feature vectors from atomic, bonding, and crystal cell features, respectively, based on ICs. The message-passing layers operate on the crystal graph, iteratively updating each feature vector. After processing through all the updating layers, the readout layer produces the final output. The output from the GNN module serves as an initial approximation of the final prediction and is fed into the SR module, implemented by the sure independence screening and specifying operator (SISSO) method39,40.

Self-adaptable encoding and screening of initial features

To convert crystal structures into graph representations compatible with GNN, we encode atomic properties as feature vectors, while the connections between atoms correspond to node and edge features, respectively. In CGCNN, one-hot encoding relies solely on atomic number, limiting its ability to capture other physical characteristics of atoms. When predicting target properties across different materials, the sensitivity of the output to input features can vary significantly. Our extensive experiments reveal that feature construction has a far greater impact on predictive performance than the specific algorithms used for feature updating. Therefore, we incorporate a self-attention-based importance-weighting module within the feature engineering process (see Supplementary Note 1).

The initial feature sets, based on physical quantities, are represented as Ra for the atomic level and Rc for the crystal level variables, respectively. Importantly, feature sets can also be constructed for distinct atomic types or other structural units. As shown in the Fig. 1b, the raw feature vector of an atom in the material is denoted as Ra = {r1, r2, r3, . . . , rN}, where N represents the size of the initial feature set. To capture more complex nonlinear relationships among these characteristics, we define a projection function σ( ) that contains learnable weight vectors and adaptively embeds each scalar feature into a unique high-dimensional hidden vector. This can be represented as \({h}_{{r}_{i}}=\sigma ({r}_{i}),{h}_{{r}_{i}}\in {{\mathbb{R}}}^{M}\), where M denotes the dimension of feature vector of \({h}_{{r}_{i}}\). The raw hidden feature of Ra that is a vector of length 180, can be represented as \({H}_{a}=\sigma ({R}_{a})=\{{h}_{{r}_{1}},...,{h}_{{r}_{N}}\}\). Each hidden feature vector in Ha represents a unique high-dimensional embedding of its corresponding physical characteristic. Subsequently, to evaluate the importance of each high-dimensional feature, we introduce a learnable matrix Wa. The ICs are computed as the inner product between the weight vector Wa and each \({h}_{{r}_{i}}\), followed by softmax normalization to yield the attention coefficients, which are then used as weights for the hidden feature vectors. To capture the relevance of each feature to the target property, attention weights are assigned and summed to generate the initial node feature vectors in the graph representation. The feature engineering process of node features can be described by:

$$\begin{array}{l}{h}^{0}\,=\,\mathop{\sum}\limits_{i=0}^{N}{a}_{i}{h}_{{r}_{i}}\\ {a}_{i}\,=\,softmax(f({W}_{a},{h}_{{r}_{i}}))\end{array}$$
(1)

where the f( ) function is used to compute the scale attention of \({h}_{{r}_{i}}\), and the h0 denotes the initial feature vector at layer 0. The node features are calculated by SAE algorithm. The outcome directly depends on the selection of the corresponding initial feature set Ra. However, the calculation method for the global feature differs slightly from that of node features. To obtain the ICs of features from different initial feature sets, we concatenate Ra and Rc to form the initial global feature set. By doing so, we can compute the ICs for each relevant characteristic and the global node feature vector, as outlined in eq (1). This approach enables the comparison of the importance between features derived from different feature sets. For various materials, the global node features integrate both atomic-level characteristics and structural material features, ensuring that the crystal graph is uniquely defined. This integration helps mitigate the risk of over-smoothing during model training, which is a common issue arising when node information is excessively aggregated.

As illustrated in Fig. 1, we can select highly relevant features from the initial feature set using the weight set and preliminary predictions generated by the trained GNN module. Assuming that there are M initial feature sets, each associated with key atoms and structural units, they are denoted as \(\Phi =\{{R}_{a}^{1},\ldots ,{R}_{a}^{M}\}\), along with a unique initial crystal feature set Rc. The size of each initial feature set is not required to be uniform, thereby allowing for the flexible selection of various physical quantities to construct these feature sets. The combined initial global feature set is represented as \(\Psi =\{{R}_{a}^{1},\ldots ,{R}_{a}^{M},{R}_{c}\}\). Subsequently, the ICs for all characteristics can be derived through the SAE process applied to the set Ψ. The ICs set of characteristics in different initial feature sets is defined as A = {a1, . . . , aN} serving as the foundation for subsequent feature screening, where N stands for the number of all characteristics. Based on the ICs sets A, we determine the importance ranking of each feature and select the top-k features from different feature set, which serves as the input for the subsequent SR module. The number of retained features per set can be independently adjusted according to specific requirements. In practice, reducing the number of selected features significantly accelerates the SR process highlighting critical importance of feature screening of GNN module.

The input feature set for the SR module is defined as \(\Omega =\{{S}_{a}^{1},...,{S}_{a}^{M},{g}_{s}\}\), where \({S}_{a}^{i}\) and gs represent the selected features from \({R}_{a}^{i}\) and g, respectively. In this work, the SR process is implemented using the SISSO machine-learning method. Unlike other approaches that rely on Density Functional Theory (DFT) results, which have limited accuracy but fast calculation speed and still demand substantial computational resources, our method uses the GNN predictions as input features to the SR module. These predictions are efficiently generated by the GNN model, thereby relaxing the strict requirement for material datasets to have DFT-based pre-calculated results. In addition to the number of screened features, the dimensionality and complexity of the descriptors significantly influence both the accuracy of the predictions and the computational efficiency of the SR process (see Supplementary Fig. 1). Detailed comparisons, including the effects and trade-offs of higher complexity on model performance and the emergence of overfitting, are provided in Supplementary Note 2. Careful optimization of these parameters is essential to achieving a balance between performance and resource consumption.

Application of the model in single oxide and halogen perovskite materials

In the following computational experiments, we concentrate on predicting multiple properties of single-oxide and single-halogen perovskites with the general formulae ABO3 and ABX3 (X=F, Cl, Br, I) using the JARVIS-DFT41 dataset (version 2021.8.18). This dataset comprises extensive materials and provides a comprehensive set of solid-state properties ideally suited for model training and evaluation. In total, we curated a subset of 689 perovskite materials for our studies. To handle missing labels in the JARVIS dataset, we filtered the perovskite structures to include only those with available computed values for each target property. The hyperparameter configurations of the GNN and SR modules in the SA-GAT-SR model are summarized in Supplementary Table 1. The SR module takes a total of 15 input features, which include the prediction output from the GNN module. To simplify the final expressions and accelerate the SR derivation speed, we set the unified descriptors complexity and expression dimension to 1 and 3 for all prediction tasks, respectively. The train-validation-test split ratio of our SA-GAT-SR and all baseline models is maintained at 85%:10%:5% due to the limited dataset of 689 perovskite samples, enhancing training stability while maintaining sufficient validation and test data. From the extensive set of properties available in this dataset, we select the following key attributes for evaluation: bandgaps corrected by the optimized Becke88 functional with van der Waals interaction (OptB88vdW)42, formation energies, dielectric constants without ionic contributions (ϵx, ϵy, ϵz), Voigt bulk modulus (Kv) and shear modulus (Gv), total energies, and energies above the convex hull (Ehull).

For the initial feature set, we select 7 physical quantities for each element and 9 physical quantities for the crystal, resulting in a combined feature vector comprising 30 features for each crystal43,44. However, since the X site of all ABO3 materials is fixed as oxygen, we focus on screening features from the A site and B site atoms as inputs to the SR module and the number of initial features is 23. The symbols used in the SA-GAT-SR model are summarized in Table 1 and their corresponding detailed information is provided in Supplementary Table 2. The octahedral factor Gμ and tolerance factor Gt are defined as Bir/rX and \(\frac{{A}_{ir}+{r}_{X}}{\sqrt{2}({B}_{ir}+{r}_{X})}\), respectively, which are commonly used for describing perovskite structures, where the rX denotes the ionic radius of oxygen or halogen anion. Both features can, to a certain extent, reflect the stability of the entire crystal structure. Moreover, the ratio of these two features has been shown to exhibit a strong correlation with several functional properties of perovskite materials32. As a result, this ratio is an initial feature in our analysis. Multiple studies have demonstrated that lattice structural distortions in perovskite materials can significantly influence a range of their properties45. To characterize these distortion characteristics, we select a representative BX6 octahedron and compute the mean B-X-B bond angle between its six neighboring octahedra, denoted as ΘBXB. To further enhance the stability and interpretability of the calculated values, we adopt the cosine of ΘBXB as a distortion feature. This cosine-based measure is preferred as it maps the bond angle distortion to a range between -1 and 1, providing a naturally normalized representation of the structural distortion. Further analysis of these distortion features, specifically the evaluation of data redundancy elimination using thresholds of 0.005 and 0.02, is provided in Supplementary Figs. 2 and 3.

Table 1 The SA-GAT-SR model initial features for predicting ABO3 materials containing A site, B site, X site, and global characteristics

The performance of the SA-GAT-SR model on the ABO3 and ABX3 datasets is summarized in Supplementary Tables 3 and 4, respectively. We employ the mean absolute error (MAE) metric to evaluate regression accuracy. Furthermore, to validate the robustness of the model, we combine both datasets and assesse the model’s performance on the merged dataset, with the results presented in Table 2. The merged dataset can be denoted as AB(OX)3. For comparison, we also train and test other state-of-the-art models, including Graph Attention Network (GAT)46, CGCNN, and ALIGNN, using the same dataset and identical train-validation-test splits on each dataset (see Supplementary Note 3). All models, including SA-GAT-SR and baselines, were trained and evaluated using the same problem-specific physical features (Table 1), ensuring performance differences stem from model architectures, not input variations. The SA-GAT-SR model integrates a GNN module with an SR module, where the final prediction accuracy largely depends on the performance of the GNN component. To further assess the contribution of each module, we conducted an ablation study to separately evaluate their individual performances. For the SR module, the input feature set remains consistent with that of the SA-GAT-SR model, except for the initial prediction results from the preceding GNN stage. To ensure robust evaluation, we performed multiple random independent experiments based on the same data split and reported the average performance metrics as the final results. In most cases, the SA-GAT-SR model demonstrates superior performance compared to the control group models. The lowest MAE for bandgap (Egap) with SA-GAT-SR is 0.147 eV, which outperforms GAT, CGCNN and ALIGNN by 19.7%, 18.3% and 10.9%, respectively. Furthermore, the full SA-GAT-SR model builds upon this foundation, achieving additional improvements in accuracy. Without the predictions from the GNN module, the MAE for Egap of individual SR module struggles to achieve 0.768 eV accuracy. However, in certain cases, the SR module outperforms the GNN module, which is shown in Supplementary Table 4. The MAE for Ehull using only the SR module is 0.075 eV, outperforming the 0.090 eV achieved by the GNN module alone. All models were trained and evaluated on the same dataset for each property, ensuring fair horizontal comparisons across models. However, varying sample sizes across properties may affect a single model’s performance comparison across properties, though this does not impact model comparisons for individual properties. Given the established importance of Out-of-Distribution (OOD) performance as a critical metric for material discovery, we have conducted a systematic evaluation of the OOD performance of the SA-GAT-SR method, with results and analyses presented in Supplementary Table 5 and Supplementary Note 4. In addition, in order to better demonstrate the generalization of our model, we also conducted systematic experiments on non-perovskite datasets, i.e., MOene dataset47,48,49. MOenes represent an emerging class of MXene-like two-dimensional (2D) materials, characterized by the chemical formula Mn+1OnT2 (n = 1–4), where “O” denotes an oxygen-group element (O, S, Se, or Te), expanding the scope of traditional MXenes to include 2D transition metal oxides. The SA-GAT-SR model demonstrated robust performance in predicting the total energy, work function, bandgap, valence band maximum (VBM), and conduction band minimum (CBM) (see Supplementary Fig. 4, Supplementary Table 6, and Supplementary Note 4), further confirming the generalizability of our framework. To assess space complexity, we analyze the number of trainable parameters in each model. The SA-GAT-SR model contains a total of 6.303M parameters, primarily due to the feature embedding matrices, which scale with the size of the input feature set and are not involved in the model’s training speed. In contrast, ALIGNN comprises only 4.027M parameters. Notably, the trainable core of SA-GAT-SR includes just 3.351M parameters. Moreover, when the hidden feature dimension is set to 128, the parameter count of the core model can be reduced to as few as 1.011M without a significant loss in prediction accuracy.

Table 2 Regression model performances on the AB(OX)3 dataset for 9 properties using GAT, CGCNN, ALIGNN and SA-GAT-SR models

To evaluate the impact of the GNN module within the SA-GAT-SR framework on the final results, we conducted another ablation study, with the findings summarized in Fig. 2. This experiment primarily investigates the influence of the number of selected features on the SR module’s derivation time and the prediction MAE for formation energy. For the convenience of expression, the reserved features combination of bandgap with the number of n is denoted as BG{n}, where the total number of the input features of SR module is actually n + 1 including the prediction of GNN model. Figure 2a demonstrates that the derivation time increases sharply as the total number of features grows (Supplementary Note 5 and Supplementary Table 7). Notably, when the feature set size reaches 22, the SR derivation time exhibits exponential growth, rendering it impractical to compute descriptors for larger datasets. It should be emphasized that the exponential growth depends solely on the number of features and is independent of the specific types of features selected, making the SAE feature selection method suitable for scenarios involving varying feature sets. In the feature screening process of the SAE module, 16 features are retained as input for the SR module to enhance accuracy, resulting in a derivation time of 32.59 seconds. In contrast, the conventional SR process requires retaining most features to ensure generalization, such as using 28 features, which takes 720.95 seconds. This demonstrates that the SAE algorithm improves the efficiency of SR by a factor of 23 compared to the conventional SR process while simultaneously enhancing accuracy.

Fig. 2: The performance of the SA-GAT-SR model across three datasets.
Fig. 2: The performance of the SA-GAT-SR model across three datasets.
Full size image

a Dependence of the SR derivation time on the number of input features. While keeping the dimensionality and complexity of the final expressions constant, the derivation time shows exponential growth as the number of input features increases. b The bandgap MAE performance of the SA-GAT model with different hidden feature dimensions across three datasets. c, d The performance of all methods-including SA-GAT-SR, SA-GAT, ALIGNN, CGCNN, SR, and GAT-on bandgap and formation energy prediction is evaluated. SA-GAT corresponds to the GNN module within SA-GAT-SR, while SR uses the same feature set as SA-GAT-SR but excludes the GNN-predicted output. eg Bandgap prediction performance, comparing the SA-GAT-SR model with and without the GNN module across three datasets as the number of reserved features increases. eg present the results of our study on the ABO3, ABX3 and AB(OX)3 datasets, respectively.

To demonstrate the influence of the hidden feature dimension on the performance of SA-GAT model, we conduct a study on the bandgap prediction across three datasets. The results are shown in Fig. 2b. A dimension that is too small can in under-fitting, whereas a dimension that is too large yields only marginal improvements in performance. As the dimension comes to 128, the MAE stabilizes around 0.190 eV, 0.107 eV, and 0.175 eV across three datasets. This indicates that each feature vector of node and edge is sufficiently expressive to capture the characteristics of real atoms and bonds. Additionally, in order to find out the adventage of SA-GAT-SR model against the traditional SR method, we conduct another ablation study. As shown in Fig. 2c, d, the GNN initial approximation method of SA-GAT-SR model has much lower MAE and higher stability compared to traditional method. Withing the SA-GAT-SR framework, the SR process serves as a second-level approximation building upon the prediction results of the GNN module and delivering further improvements in the overall prediction accuracy. In addition to the advantage of making the results more accurate, the screening capability of SAE algorithm can filter out less relevant features and reduce the size of SR input features. As for the prediction of formation energy and bandgap in AB(OX)3 dataset, the SA-GAT-SR model achieves lowest MAE of 0.101 and 0.147, respectively. The SA-GAT—GNN module of SA-GAT-SR model also suggests comparable performance compared with CGCNN and ALIGNN achieving MAE of 0.109 and 0.194. As shown in Fig. 2e–g, as the number of reserved feature increases, the feature search space expands accordingly. Therefore, as the number of input features increases, the MAE eventually converges to a fixed value, because the input feature set becomes sufficiently large to encompass all features required for the optimal solution. The speed of MAE convergence can thus serve as a valuable metric for evaluating the quality of the selected features. SR process with the GNN module exhibits stable performance BG14, while simultaneously maintaining a high level of prediction accuracy. Without the GNN module, the expression result in AB(OX)3 dataset remains unstable until BG22, while results in ABO3 and ABX3 datasets stabilize earlier but continue to exhibit high MAEs. The performance of the SA-GAT-SR prediction across all target properties is provided in Supplementary Fig. 4.

Despite the fact that the SAE mechanism does not directly enhance the accuracy of the final results, it provides valuable insights by reflecting the importance of each initial feature, thus offering physical interpretation. The ICs for each feature across all prediction tasks are shown in Fig. 3a–c, where the values represent the relative importance ratios. Although the AB(OX)3 dataset encompasses all materials in ABO3 and ABX3, the feature distributions differ across the three datasets. Consequently, the relative importance of the same initial feature varies between datasets. Similarly, the IC distributions for different properties are distinct, reflecting the physical significance of different features. In the ABO3 dataset, AQ shows a relatively high contribution to all properties except for bandgap, with IC varying from 0.0556 to 0.0689. In contrast, Gdis exhibits a high IC of 0.0739 for bandgap, indicating that crystal structure distortion significantly influences this property. For the AB(OX)3 dataset, Gdis also demonstrates a high IC from 0.1029 to 0.1683 across all properties, except formation energy and total energy.

Fig. 3: The impact of the GNN module in both ICs computation and formation energy prediction.
Fig. 3: The impact of the GNN module in both ICs computation and formation energy prediction.
Full size image

ac The ICs derived by the SAE algorithm within the GNN module, where (ac) correspond to the ICs results for ABO3, ABX3, and AB(OX)3, respectively. For each property prediction task, the IC reflects the significance of a specific feature within its feature set. In the figures, EF, Egap, and ET denote formation energy, bandgap, and total energy, respectively, for convenience. df Comparison of the GNN model with different three feature embedding algorithms in formation energy prediction on the AB(OX)3 dataset. The blue diamonds and red dots represent the results on the training set and testing set, respectively. The GNN model with the fully connected network (FCN) embedding algorithm serves as a baseline, reflecting standard performance. The model with the CGCNN embedding algorithm utilizes one-hot encoding followed by the FCN to generate feature vectors.

To further evaluate the role of the SAE algorithm, we replace the SAE module with a fully connected network (FCN) and the CGCNN one-hot encoding method for feature embedding. Comparative experiments are conducted on the same dataset using identical hyperparameters. The MAE values obtained using the FCN, CGCNN, and SAE algorithms are denoted as \({E}_{{\rm{pred}}\_{\rm{FCN}}}\), Epred_CGCNN, and Epred_SAE, respectively. As shown in Fig. 3d–f, the GNN model based on SAE significantly outperforms those using FCN and CGCNN with the best MAE of 0.107 eV/atom. Among them, the FCN encoding demonstrates the poorest performance, yielding an MAE of 0.223 eV/atom. Additionally, it exhibits an overfitting issue, as evidenced by the MAE in the testing set being 0.022 eV/atom higher than that in the training set. The CGCNN encoding method shows suboptimal MAE of 0.142 eV/atom due to the limited information used to distinguish different atoms, as it relies solely on the atomic number. This limitation results in underfitting, as indicated by the MAE in the testing set being 0.030 eV/atom lower than that in the training set and prevents it from capturing the critical importance of features. Furthermore, the performance of the final descriptors depends on the predictive accuracy of the GNN module. Consequently, the SAE method yields the most accurate results and best robustness with a minimal difference of 0.002 eV/atom between the training and testing sets.

Compared to DL based methods, the interpretability of our SA-GAT-SR model primarily lies in the final derived mathematical expressions. The expression result of the SA-GAT-SR model is composed of material-specific combination feature descriptors (e.g., \(\frac{{B}_{en}}{{A}_{en}}\)) and a neural network-derived descriptor (EP). Notably, the relationship between EP and the initial material features is nonlinear. Changes in the initial features not only affect the final output of the expression but also alter the value of the coupled neural network descriptor EP. Let the feature set comprising the initially selected physical quantities excluding EP be denoted as M0. If we define the mapping relationship between a material’s structure and its properties as \({\mathcal{F}}(\cdot )\), the SA-GAT-SR model result can be represented as follows:

$$\hat{Y}={\mathcal{F}}({E}_{P}({M}_{0}),{M}_{0})$$
(2)

where EP( ) represents the function describing the relationship between the GNN module’s predictions and the initial features. The above expression can be directly applied for material’s feature screening tasks. However, when using the model for guiding new material development, it is crucial to understand how the predicted results are influenced by each feature. The predicted value from the GNN model cannot be directly used as a guiding descriptor because it lacks an explicit mathematical relationship with the feature set M0. In most experimental outcomes, the coefficient of EP is typically close to 1. Consequently, the final expression can be interpreted as a sum of the prediction and correction terms.

To ensure that the derived expressions are interpretable and easy to analyze, we constrain the complexity of each descriptor to 1. In the bandgap prediction experiment, Table 3 represents a result that achieves both high accuracy and simplicity.

Table 3 The expression results derived by the SR module with descriptor complexity of 1 for all target properties in the AB(OX)3 dataset

The expression incorporates both XQ and Gdis, which have high ICs and are factors related to the stability of crystal structure. Among the terms, the coefficient of EP is 1.06, which can be reasonably approximated as 1. The Gdis, which has a value range from −1 to 1, appeares with a small coefficient of −0.13 in the expression, indicating that it has only a minor impact on the predicted value. Therefore, the descriptor has a value range from −0.13 to 0.13. In contrast, XQ has relatively larger coefficient of 0.20, modulating bandgap through ionic effects of the X-site oxidation state. Given that the mean value of XQ across all materials is −1.83, the mean value of the descriptor is the same. The MAE of the derived equation of bandgap property in Table 3 and the corresponding GNN module is 0.159 eV and 0.194 eV, respectively, indicating that the inclusion of XQ as a key factor has a significant impact on the improvement in predictive performance. Both parameters may have a combined influence on the bandgap variation. Notably, −0.13 Gdis reducing the bandgap with increasing distortion. Octahedral distortions in perovskites reduce the bandgap by enhancing orbital overlap or altering d-orbital splitting, thereby raising the valence band maximum (VBM) or lowering the conduction band minimum (CBM) and leading to more significant bandgap narrowing. Based on this model, when designing perovskite materials, we can optimize the bandgap by tuning the lattice distortion (Gdis) and the tolerance factor (XQ). Additional expression results with a descriptor complexity of 2 are provided in Supplementary Table 8. The work by Wang et al. also derived an expression predicting the bandgap property of photovoltaic perovskites based on SISSO31. In contrast to the derived equation of SA-GAT-SR, they use the DFT calculation results based on Perdew-Burke-Ernzerhof (PBE) functional as a feature similar to the Ep in equation. (3), which can be expressed as

$${\hat{Y}}_{bandgap}^{{\prime} }=1.28{E}_{PBE}+9.05\frac{{l}_{X}}{{Q}_{B}}-(0.77{Q}_{X}+1.28){l}_{B}+2.50$$
(3)

where EPBE denotes the PBE bandgap energy. And lX and lB represent the energy levels of lowest-unoccupied molecular orbitals (LUMO) of the atoms in B site and X site, respectively. Given that the PBE functional typically underestimates bandgaps by around 50%, the coefficient of EPBE in equation (3) is accordingly much greater than 1. This equation leverages a less accurate DFT result to learn a correction term that more closely approximates the true value. In comparison, our formulation uses a GNN-predicted value—offering higher accuracy—as the initial approximation and entirely removes the dependency on per-material first-principles calculations, making it highly suitable for fast, scalable property predictions over large materials databases.

Many prior DL based approaches have attempted to develop fully universal models capable of accurately predicting any property using ultra-large-scale datasets. GNN models are often highly sensitive to the training dataset, making it challenging to improve robustness without careful tuning and data selection. In many cases, the primary objective is to identify key descriptors of the target property and leverage them to guide the discovery of new materials. Compared to neural network-based descriptors derived from crystal structures, mathematical descriptors offer greater interpretability and more actionable insights. The derived equation of bandgap property in Table 3 presents a combined descriptor, −0.13Gdis + 0.20XQ, which effectively captures the relationship between structural factors and the target property. The SISSO algorithm generates multiple expression models with comparable predictive performance from a large descriptor space. In our experiments, we set the number of retained expression models to 50. Although the derived equation in Table 3 achieves an MAE of 0.159 eV, it is not the most accurate among the generated models. The MAE of the output models ranges from 0.1563 eV to 0.1723 eV, with the top-ranked models exhibiting similar predictive performance. In certain cases, it is not always necessary to select the expression with the best performance as the final result. A more concise and interpretable expression may be preferred, such as a descriptor with a complexity of 1. When screening materials across large-scale datasets to meet specific requirements, the expression with the best predictive performance is typically applied, as achieving the highest accuracy is paramount in such scenarios. However, in the context of new material development, simplicity and interpretability of the descriptors become more important32. A simpler expression allows for a clearer and more intuitive determination of the optimal composition or structural features of the new material, facilitating the material design process.

Discussion

In this study, we integrated GNN-based deep learning methods with symbolic regression and developed the SA-GAT-SR model, which achieves both high accuracy and interpretability. Compared to previously reported approaches, we propose a novel feature encoding algorithm that enhances automating large-scale feature screening, which reduces the number of SR input features from 31 to 14. This advancement significantly reduces the computation time of SR module achieving at least 23-fold acceleration compared to conventional SR method. Our approach simplifies and optimizes the feature requirements for input data in traditional SR-based methods, making the model more versatile and accessible when constructing and working with new datasets. We find that in predicating various material properties, the SA-GAT-SR model outperforms both the state-of-the-art GNN models and standalone SR models. In most property predictions, our model achieves an R2 of 0.93 or higher across tested systems indicating strong predictive performance and reliability. In addition to the high accuracy of the SA-GAT-SR model, which is well suited for large-scale material screening, our approach also offers strong interpretability by generationg physically meaningful descriptors, marking a significant advancement in both performance and insight. With our method of combining with GNN and SR, the model’s learning objective can be naturally shifted from directly predicting material properties to learning combinations of meaningful descriptors along with their associated coefficients. By embedding interpretability directly into neural architectures, these models enable a more transparent decision-making process and facilitate knowledge extraction from complex datasets. This may potentially foster data fusion bridging experimental and ab initio results that may have different fidelity and origin. Looking forward, the development of self-explanatory neural models will further enable automated and interpretable insights that will drive more efficient material design and discovery.

Methods

Dataset

In this study, we employ the dataset of ABO3 and ABX3 (X=F, Cl, Br, I) type perovskite materials from the JARVIS-DFT (Joint Automated Repository for Various Integrated Simulations)41 database for training and evaluating our model. The JARVIS-DFT database is a widely recognized resource that provides high-quality, density functional theory (DFT) calculated properties for a diverse range of materials. For our analysis, we selected materials that conform to the perovskite crystal structure as the dataset. Due to the presence of lattice distortions, the crystal symmetry of these materials includes not only cubic but also tetragonal, hexagonal, monoclinic, and orthorhombic phases. In total, we curated a subset of 689 perovskite materials for further study. It is worth noting that some materials in the dataset may be missing certain properties due to incomplete data. Given that the JARVIS-DFT database is regularly updated, we chose the specific version of the dataset from 2021.8.18, as it is the most commonly used version in recent studies and ensures consistency with previous research.

Self-adaptable Graph attention networks (SA-GAT)

Our SA-GAT-SR model is a variant architecture based on GAT50 for the specificity of material data, especially the crystal materials. In the following, we define the crystal graph as an undirected graph with N nodes, denoted by \({\mathcal{G}}=(V,E,g)\), where V represents the node set, E represents the edge set, and g denotes the global node vector containing initial global features, which provides the unique representation of the graph. The node with index i is represented as vi, and the feature vector of vi at layer l is denoted by \({h}_{i}^{l}\). The corresponding edge feature in E, which connects nodes vi and vj, is represented by \({e}_{ij}^{l}\). We consider the 12 nodes within a distance of 12 Å from vi as the neighboring nodes of vi and utilize a radial basis function (RBF) to compute the edge feature vectors. Additionally, we define a ranking index matrix \({{\mathcal{R}}}_{g}\in {{\mathbb{Z}}}^{N\times 12}\), which stores the distance ranking of each neighboring node relative to vi. Each row vector of \({{\mathcal{R}}}_{g}\) represents the ranking index of the neighboring nodes of vi.

To start, we convert the material structure into the graph representation using the self-adaptable encoding (SAE) method. Meanwhile, SAE produces the ICs of each feature for the following screening step. Then, the node features of graph are fed into the following message passing layers and iteratively updated. The normal nodes and global node representing atoms and crystal respectively are updated by different algorithms. After passing through all updating layers, we obtain the final global node feature gL, where L represents the total number of layers in the GNN module. The global node has learned the characteristics of the entire material and becomes a unique representation of this material. Finally, the readout function extracts deep information from gL by a feed-forward network and gives out the prediction result, which is represented as:

$$\hat{y}=FFN({g}^{L})$$
(4)

The specific form of the output \(\hat{y}\) depends on the nature of the prediction task, such as classification or regression. The FFN( ) function includes activation functions and regularization operations, tailored to optimize the model performance according to the task requirements.

Atomic nodes and global node updating method

In the GNN module, we incorporate the distance factors of different node pairs by leveraging the ranking matrix. We adopt the encoder architecture of the Transformer51,52,53,54 to update graph node features and employ a novel set2set55 based algorithm to update the global node feature. As shown in Fig. 1b, the orange global node acts as q* vector in the set2set model to represent the state of the node set. The network is composed of multiple serially stacked blocks represented by green rectangle, each comprising a message passing layer and a set2set layer, which update the atomic and global nodes, respectively.

At the start of each layer, the ranking matrix \({{\mathcal{R}}}_{g}\) is embedded into a distance coefficient matrix \({{\mathcal{C}}}_{g}\in {{\mathbb{R}}}^{N\times 12}\), represented by \({{\mathcal{C}}}_{g}={{\rm{Emb}}}_{r}({{\mathcal{R}}}_{g})\). The message passing update algorithm is formulated as follows:

$${h}_{i}^{l+1}={h}_{i}^{l}+\sum _{k\in {N}_{i}}{a}_{ik}{W}_{val}^{l}{z}_{ik}^{l}$$
(5)
$${e}_{ij}^{l+1}={e}_{ij}^{l}+{a}_{ij}{W}_{edge}^{l}({h}_{i}^{l}\oplus {h}_{j}^{l}\oplus {e}_{ij}^{l})$$
(6)
$${a}_{ik}=Softmax({W}_{qry}^{l}{h}_{i}^{l}\cdot {W}_{key}^{l}{z}_{ik}^{l}+{c}_{ik})$$
(7)
$${z}_{ik}^{l}={h}_{k}^{l}\oplus {e}_{ik}^{l}$$
(8)

where \({c}_{ij}\in {{\mathcal{C}}}_{g}\) is a scaling factor representing the distance coefficient between vi and vj, and denotes the vector concatenation operation. Ni denotes the neighbors of vi in graph \({\mathcal{G}}\), while Wqry, Wkey, Wval, Wedge are learnable embedding matrices. Equations (5) and (6) represent the updates of the node feature hi and edge feature eij connecting vi and vj in layer l + 1.

The core of the GNN module comprises multiple stacked message-passing layers that iteratively update the node and edge features. The readout operation is performed using the global node feature, which is updated at each layer. Inspired by previous work56,57, the global node, sometimes referred to as [VNode] in other models, is an artificially introduced node. Unlike other approaches, our model’s global node, which represents the entire graph is initialized from actual crystal properties via the SAE module. Thus, it acts as a representation of the graph for the readout operation but does not participate in node updates within the graph.

To enhance information retention across layers, we employ an inter-layer set2set method with a GRU58 updating unit. This configuration allows the readout operation to be conducted on the final output global node from the last layer, computed as follows:

$${g}^{l+1}={g}^{l}+\varphi ({q}^{l+1}\oplus {r}^{l+1})$$
(9)
$${q}^{l+1}=GRU({g}^{l})$$
(10)
$${r}^{l+1}=\sum _{i\in N}{e}_{i}^{l+1}{x}_{i}^{l+1}$$
(11)

where,

$$\begin{array}{r}{e}_{i}^{l+1}=Softmax({q}^{l+1}{x}_{i}^{l+1})\\ {x}_{i}^{l+1}={h}_{i}^{l+1}+\mathop{\sum}\limits _{j\in {N}_{i}^{1}}{a}_{ij}^{q}{W}_{val}^{q}{e}_{ij}^{l+1}\\ {a}_{ij}^{q}=Softmax({W}_{qry}^{q}{h}_{i}^{l}\cdot {W}_{key}^{q}{e}_{ik}^{l+1})\end{array}$$
(12)

In the above equations, \({W}_{qry}^{q},{W}_{key}^{q},{W}_{val}^{q}\) are learnable matrix parameters within the readout module, and \({N}_{i}^{1}\) denotes the 1-hop neighbors of vi in graph \({\mathcal{G}}\). The query feature vector ql+1 in the set2set module is derived from the global node using a GRU function that encapsulates key graph information. The \({e}_{i}^{l+1}\) is computed based on \({x}_{i}^{l+1}\), obtained from node features while accounting for the local environment around each node. The function φ( ) represents a feed forward network (FFN) that converts the combined result into the output global node feature.