Introduction

In recent years, density functional theory (DFT) and molecular dynamics (MD) simulations have been studied and applied extensively in materials multiscale modeling1. For example, the calculations of energy and forces of materials across different scales have been achieved by using these simulations2,3,4. Current widely used simulation methods including Kohn-Sham density functional theory (KSDFT)5,6 and MD simulations with classical interatomic potentials7,8,9,10,11,12,13,14,15,16 have demonstrated high performance in predicting the formation energy and elastic modulus of materials. However, both methods have their own limitations. KSDFT is computationally demanding and typically restricted to systems containing only a few hundred of atoms, while MD can be used in larger systems but is limited in accuracy due to the empirical nature of interatomic potentials.

To solve the limitations of KSDFT and MD, machine learning (ML) models17,18 such as neural network potential (NNP)19,20, Gaussian approximation potential21, spectral neighbor analysis potential22,23, and moment tensor potential24 have been proposed to accurately predict energy and forces of crystals and molecules. They use atomic species and nuclear coordinates to build descriptors (also called “fingerprints”), which are invariant under permutations among the same elements, and isometric transformations of rotations, as features to be fitted by a chosen regression model19,25. However, these descriptors need to be designed meticulously to satisfy the restrictions and the complex transformations thus making it difficult to explain the models26,27.

To get more generalized descriptors, graph networks, which represent atoms and bonds as nodes and edges, respectively, combined with convolutional neural networks have received significant attention, since convolutional neural networks can automatically find the important features compared to descriptor-based models28. Several graph convolutional neural networks such as generalized crystal graph convolutional neural networks (CGCNN)26, SchNet29, MEGNet30 and atomistic line graph neural network (ALIGNN)31 have been proposed. They are straightforward to be adopted and suitable for both crystals and molecules, however, these descriptors have complex configurations that contain a series of operators and hidden layers, and their fitting process is time-consuming due to the high data requirements and the regression function contains a large number of parameters to be fitted in the neural network32.

Compared to graph-network-based potentials, symbolic regression is a faster method to build interatomic potentials by using genetic programming to find a function that accurately expresses interatomic potentials from a set of variables and mathematical operators33,34,35,36. But symbolic regression also has some limitations. For example, the expressions in the hypothesis space must be simple and have a significant effect on potential energy, and this model cannot learn complex terms that involve bond angles.

Besides, for general ML potentials, transferability, which describes the ability of a model to correctly predict the property of an atomic configuration lying outside the training dataset, is limited. Consequently, physically informed neural networks (PINN) are proposed to improve the transferability of unknown structures37,38,39 by combining a general physics-based interatomic potential with a neural-network regression. PINN achieves this by optimizing a set of physical-meaning parameters of a physics-based interatomic potential from trained neural networks, and then feeding them back to improve the accuracy of the original physics-based interatomic potential. However, this method encounters a similar obstacle to the graph networks mentioned above, which is the time-consuming fitting process in the neural network resulting from a large size of data and numerous parameters.

As the molecular-dynamics simulation database gradually improves, such as the force-field database of NIST JARVIS which contains properties like formation energy and elastic constants calculated by different classical potentials, the force-field database can be the potential input for machine-learning model40,41. In this study, we present a regression-trees-based ensemble learning approach that efficiently predicts the formation energy and elastic constants of carbon allotropes with a small size of data calculated by classical potentials. We use carbon allotropes as an example to evaluate the performance of our model since carbon is one of the fundamental elements on Earth42 these carbon allotropes have a variety of physical properties while being applied widely in cutting and polishing tools43, superlubricity44, solar thermal energy storage45, etc. Therefore, understanding of the physical properties of carbon allotropes plays a significant role in both scientific research and engineering applications. We begin by extracting the structures of carbon allotropes from the Materials Project (MP)46, and compute their formation energy and elastic constants using MD simulations with nine different classical interatomic potentials via the Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS)47. Then use these computed properties as features and corresponding DFT references as targets to train and test four different ensemble learning models48,49,50,51 consisting of regression trees52. In general, the performance of ensemble learning models is better than that of nine classical potentials, and based on feature importance, ensemble learning can find the accurate features and use them to improve the precision of prediction.

Results

Ensemble learning framework for properties prediction of carbon materials

Figure 1 illustrates the schematic of the ensemble learning framework. Firstly, carbon structures are extracted from the MP database. Then, the formation energy and elastic constants of each structure are calculated by MD with nine classical interatomic potentials, including the analytic bond-order potential (ABOP)53, adaptive intermolecular reactive empirical bond order potential (AIREBO)9, standard Lennard-Jones potential (LJ)13, AIREBO-M10 potential that replaces the LJ term with a Morse potential in AIREBO potential54, environment-dependent interaction potential (EDIP)53, long-range carbon bond order potential (LCBOP)12, modified embedded atom method (MEAM)14, reactive force field potential (ReaxFF)15, and Tersoff potential55. The training dataset is composed of these properties and corresponding DFT reference collected from the MP database, and encoded into feature vectors xi and target vectors yi, respectively. For the formation energy, 58 carbon structures and their DFT reference are extracted from the MP database, and for the elastic constants, 20 out of 58 carbon structures’ DFT reference are used due to the absence of DFT reference and removal of unstable or erroneous calculations56. Next, the regression-trees-based ensemble models will be trained based on these vectors. We select regression-trees-based ensemble learning models for the following reasons. First, compared to neural networks, regression trees are white-box models which make the models and outputs easy to understand and interpret. Second, as non-linear models, regression trees have better performance than classical linear regression and neural networks methods when dealing with small-size data and highly non-linear features. Third, to mitigate locally optimal decisions of regression trees, ensemble learning combined with the predictions of several regression trees will be deployed to improve robustness over a single regression tree. Last, for multi-target problems such as the prediction of elastic constants, the ensemble learning method can learn the correlations of elastic constants and output multiple targets at once. Here, different regression-trees-based ensemble-learning methods implemented in the Scikit-Learn package57, including bootstrap aggregation (bagging)58 and boosting49, are used to build simple, fast, and interpretable models. Details of the architectures and methods of regression trees and ensemble learning are given in the “Methods” section. For the new carbon structures, we can calculate the same properties by using MD with the nine potentials and feed these calculated properties into the trained model with the smallest mean absolute error (MAE) during testing, as shown in Eq. 1, to predict the properties of new structures.

$${MAE}=\frac{{\sum }_{i=1}^{n}|{y}_{i}^{{pre}}-{y}_{i}|}{n}$$
(1)

Where yipre is prediction of model, yi is value of reference, and n is the number of samples.

Fig. 1: Ensemble learning framework for properties prediction of carbon materials.
figure 1

The properties of carbon structures are calculated by MD and use these calculated values and DFT references as input to train the ensemble learning model. To get the same properties of new carbon structures, first calculate the properties of new structures by using MD, then put these values into the best trained model to get final predictions.

For the formation energy of carbon materials, we employ four different ensemble-learning methods, namely RandomForest (RF), AdaBoost (AB), GradientBoosting (GB), and XGBoost (XGB), to evaluate their performance. Grid search in combination with 10-fold cross-validation is applied to optimize hyperparameters. After tuning, we run 10-fold cross-validation twenty times for each method with the optimized hyperparameters and calculate the MAEs compared to DFT reference. Furthermore, the median absolute deviation (MAD) of each method is calculated. MAD is defined as the median of the absolute deviations from the median of the residuals, as shown in Eq. 2 and Eq. 3.

$${MAD}={median}(\left|{r}_{i}-\widetilde{r}\right|)$$
(2)

Where ri is residual between ith target’s prediction and its’ corresponding target. \(\widetilde{r}\) represents residuals’ median.

$$\widetilde{r}={median}(r)$$
(3)

MAD represents the dispersion of residuals. It is more robust than MAE since it can ignore the influence of outliers. The MAE and MAD for each method are depicted in Fig. 2. Here, the voting regressor (VR) which combined RF, AB, and GB models is utilized to mitigate the overall error by averaging the predictions. Besides, Gaussian process (GP)59, as a generic supervised learning method to solve regression problems, is also evaluated. Overall, ensemble-learning models have better performance than that of classical interatomic potentials and GP model. Notably, all these MAEs are lower than the most accurate classical potential, LCBOP. Since the formation energy values calculated by different classical potentials have high non-linear and complex relationships, under this condition, the regression trees have better performance than classical regression methods such as GP. In the inset of Fig. 2, the formation energy of various structures predicted by RF, LCBOP, and DFT are illustrated. It can be observed that RF outperforms LCBOP in terms of overall error. However, for the structures with the highest formation energy, RF’s predictions are less accurate than those by LCBOP may be due to the inherent nature of weak extrapolation that causes the lower formation energy of mp-998866. It is worth noting that RF has weak performance for mp-1008395, mp-570002, mp-624889, and mp-1018088. The reason is the deviation of the features with respect to DFT for each structure. For mp-1018088, except for the feature calculated by LJ potential, the features calculated by the other potentials are smaller than the reference, leading to an underestimated prediction. The values of mp-624889 have a distribution similar to mp-1018088, but the deviation is smaller (−0.33 eV/atom for mp-624889 and −0.55 eV/atom for mp-1018088) that makes the prediction more accurate. Conversely, since the values of mp-1008395 and mp-570002 calculated by the most classical potentials are larger than the reference, the predictions of both are overestimated. Nevertheless, RF provides more accurate predictions in general.

Fig. 2: MAEs and MADs of the formation energy relative to DFT reference under different methods.
figure 2

Overall, ensemble-learning models have better performance than that of classical interatomic potentials and GP models, and the RF has the best performance than the others. Here, the inset shows the prediction of the formation energy of carbon structures by LCBOP and RF. The black circle represents the DFT reference. The RF has better performance than LCBOP overall. For the structures with high formation energy, due to the inherent nature of weak extrapolation, RF is less accurate than LCBOP.

For the elastic constants of carbon materials, we also train and test RF, AB, GB, and XGB models using elastic constants calculated by the same nine classical potentials. Grid search in combination with 5-fold cross-validation is applied to tune the hyperparameters of each model. Then the 10-fold cross-validation is conducted on the models with optimized hyperparameters to evaluate the performance of the models. The prediction of elastic constants is a multi-target problem. However, AB, GB, and XGB don’t support multi-target regression. To overcome this limitation, we use a multi-target regressor57 combined with all four ensemble methods to predict elastic constants. In brief, the multi-target regressor fits one regressor based on ensemble methods per elastic constant, it is a simple strategy to extend the regressors which don’t support multi-target problems. Here, we use the Tersoff potential as a benchmark to compare with different ensemble methods since the MAE of the elastic constants of the Tersoff potential is at least an order of magnitude smaller than those of other classical potentials. Fig. 3a illustrates the MAEs of the total elastic constants of four ensemble methods conducted by twenty times 10-fold cross-validation individually. The MAEs of AB, RF, XGB, and GB are much smaller than that of Tersoff. It is worth noting that the MAE of Tersoff is significantly increased by one structure (mp-1095534) that includes both sp and sp3 hybridization of carbon, making the structure more complex than others such as diamond or graphite, which only have single hybridization. This complexity makes it difficult for Tersoff to accurately calculate. If the error associated with the mp-1095534 is removed, the MAE of Tersoff drops to 63 GPa, which is still larger than those of RF, AB, GB, and XGB. Notably, different potentials behave differently for different carbon structures, hence, to get a minimal MAE of all structures from classical potentials, we extract the smallest error among nine classical potentials with respect to DFT reference for each structure, and then calculate the total MAE of these smallest errors. As shown in Fig. 3a, the Min represents the best performance by using these classical potentials. We can see that AB has a smaller MAE than Min, and XGB performs similarly to Min. Fig. 3b shows the elastic constants calculated by Tersoff and predicted by AB and RF by using 10-fold cross-validation. The black dashed lines are the ideal fit (1:1). To fit the plot, fifteen points of Tersoff with excessively large MAEs are removed. Both AB and RF have lower MAEs than Tersoff, as shown in Fig. 3a. The AB has a lower MAE than RF, possibly due to the elastic constants of similar structures correlated with each other, and a sequential process like AB can reduce the bias. The points in the green circles in Fig. 3b have large errors, and all these points come from C11, C22 or C33 of the structures that have complicated structures not trained in the models. These structures have smaller values in C11, C22 or C33 compared to most of the training data. Therefore, the models don’t have enough fitted regressors for these structures, resulting in inaccurate predictions. To further assess the performance of ensemble methods compared to Tersoff, the MAEs and MADs of partial elastic constants for AB and Tersoff are calculated in Table 1. We exclude the MAEs and MADs of the remaining elastic constants due to their negligible errors. In general, all nine smallest MAEs and MADs are obtained from AB. For Tersoff with mp-1095534, the MAEs and MADs are higher than the others. Although Tersoff without mp-1095534 yields smaller MAEs and MADs than that with mp-1095534, the MAEs and MADs of AB are still smaller, some are even over 50% lower than those of Tersoff without mp-1095534. Both tables demonstrate that the ensemble learning has better performance than Tersoff.

Fig. 3: Performance of different classical potentials and ML methods in elastic constants prediction.
figure 3

a MAEs of the total elastic constants relative to DFT reference under different methods, Min represents the best performance for all structures by using these nine classical potentials. AB and XGB have better or similar performance than the Min, and all ensemble models are better than the Tersoff. b Predicted elastic constants under Tersoff, RF, and AB versus DFT reference, both RF and AB models have lower MAEs than before. The points in the green circles have large errors, and all these points come from the complicated structures which are out of training sets in both ensemble models.

Table 1 MAEs and MADs of the partial elastic constants relative to DFT reference under AB and Tersoff

Besides, we combine formation energy and elastic constants data together to train and test the same four ensemble methods. The MAEs for both properties are shown in Fig. 4. All models perform weaker than before, with some MAEs even three times larger than when only one property is predicted. Even so, the MAEs of formation energy in RF, AB and GB models are lower than that calculated by over half of classical potentials, and the MAEs of formation energy in XGB are similar to AIREBO-M. Additionally, the MAEs of elastic constants in all models are lower than that calculated by all classical potentials, including Tersoff, though the MAE of Tersoff is lower if it removes the residuals of mp-1095534. The main reason for the increase in errors is the large size of the feature and the complexity of the correlation between features and targets will make regression trees hard to learn the relationship between features and targets correctly, the limited samples also prevent the models from learning the features of complex structures well. In our dataset, most structures are either graphitic or diamond-like. Graphite-like structures typically have anisotropic C11, C22, and C33 elastic constants, with two of them usually close to 900 GPa, and their formation energy is lower than that of diamond-like structures, which is around 0.05 eV. On the other hand, the C11, C22, and C33 are isotropic in diamond-like structures and all of them are around 1100 GPa, which are higher than those of graphitic structures, and have a higher formation energy of around 0.15 eV. Given these connections between formation energy and elastic constants, we can only use one type of property as a feature to reduce the dimension of the feature and predict both properties. For instance, when only applying the dataset of formation energy as features to train and predict both formation energy and elastic constants, we find that the MAEs are similar to those shown in Fig. 4.

Fig. 4: MAEs of the formation energy and total elastic constants relative to corresponding DFT reference under different ensemble methods.
figure 4

Due to the complexity of the relationship between features and targets, all models perform weaker than when only one property is predicted. Even so, the MAEs of formation energy in RF, AB and GB models are lower than half of classical potentials. And the MAEs of elastic constants in all models are lower than that calculated by all classical potentials.

Interpretability

To reveal the correlation or any useful information behind these features, principal component analysis (PCA) is used to decompose the high-dimensional dataset into a set of orthogonal components and project the dataset onto these components to indicate the maximum variance. Figure 5a shows the projection of the 9-dimensional formation energy on a 2D plane, the graphite-like structures are grouped on the left of the plot, the diamond-like structures followed by fullerene-like structures are clustered on the right of graphite-like structures relative to the first principal component. And for the others, they are located more scattered and on the right in general. This distribution is consistent with the formation energy of these structures that the graphite-like structures have the smallest formation energy followed by diamond-like and fullerene-like structures, and the complex structures have relatively higher formation energy. If combined with the second principal component, similar structures are close to each other, some of them are far away from their clusters due to their higher formation energies, all of these indicate that the feature space contains the corresponding physical meaning that is consistent with that of the target property. Fig. 5b shows the PCA of the representations after the first hidden layer in CGCNN with the same structures, likewise, the graphite-like, diamond-like, and fullerene-like structures are clustered and more compact overall, and for those points that are far away from their similar structures in Fig. 5a, they are also far away from their clusters in Fig. 5b. For the high-energy structures, three similar structures are closer to each other than the other one. Especially, one of the middle-energy structures, even though there are other structures with similar energies, is far from others in both figures, indicating the features of structures with similar energies may be different due to the different structures. Since each feature vector is composed by the calculations of the classical potentials, the correlation between these features and each feature itself for similar structures leads to the similarity of the features of similar structures, while the features of dissimilar structures differ depending on the values of each feature in them, even if the energies of these structures are similar. Therefore, different from directly using structural information as representation, these special features based on energy indirectly reflect the correlation between structures. To further demonstrate that this kind of feature vector can distinguish different structures with similar energies, Fig. 5c shows the features’ values of different structures. It can be clearly seen that among the three structures graphite-like, diamond-like, and fullerene-like, the features of the same type of structures are similar, while the features of different types of structures are different. For the outliers in these three types of structures, each outlier has different features from its own cluster, indicating the difference in their structures. Similarly, for the high-energy structure, the features of the first three are similar, while the other ones are different. In the middle-energy structures, the overall features of the second-to-last structure are different from other structures due to its high EDIP and MEAM calculated values and low Tersoff value, which causes it to be far away from other points. The second structure in middle-energy structures is similar to the fullerene-like in terms of the relative relationship between each feature, which makes it close to the fullerene-like structure in Fig. 5a and it also can be seen in Fig. 5b. This shows that these features contain the correlation between structures to a certain extent. However, the correlation between similar structures may be influenced by individual changed feature values, such as some graphite-like structures being separated from their cluster in Fig. 5a due to the underestimated calculations from LJ potential than others, it can be improved by removing the unstable features.

Fig. 5: Interpretability of feature importance with different methods.
figure 5

a Visualization of the features. The original 9D vectors are reduced to 2D with PCA. Similar structures are clustered together, and the others are scattered due to their different structure from each other. The distribution of the features for all structures on the first principal component is also similar to the distribution of the formation energy for the same structures. b PCA of representations of the same structures in CGCNN has a similar distribution to (a). c The feature values corresponding to each structure. It can be seen that similar structures have similar feature values. d PCC of features and reference. Same as feature importance, ReaxFF has the largest PCC with the reference. The LCBOP, however, has a larger PCC than AIREBO-M, which means other factors may also play a role in determining feature importance.

Besides, the criteria for node splitting in regression trees is mainly based on the loss function like mean squared error (MSE), which describes the distribution of the targets under different features, and the regression trees will identify the feature with the minimal MSE as the threshold for the split point. Since there is a correlation between the features and targets in this study, it is ideal that the feature and corresponding targets have a linear relationship. So, the accuracy of the features varies for different structures, leading to different levels of linearity. Regression trees can evaluate the linear relationship of each feature at each split point and capture the most important feature that has minimal MSE. If a feature has relatively weak performance, it will have a nonlinear relationship between its values and targets. Thus, the targets corresponding to any two adjacent sorted values in this feature will become far apart, which causes the MSE to be larger compared to the more accurate feature. Table 2 shows the average feature importance of the regression tree fitted by formation energy, where 20 times permutation importance is employed for feature evaluation48. The permutation feature importance calculates the difference of error before and after permutation of the values of the features. In Table 2, ReaxFF has the most impact on the accuracy of the model, and AIREBO-M follows ReaxFF in three models. LJ and MEAM potentials have the smallest impact due to their high deviations. It should be noticed that LCBOP has a smaller MAE than ReaxFF and AIREBO-M in Fig. 2, this is because the node splitting depends on the linear relationship between features and targets instead of the difference between feature and target. Therefore, the Pearson correlation coefficient (PCC) is also used to assess the feature importance. PCC can measure linear correlation between two sets of data. The equation of PCC is as follows,

$$r=\frac{\sum ({x}_{i}-\bar{x})({y}_{i}-\bar{y})}{\sqrt{\sum {({x}_{i}-\bar{x})}^{2}\sum {({y}_{i}-\bar{y})}^{2}}}$$
(4)

Where xi and yi are values of the x and y variables, respectively, and and are their means. Fig. 5d is PCC of features and reference. From the last column of PCC, we can see ReaxFF has the largest positive linear correlation with the reference, and the regression tree will capture this linear correlation and use ReaxFF to split nodes, which indicates ReaxFF is the most important feature. The high positive linear correlation also explains why ReaxFF is more important than LCBOP although LCBOP has a smaller MAE in Fig. 2. However, the LCBOP’s correlation with reference is higher than that of AIREBO-M, yet its importance is lower than that of AIREBO-M in Fig. 6b. This suggests that other factors may also play a role in determining feature importance.

Table 2 Average feature importance is calculated by permutating features
Fig. 6: Frequency of local optimization occurred in each feature.
figure 6

a Flowchart of calculating local minimal level of featurej. In the beginning, initialize the local minimal level of each feature to 0 and split each feature’s vector and target’s vector into small parts, each part of the target consists of adjacent sorted target values (such as A-C represent three similar formation energy) in the figure, and each part of the feature contains the MD-calculated values of structures corresponding to the target vector (such as a–c). Then compare each feature’s part to each target’s part to check if both minimum and maximum values of each feature part also exist in target’s part, if so, which means it has a local minimal error and add 1 to this feature’s local minimal level. b Local minimum level of different features. ReaxFF has the largest local minimal level, followed by AIREBO-M, which consists of the feature importance.

Except for the factors mentioned before, the loss function indicates that local minimal error also impacts the choice of features for splitting. The regression tree algorithm is inherently greedy, aiming to find the feature with the local minimal error as a splitting feature if the sorted target values of the samples are close to each other. To quantify the level of local minimal error of each feature, we propose a way to describe the frequency of its occurrence. Figure 6a illustrates the process of computing the local minimal level of each feature. In the beginning, initialize the local minimal level of each feature to 0, and split each feature’s vector and target’s vector into small parts, each part of the target consists of adjacent sorted target values and each part of the feature value is the corresponding target values after sorting according to the MD-calculated values. For each part of the feature, we find its minimum and maximum values and compare them to each part of the target. If both minimum and maximum values exist in the target’s part, which indicates a local minimal error and adds 1 to this feature’s local minimal level. Figure 6b shows the local minimal level of each feature, where ReaxFF has the largest local minimal level, followed by AIREBO-M. Thus, the local minimal level may explain a certain degree of the feature importance of ReaxFF and AIREBO-M in Table 2. In addition, the PCC, MAE, feature importance, and local minimum frequency of RF trained by carbon materials’ formation energy are analyzed to illustrate that under different structures, the ensemble learning likely to use the more accurate potential as criteria to output. As shown in Table 3, the highest PCC, lowest MAE, largest Importance, and largest frequency are bolded. It can be seen that under different energy interval, the accuracy of each potential is different, the PCC, MAE, feature importance, and local minimum frequency of each feature in ensemble models are positively correlated to each other in general. The ensemble learning splits the node based on these indicators and normally uses the more accurate potentials’ properties as criteria to output, although the ReaxFF has the largest PCC and importance overall, ensemble learning can generally utilize the more accurate feature as criteria for predicting under the corresponding structures, for the structures with high formation energy, LCBOP is not the most important feature may be due to the lack of training data in which only 5 samples, the importance of LCBOP is still the second large. this characteristic is based on the local minimum algorithm which can capture relatively accurate features for splitting the tree’s node.

Table 3 The PCC, MAE, feature importance, and local minimum frequency of RF trained by carbon materials under different formation energy intervals

In summary, the MAE, PCC, feature importance, and local minimum frequency of each feature in ensemble models are correlated, and positively correlated to each other in general. The feature with the highest importance also generally owns the smallest MAE, highest PCC, and highest local minimum frequency, indicating the relationship between the feature and target, and the algorithm of the decision tree are correlated to each other and will decide the feature importance of decision trees. So, based on the linear correlation between features and targets and the local greedy characteristic of algorithms, ensemble learning can capture relatively accurate features calculated by the nine classical potentials under the corresponding structure for splitting the node. More accurate features can be used to improve the ensemble model’s performance systematically.

Formation energy prediction of new carbon structures

To further evaluate the performance, the trained formation energy RF model is employed to predict the formation energy of new carbon structures. We extract all structures of both silicon carbide and silicon from the MP database and compared them with carbon structures by using the similarity method60. This method evaluates the dissimilarity of any two structures by calculating the statistical difference in local coordination environment of all sites in both structures. Out of 76 silicon carbide and silicon structures, 10 are extracted as new structures based on the similarity method. We replace the silicon element with the carbon element in these 10 structures to get new carbon structures and calculate their formation energy with the nine classical potentials as features. These features are then input into the pre-trained RF model to predict formation energy. Fig. 7 illustrates the formation energy of each new structure calculated by RF, CGCNN, ALIGNN, DFT, and 3 most accurate classical potentials. The MAEs of CGCNN, ALIGNN, and RF are 0.376 eV/atom, 0.446 eV/atom, and 0.850 eV/atom, respectively, while the minimum MAE in classical potentials is 1.402 eV/atom for AIREBO. The CGCNN and ALIGNN models use interatomic structural information as input, which enables them to have certain transferability when facing structures outside the training set. Different from these two models, the RF model based on the feature values instead of atomic structures, although RF performs well in interpolation, its extrapolation ability is limited for new structures with energies higher than those in the training set. This can be seen from the fact that the high energy prediction value approaches a certain value in Fig. 7. Since only 4 carbon structures in the training dataset have formation energy larger than 2 eV/atom, and the highest energy is 2.7 eV/atom, its maximum prediction value is around 2.7 eV/atom, and it cannot make reasonable predictions for structures higher than 2.7 eV/atom. In addition, for new structures whose energies are within the range of the training set, RF’s predictions depend on the accuracy of the features. In other words, for the lowest energy structure in the graph, since all features are higher than those calculated by DFT, RF will over-predict like the classical potential. Therefore, the performance of RF depends on the diversity of the training set and the transferability of the features. To further inspect the relationship between features and model, the MAE of features corresponding to DFT, and feature importance of RF are calculated, as shown in Table 4. Interestingly, the feature importance changes with the accuracy of the features for the new structure. The AIREBO-M has the smallest MAE and has the largest importance, but the ReaxFF has a large MAE and its importance is low. This may indicate that the RF can filter out the features with large errors according to the trained feature values, so as to split the trees by the features within a reasonable range.

Fig. 7: Performance of different methods for prediction of formation energy of some carbon structures.
figure 7

The x axis represents the ID of silicon carbide and silicon structures in MP database corresponding to the carbon structures.

Table 4 The MAE of features corresponding to DFT, and feature importance of RF

Discussion

There are some limitations and opportunities for improvement. First, the limited size of the training dataset may restrict the performance of the models. This constraint is acknowledged in Figs. 3b and 7, and it could be mitigated by including more training samples to extend learning space or to make the interpolate prediction smoother. Due to the limited carbon structures in MP, the Si-O binary systems are used to test the performance of ensemble learning in larger datasets and evaluate the influence of training data size. Here, 335 Si-O structures are extracted from MP and three classical potentials, COMB61, Tersoff62, and Vashishta63, calculate their formation energies. Figure 8 shows the performance of different models under different k-fold cross-validation, as the k increases, the training data becomes more and the error decreases. Finally, the error tends to be stable when a certain amount of training data is reached, indicating that under-fitting will lead to prediction errors when there is insufficient training data set. When the number of training sets reaches a certain level, the very similar training data won’t help the model to improve the performance, since the model has learned enough feature information based on the previous training set to predict new structures. Although for the extrapolate new structures is limited, such as in Figure 7, the overall errors in all ensemble models for 10-fold cross-validation (0.132 eV/atom, 0.143 eV/atom, 0.140 eV/atom, and 0.141 eV/atom for RF, AB, GB, and XGB, respectively) are smaller than that of three potentials (0.240 eV/atom, 0.156 eV/atom, and 0.147 eV/atom for COMB, Tersoff, and Vashishta, respectively). Figure 9 illustrates the formation energy of each Si-O structure calculated by RF, CGCNN, ALIGNN, DFT, and 3 classical potentials in a logarithmic scale, negative values in CGCNN and ALIGNN are taken as absolute values for plotting. For low-energy structures (the first 250 structures), compared with DFT calculations, the CGCNN, ALIGNN, COMB, and Tersoff have higher deviations than those of Vashishta and RF in general. The COMB and ALIGNN have higher predictions than the DFT, and the Tersoff and CGCNN are vice versa. For high-energy structures (the remaining 85 structures), however, Vashishta’s predictions are generally lower than DFT values, and the other two classical potentials are more accurate than Vashishta’s. The RF and ALIGNN have smaller deviations than the CGCNN. The overall MAEs of RF, ALIGNN, CGCNN, Vashishta, Tersoff, and COMB are 0.132 eV/atom, 0.106 eV/atom, 0.146 eV/atom, 0.147 eV/atom, 0.156 eV/atom, and 0.240 eV/atom, respectively. Briefly, the ML-based models have lower MAEs than that of classical potentials and the RF has the lowest overall error besides the ALIGNN. Even though the ALIGNN has a larger deviation in the low-energy region than the RF on a logarithmic scale, the energy difference between DFT and ALIGNN is small on a linear scale, and the RF has some points that have relatively large deviations in both low-energy and high-energy regions, the ALIGNN, however, has the lowest MAE for predictions of high-energy structures. To interpret the RF, the PCC, MAE, local minimum frequency and feature importance between different potentials in the low-energy and high-energy regions are calculated as well. As shown in the Table 5, for the low-energy region, Vashishta has the smallest MAE, the largest PCC, the largest feature importance, and largest local minimum frequency, while Tersoff is in the high-energy region. These results are generally consistent with those in Table 3, also indicating that ensemble learning can find more accurate potential energy calculations under the corresponding structure as features for prediction.

Fig. 8
figure 8

Performance of ensemble learning models for formation energy prediction using different training size. The error decreases and tends to be stable with the rise of k, and RF has the smallest MAE in total.

Fig. 9
figure 9

The formation energy of each Si-O structure calculated by RF, CGCNN, ALIGNN, DFT, COMB, Tersoff, and Vashishta. Among the classical potentials, Vashishta and Tersoff are the most accurate classical potentials for energy prediction of low-energy structures and high-energy structures, respectively. And the ALIGNN has the best total performance among all methods.

Table 5 The MAE, PCC, feature importance and local minimum frequency of different features

Second, the performance of the regression trees is related to the accuracy of the features and a linear correlation between features and targets. So, more accurate classical interatomic potentials may be used as features to improve the performance. In addition, more features make ensemble methods more complex and harder to interpret the feature importance, and complex correlations between features and targets also may lead to regression trees being unstable which affects the performance of ensemble learning, like the overall performance of the two properties prediction model (Fig. 4) is worse than single property prediction model (Figs. 2 and 3a), so to get better performance and interpretability, single property prediction model and appropriate feature size need to be considered. Table 6 shows the performance of RF for predicting the formation energy of carbon materials with different feature sizes. Compared with the MAE of MD simulation with a single classical potential (Fig. 2), the RF characterized by one feature calculated with a single potential performs better, but not as well as the RF characterized by all potentials. Besides, except for the RF with low-precision features (using ABOP, LJ, MEAM, Tersoff, and EDIP as features), the RFs with high-precision features (using AIREBO, AIREBO-M, LCBOP, and ReaxFF as features) and using only accurate LCBOP and ReaxFF as features perform better than the RF with only a single feature. In particular, the RF performs best when only the highest-precision potentials LCBOP and ReaxFF are used as features, and these results also show that when the number of features increases, especially when inaccurate feature values are added, the accuracy of the model will decrease due to the more complex feature relationship, on the contrary, when only the accurate features are used, the correlation between features and targets are more linear, make the regression tree easier to find the intrinsic correlation between feature and target.

Table 6 The MAE of RF trained by different features size

Except for the discussions above, under a certain feature size, the input feature composed of physical properties calculated by different classical interatomic potentials is not convenient to get since it needs to calculate each new structure’s physical properties by these potentials. Inspired by the imputation of missing input values, this dilemma can be relieved by utilizing imputation methods to infer the missing values from the known part of the data. Here, we use k-Nearest Neighbors (KNN) approach64 to impute the missing features in the input. KNN based on Euclidean distance metric to learn the correlation between features and find the nearest neighbors of the missing values among the samples that have values for the features, and the missing values are imputed using average weighted values from the nearest neighbors. Figure 10 shows under the 10-fold cross-validation of the formation energy dataset, the performance of the RF model combined with 2-nearest neighbors’ imputation when only one or two features use MD calculation. It can be clearly seen that when the more accurate features are calculated and other features are imputed, the accuracy of the model is higher. This is because the more accurate features are more important in the model, so calculating these features instead of imputing them will reduce the deviations of these features, thereby improving the prediction stability of the model. It can also be found from Fig. 10 that if more feature values are calculated as input, the accuracy of the model will be higher, such as when ReaxFF and AIREBO-M are used as input, the MAE is smaller than others, and the accuracy of the model is similar to that of the GB full-input model (Fig. 2). So, it is feasible to reduce the workload of obtaining the input part through the imputation of missing data though it will increase the error to a certain extent. Finally, it is worth mentioning that there are also some questions not discussed in this paper, such as the feasibility of the ensemble learning method in MD simulation and structure optimization problems, and these questions need further research to determine whether ensemble learning can do these calculations.

Fig. 10: The MAEs of formation energy of RF under 2-nearest neighbors’ imputation.
figure 10

The x-axis represents different conditions, where the computed features are obtained by using MD simulations, and all other features in each condition come from 2-nearest neighbors’ imputation. The RF model has better performance when more features or more accurate features are calculated as input instead of imputation.

In summary, we explore the possibility of prediction for the physical properties of a small size of carbon allotropes based on ensemble learning. The formation energy and elastic constants of carbon structures, as examples, can be predicted by using this kind of method. In general, the ensemble methods have better performance than the classical interatomic potentials we used in this work. Although at some points, the prediction is not accurate due to a lack of training data, the high dimensionality of features, and the local greedy characteristic of the algorithm, making the model difficult to learn the relationship between features and targets correctly. The PCA shows the input which consists of the values calculated by different classical interatomic potentials, has a similar distribution with the corresponding target physical property. What’s more, the Pearson correlation coefficient illustrates the linear correlation between input and output, and the regression trees can capture the relatively accurate feature as criteria for splitting the point in regression trees by evaluating the feature importance.

Methodology

Regression trees of ensemble learning

Regression trees, a type of decision tree, are used to predict outputs consisting of numerical values instead of categorical targets. They are also the base estimators in ensemble learning (the tree structures in Fig. 12). Figure 11 illustrates a regression tree that has seven nodes in total. The tree starts from the top node, and each node contains sorted samples and will be split into two subsets based on the criteria (thresholds) for the features until the terminal condition is reached. The blue nodes are parent nodes, they have two subsets called children. The green nodes are end nodes, representing numerical outputs that are decided by the targets. In scikit-learn, the optimized version of Classification and Regression Trees (CART)52 is used. This algorithm determines how to divide the sorted samples by trying different thresholds and calculating the MSE at each step. In this study, the feature vectors xiRn and target vector yRk are properties calculated by classical interatomic potentials and corresponding DFT reference, respectively. Where subscript i represents indexes of different materials, superscript n shows the number of input variables (the number of classical interatomic potentials), and superscript k is the total number of materials. We denote Qm as the dataset at node m with Nm samples, \({Q}_{m}^{{left}}{and}{{Q}}_{m}^{{right}}\) as the children of Qm, with \({N}_{m}^{{left}}\) and \({N}_{m}^{{right}}\) as the number of samples of these children. These children will split Qm into two parts using a threshold. The quality of the split of node m is calculated by minimizing the weighted average of impurity.

$$G\left({Q}_{m}\right)=\frac{{N}_{m}^{{left}}}{{N}_{m}}H\left({Q}_{m}^{{left}}\right)+\frac{{N}_{m}^{{right}}}{{N}_{m}}H\left({Q}_{m}^{{right}}\right)$$
(5)

Where H is loss function (such as MSE), for example, at node m, the MSE of its left child \({Q}_{m}^{{left}}\) is given by:

$${\bar{y}}_{m}=\frac{1}{{N}_{m}^{{left}}}{\sum }_{y\in {Q}_{m}^{{left}}}y$$
(6)
$$H\left({Q}_{m}^{{left}}\right)=\frac{1}{{N}_{m}^{{left}}}{\sum }_{y\in {Q}_{m}^{{left}}}{(y-{\bar{y}}_{m})}^{2}$$
(7)

Here, \({\bar{y}}_{m}\) is average value of target at node of \({Q}_{m}^{{left}}\). By recursing the \({Q}_{m}^{{left}}\) \({and}{{Q}}_{m}^{{right}}\), the weighted average of impurity changes as well, and selects the threshold that minimizes the impurity \(G\) as the node m. Repeat and do the same steps for each node until the terminal condition is reached, and finally, a trained regression tree is obtained.

Fig. 11: The schematic of a regression tree illustrates the blue nodes (parent nodes) are split into two subsets (children) based on the threshold for the features until the terminal condition is reached.
figure 11

When using the tree to predict, the trained regression tree will depend on the inputs and the thresholds to select the children, finally, one green node (output) will be decided.

Bagging and boosting methods

Bagging and boosting methods shown in Fig. 12 are used to achieve better performance than a single regression tree in this work. In bagging methods, several regression trees are trained independently by their own subsets in which the data can be chosen more than once. The final prediction is obtained by averaging the predictions48 of all individual regression trees. On the contrary, the regression trees in the boosting method are generated sequentially and each regression tree has limited depth and is related to the previous one. Instead of averaging the outputs of all regression trees, the final prediction is calculated by calculating weighted median49 of the predictions of all regression trees or summing predictions of all regression trees up50

Fig. 12
figure 12

Configurations of two regression-tree based ensemble learning models: bagging (a) and boosting (b). a In bagging methods, several regression trees are trained independently by their own subsets in which the data can be chosen more than once, and the final prediction is obtained by averaging the summation of all regression trees’ predictions. b The boosting method is generated sequentially, and each regression tree is related to the previous one. The final prediction is weighted median of all predictions of regression trees or summation of all predictions of regression trees.

Data collections

For the formation energy, 58 carbon structures and 335 Si-O structures and their DFT reference are extracted from the MP database. And use the nine classical potentials available for carbon elements in LAMMPS to calculate the energy minimization of each structure to get the formation energy per atom of each structure, then the energy above hull in the Materials Project database is used and extracted the structure with a value of 0 as a reference, and the nine-potential-calculated values under this structure are used as their respective reference to obtain the energy above hull of all structures and use these values as the input feature value. For elastic constants, 20 out of 58 carbon structures’ DFT reference are used due to the absence of DFT reference and removal of unstable or erroneous calculations65. For the features of each structure, use the same nine potentials to calculate the elastic constants at 0 K with the LAMMPS. The 21 elastic constants calculated by each potential are used as the input features.