Background & Summary

The pursuit of sustainable energy solutions necessitates the development of efficient processes for hydrogen production1,2,3, with methane decomposition emerging as a pivotal pathway4,5,6. However, traditional methane decomposition methods, such as dry reforming that integrates CO2 hydrogenation, face significant challenges including stringent reaction conditions and the complexity of catalytic reactions which often result in undesired byproducts and decreased selectivity7,8,9.

To address these challenges, our research focuses on optimizing methane decomposition using single-atom alloy (SAA) catalysts. SAAs are renowned for their high selectivity and efficiency, attributed to their uniform active sites10,11,12, which offer precise control over the catalytic process and potentially simplify methane decomposition to produce hydrogen without CO2 emissions13,14,15.

Several databases, such as the Open Catalyst Project16 and Catalyst Hub17, provide substantial data on catalytic reactions, particularly with a focus on thermodynamic properties like adsorption and reaction energies. However, DFT calculations tend to emphasize these thermodynamic aspects, leaving gaps in kinetic parameters, especially reaction energy barriers. Data related to methane dry reforming, in particular, is urgently needed to bridge these gaps.

To accelerate the development and optimization of SAAs for methane decomposition, machine learning (ML) and density functional theory (DFT) techniques are employed. These widely utilized computational tools predict and validate the dissociation energy barriers on various SAA surfaces, identifying catalysts with optimal performance18,19,20,21,22,23,24. This approach not only enhances the efficiency of catalyst screening but also significantly reduces the time and resources typically required for experimental testing25.

The process of dehydrogenating CHx species (where x ranges from 1 to 4) can be understood as a series of sequential steps, each involving the elimination of one hydrogen atom. Numerous studies have indicated that among the four-step dehydrogenation pathways for CHx, the dissociation barrier of CH is the highest26,27. Additionally, as shown in Supplementary Fig. 1, both the initial and final steps in the dehydrogenation of CH4 exhibit significant energy barriers that are positively correlated.

Based on these findings, a variety of SAAs are developed, and DFT is utilized to calculate the energy barrier for C-H bond dissociation. The structures and adsorption sites for different SAA surfaces are illustrated in Supplementary Figures 2, 3, resulting in a dataset containing 689 DFT-calculated energy barrier values. The outcome of DFT is utilized as training set and test set. The DFT dataset utilized as training set contains 623 energy barrier data for C-H dissociation. A comprehensive sampling in transition metals is conducted and several metals’ surfaces are selected as the substrates. The DFT dataset utilized as the test set contains 66 energy barrier data and the sample are concentrated on surfaces that are not included in the training set but appear on the prediction set. For each metal, the surface with the largest area proportion, including Ag322, Au332, Cd001, Hg101, Mn110, Mo110, Nb321, Pd111, Rh111, Tc100, is chosen. This dataset is not involved in the model training and only utilized as validation. The distribution of the training set, prediction set and test set is discussed below. All the SAA surfaces are created by substituting one metal to substrate metal surface on the top layer. The single atom metal replacement site is selected by automatically cutting the surface through Pymatgen, and then identifying the surface core atom for replacement.

The ML technique further make predictions after training. The elements of dopant and substrate metals are shown in Fig. 1, containing 30 types of transition metals. It is found the SAAs with host metals like Fe, Co and Ni showing a high dissociation activity, which is in good agreement with previous experimental findings28,29,30,31,32,33. In the test for different models, GBDT model performs the best, with the r2, RMSE and MAE being 0.921, 0.130 eV and 0.094 eV, respectively. The C-H dissociation energy barrier data is predicted by this model. The distribution of C-H dissociation energy barriers is also depicted in Fig. 1, including 8,638 energy barrier values of reaction occurring on top site. For dopant metal atoms loaded on these host metals, Co, Ru, Re, Os and Ir demonstrate noteworthy catalytic performance. The analysis of feature importance for all employed descriptors shows that doped_weighted_surface_energy is the most significant descriptor and account for over 40% of overall feature importance, seen in Supplementary Figure 4. This descriptor is the weighted surface energy of the doped metal atom and is related to the ability of the doped atom to gain and lose electrons. Other highly ranked attributes include com_top_d_e_number, host_molar_volume, com_top_d-band, CN-B3 + 1-top05, com_top_electronegativity, and host_surface_energy and host_surface_work_function. A detailed discussion of the results of machine learning can be found in the previous work28, and only the parts related to the database we present are outlined here.

Fig. 1
figure 1

(a) The metal types of host and doped atoms and (b) the distribution of C-H dissociation energy barriers, obtained by ML.

By leveraging advanced ML models and focusing on SAAs, this database focuses on the optimization of methane decomposition using single-atom alloy catalysts. ML model are trained with the DFT results, utilizing the optimized descriptors from Material Project34, Pymatgen35 and expert knowledge. The classification of reaction active sites improves the accuracy of ML models. C-H energy barriers on the top site of various SAA surfaces are predicted and tested further.

Methods

The methodology applied in this work mainly includes two parts: DFT calculations to generate data for training and testing, and ML to make predictions. The overall workflow is shown in Fig. 2. Detail information of each step is explained below.

Fig. 2
figure 2

Workflow for producing C-H dissociation energy barriers database generation. We start by searching for the transition state of C-H dissociation on various SAA surfaces with DFT and obtain the energy barriers. These values are subsequently utilized to train the ML models, where high relevant descriptors are selected after feature engineering. The source of origin descriptors is explained below. We first classify the prediction set with the reaction sites. Then different models are trained on various reaction sites. The database, containing 8638 entries of top site is predicted by the new model and show high accuracy in further tests.

DFT Calculations

First-principles calculations were performed using the Vienna Ab initio Simulation Package (VASP)36,37. The electron-ion interactions were treated using the Projector Augmented-Wave (PAW)38,39 method. The exchange-correlation potential was performed within the Generalized Gradient Approximation (GGA)40 using the Perdew-Burke-Ernzerhof (PBE) functional. A plane-wave basis set with a cutoff energy of 400 eV was employed. Additionally, intermolecular van der Waals interactions were corrected by the Grimme’s DFT-D3 method41.

For surface optimization, the energy convergence criterion for electronic self-consistent calculations was set to 1 × 10−5 eV, and the convergence criterion for the Hellmann–Feynman forces on the free atoms during ionic relaxation was set to 2 × 10−2 eV Å−1. Brillouin zone sampling was performed using the Monkhorst-Pack method, with a sampling standard such that k × a > 30, where k is the number of K-points and a is the lattice constant42. Transition state searches were conducted using the Climbing Image-Nudged Elastic Band (CI-NEB) method43 and the dimer method44, with transition states validated by frequency analysis to ensure only one imaginary frequency.

Models of various transition metals in their most stable phases were obtained from the Materials Project (MP) database34 and optimized using DFT. Potential surfaces were generated by slicing these models and replacing one substrate metal atom with a dopant atom on the topmost surface layer. For each surface model, a vacuum layer thickness of 15 Å was used. The lattice dimensions a and b were ensured to be greater than 6 Å through supercell expansion, with a sufficient number of layers selected to ensure a thickness greater than 5 Å. During surface optimization, the top two layers were fully relaxed, while the remaining lower layers of metal atoms were kept fixed at their bulk equilibrium positions.

Machine learning

Feature engineering

An automated methodology was developed to select and refine descriptors for single-atom alloy catalysts. The descriptors utilized fall into three main groups: those characterizing the elemental properties and the substrate’s surface, those linked to the substrate surface’s coordination specifics, and innovative descriptors developed from combining elemental properties with coordination numbers based on domain expertise. Elemental and surface property descriptors were extracted from the MP and Pymatgen databases34,35. Information on the metal surface structure, including the coordination numbers of single atoms in the uppermost layer, was derived through automated surface analysis using Pymatgen35. These included the coordination number within a top 0.2 radius (CN-B3 + 1-top02), a top 0.5 radius (CN-B3 + 1-top05), a top 1 radius (CN-B3 + 1-top), and the total coordination number (CN-B3 + 1). The BrunnerNN_real and MinimumDistanceNN methods were employed to assess coordination numbers. Additionally, descriptors quantifying the coordination number in various radii atop the metal surface were formulated. For the active center depends not only on the monatomic metal, but also on the host metal of the monatomic metal coordination. Therefore, the total electronegativity, number of d-electrons, d-band center of the monatomic metal and its coordinated host metal were calculated by means of coordination number weighting. These innovative descriptors integrate the attributes of the single-atom metal with those of its host. After discarding descriptors with incomplete data, a comprehensive set of 85 feature descriptors was compiled, all of which were autonomously generated without the need for further theoretical computations.

To evaluate the efficacy of feature sets, the analysis utilized the Pearson correlation coefficient (p), calculated as outlined in formula (1):

$$p=\frac{{\sum }_{i}({f}_{i}-\bar{f})({F}_{i}-\bar{F})}{\sqrt{\sum _{i}{({f}_{i}-\bar{f})}^{2}}\sqrt{{\sum }_{i}{({F}_{i}-\bar{F})}^{2}}}$$
(1)

Here, fi and Fi represent the features being compared, \(\bar{{\rm{F}}}\) and \(\bar{{\rm{F}}}\) as the mean of the features. The p value ranges from −1.0 to 1.0, where a higher absolute value indicates a stronger correlation.

The analysis of the Pearson correlation coefficients for the 85 descriptors revealed high correlations among some, indicating redundant information. To address this, two methods were applied: Mutual Information Regression (MIC) and Recursive Feature Elimination with Cross-Validation (RFECV) methods45, both facilitated by the Scikit-Learn package46. These methods aided in identifying and eliminating highly correlated features based on their importance rankings. Post-screening, the Pearson correlation coefficient between all retained descriptors was kept below 0.847. In the calculation of Pearson correlation coefficient, when the Pearson correlation coefficient of two features is greater than our set value, the features with little mutual information are discarded.

Furthermore, mutual information is defined in Eq. (2):

$$I(X;Y)={D}_{KL}\Vert {P}_{X}\otimes {P}_{Y}$$
(2)

It measures the dependency between a pair of random variables, (X, Y), across the space (X × Y). Here, P(X, Y) denotes their joint distribution, while PX and PY represent their respective marginal distributions, and DKL indicates the Kullback-Leibler divergence, a measure of how one probability distribution diverges from a second, expected probability distribution.

Model selection

In this research, we utilized a diverse array of ML techniques to assess and compare their prediction capabilities. We employed nine different ML classification algorithms: Multilayer Perceptron Classifier (MLPC), Gradient Boosting Classifier (GBC), Random Forest Classifier (RFC), Extra Tree Classifier (ETC), Decision Tree Classifier (DTC), K-Neighbors Classifier (KNC), Linear Support Vector Classifier (LSVC), Ridge Classifier (RidgeC), and ETC + KNC. In addition to these classification methods, we applied fifteen ML regression algorithms: Multilayer Perceptron Regression (MLP), Extreme Gradient Boosting Regression (XGB), Gradient Boosting Regression (GBDT), Random Forest Regression (RF), Extra Tree Regression (ETR), Adaptive Boost Regression (ADAB), Linear Regression (LR), Ridge Regression (RIDGE), Lasso Regression (LAS), Bayesian Ridge Regression (BAY), Bayesian ARD Regression (ARD), Gaussian Process Regression (GPR), Linear Support Vector Regression (LSVR), Support Vector Regression (SVR), and K-Neighbor Regression (KNN). Except for XGB, all other algorithms were implemented using the Scikit-Learn package46.

To optimize the performance of these algorithms, hyperparameter optimization was conducted through a grid search method, leveraging a randomly sampled training set. This method involved exploring various combinations of hyperparameters defined within a specific search space for each algorithm to identify the most effective settings. The objective was to determine the optimal hyperparameters that produced the best results on the test data from training set, thereby enhancing the predictive accuracy of the models by systematically testing a wide range of hyperparameter configurations.

Once the optimal hyperparameters were identified, we assessed the trained ML models’ performance, including predictive accuracy and generalization capabilities. The models were trained using random subsets constituting 80% of the dataset, termed the training set, with the remaining 20% serving as the test set for validating model predictions. Given the dependency of prediction accuracy and difficulty on the division between training and test sets, the training phase was replicated 1000 times for each ML regression algorithm to minimize the randomness effect.

The effectiveness of each ML regression model was quantified using three statistical measures: the coefficient of determination (r²), mean absolute error (MAE), and root mean square error (RMSE). These metrics were calculated to evaluate the models’ prediction errors. Furthermore, the mean, minimum, and maximum values of these indices were employed as benchmarks to assess performance comprehensively. The goal was to identify a model that consistently performs well and exhibits minimal fluctuations across a substantial number of tests, ensuring robustness and reliability in predictive scenarios.

The r2, RMSE, and MAE are defined as:

$${r}^{2}=1-\frac{\sum _{i}{({Y}_{i}-{y}_{i})}^{2}}{\sum _{i}{({Y}_{i}-\bar{Y})}^{2}}$$
(3)
$${RMSE}=\sqrt{\frac{1}{{\rm{N}}}\mathop{\sum }\limits_{i=1}^{N}{({Y}_{i}-{y}_{i})}^{2}}$$
(4)
$${MAE}=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}|{Y}_{i}-{y}_{i}|$$
(5)

where Yi signifies the values obtained from DFT calculations, yi represents the predictions generated by the machine learning models, and Y̅ represents the average of all DFT values. For an effective model, the coefficient of determination, denoted as r2, should ideally approach 1, indicating that the model accounts for most of the variance in the observed data. Additionally, lower values of both RMSE and MAE, ideally approaching zero, reflect higher precision and accuracy of the model’s predictions.

For classification purposes, it was essential to account for the possibility of multiple reaction sites on a single SAA surface. Consequently, a multi-label classification approach was implemented to accommodate multiple potential reactions per SAA. In evaluating the ML model, the recall rate was prioritized as the primary metric to minimize any oversight, setting aside the precision measure. The Extra Trees Classifier (ETC) and K-Neighbors Classifier (KNC) emerged as the top models based on recall performance and were selected for the prediction tasks. The predictions made by these two models were aggregated to further reduce the risk of missing any reactions. The classification outcomes from these nine models on the test set are detailed in Supplementary Table 1.

The recall score is defined as follows:

$${Recall}=\frac{{TP}}{{TP}+{TN}}$$
(6)

where TP is the number of correct positive predictions, TN is the number of correct negative predictions.

Data Records

The structure files obtained from DFT calculations can be accessed via Figshare at https://figshare.com/articles/dataset/SAA_C-H_/2583655048. The DFT data utilized as training set can be accessed at https://doi.org/10.6084/m9.figshare.2764583149. The descriptors used for ML and the resulting data can be downloaded at https://doi.org/10.6084/m9.figshare.2620000750. The scripts used to download data from the MP and process it using Pymatgen are available on Zenodo at https://zenodo.org/records/813329451.

The dataset, organized in a CSV spreadsheet, includes 10,950 entries detailing descriptors and predictions of the C-H dissociation energy barriers on SAA surfaces28. Each row represents a different chemical composition, while each column denotes a property of that composition. Specifically, 8,638 entries focus on top sites, which are predicted by ML and tested.

The initial five columns of the dataset provide a comprehensive overview of each system’s base material, reaction surface, surface proportion, dopant atom, and the space group of the base material. These fundamental properties, essential for understanding the reaction context, are sourced from the MP database34. As the initialism of fundamental, these properties are classified as F in Supplementary Table 2. Note these properties are not involved in the training, but utilized to distinct reaction systems.

Subsequent to these fundamental details, the dataset expands to include 85 columns that describe various properties of the systems, derived from MP or computed using Pymatgen. These properties are categorized in the methods section and Supplementary Table 2 as D1, D2, and D3. The descriptors, which characterize the elemental properties and the substrate’s surface, are categorized as D1. Those linked to the substrate surface’s coordination specifics are categorized as D2. The innovative descriptors developed from combining elemental properties with coordination numbers based on domain expertise are categorized as D3. Most descriptors are named by the properties and easily understood. The detailed description can be found in Supplementary Table 2.

The dataset also includes columns that indicate the adsorption sites based on ML predictions of reaction pathways, marking a ‘1’ for presence and ‘0’ for absence at specific sites. Finally, predictive outcomes, for example energy barriers, are provided. These outcomes, along with the reaction sites, are categorized as Outcome (O) in Supplementary Table 2, illustrating the results of machine learning analyses.

The DFT test set is depicted in Supplementary Table 3. The host surfaces and dopants describe the catalytic system. The C-H dissociation energy barriers, obtained from DFT and ML predictions, are listed together to compare.

Technical Validation

Comparison to DFT calculations

This work reported energy barriers of C-H dissociation, generated by ML method. DFT is widely used in transition state calculations because of the high accuracy and reproducibility. Hence, we investigated the C-H dissociation energy barriers of various SAA surfaces. The outcome was used as the test set. Note that the data contained in training set were excluded. The composition of training set, prediction set and test set can be seen in Fig. 3, detailed in Supplementary Table 3. The dot falling onto the reference line (y = x) indicated the values of C-H dissociation energy barriers obtained by two methods are totally agreed. The r2 and RMSE were 0.881 and 0.145 eV, showing high consistency.

Fig. 3
figure 3

The comparison of C-H dissociation energy barriers obtained from DFT and ML methods. The test set, containing 66 energy barriers, are compared with the machine learning results. The r2 and RMSE were 0.881 and 0.145 eV, respectively.

The comparison of C-H dissociation barriers obtained by ML and DFT is shown in Fig. 4. DFT calculation results are labelled as yellow and green dots in the figure. Light blue dots represent the predictions of the ML method. Blue dots indicate the accumulation of light blue, reflecting the predictions concentrating on same host surface. The training set is obtained by DFT first and the prediction set is determined subsequently. The test set is randomly selected in the SAAs, which are predicted and not appear in the training set. The SAAs, which are represented by the blue dots far from yellow dots in the figure, are more likely to be tested. The dots, representing DFT calculation results, cover the whole circle area, showing that the sampling is reasonable and the sample is representative enough.

Fig. 4
figure 4

The dot plot of training set, prediction set and test set. The metal label in figure indicates the host.

We also compared our database with other previous researches. The previous researches which investigated C-H dissociation barriers were mainly first-principles studies. The database we predicted showing strong agreement with them, as can be seen in Table 1.

Table 1 The comparison of predicted and reported C-H dissociation energy barriers.