Introduction

Novel, functional structures at the nanoscale could be crucial for transforming a broad spectrum of economically significant processes into greener and more sustainable solutions. For instance, nanostructured materials hold the potential to significantly enhance the cost-effectiveness of fuel-cell devices1, enable the creation of highly efficient quantum-dot LEDs2, and pave the way for generating atom-precise efficient nanocatalysts for studying novel catalytic pathways in electrochemical applications3,4.

As performance is highly dependent on specific structural characteristics which often can not easily be resolved in lab experiments, computational chemistry - most often by using Density Functional Theory (DFT) based approaches - can be used to generate in-silico insights. Typical questions range from elucidating which feature of a given nanoparticle might improve catalytic performance to mechanistic explanations for key synthesis procedures, allowing tailored experiments to drive up experimental yields for optimal structures.

Commonly, these questions are associated with finding energetically favorable configurations for the potential energy surface (PES) of a system, which is a property relevant to solving a wide range of problems in computational chemistry.

The established methodology allows finding “docking” mechanisms between small molecules and large biomolecules, which is relevant for drug development5. Additionally, a large area of research revolves around the sensing of harmful gases by novel nanomaterials chosen according to their strength of interactions. In the field of catalysis, identifying and maximizing relevant catalytic sites with the lowest number of atoms results in highly efficient and financially viable catalysts6.

For the latter, the primary approach starts by finding symmetry points of a surface crystal plane and sampling a small area of the configuration space around them7 to find a minimal energy configuration. All system constituents are usually kept rigid in this approach, and only the final configurational pick is subject to a relaxation procedure and further study. Over time, several physics-based optimization techniques have evolved to allow for a more elaborate study of the PES moving beyond simple sampling of enumerated configurations: the Nudged Elastic Band (NEB) method allows the description of catalysis pathways8. With Minima-Hopping and genetic algorithms, new stable material conformations can be explored9,10, and metadynamics could be used to examine the thermodynamics of a given system11,12. Despite these methodological improvements, the high number of necessary energy evaluations has relegated those to niche and specialist applications. At the same time, the original sampling-based workflows for configurational search are still in everyday use and can be performed in a highly automated fashion13,14. However, when applied to novel systems incorporating irregularly shaped nanostructures or more complex adsorbents, exhaustive sampling quickly reaches the computational limits of the average project. Here, data-driven machine learning techniques such as neural networks have been highly utilized15 and could be used to describe the PES and subsequently allow for more efficient exploration. While early research showed that this approach could successfully reproduce even complex PES16,17, the practical application requires that this can be achieved without exhaustively sampling the PES of the system under examination for data generation.

One approach is using a large, pre-trained model describing the system and using that to significantly speed up an exhaustive sampling and relaxation procedure, as has been done to model adsorption dynamics18,19. However, this relies on the availability of a large-scale dataset to create a model with suitable generalization power towards the research problem. Without such prerequisites, one can also iteratively build up an uncertainty-quantifying model for the PES of a specific system using techniques such as active learning (AL) or Bayesian optimization (BO). The “classic” approach to this problem would be a Gaussian-process (GP) based regression model for the PES. The inputs would be features extracted from the atomistic structure20,21, with newer research suggesting the usage of end-to-end-trained graph-neural-network models22. These kinds of machine-learning models have been used to find stable crystal structures23,24,25, learn adsorbate-surface PES26 and have been integrated into techniques like NEB27,28 or minima-hopping29 resulting in significant computational speedup.

The basic problem of finding energetically minimal configurations for a variety of systems using BO has been extensively studied under the BOSS-handle using problem-specific features and Gaussian process-based PES surrogates. Starting by demonstrating that energetical minimal molecular structures can be found by optimizing a model based on dihedral angles30, the method was further demonstrated for adsorption problems by finding the optimal placement of large, rigid molecules on inorganic surfaces31,32 and has been refined to determine optimal adsorption structures for partly-flexible molecules on gold-clusters33. They, however, rely on hand-crafted feature extraction functions, which can restrict the potential expressibility of the model and require a high level of prior physical understanding that might not be present in more complex systems. Further, this approach isn’t transferable as it relies on a fixed, molecule-dependent size of input features.

Our method encodes the cluster-adsorbate system as a graph. This is processed by a state-of-the-art Graph Neural Network (GNN), which allows the representation to be symmetry, rotation, and translation invariant. Given the pooling properties of the graph, this output has the same dimensionality independently of the molecule used for input. The output of the GNN is used as input to a Gaussian Processes (GP) model which creates the final predictions along with quantification of their uncertainty. Finally, a Bayesian Optimization (BO) scheme produces suggestions for optimal adsorption sites.

We demonstrate the robustness of the AUGUR framework on four systems. The first is the family of the chini clusters that have been exhaustively studied as a precursor to generating atom-precise nanoclusters using Molecular Organic Framework as a template in our previous study3. The atom-precise nanoclusters are crucial in developing the next generation of highly tailored, efficient, and economically viable nanocatalysts. The DFT simulations of Zn2+ - Chini cluster interactions found in literature allow us to know, a priori, the energetic behavior of all chemically distinct sites. This, in turn, allows the validation of our algorithm. Second, we selected ZnO clusters that are well-known experimentally to be used in semiconductor companies for chips, sensors, and electrodes due to their observed high catalytic activity and stability34. In our study, the focus lies on the gas adsorption capability of the (ZnO)78 cluster, as it is one of the “magic clusters”35. However, due to the large size of the cluster, the conventional approach of exhaustive DFT investigation, covering the entire surface of the cluster, is computationally infeasible. The third and fourth use cases involve extended 2D surfaces with single and multiple defects, respectively. The third case investigates a hydrogen-deficient Covalent Organic Framework (COF), TpBD-FF, whose reactivity arises from this structural deficiency. Examining its interactions with Zn ions from the electrolyte provides valuable insights into its potential as an anode coating material in zinc-ion batteries36. Our fourth case study explores a multi-atom substituted Silicene, a 2D allotrope of Silicon whose enhanced electronic structure properties make it a promising candidate for many catalytic applications37,38. In this work, we specifically focus on formaldehyde adsorption on the defective Silicene layer.

The AUGUR pipeline has proven itself robust, transferable, and highly efficient in all case studies, identifying the optimal sites with approximately ten simulation runs and without the need for hand-crafted features.

Results

To show the generalizability of the method, we investigate two different adsorption scenarios, featuring both atomic and molecular adsorbates, along with four nanosystem adsorbents of varying complexity. The adsorbate distance was chosen based on the explicit literature survey of the optimal interaction distance between different atomic species. Nevertheless, AUGUR offers the functionality of determining these distances as the sum of the van der Waals radii. This is a less efficient approach as it places adsorbates further from the cluster/surface than necessary, increasing DFT times. It however, adds another level of abstruction allowing AUGUR to be used in an even more problem-agnostic manner.

Cluster adsorption optimization

Chini clusters

Chini clusters are a unique class of organometallic compounds known for their unusual bonding and electronic properties. These clusters serve as precursors for synthesizing atomically precise nanoclusters, with their model structures illustrated in Fig. 1. Understanding the interaction between Chini clusters and Zn2+ ions is crucial for unraveling the encapsulation process of platinum nanoparticles (Pt NPs) within the Zn-based ZIF-8 Metal-Organic Framework (MOF)3 template. The precise understanding and control of this interaction will aid the creation of highly efficient and specialized electrocatalysts that minimize the usage of rare earth elements.

Fig. 1: Case Study 1: Chini clusters \({[{{\rm{Pt}}}_{3{\rm{n}}}{({\rm{CO}})}_{6{\rm{n}}}]}^{2-}\) (n = 1–3); These nanoclusters comprise of three Pt-Pt bonds forming a triangle and laterally protected with CO ligands.
figure 1

As n increases, layers are progressively added, as seen in (ac). The color scheme used is as follows: Pt (black); O (red); C (light grey).

For generating suggestions for the optimal adsorption positions of Zn ions on the Chini clusters, point generation was guided by the following physical principles: The Zn atom was positioned on the surface of a sphere centered on either Pt or O atoms. Carbon atoms were excluded as potential adsorption sites since they are unavailable for interaction on the cluster’s surface. A mixture of initial simulations and physical intuition determined the sphere’s radius. It must be noted that this constraint can err on the high side, as the only effect will be an increase in the computational time, with the final energy remaining constant. After evaluating these points, our pipeline suggested the optimal adsorption sites, which were then simulated using DFT.

The interaction energies at these suggested sites were compared with those obtained from Monte Carlo (MC) sampling. As shown in Table 1, the lowest interaction energy predicted by AUGUR is consistently lower than that predicted by the MC approach, a trend that becomes more pronounced with increasing cluster size-from an improvement of 8.73 % in Pt3 to 142.62 % in Pt9. This demonstrates that our framework is more robust in finding the global minima than the random MC sampling approach, ultimately saving both time and resources.

Table 1 Lowest interaction energies for Pt3, Pt6, and Pt9 nanoclusters as determined by the Monte Carlo sampling approach and their comparison with the AUGUR pipeline

In addition to the optimization process, the high level of symmetry allows for fairly accurate energy predictions. This was evaluated using Leave One Out (LOO) cross-validation. Although the uncertainty was relatively high due to the limited number of points required, the Mean Square Error (MSE), as shown in Table 2, was seen to be remarkably low. A more detailed representation of the LOO can be seen in the SI.

Table 2 Mean Square Error (MSE) for models trained and tested on Pt3 and Pt6 clusters

Given the accuracy showcased in Table 2, our model can be leveraged to reconstruct the entire energy surface and quantify uncertainties in unexplored regions. This approach offers a more holistic understanding of the system as compared to the traditional trial-and-error sampling, which lacks an underlying model for further system interpretation.

Prior literature knowledge of the electronic structure of Chini clusters allows us to verify the results. These clusters have three distinct chemical sites: the “Top,” “Bridged,” and “Terminal”. Our study shows that the “Top” is the most energetically favorable for Zn ion binding. This is due to the formation of the strong Zn-Pt bonds at the exposed Pt atoms in the outer layer of the cluster, as concluded by our previous study3. Additionally, the “Bridged” and the “Terminal” positions show considerably less strong interaction energies due to the carbonyl ligands inhibiting the direct Metal-Metal (M-M) bonding3. AUGUR successfully recreates these findings, accurately identifying the optimal and relative energetics of Zn-chini cluster interactions.

In Fig. 2, the energy surface and optimal site for Pt3 are presented and verified with the predictions. Of note is that although the predictions are fairly accurate, the uncertainty is relatively high. This can be attributed to the low number of points, as well as to the different distances of Zn ion placement in the training and prediction sets. For a more detailed view of the energy surface and standard deviation, refer to the SI.

Fig. 2: Predicted energy surface for the Pt3-Zn cluster in [eV] (left).
figure 2

The cluster figure depicts the most favorable adsorption position as predicted by AUGUR in the Pt3 cluster (middle up). Uncertainty quantification of the prediction (right). The color scheme used is as follows: Pt (black); O (red); C (light grey), Zn (Blue).

In Fig. 3, the energy surface and optimal site for Pt6 are presented. The uncertainty is equivalent to that in Fig. 2 despite the larger size. This is because Pt6 exhibits a strong symmetry, which reduces the number of unique sites to those of Pt3. For a more detailed visualization of the energy surface and standard deviation, refer to the SI.

Fig. 3: Predicted energy surface for the Pt6-Zn cluster in [eV] (left).
figure 3

The cluster figure depicts the most favorable adsorption position as predicted by AUGUR in the Pt6 cluster (middle up). Uncertainty quantification of the prediction (right). The color scheme used is as follows: Pt (black); O (red); C (light grey); Zn (Blue).

The Pt9 cluster introduces two additional distinct sites due to the existence of the middle layer. These are the “Bridged Middle” and “Terminal Middle”. According to the prior analyses3, these sites are expected to be less energetically favorable but follow the same energy relationships as those in the upper and lower layers. In the SI, a detailed view of the energy surface and standard deviation is provided. Of note is that this prediction was generated with only one simulation from Pt9 included in the training set, alongside Pt3 and Pt6. This finding is of potentially high importance as it suggests that our model could be trained on comparatively inexpensive simulations supplemented by a few strategic points from the more computationally intensive cases to achieve physically consistent and time-efficient results. These models could then be deployed for the prediction and optimization of larger structures of the same family and related systems with higher nuclearity, which would otherwise be computationally infeasible to simulate as seen in Fig. 4.

Fig. 4: Predicted energy surface for the Pt9 - Zn cluster in [eV] (left).
figure 4

The cluster figure depicts the most favorable adsorption position as predicted by AUGUR in the Pt9 cluster (middle up). Uncertainty quantification of the prediction (right). The color scheme used is as follows: Pt (black); O (red); C (light grey); Zn (blue).

Nevertheless, it is important to note that the Chini clusters system is relatively simple, and intuition alone would have converged to the optimal site in a reasonable number of iterations. Therefore, our pipeline was subsequently tested on a much more complex system, lacking obvious symmetries.

Zinc Oxide Cluster

Our second case study involves the Zinc oxide ((ZnO)78) cluster depicted in Fig. 5, with carbon monoxide (CO) as the adsorbate molecule. ZnO clusters are used experimentally and industrially in semiconductor companies for chips, sensors, and electrodes due to their observed high catalytic activity and stability34. Our study aims to aid the exploration of the gas adsorption potential of the large Zn oxide clusters, which would otherwise be very computationally expensive to pursue. The (ZnO)78 was mainly selected because it is one of the “magic clusters”39, yet remains relatively unexplored in the literature. Given the complexity and asymmetry of such an extensive system, manually identifying all the chemically distinct adsorption sites on the potential energy surface (PES) to begin the exploration is impractical. Through our investigation, we discovered two distinct sites in what appeared to be nearly identical locations. However, the adsorption energies varied significantly between these sites, with one being the optimal adsorption position and the other showing only half the absolute adsorption energy.

Fig. 5: Configuration of the Zinc Oxide Nanocluster (ZnO)78.
figure 5

This figure depicts the relaxed structure of the bare (ZnO)78 cluster along (a) x-axis and b z-axis. It can be noted that no visible symmetries are seen in the whole cluster. The color scheme used is as follows: O (red); C (light grey); Zn (Blue).

The point generation in this study was done by positioning the carbon atom of the adsorbate on the surface of a sphere centered on a randomly selected atom within the cluster, with a radius of 1.40 Å. Subsequently, the oxygen atom of the adsorbate was placed on the surface of a sphere centered on the carbon atom, with a radius of 1.12 Å, corresponding to a triple bond between the C and O atoms of carbon monoxide. The placement was further constrained to maintain a minimum distance of 1.40 Å between the oxygen atom and the Zinc oxide cluster so that it does not coincide with the cluster atoms. Additionally, the carbon atom was positioned first as it shows a higher chemical affinity for the cluster. In this way, we ensure that our method does not preclude placements of the adsorbate where the O atom is closer to the cluster, but makes it less likely.

We present the comparison of energy optimization results obtained using both Monte Carlo (MC) sampling and the AUGUR framework in Table 3. In total, 19 samples were evaluated with the MC approach, compared to 13 with AUGUR. Despite the lower number of samples, the AUGUR framework consistently outperformed the MC method in both energy optimization and data efficiency. To ensure that the optimum identified by AUGUR is indeed the global minimum, approximately 70 additional simulations were conducted. After thoroughly investigating the structure, we are confident that the identified optimum is the global minimum.

Table 3 Lowest interaction energies and Mean Square Error (MSE) results of the (ZnO)78 clusters as determined by the Monte Carlo (MC) sampling approach and their comparison with the AUGUR pipeline

An additional advantage of this approach is its ability to generate data points that train an accurate model in the most data-efficient manner. To demonstrate this, we present the cross-validation results from both approaches in 3. Each AUGUR step involved two suggestions: one focused on extreme exploration (with a tradeoff of 100) and the other on extreme exploitation (with a tradeoff of 0.5). This method could be further refined by gradually decreasing the tradeoff and generating only one suggestion per step. However, the combined approach enabled rapid convergence-not in terms of the total number of simulations but in terms of the number of suggestion steps-by allowing the two extremes to be evaluated in parallel.

In Fig. 6, we show the predicted energy surfaces and associated uncertainties of the investigated cluster by predictions made at 5000 points. The prediction of those suggested points required approximately 150 seconds. In addition, the average simulation duration for one suggestion of this case study is approximately two days. Therefore, performing an exhaustive energy surface investigation using traditional means would have taken multiple years.

Fig. 6
figure 6

Results for the energy surface of the (ZnO)78 cluster (left), the most favorable adsorption position (middle up), and the corresponding uncertainty (right).

For a more detailed view of the energy surface and standard deviation, refer to the SI. This case study, more than anything, highlights the advantages of the AUGUR framework. Our pipeline not only determines the best energy site with minimal data cost, it also gathers points efficiently for the model’s training which can be seamlessly used to gain physical insight that would have otherwise taken years to achieve.

Surface adsorption optimization

In this section, we investigated 2D surface adsorption. The molecule’s placement was adjusted to prevent unphysical configurations, such as penetrating the surface, with periodic boundaries accounted for using the minimum image convention40. This new distance calculation was used to both calculate the edge attributes of the graphs, as well as assure that the adsorbate was placed within the allowed distance to the cluster, accounting for periodicity. Geometric constraints, avoiding unphysical overlaps or piercing of the surface, were further implemented.

Defective COF surface

For our third case study, we examined the TpBD-FF Covalent Organic Framework (COF), a two-dimensional porous periodic system known for its use as a coating material for zinc-ion battery (ZIB) anodes. The structure of TpBD-FF is illustrated in Fig. 7 a. TpBD-FF enhances cell efficiency by promoting the formation of well-defined ion transport channels36. Our goal is to understand this by studying the interaction of electrolytic Zn2+ ions with a single defective layer of TpBD-FF. We aim to identify the optimal adsorption site and its strength, which is essential for evaluating which specific functional groups in the COF layer influence its protective role in ZIB performance.

Fig. 7: Configuration of defective Covalent Organic Frameworks (COFs) (TpBD−FF).
figure 7

This figure depicts the relaxed structure of the (a) Bare TpBD-FF COF and (b) H-deficient TpBD-FF COF, viewed along the z-axis. The pink encircling in b shows the position where the H-deficiency defect was introduced. The color scheme used is as follows: O (red); C (light grey); N (Blue), F(Fluorine), H(light pink).

To test the robustness of the AUGUR framework in defective periodic systems, we introduced a hydrogen deficiency defect, a commonly observed imperfection in COFs41. Specifically, a hydrogen atom was removed from one of the chemically reactive N-H bonds of TpBD-FF, as illustrated in Fig. 7b. Introducing such a defect not only mimics realistic conditions but also allows us to test the more accurate applicability of the TpBD-FF COF to be used as an anode coating.

To systematically determine the optimal interacting site, Zn2+ ions were initially placed at three random positions on the TpBD-FF surface. The point generation process was performed similarly to the case study described in Section “Cluster adsorption optimization”, where a single-atom adsorbate was considered, too. Specifically, Zn2+ ions were positioned at distances ranging from 1.6–2.4 Å from all the surface available atomic species (H, O, and F) within TpBD-FF, ensuring sufficient sampling of the potential energy surface. This range was considered from the various existing experimental and computational distances - Zn-H42, Zn-O43, Zn-F44) reported in the existing literature.

The energy optimization results comparing the Monte Carlo (MC) sampling approach and the AUGUR framework are summarized in Table 4. A total of 10 candidate configurations were evaluated using MC sampling, whereas the AUGUR framework converged in just 1 iteration. The most optimal binding site has the lowest energy −8.00 eV as compared to −6.44 eV in MC samples. Despite the significantly reduced number of samples, AUGUR consistently outperformed MC sampling in terms of energy optimization and computational efficiency.

Table 4 Lowest interaction energies and Mean Square Error (MSE) results of a single TpBD-FF COF layer as determined by the Monte Carlo (MC) sampling approach and their comparison with the AUGUR pipeline

To validate that the site identified by AUGUR corresponds to the global energy minimum, we exhaustively evaluated all unique adsorption sites on the TpBD-FF surface. This exhaustive search, as shown in Fig. 8, confirmed that AUGUR’s prediction aligns with the true global minimum, further demonstrating the robustness and reliability of the AUGUR framework in handling adsorption problems within complex 2D periodic systems. For a more detailed view of the energy surface and standard deviation, refer to the SI.

Fig. 8: Results for the energy surface of the TpBD-FF COF layer (left), the most favorable adsorption position (middle up), and the corresponding uncertainty (right). The orange circle pinpoints the site of the most stable interaction.
figure 8

The color scheme used is as follows: O (red); C (light grey); N (Blue), F(Fluorine), H(light pink), Zn(Steal Blue).

Defective silicene

Our final case study focuses on Silicene, a 2D periodic material, visualized in Fig. 9a), that exemplifies the broad applicability and robustness of the AUGUR pipeline. Silicene features a buckled structure arising from the sp2-sp3 hybridization of silicon atoms. This structural corrugation endows Silicene with unique electronic properties and an active surface, making it a promising candidate for various chemical applications37,38. Additionally, its defect engineering has been extensively studied to enhance chemical reactivity for specific purposes45. These attributes render Silicene an ideal test case for validating the effectiveness of AUGUR in handling periodic systems, especially with multiple defects.

Fig. 9: Configuration of Multi-point defective Silicene.
figure 9

This figure depicts the relaxed structure of the (a) bare Silicene layer and b Defective Silicene along the z-axis. Note: A higher number of defects ensures more distinct sites. The color scheme used is as follows: Si (Orange); B (Green); C (light grey); N (Blue); Al (Turquoise); P (Lavender).

To demonstrate AUGUR’s capability in such challenging use case scenarios, we introduced five distinct point defects into the Silicene lattice, all of which have been previously characterized separately in the literature46,47,48,49,50. Specifically, we replaced five randomly selected silicon atoms with boron (B), aluminum (Al), carbon (C), nitrogen (N), and phosphorous (P). This defective Silicene structure was then employed as an adsorbent substrate to evaluate whether AUGUR could efficiently and accurately converge to the most favorable adsorption site for an incoming adsorbent molecule.

We selected formaldehyde (CH2O) as the adsorbate for two key reasons. First, formaldehyde is a well-known carcinogen, and studying its interaction with defective Silicene could offer insights into its potential use as a sensor for its atmospheric or biological detection51. Second, formaldehyde serves as a sufficiently complex test case, with its four constituent atoms allowing us to assess AUGUR’s efficacy in cases where higher-dimensional adsorbates are involved.

To address the complexity of the adsorption process, we implemented an enhanced placement strategy that positioned the entire adsorbate molecule in a single step, followed by a random rotational adjustment. This approach allowed for systematic exploration of adsorption configurations while maintaining computational efficiency. Furthermore, the placement methodology prioritized the aldehydic oxygen atom in formaldehyde, ensuring its alignment towards the defective Silicene surface, as it is the most reactive functional group in formaldehyde. To achieve this, the oxygen atom was initially positioned near a randomly selected Silicene surface atom at specific distances informed by the experimentally and computationally established bond lengths in the literature cited. The selected placement distance ranges were: O-Si (1.57–1.73 Å)52,53, O-B (1.28–1.46 Å)54,55, O-C (1.13–1.48 Å)56,57, O-N (1.1–1.44 Å)58, O-P (1.43–1.52 Å)59,60, and O-Al (1.68–2.1 Å)61,62. While these ranges were considered for completeness, placement was consistently performed at the lower bound of each range to ensure uniformity across configurations. Subsequently, the formaldehyde molecule was rotated randomly along its three principal axes. This rotation was carefully designed to preserve the proximity of the oxygen atom to the substrate, ensuring that it remained the closest point of interaction with the defective surface.

As can be seen from Table 5, AUGUR significantly outperformed the MC benchmark, with the lowest interaction energy being −2.19 eV as compared to −1.80 with MC sampling.

Table 5 Lowest interaction energies and Mean Square Error (MSE) results of the multi-point defect Silicene Layer as determined by the Monte Carlo (MC) sampling approach and their comparison with the AUGUR pipeline

Lastly, the energy surface along with the optimal adsorption position are presented in Fig. 10. For more details about the energy surface and standard deviation, refer to SI.

Fig. 10: Results for the energy surface of the defective silicene surface (left), the most favorable adsorption position (middle up), and the corresponding uncertainty (right).
figure 10

The blue circling pinpoints the site of the lowest interaction energy. The color scheme used is as follows: Si (Orange); B (Green); C (light grey); N (Blue); Al (Turquoise); P (Lavender); O (red).

Discussion

In summary, we have presented a novel deep kernel learning framework for automatically determining the optimal adsorption sites on molecules, minimizing the number of DFT simulations required. It seamlessly combines graph neural networks and Gaussian processes. In this way, we achieve rotation, symmetry, and translation invariance for the inputs of the Gaussian processes. Additionally, because of the properties of graphs, we also allow for the same model to be used across different molecules regardless of their size. This alleviates the need for crude techniques such as padding or hand-crafted features that require a high level of physical intuition. Further, this allows for significant dimensionality reduction to the inputs of the Gaussian processes, which famously scale poorly with increasing feature dimensions. This composite model will enable us to fully investigate the energetic structure of the molecule as well as selectively refine areas of high uncertainty if deemed necessary. The model is then used as the surrogate to a Bayesian Optimization scheme that iteratively refined both the model and the suggested sites for optimal adsorption till it converges to the global optimum.

It is shown that this algorithm performs robustly and efficiently, determining both the optimal sites as well as predicting the energy surface with a limited number of data points. The natural next step is to deploy this algorithm for cases of highly complex adsorbents and leverage AUGUR to further our chemical understanding of hitherto computationally prohibitive systems.

Methods

AUGUR pipeline

Our pipeline consists of four major components, also visualized in Fig. 11:

  • A graph neural network that is responsible for extracting a meaningful, rotation, translation, and symmetry invariant representation of the molecule. The output dimension of the graph is fixed regardless of the molecule’s size and, therefore, can be used across various molecules.

  • A Gaussian Process that receives as input the output of the graph and predicts the system’s interaction energy along with the corresponding uncertainty in the prediction

  • A Bayesian Optimizer that uses the above two models as a combined surrogate and generates suggestions for adsorption positions. These are simulated by the simulation model.

  • A Density Functional Theory model that provides a physically accurate but computationally expensive evaluation of the interaction energy of the suggestions.

Fig. 11: AUGUR pipeline summary.
figure 11

The top line is the optimization pipeline, from left to right, training the GNNs and the GPs, using them as surrogates for BO, evaluating the BO suggestions with DFT, adding the new results into the dataset, and repeating them. The bottom line is the point generation for BO, from left to right, defines the cluster, place the first adsorption atom at a physically consistent distance and outside the convex hull of the molecule, and repeats this process with all atoms of the adsorbant molecule.

It must be noted that the exact model architecture will be less relevant with the number of samples going to infinity. Since we are operating on a limited dataset, however, we need to decide on the most efficient architecture parameters. There are some broad guidelines that can be followed. The number of convolution layers for example, can be chosen to roughly correspond to the maximum number of n-hop neighbors that affect each atom’s interaction. The GP kernel can be chosen to reflect the perceived smoothness or discontinuity of the interaction energy surface, and if we standardize the data, the zero mean can be chosen as the prior mean. These guidelines, however, are not guaranteed to yield optimal results, not to mention the lack of specific knowledge for these specifications. To this end, we perform a model parameter sweep and choose the one that yields the lowest test set MSE. Lastly, the pipeline provides a prediction for the overall interaction energy surface and its corresponding uncertainty. The combination of those will allow the user to decide on an individual basis whether the predictions are sufficiently accurate for the desired application.

Density Functional Theory

Density Functional Theory (DFT) is an approximation method used to solve the Schrödinger equation for many-body systems.

The total energy (functional of density) of the system is given by:

$$E[\rho ]={T}_{{\rm{s}}}[\rho ]+{E}_{{\rm{H}}}[\rho ]+{E}_{{\rm{ext}}}[\rho ]+{E}_{{\rm{xc}}}[\rho ]$$
(1)

where Ts[ρ] is the kinetic energy of the non-interacting electrons, EH[ρ] is the Hartree energy representing the electron-electron repulsion, Eext[ρ] is the external potential energy from the nuclei, and Exc[ρ] is the exchange-correlation energy that includes all complex quantum mechanical effects.

The variational principle used to determine the ground-state energy is given by:

$${E}_{0}=\mathop{\min }\limits_{\rho }\,\left\{E[\rho ]\right\}$$
(2)

Density Functional Theory (DFT) is particularly valued for its computational efficiency as, unlike traditional ab-initio methods, DFT scales more favorably at N3. In this context, N represents the system size, encompassing the number of atoms, electrons, or basis functions. Consequently, if the system size doubles, it increases only by a factor of 8. This more favorable scaling allows DFT to perform calculations faster and handle larger systems more effectively than traditional ab-initio methods like Hartree Fock (HF). Furthermore, DFT inherently accounts for electron correlation effects, which are often neglected by HF. This capability makes DFT especially useful for systems where electron correlation is significant, such as in transition metal chemistry, which is central to many chemical reactions and the use cases in our study.

Despite the advantages, DFT does have limitations, particularly in terms of system size. Although DFT simplifies the many-body problem of N electrons to a dependency on the electronic density (reducing the complexity to three spatial coordinates), it is generally constrained to handling systems with a few hundred atoms. The computational effort still scales approximately as N3, making the study of electronic structure properties of large systems resource-intensive and time-consuming.

To mitigate these limitations, our study leverages AUGUR, a framework that guides the experimental process using Bayesian Optimization (BO) enhanced by graph-based Gaussian Processes. This approach reduces the number of required DFT simulations, significantly lowering the computational burden.

Bayesian Optimization

Bayesian Optimization (BO) is an optimization algorithm that has proven itself to be both data efficient and accurate even in non-convex/concave optimization problems. In principle, let us consider a costly optimization problem, such as one relying on expensive simulations or arduous experimental work for the evaluation of the objective function. The natural solution to cases such as these is the training of a surrogate model that will provide sufficient accuracy at a fraction of the required time. Nevertheless, data-driven models typically require a lot of data to be trained sufficiently, which in turn invalidates the main motivation of the surrogate model, i.e., the minimization of the requirement of the time-consuming part of the pipeline. Bayesian optimization is particularly suited to this kind of problem because it simultaneously trains a surrogate model efficiently and identifies hopeful query points. It relies on two components. The first is a stochastic predictor, i.e., a predictor that can provide uncertainty quantification, typically Gaussian Processes (GP), Bayesian Neural Networks (BNN), or others. The second is a function that evaluates each possible point on a) how optimal it is and b) how much new information it injects into the model. This function shall be referred to as the acquisition function. For the remainder of the chapter, we will describe a minimization process, but everything can be applied without loss of generality to any optimization problem. In the present work, we seek to find the position of adsorption that will result in the lowest possible energy. To this end, we can generate a number of random positions and evaluate their respective energies using the uncertainty-aware predictor. The energy value is an indicator of the site’s optimality. The standard deviation reflects how uncertain the prediction is. Choosing to focus on the points of high standard deviation will lead to the most efficient injection of new information and, therefore, the fastest training of the surrogate model. This is called exploration and is typically the focus of the first queries of the ground truth process. As the surrogate improves, the focus shifts to determining the optimal points, and therefore, the choice is increasingly being determined by the prediction and less by the standard deviation. To accommodate both exploration and exploitation efficiently, we have simulated a point with tradeoff 0.5 and a point with tradeoff 100. The first point prioritized areas of strong interactions and the second areas of high uncertainty. This phase is called exploitation. A more detailed derivation can be found in SI.

Note that there are multiple, much more efficient ways of optimizing the acquisition functions, including gradient-based and evolutionary algorithms. Nevertheless, the complexity of the constraints of this problem made their application non-trivial. In comparison, random point generation proved very efficient, especially given that the deep kernel method allows for the fast and parallel processing of thousands of points in mere seconds. More information about how the points were generated can be found in Section “Point generation”.

Point generation

The convergence of Bayesian Optimization can be significantly sped up if we introduce a layer of physical understanding of the problem. The simplest way is to generate a number of points (in our case, 10,000) that adhere to a set of constraints. A random atom of the cluster is chosen. The adsorbing atom is placed on the surface of a sphere with a radius determined by the physical limits of the interaction distance between the two atom types. Then, the adsorbate is tested for being placed inside the cluster or outside. This is done by performing a Delaunay tessellation on the cluster coordinates and determining the location of the adsorbent with respect to the resultant convex hull. If the adsorbent is found inside the hull, it is discarded, and the process starts anew. If this constraint is not violated, the adsorbent distance to the remaining cluster atoms is determined. If it was not positioned closer than the acceptable limits for any atom, then the new adsorbent position is accepted into the sample set and evaluated through the acquisition function. Otherwise, it is discarded, and the process starts anew. The sample generation process can be summarized as follows:

  1. 1.

    Choose a random atom of the cluster, provided adsorption is physically possible on it.

  2. 2.

    Place the adsorbent on the surface of a sphere with a radius determined by physical knowledge of the maximum and minimum interaction distance of these two atoms.

  3. 3.

    Test if the adsorbent falls inside the convex hull of the cluster or goes through the adsorbate surface.

  4. 4.

    If the above constraint is not violated, test if the adsorbent is closer than a distance of tolerance to the other atoms.

  5. 5.

    If the above constraint is not violated, the sample is admitted into the list of viable samples.

If the adsorbent consists of more than one atom, we place every subsequent atom on a sphere with a radius chosen based on the expected bond length and repeat steps 3 and 4 until all the adsorbent atoms have been placed. Then, the total system is accepted into the list of viable samples. A concise flowchart of the process can be found in Fig. 11.

Gaussian processes

In the current project, the surrogate model chosen was Gaussian processes, as it combines robustness to overfitting and the well-documented accuracy of global kernel methods. It is a data-driven stochastic algorithm that models predictions as the posterior of the Bayes formula. The prior and the likelihood are modeled as Gaussian distributions whose parameters are optimized based on the already measured samples. In this way, the algorithm elegantly provides an inherent uncertainty quantification, which in data sets of small size can be crucial. A more detailed description can be found in SI or in ref. 63.

It must be noted that AUGUR aims at providing predictions and optimization with a limited dataset. If the complexity of the system rises significantly, one must resort to the so-called batch GP64 or fully Bayesian neural networks65. These algorithms should be able to handle large datasets, though at the cost of somewhat decreased performance. However, with state-of-the-art algorithms implemented in gpytorch66,67 Gps can easily handle datasets of a million samples with very limited approximations.

Graph Neural Networks

Gaussian processes are a versatile family of algorithms. Nevertheless, they suffer significantly with feature spaces of large dimensionality. Further, meaningfully representing a molecular structure using a set of scalar features is non-trivial. One data-driven algorithm, however, that provides a natural encoding for molecules is Graph Neural Networks (GNNs).

The inputs of this architecture are graph structures, which consist of nodes, edges, and edge attributes. Each node is a mathematical representation of an atom. Node features that are used in this work are the atomic radius, atomic mass, and electronegativity of each atom. Edges are tuples that describe which nodes are connected. Lastly, the edge attributes contain information about the strength of interaction between two nodes. In the present study, the relevant entry of the coulomb matrix and the pairwise distance between two atoms were used as edge features.

The resulting representations are processed with graph convolution layers. These layers operate in two steps. Initially, all the nodes connected to a specific node create messages that codify the effect of these nodes on the receiving one. The messages can be generated by a variety of algorithms, but in the present work, we chose a neural network inspired by ref. 68. This process is repeated for every node.

The second step is the aggregation process, where all the messages directed at one node are condensed into one update that is then applied to the node features. Typically, this can be done by averaging or selecting the maximum message values, but more involved strategies like transformers can be applied. The updated node features now contain information about their neighboring nodes, thus giving more context to the representation.

Finally, once sufficient context has been applied, the graph representation is pooled down to a fixed dimension vector. Typically, this can be done by averaging or selecting the maximum value of every feature across the entire graph. This step condenses a graph of arbitrary size down into a fixed number of representations. In this way, the same algorithm can process different molecules of various sizes without relying on suboptimal techniques.

Graph neural networks can naturally encode molecules but lose the stochastic nature of the Gaussian processes. To this end, we have used a GNN as a feature extractor from the molecules, and these features are used as inputs to the GP. The parameters of the full pipeline are trained simultaneously towards the goal of minimizing the marginal log likelihood. In this way, we combine the advantages of both algorithms.