Deep reinforcement learning for inverse inorganic materials design

Karpovich, Christopher; Pan, Elton; Olivetti, Elsa A.

doi:10.1038/s41524-024-01474-5

Download PDF

Article
Open access
Published: 19 December 2024

Deep reinforcement learning for inverse inorganic materials design

npj Computational Materials volume 10, Article number: 287 (2024) Cite this article

12k Accesses
28 Citations
6 Altmetric
Metrics details

Subjects

Abstract

A major obstacle to the realization of novel inorganic materials with desirable properties is efficient materials discovery over both the materials property and synthesis spaces. In this work, we propose and compare two novel reinforcement learning (RL) approaches to inverse inorganic oxide materials design to target promising compounds using specified property and synthesis objectives. Our models successfully learn chemical guidelines such as negative formation energy, charge neutrality, and electronegativity balance while maintaining high chemical diversity and uniqueness. We demonstrate multi-objective RL algorithms that can generate novel compounds with both desirable materials properties (band gap, formation energy, bulk modulus, shear modulus) and synthesis objectives (low sintering and calcination temperatures). We apply template-based crystal structure prediction to suggest feasible crystal structure matches for target inorganic compositions identified by our machine learning (ML) algorithms to highlight the plausibility of the identified target compositions. We analyze the benefits and drawbacks of the ML approaches tested in this work in the context of accelerated inorganic materials design. This work isolates and evaluates the effects of different RL methodologies to suggest promising, valid compounds of interest by exploring the chemical design space for materials discovery.

Predicting the synthesizability of crystalline inorganic materials from the data of known material compositions

Article Open access 25 August 2023

Highly accurate machine learning prediction of crystal point groups for ternary materials from chemical formula

Article Open access 28 January 2022

Topological feature engineering for machine learning based halide perovskite materials design

Article Open access 22 September 2022

Introduction

The discovery of new materials with desirable properties is crucial to applications in areas such as energy storage and electronics manufacturing. However, virtual materials screening is still gated by time- and resource-consuming physics-based simulations^1,2,3, high-throughput experimentation^4,5,6,7,8, and the need for good learned representations of materials^9,10,11,12. In recent years, machine learning (ML) has revolutionized inverse materials design, where given one or more target properties, a model is trained to generate candidate compounds that satisfy the desired objectives.

Generative neural networks such as autoencoders have been used to optimize organic molecules with desirable properties over a learned latent space^13,14. However, optimizing over a non-convex objective function in a high-dimensional latent space is difficult¹⁵. Generative adversarial networks (GANs) have also seen success in molecular generation tasks. While GANs circumvent the need for latent space optimization, they suffer from issues such as training instability and mode collapse, which make it difficult to generate compounds with high diversity and validity¹⁶. In the inorganic composition space, variational autoencoders (VAEs), GANs, and autoregressive models have been leveraged for generation of hypothetical inorganic materials^17,18,19, 2D materials²⁰, semiconductors²¹, reticular frameworks²², and cubic materials²³. However, these prior works exploring generative model architectures to generate compositions satisfying materials property objectives do so by altering the distribution of the training set (e.g., only training on high band gap materials). This training strategy is data-inefficient, as it requires a comprehensive materials dataset of compositions exhibiting the property of interest within the target range. These prior works are also limited in applicability for real-world materials discovery use cases. Many cases require narrow materials synthesis or property windows or are subject to complex multi-objective tasks, where it is difficult to curate large, high-quality materials science datasets containing sufficient data in property regimes of interest. Furthermore, these works only consider materials property objectives without considering synthesis objectives for generated materials. Synthesis objectives such as synthesis temperature or time are important factors in inorganic materials design. For instance, controlling calcination and sintering temperature and time is experimentally useful to influence materials properties (magnetic, electrical, optical, electrochemical), compound morphology (crystal structure, particle size, grain size, phase purity), and reaction yield^{12,24,25,26,27,28,29,30,31,32,33}. Prior works do not take into account multi-objective reward functions such as optimizing both synthesis and property objectives, which is a challenge for generative methods such as GANs and VAEs^14,34.

Reinforcement learning (RL) is a ML approach which describes how intelligent agents take actions in an interactive environment to maximize an expected cumulative reward³⁵. Recent advances in ML have paired the RL mathematical formalism of decision-making with advances in deep learning to train models which learn from complex, high-dimensional inputs and action spaces. RL approaches to molecular generation predict molecules with respect to a reward function. Prior work includes both policy gradient and Q-learning approaches to design of organic molecules and have demonstrated these methods to be sample efficient, stable during training, and high performance^15,36,37. Prior applications of RL to the inorganic materials space has been limited to specific case studies such as optimization of chemical vapor deposition synthesis conditions for MoS₂³⁸, metamaterial design³⁹, digital materials design⁴⁰, metal-organic frameworks⁴¹, graphene oxide functional group distribution⁴², and nanopore design⁴³. Our previous work⁴⁴ showed initial promise for materials discovery using RL methods; to the best of our knowledge, no other prior work has leveraged RL for inorganic materials composition generation. Deep RL is particularly suited to guided materials generation tasks as it is adept at learning from high-dimensional data that is commonly encountered in the materials space⁴⁵. Specifically, Q-learning based RL approaches are not constrained by prior knowledge or large initialization data sets, which is important in data-scarce domains frequently encountered in materials science⁴⁶. Policy gradient-based approaches can directly optimize a policy in large discrete and continuous action spaces, such as the space of all elements and coefficients, respectively⁴⁶. Consequently, there is a dearth of ML-informed screening methods in the inorganic materials space which can suggest initial target compounds subject to synthesis objectives and property objectives. This is a crucial and necessary step for the advancement of high-throughput compound screening and laboratory automation, such as ML-accelerated self-driving laboratories^4,5,6,8,47.

We use a data-driven RL approach to generate novel inorganic oxide materials satisfying both materials property (band gap, formation energy, bulk modulus, shear modulus) and synthesis (sintering temperature, calcination temperature) objectives. We present two RL approaches, a deep policy gradient network (PGN) based algorithm and a deep Q-network (DQN) based algorithm, which are employed to explore the inorganic chemical design space. We use template-based matching to propose potential crystal structures for the identified compositions of interest. We compare and contrast the two methods to highlight their respective benefits and shortcomings in different contexts of ML-enabled inorganic materials design. The RL-generated compositions in this work exhibit high validity, negative formation energy, and strong adherence to the target objectives. The proposed models outperform baseline ML methods in validity, diversity, and materials property satisfaction. Ultimately, this work proposes two key methods that can accelerate the discovery of novel materials satisfying multiple objectives to bridge synthesis-property design spaces.

Results

We frame the RL-assisted discovery of novel inorganic materials with desirable properties as a feedback loop (Fig. 1a). Each of the RL approaches developed in this work uses a generator model and a predictor model. The generator model initiates novel, chemically valid materials satisfying one or more synthesis or property objectives. The predictor model is a supervised ML algorithm which is trained to speculate materials synthesis or property data from its chemical composition. In our learning workflow, the generator model first suggests an inorganic material composition. The predictor model then assigns a reward to the generated chemical formula based on a user-specified reward function, such as maximizing or minimizing a particular property (e.g., maximize bulk modulus, minimize sintering temperature) or targeting a specific value (e.g., a band gap of 2.5 eV). The reward function can consist of a single objective or differently weighted multiple objectives depending on the preferences and end goals of the user. For instance, if an experimentalist wishes to design materials with a stringent objective of a large bulk modulus but a more forgiving objective of processing temperature, a reward function that places a strong emphasis on maximizing bulk modulus and a weaker emphasis on minimizing calcination temperature would be more appropriate.

**Fig. 1: Overview of RL architectures and materials generation.**

We formulate the material generation process as a sequence generation task where a trajectory is defined as T = {s₀, a₀, s₁, a₁, …a_T-1, s_T} with a horizon capped at T = 5 steps, as shown in Fig. 1b. Here, states s are represented by material compositions while actions a constitute the addition of an element and its corresponding composition to the existing incomplete material. At each step, an element (out of a set of 80 possible elements) and its corresponding composition (an integer ranging from 0–9) are appended to the existing (or empty) material. If the corresponding composition is 0, the agent does not add the element. This approach allows the agent to generate materials containing ≤ 5 elements. In all experiments, we restrict to generating oxides, hence any element can be considered in the first four steps, but only oxygen for the final step.

Here, we compare two different RL formulations: 1) policy gradient approach, PGN, which aims to optimize the policy directly 2) deep Q-learning approach, DQN, which aims to optimize the policy via learning a surrogate value function. In both approaches, we define the reward R_t at timestep t given values s_t and a_t as

$${R}_{t}\left({{\boldsymbol{s}}}_{t},{{\boldsymbol{a}}}_{t}\right)=\mathop{\sum }\limits_{i=1}^{N}{w}_{i}{R}_{i,t}\left({{\boldsymbol{s}}}_{t},{{\boldsymbol{a}}}_{t}\right)$$

(1)

where R_i,t is the reward from the ith objective and w_i is the user-specified weight placed on the ith reward. For materials synthesis and property objectives, R_i,t = P(s_T), where P is a predictor model for the particular objective. This reward formulation encompasses both single-objective (N = 1) and multi-objective (N > 1) tasks. We assign rewards R_t at all non-terminal a value of zero until step 5 (terminal step) when the compound is fully generated.

Supplementary Fig. 1 depicts the materials property and synthesis parameter distributions across the preprocessed inorganic oxide data extracted from Materials Project⁴⁸. We select four materials property objectives (formation energy, bulk modulus, shear modulus, band gap) and two materials synthesis objectives (sintering temperature, calcination temperature). This subset contains objectives relevant to materials properties and their applications (negative formation energy, band gap: electronic devices, semiconductors, bulk/shear modulus: building materials, superhard materials) and synthesis objectives (calcination/sintering temperature). Lower processing temperature conditions are desirable due to energy savings capabilities, realization of metastable structure types and desirable materials properties, as well as manufacturing requirements^{49,50,51,52,53,54}. For example, in the processing of solid-state electrolytes, lower synthesis temperatures can reduce interdiffusion and improve battery component compatibility^55,56. As the dataset consists of oxides, the majority of the compounds in the dataset have formation energies below zero and band gaps greater than zero. Sintering temperatures generally skew higher than calcination temperatures because in calcination, a mixture of compounds is heated to a high temperature to remove impurities and unwanted volatile substances, often through thermal decomposition, while in sintering, a compound is heated at a higher temperature than that reached by calcination to induce densification and grain growth^57,58,59.

Initiating models

We build upon the PGN model architecture proposed by Popova et al.³⁶ to generate inorganic materials with desired properties. We frame this task as optimizing the parameters of a policy network to maximize the expected reward, which is calculated as the sum of rewards assigned to predicted material properties and synthesis parameters. The reward function (Supplementary Table 1) considers all possible terminal states and is approximated through sampling from the generative model.

Our PGN approach (Fig. 1c) employs a stack-augmented recurrent neural network (stack-RNN) as the generative model, well-suited for sequence prediction tasks such as generating complex molecules and materials due to its ability to capture patterns effectively^36,60,61. This is particularly advantageous in the context of inorganic materials generation, where the generator model must understand complex chemical concepts like ionic charge and electronegativity to predict chemically valid compounds. We initiate the generation process with a <START> token and iteratively predict the next element and composition until reaching a termination condition signaled by an <END> token or reaching a maximum time horizon. The final reward is estimated based on the generated material using a modified reward function. This comprehensive approach provides a framework for systematically generating compounds that not only meet synthesis and property objectives but also adhere to chemical constrains.

In the context of materials design, we also train a DQN agent to maximize expected rewards to generate compounds with desired properties. Similar to the PGN, the DQN generates materials which are then evaluated by a trained material property predictor. The DQN state representation (Fig. 1d) consists of the material composition and the current step of the generation process. The material composition is featurized using the Magpie⁶² framework, yielding a vector consisting of stoichiometric properties, elemental properties, electronic structure features, and ionic features. Moreover, the representation includes the current step to account for time-dependent policies, which have been shown to outperform time-independent policies¹⁵. The action space of the DQN mirrors that of the PGN and comprises an element and element coefficient component encoded as one-hot vectors.

To compare our modeling approaches to a suitable baseline, we implement three additional methods and investigate performance. The first is a random agent which selects actions at random and is used to ensure that the RL models are learning from the data.

The second is an implementation of “Deep learning enabled INorganic material Generator” (DING), a conditional variational autoencoder (CVAE) model by Pathak et al. which was demonstrated to generate inorganic material compositions with targeted formation energy, volume per atom, and energy per atom¹⁷. We adapt the model to target the synthesis and materials objectives selected in this work. Other general ML-enabled inorganic materials composition generation methods such as refs. ^18,19. are not suitable baselines, as they rely on altering the training set distribution to generate materials with specified properties. Finally, we leveraged SMACT⁶³, a Python library of rapid screening and informatics tools that uses chemical rules for high-performance computational screening, to compare to a comprehensive brute-force search across the composition space. We note that a brute-force enumeration with SMACT is an approximate upper bound for target property performance, as it systematically generates all composition combinations in the search space rather than conducting an informed search.

Single-objective materials generation

Our example selected single-objective tasks are:

1.
Minimize formation energy per atom (Form. energy): compound stability relative to its elemental constituents.
2.
Maximize bulk modulus, shear modulus (Bulk mod., Shear mod.): maximize mechanical strength of material, important for structural applications and superhard or ultraincompressible materials.
3.
Maximize band gap (Band gap): large band gap materials are desirable for applications such as wide-bandgap semiconductors and other electronic materials.
4.
Minimize sintering, calcination temperatures (Sinter temp., Calcine temp.): lower temperatures are associated with lower energy consumption costs and greener synthesis pathways.

In Fig. 2 we apply t-distributed stochastic neighbor embedding (t-SNE) dimensionality reduction to 1,000 generated compounds from each of the four modeling approaches across the different objectives after training. When these plots are colored by their target objective (excluding Fig. 2d, as the random agent follows a random policy), clusters emerge. Fig. 2a and b show clear clustering of the generated compounds based on the target objective for the PGN and DQN approaches, respectively. In comparison, we see a higher degree of overlap between the compounds generated (Fig. 2c) using the different objectives for the DING baseline model. Higher degrees of clustering based on target objective support that the PGN and DQN models more effectively explore the high-dimensional space and identify desirable compositions which maximize the target reward function as compared to the DING model. Moreover, the clusters identified by the PGN model span a larger portion of the lower-dimensional space as compared to the DQN model. Similar trends are observed when we plot the principal component analysis (PCA) dimensionality reduction of the generated compounds (Supplementary Fig. 2). To quantify these trends, we computed the cluster variances of each of the target objective clusters in Supplementary Table 2 in the original 145-dimensional feature space as well as after t-SNE and PCA dimensionality reduction for the three ML modeling techniques. In nearly all cases in both the original and reduced feature spaces, we find DING clusters exhibit higher variance than PGN or DQN clusters. For instance, DING cluster variance is on average across all tasks approximately 57% and 13% higher than PGN and DQN, respectively, for the original feature space, 20% and 40% higher (respectively) after t-SNE, and 70% and 58% higher (respectively) after PCA. PGN and DQN can better learn meaningful information from the reward function about the relationship between composition and the target property.

**Fig. 2: t-SNE dimensionality reduction for generated compositions.**

Supplementary Fig. 3 depicts the differences in progression of the t-SNE plots over training for the PGN and DQN. Compounds generated by PGN during training cluster more broadly across the lower-dimensional space as compared to compounds generated by DQN which cluster more tightly, many of which do not overlap between the two strategies. Differences in RL training strategy result in unique subsets of identified compounds for the same training objectives. All three ML models (PGN, DQN, DING) learn more meaningful representations of the inorganic compositions and their mapping to the synthesis and property tasks than the random baseline.

From Fig. 2a–c, the synthesis objectives (calcination and sintering temperature) and formation energy objective clusters are more widely distributed across the low-dimensional space and overlap with the other materials property objective clusters. Formation energy and calcination/sintering temperature are broader materials science concepts which are more difficult to explicitly define and isolate, as they are correlated to other materials properties as well as crystal structure, electronic structure, and processing conditions. In comparison, bulk/shear modulus and band gap are well-defined for materials, and a number of composition families have been closely associated with large band gap (wide-bandgap semiconductors, insulators) and high bulk/shear modulus (superhard/ultraincompressible compounds), so these properties naturally cluster more closely^64,65,66,67. Bulk and shear modulus also cluster more closely to each other than the other selected objectives because they are correlated properties in condensed matter physics. These observations correspond well to physical intuition and support that the ML models have learned general chemical guidelines governing materials composition, property and synthesis conditions.

To evaluate how well the generated compounds adhere to the objective of interest, we show in Fig. 3a the average target property values over all modeling strategies for each of the single-objective tasks. Three of the tasks target property maximization (bulk modulus, shear modulus, band gap) and the other three target property minimization (formation energy, sintering temperature, calcination temperature). We evaluate model performance by predicting target property values of the generated inorganic compositions using the trained property prediction models. For maximization tasks, a higher value of the target property is desired, while for minimization tasks, a lower value of the target property is desired. A comprehensive set of performance metrics for all four modeling approaches (PGN, DQN, DING, random) and SMACT enumeration can be found in Supplementary Tables 3–7, respectively.

**Fig. 3: Materials property, validity, and diversity metrics for generated compositions.**

The DQN model performs the best for bulk/shear modulus and sinter/calcination temperature tasks, while PGN is most successful in the band gap and formation energy tasks. Both PGN and DQN outperform both the DING model and random baselines on all tasks, demonstrating the capability of the RL frameworks to effectively generate inorganic compositions satisfying property objectives. A training progression of the PGN over time (Supplementary Figs. 4-5) supports this observation. The RL approach allows the model to explore the high-dimensional composition space over time and iteratively locate inorganic compositions better satisfying the given objective, indicated by gradually optimizing the property value with lower average loss and higher average reward. Surprisingly, in certain tasks such as bulk/shear modulus, DING does not outperform the random baseline, motivating the need for more intelligent ML-based approaches to inorganic compound discovery. Bulk and shear modulus are complex physical phenomena with experimental data available for only a small subset of known inorganic compounds^48,68, and ML predictor models are most reliable in the composition regime they are trained on. DING is empirically observed to generate less realistic compositions than PGN and DQN by exploiting the higher uncertainty regimes of the predictor model in these tasks, leading to lower performance.

In comparison to the brute-force search (Fig. 3, Supplementary Tables 3–7) which by nature is best-performing as it finds a near-optimal solution, PGN and DQN perform considerably well in the different property tasks. With the SMACT enumeration as a reference point, we find absolute differences of 1.05 eV/atom, 41.1 GPa/0.19 log GPa, 43.2 GPa/0.39 log GPa, 1.29 eV, 90.2 °C, and 68.8 °C for formation energy, bulk modulus, shear modulus, band gap, sintering temperature, and calcination temperature, respectively, between the best-performing model and the exhaustive search. As calculated in Davies et al.⁶³, even after the application of chemical filters such as charge neutrality and electronegativity balance, the space for four-component inorganic materials exceeds 10¹⁰ combinations and the five-component space is estimated to exceed 10¹³ combinations. Brute-force search is useful in smaller, more limited composition spaces where the number of components is small. RL-based methods can excel in larger, more complex search spaces where a complete enumeration of compositions is computationally expensive and more targeted exploration is desirable. In this case, it is encouraging to see that RL models perform within a close range of the brute-force approach.

For each task, we evaluate model performance based on metrics related to the material discovery process. We focus on charge neutrality and electronegativity balance, which are two fundamental rules of inorganic compounds that have been used to filter out chemically implausible compositions^{18,19,20,63,69}. We evaluate the charge neutrality and electronegativity balance of generated compounds following the methods in refs. ^18,63. by querying each model for 1,000 inorganic compositions and evaluating the percentage adherance to each rule. In Fig. 3b, we compare the percentage of generated compounds that adhere to the charge neutrality and electronegativity balance rules, respectively. PGN consistently outperforms the other three approaches in both metrics, achieving > 90% charge neutral and electronegativity balanced compounds. In comparison, the DQN model shows greater performance in charge neutrality over DING for the majority of the objectives and but has comparable or better electronegativity balance than DING in only around half of the tasks. Furthermore, both PGN and DQN outperform the random agent baseline in terms of charge neutrality (random = 0.59 ± 0.01) and electronegativity balance (random = 0.42 ± 0.01) for the majority of the tasks, showing their utility in learning important chemical guidelines to generate valid inorganic materials. SMACT brute-force enumeration is guaranteed to generate compositions passing the charge neutrality and electronegativity balance checks.

We investigate model performance through measuring the diversity of generated compounds. Generating a diverse selection of inorganic compounds that satisfy the desired objectives is crucial for experimentalists for applications in high-throughput screening and materials discovery. We compute the percentage of unique compounds generated in a fixed sample size of 1,000 attempts and compare between modeling approaches (Fig. 3c). Both DQN and the random agent obtain ∼100% uniqueness, which is not surprising because the random agent always takes random actions (virtually ensuring all generated compounds are unique) and the DQN samples from the top 20% of ranked actions at each step to introduce stochasticity in materials generation. SMACT enumeration is also guaranteed to have 100% uniqueness because we compute all possible combinations of elements. In the PGN formulation, each action is sampled from a multinomial distribution over the probability of the next action given the previous state p_Θ(a_t|s_t−1), which introduces stochasticity but to a lesser degree. However, the PGN approach still outperforms DING in uniqueness percentage over the majority of the tasks, showcasing its ability to generate compounds that are both unique and exhibit desirable properties.

While uniqueness is a practical metric, it falls short in that many compounds can be strictly unique but chemically similar in composition. To understand chemical diversity more quantitatively, we leverage the Element Mover’s Distance (EMD)⁷⁰, a well-defined distance metric for chemical compositions which enables the measure of inorganic composition similarity. We plot in Fig. 3c the average EMD for each of the modeling strategies over 1000 generated compounds, with higher EMD signifying greater diversity. The random agent again exhibits the highest diversity because it always selects random actions. On average, DQN and DING show similar EMD ranges, with PGN exhibiting moderately lower EMD diversity for all objectives. SMACT brute-force enumeration displays similar EMD scores to the RL-based approaches. This indicates that RL is able to achieve the same level of composition diversity as exhaustive search. Consequently, this means that the RL models are not over- or under-targeting a particular subset of compositions. We note that it is possible to have high uniqueness and low EMD diversity (or vice versa), as in cases such as the PGN formation energy and band gap tasks where uniqueness differs significantly by ∼ 30% and EMD is similar (∆EMD ≈ 1.5). For instance, the model could have found an optimal compositional subspace that contains slight variations in chemical formula (e.g., doped versions of the same compound), which could be beneficial or detrimental depending on the use case. In early stage screening, an experimentalist might favor generation of highly diverse compound families for further downstream exploration, while in more stringent use cases variations of specific compound families may be preferred for compatibility or resource requirements. There is also a clear trade-off for the PGN model between generated compound diversity and property satisfaction. As training progresses, the model further hones in on the optimal compositional subspace which results in the maximum reward, causing property satisfaction to increase over time (Supplementary Fig. 4) but uniqueness and EMD to decrease over time (Supplementary Fig. 5c). Hence, the experimentalist can tune such training hyperparameters to preferred values depending on the use case of interest.

We also evaluate model performance by measuring the predicted formation energies of generated compounds from each approach. Formation energy is a measure of the driving force to form an inorganic compound from its constituent elements, with lower (more negative) formation energies signifying a larger driving force. We note that a comprehensive comparison to neighboring compounds on the convex hull would be necessary to determine true compound stability⁷¹. For each model, we generate 1,000 inorganic compositions and use a machine learned formation energy predictor model to predict their formation energies. Fig. 3a (right) depicts the average formation energies for generated models targeting various independent objectives. The x-axis corresponds to the independent objective targeted by the model and the y-axis corresponds to the average formation energy over 1000 generated compositions. From Fig. 3a, explicitly biasing the models towards lower formation energy results causes compounds generated by PGN, DQN, and DING to have more negative average formation energy compared to the random baseline, with PGN and DQN outperforming DING. This behavior is expected, as the models are either rewarded for (in the case of PGN and DQN) or conditioned to (in the case of DING) generate compounds with more negative formation energies. However, PGN-determined formation energy generally becomes more negative during training in all objectives (Supplementary Fig. 5b), even when formation energy is not rewarded. Furthermore, PGN generates compounds with significantly lower formation energies than DQN or DING over all single property tasks (Fig. 3a), indicating that PGN can generate compounds with both negative formation energies and exhibit the target property. DQN generates compounds with more negative formation energies than DING and random when it is explicitly rewarded but generates compounds of similar or more positive formation energies in all other tasks. This is likely because in a number of cases DQN is more adept at locating composition subspaces better satisfying the target property (Fig. 3a) than PGN or DING, some of which may contain larger concentrations of unstable or metastable chemical formulas. We note that the random baseline is a strong baseline for formation energy, as our compound space is restricted to oxides which are generally stable compounds. It is therefore unsurprising that compounds generated by DQN and DING have comparable or less negative formation energies than the random agent in the cases where formation energy is not targeted. Furthermore, PGN and DQN formation energies are within close proximity to brute-force search. Excluding the case when the target property is explicitly formation energy, PGN generates compositions with a lower average formation energy than SMACT enumeration for four out of the five tasks.

Multi-objective materials generation

In materials discovery, experimentalists typically seek to design materials that exhibit multiple desirable properties, so we also use this RL framework to learn joint objectives. The generation protocol and evaluation metrics used in multi-objective tasks to analyze model performance are similar to that of the single-property tasks. We jointly maximize bulk modulus and minimize sintering temperature. The reward function is now a linear combination of individual rewards for each objective (Eq. 1), where w_i are hyperparameters to weigh the relative importance of individual objectives in the joint objective task. For a materials generation task with two distinct objectives, this reduces to a single-objective weight w. We investigate the influence of varying w for two selected objectives, sintering temperature (T_sint) and bulk modulus (K), where increasing w increases bulk modulus and decreasing w decreases sintering temperature:

$${R}_{t}\left({{\boldsymbol{s}}}_{t},{{\boldsymbol{a}}}_{t}\right)=-\left(1-w\right){T}_{{sint}}+{wK}$$

(2)

Equation. 2 contains a negative sign on the weight for T_sint because we aim to minimize sintering temperature. We selected the minimization of sintering temperature and the maximization of bulk modulus as our multi-objective tasks. High sintering temperatures increase production costs, maintenance fees, and equipment costs while making quality control more difficult⁷². As a result, materials with ultra-low sintering temperatures are desirable to save energy, reduce processing time, and enable further integrations with semiconductors such as silicon or GaAs, metals, and polymer-based substrates^73,74. Bulk modulus is a crucial parameter in condensed matter physics measures volumetric elasticity and closely correlates to hardness and toughness^75,76,77. “Ultraincompressible” and “superhard” materials with large bulk moduli which demonstrate high incompressibility or extreme hardness are of great interest to the broader materials science community in application such as replacing diamond in cutting tools, abrasives, coatings, and other high-pressure devices^78,79,80.

Fig. 4 depicts the multi-objective property distributions of materials generated by PGN, DQN, and DING, color-coded by the objective weight w. As w increases, PGN and DQN prioritize maximizing bulk modulus over minimizing sintering temperature, as is observed by the property distribution of the generated materials shifting up (larger bulk modulus) and to the right (higher sintering temperature). When w = 0.2, the generated materials skew towards low bulk modulus and and low sintering temperature, so the majority of the reward favors minimizing sintering temperature. When w = 0.4 or w = 0.6, PGN and DQN generates materials that have simultaneous improvements in both material properties over the baseline, as shown in the distribution shift to the upper-left corner of the subplots. Here, a more equally proportional weight is placed on each of the reward objectives of minimizing sintering temperature and maximizing bulk modulus. DQN materials have higher rates of validity (charge neutrality and electronegativity balance) but with lower diversity (EMD). However, when w = 0.2 or w = 0.4, the PGN and DQN over-prioritize minimizing sintering temperature at the expense of bulk modulus. This shows the trade-off between the two properties, resulting in a characteristic Pareto front in multi-objective optimization, where w is a parameter that can be tuned according to the application of interest.

**Fig. 4: Multi-objective property distributions.**

Multi-objective materials generation metrics encompassing validity, diversity, and properties of the materials generated by each of the agents are described in Supplementary Table 8. PGN and DQN outperform DING and the random agent in all metrics except EMD for generation of inorganic oxides with both high bulk modulus and low sintering temperature. DING evidently struggles to learn more selective composition spaces such as that required by a dual property objective, as the average bulk modulus and sintering temperature of DING-generated compounds is nearly unperturbed by changing the weight between the two target properties. In contrast, in both PGN and DQN-generated compounds the sintering temperature and bulk modulus are directly influenced by changing the reward weight, with decreases in sintering temperature of 312 °C, 111 °C and increases in bulk modulus/log bulk modulus of 26.3 GPa/0.27 log GPa, 132.9 GPa/1.07 log GPa for PGN and DQN, respectively. The DQN model generated compounds with lower sintering temperatures and higher bulk moduli for w = 0.6 and w = 0.8, while the PGN model performed better in the w = 0.2 and w = 0.4 regimes. In terms of formation energy, the PGN model was able to consistently generate compounds with comparable or lower formation energy, higher charge neutrality, and higher electronegativity balance percentages across all objective weights for the multi-target task. Policy gradient pretraining allows the underlying recurrent neural network to learn a general understanding of composition rules of forming stable chemically valid inorganic compositions from a large dataset, while Q-learning only relies on exploration/exploitation and is not constrained by these rules. Compared to the inorganic materials subspace generated by the random policy (Supplementary Fig. 6) which spans the majority of the property space, the ML model architectures learn a more informed and restricted mapping of inorganic composition to desired materials properties.

Multi-objective RL provides a useful way to elucidate the effects of different chemistries in the laboratory by aiding experimentalists with initial suggestions on how to achieve or alter desired materials properties and synthesis parameters with compositional design choices. From an analysis of the generated compounds, we can visualize broader trends in how composition may affect the target properties by examining commonly-found elements in regions of interest (Fig. 5). From these insights, we can qualitatively evaluate the extent of information loss between the property prediction models and the trained RL models. For instance, if the visualized trends did not corroborate with expectations from materials theory or prior works, there may have been significant information loss during training, or the RL models may have learned from noise in the data. For instance, we expect to see that the RL models learn expected trends in the data concerning the relationship between composition and materials property. For the composition-property space learned by the PGN (Fig. 5a), we see a clear split between the upper-right and lower-left sides of the space. The PGN has learned that Fe-containing compounds tend to have a higher predicted bulk modulus and higher sintering temperature, while compounds containing Li tend to have a lower sintering temperature and lower predicted bulk modulus. Other smaller pockets in the space provide some insight as well, such as the addition of V and Na potentially leading to lower values of bulk modulus and sintering temperature.

**Fig. 5: Visualizing the multi-objective RL composition space.**

A similar investigation of the DQN model (Fig. 5b) exemplifies how the exploration/exploitation nature of Q-learning and stochastic sampling of the top actions allows for a wider diversity of elements to be probed for reward maximization. Bi, Rb, and Au are correlated with lower sintering temperature, likely due to correlations and trends in bonding strength and selective doping^12,49,54,81, while Os, Fe, Si, Ge, and Pt are found to increase bulk modulus of generated compositions. Our results correspond well with previous works which identify trends in composition, bulk modulus and sintering temperature^12,75,82,83. Moreover, our findings are consistent with empirical design rules in materials science for ultraincompressible and superhard materials which suggest that transition metal compounds are promising candidates for ultraincompressible materials⁸⁴. However, sintering temperature and bulk modulus are inherently complex material characteristics, which can be determined by a combination of crystal structure, electronic structure, method of synthesis, and anthropogenic factors such as experimentalist bias^12,81,85,86. Nevertheless, such a tool offers beneficial initial insights for both forward (assessing the impact of a specific additive on resulting properties) and inverse (determining the suitable additives to achieve a particular synthesis condition) inorganic materials design. While this analysis aims to highlight learned correlations without introducing additional assumptions, we hope these findings can be used in conjunction with model explainability methods such as SHAP⁸⁷ or LIME⁸⁸ on the property prediction models themselves to reveal important composition-property relationships.

Template-based crystal structure prediction

Due to the rapid growth in machine learning-enabled materials screening, recent works^{10,89,90,91,92,93} have made strides in predicting structural properties of inorganic materials solely given their composition. These methods take as input an inorganic composition and predict structural properties such as Bravais lattice, space group, and lattice parameters. To supplement the high-quality hypothetical compositions screened by our models, we adapt the template-based crystal structure prediction algorithm proposed in Kusaba et al.⁹⁰ to obtain hypothetical substitution-based crystal structures for select generated Pareto-efficient novel materials. This structure prediction method relies on metric learning, an ML-based algorithm which automates the selection of template structures from a crystal structure database with high chemical replaceability to best match the unknown stable structure for a target chemical composition. For each target composition, the classifier predicts the closest match crystal structure and a percentage score which quantifies the confidence in the prediction.

We highlight a subset of compounds in Fig. 4a, b which are PGN- and DQN-generated near Pareto-optimal compositions satisfying both target objectives (large bulk modulus and low sintering temperature), ≥ 80% confidence matched template crystal structures, charge neutrality and electronegativity balance, negative predicted formation energy per atom, and no existing entry in the Materials Project composition dataset used in this work. We additionally show in Fig. 6 template crystal structure matches for a subset of the near Paretooptimal inorganic compositions, with their predicted bulk moduli, sintering temperatures, formation energies, space groups, closest matching Materials Project ID and composition, and percent confidence match tabulated in Table 1. The selected compositions exhibit high predicted bulk modulus (≥ 4 log GPa/≥ 50 GPa), low predicted sintering temperature (< 900 °C), negative predicted formation energy per atom, and high confidence (> 85%) crystal structure matches. Another utility of the crystal structure matching is the connection to experimentally observed compounds sharing similar structures. For instance, the best match structure suggested for Ge₂OsO₅ by the crystal structure prediction algorithm has been experimentally synthesized for its template composition CrPb₂O₅, which could provide additional useful information in terms of synthesis route and conditions for experimentalists attempting to realize the new analogue in the laboratory⁹⁴.

**Fig. 6: High confidence, near Pareto-optimal template matched crystal structures.**

Table 1 Predicted bulk moduli, sintering temperatures, formation energies, space groups, closest matching Materials Project ID (MP-ID) and composition, and percent confidence match for a subset of the near Pareto-optimal compositions

Full size table

We have demonstrated how RL-based screening methods can identify promising inorganic compositions constrained by target properties of interest. These target compositions can, along with hypothetical crystal structure matches, be used as starting points for further downstream computational calculations (i.e., Density Functional Theory (DFT) and first-principles methods) and experimental validation in the laboratory. ML methods are thus able to aid in filtering out unsuitable candidate compositions and suggest plausible structures to reduce computational and experimental burden and increase materials discovery throughput. The substitution-based crystal structure predictions can be used to filter and constrain the prototype structures used as starting points in first-principles calculations such as DFT calculations which can be prohibitively expensive for high-throughput screening approaches. Moreover, by providing predicted symmetry and structure parameters, these predictions can help significantly streamline the process of structure refinement from experimental data such as powder X-ray diffraction measurements^90,91,95.

Discussion

We compare the strengths and shortcomings of the different modeling strategies discussed here to identify where different approaches may excel, as the contexts in which they are deployed could certainly differ. One important metric to evaluate is based on sample efficiency. Value-learning methods can use off-policy learning and experience replay to boost sample efficiency, while policy gradient-based methods cannot. Another trade-off to consider is the bias-variance trade-off which impacts the optimality and diversity of the generated compositions. The PGN suffers from variance of gradient estimates especially when directly sampling from high-dimensional action spaces⁹⁶. On the other hand, the DQN suffers from bias, as it is overly optimistic when the actions are chosen based on maximizing an approximate valuefunction⁹⁷. This can lead to differences in how diverse (how many different elements and composition families are explored) and how optimal (how well does the generated compound distribution satisfy the target objectives) the resulting learned distributions are. Another crucial comparison is between the training methods of each modeling strategy. The PGN requires pretraining of an unbiased model on a large dataset (10³ − 10⁵) of inorganic materials compositions, which is beneficial in data-rich use cases like inorganic oxides or alloys which but could prove difficult in a data-scarce domain where the composition space is narrow^18,98,99. The DQN model, however, can learn simply from trial-and-error since it uses exploration/exploitation, making it more suitable for materials science domains where the number of known compounds is limited in the tens or hundreds such as thermoelectrics or solid-state electrolytes^100,101,102.

From the performed experiments, the PGN model excels in generating compounds that conform well to the chemical guidelines we have outlined (charge neutrality, electronegativity balance, negative formation energy) while satisfying the target materials synthesis and property objectives at some cost of chemical diversity. One scenario of interest could be where experimentalists have a suite of existing materials satisfying their objective (for instance, a suite of Li-ion and Na-ion battery cathode materials) and wish to discover additional analogues or variations of the selected material families. In this context, the PGN would be more suitable because a pretraining dataset exists and validity is more crucial than diversity. The DQN model outperforms in identifying a diverse set of inorganic compositions with superior materials synthesis and property values but is more prone to error in terms of compound validity. This behavior could be useful in high-risk, high-reward or needle-in-a-haystack use cases (e.g., high ZT thermoelectrics) where high-throughput experimentation is readily available and finding a material which strongly satisfies the synthesis or property objectives outweighs the cost of validating additional predicted compositions^102,103.

Another limitation is that our RL modeling strategies only consider generation of oxides with integer coefficients. However, materials with fractional coefficients such as doped compounds are prevalent across materials science, particularly in functional materials like Li-ion battery cathode materials, solid-state electrolytes, and heterogeneous catalysts. Generalizing our modeling strategy to a continuous action space for element coefficients would make the problem significantly more difficult, as there are potentially infinitely many compositions to attempt to satisfy the reward objective which would make the optimization landscape more complex. Future endeavors in this space could adapt new advances in RL which extend these methods to both continuous and discrete action spaces to address this problem^104,105.

Furthermore, our methods rely on using learned surrogates (trained ML models) to inform our synthesis and property objectives because they are relatively inexpensive to train as compared to more accurate physics-informed calculations. However, using learned surrogates to determine rewards can lead to generation of compounds out of the distribution of the training set of the surrogates. Optimizing over the chemical space in this manner will naturally push the explored compounds away from the training set distribution, making prospective real-world errors in synthesis and property prediction higher than the original train-test split error. Therefore, the RL algorithm could potentially favor minimizing the error over the true property value, hampering its real-world predictive abilities. Future work is required to explore more rigorous physics-based reward methods to inform our RL algorithms more effectively and increase the reliability of the suggested compounds and their objectives of interest. Quantum-mechanical methods could also be applied for validation of compositions and their templated crystal structures to further determine their stability and exhibited properties.

Finally, RL is a fast-growing field, and the machine learning community has rapidly developed new algorithms and model training paradigms over the past years which have mitigated drawbacks and expanded its potential use cases. This work has explored two RL strategies, policy gradient and deep Q-learning, which have shown great success in a number of problem areas. As an initial effort to explore RL in the context of machine learning-assisted materials design, we have compared the advantages and drawbacks of the policy gradient and deep Q-network RL approaches and how they perform against existing ML-guided algorithms, particularly with multi-objective synthesis and property tasks. However, more advanced RL strategies (and newer derivative approaches) such as soft actor-critic¹⁰⁶, double deep Q-network¹⁰⁷, rainbow deep Q-network¹⁰⁸ or proximal policy optimization¹⁰⁹ have explored ways to improve in areas such as stability and sample efficiency. We intend to investigate more advanced RL algorithms for inverse inorganic materials design in future works, which we envision will further improve upon model performance and the quality of generated compositions.

We have demonstrated and contrasted two new RL approaches for the generation of inorganic compositions subject to single and multi-objective reward objectives. Our models satisfy common chemical guidelines of charge neutrality, electronegativity balance, and negative formation energy. They successfully learn the composition-synthesis-property space to suggest promising compositions of interest which are matched with template crystal structures to accelerate the screening of inorganic compositions. The developed models in this work capture physics-based trends in the relationships between composition, materials properties, and synthesis parameters and outperform baseline ML methods in the field. Our approach could help guide experimentalists with initial suggestions for inorganic compositions and composition families to further explore to aid in synthesis planning for the discovery and design of new materials. The demonstrated methods could be used as an initial step in high-throughput screening of inorganic compositions in combination with higher-fidelity quantum mechanical calculations and laboratory experiments to identify compounds exhibiting target properties and convex hull stability. These modeling strategies take an important step towards realizing true high-throughput inorganic synthesis, where machine learning can effectively screen large numbers of inorganic compounds to hasten the synthesis of inorganic materials with desirable properties. The initial methods of this work could be effectively applied to other disciplines in materials science which are hard-pressed to hasten the realization of novel materials for emerging technologies, such as renewable energy generation, catalysis, and carbon capture.

Methods

Policy gradient network overview

The model architecture used for the PGN in this work builds upon the work of Popova et al.³⁶ We formulate the goal of generating inorganic materials with desirable properties as a task of finding the parameters Θ of the policy network which maximizes the expected reward

$$J\left(\Theta \right)\,{\mathbb{=}}\,{\mathbb{E}}\left[R\left({{\boldsymbol{s}}}_{t}\right)\left|{{\boldsymbol{s}}}_{0}\right.,\Theta \right]=\sum _{{{\boldsymbol{s}}}_{t}}{p}_{\Theta }\left({{\boldsymbol{s}}}_{t}\right)R({{\boldsymbol{s}}}_{t})\approx \mathop{\prod }\limits_{t=1}^{T}{{\mathbb{E}}}_{{{\boldsymbol{a}}}_{t} \sim {p}_{\Theta }({{\boldsymbol{a}}}_{t}\left|{{\boldsymbol{s}}}_{t-1}\right.)}R({{\boldsymbol{s}}}_{t})$$

(3)

where R(s_T) is the reward assigned based on predicted material properties and synthesis parameters of s_T. We sum over all possible terminal states s_T and approximate the sum through sampling from the generative model. We use a stack-augmented recurrent neural network (stack-RNN) as the generative model trained with cross-entropy loss function minimization and the REINFORCE algorithm to conduct policy gradient updates during learning^110,111.

The stack-RNN architecture is particularly suited for sequence prediction problems where pattern matching is important, such as in context-free languages and SMILES string generation, due to its ability to “count”^36,60,61. In inorganic materials generation, stack-RNNs are useful because the generator model must learn complex concepts such as ionic charge and electronegativity to predict chemically valid compounds. To further reinforce this concept and mirror the effects of a constraint oracle as is used in the DQN model (see Q-learning RL framework), we include additional reward terms in the reward function which positively weight generated compounds that are both charge neutral and electronegativity balanced leading to the modified reward function

$${R}_{t}\left({{\boldsymbol{s}}}_{t},{{\boldsymbol{a}}}_{t}\right)=\left(1-{w}_{{ce}}\right)\left[\mathop{\sum }\limits_{i=1}^{N}{w}_{i}{R}_{i,t}\left({{\boldsymbol{s}}}_{t},{{\boldsymbol{a}}}_{t}\right)\right]+{w}_{{ce}}{R}_{t}^{{ce}}\left({{\boldsymbol{s}}}_{t},{{\boldsymbol{a}}}_{t}\right)$$

(4)

where R_t^ce is the reward for satisfying charge neutrality and electronegativity balance and w_ce is the weight placed on the charge neutrality and electronegativity objectives. In practice, we found that equally weighting materials synthesis/property objectives and charge neutrality/electronegativity balance objectives (i.e., setting w_ce = 0.5) resulted in the best performance.

We initiate the generation process (Fig. 1c) by giving the model a <START> token, which signals to the model to begin a new sequence. At each time step 0 < t < T, the policy network takes the previously generated element and composition and produces a probability distribution p_Θ(a_t|s_t−1) over the action space. We then sample from the predicted distribution to determine what element and composition will be next added. The sequence generation task is finished either when the maximum time horizon is reached or the model generates an <END> token, and the generated material is the concatenation of all of the generated element-composition tokens. The final reward is then estimated using Eq. 4.

Deep Q-network overview

Q-learning is a model-free reinforcement learning algorithm which trains an agent to behave optimally in a Markov process¹¹². The agent performs actions to maximize the expected sum of all future discounted rewards given an objective function

$$J\left(\Theta \right)=\mathop{\sum }\limits_{t=0}^{T}{\gamma }^{t}{R}_{t}\left({{\boldsymbol{s}}}_{t},{{\boldsymbol{a}}}_{t}\right)$$

(5)

where γ ∈ [0,1] is the discount factor and R_t represents the reward at step t given values s_t

$${Q}^{\pi }\left({{\boldsymbol{s}}}_{t},{{\boldsymbol{a}}}_{t}\right)={R}_{t+1}+\gamma \mathop{\max }\limits_{{{\boldsymbol{a}}}_{t+1}}{Q}^{\pi }\left({{\boldsymbol{s}}}_{t+1},{{\boldsymbol{a}}}_{t+1}\right)$$

(6)

with ${Q}^{\pi }\left({{\boldsymbol{s}}}_{t+1},{{\boldsymbol{a}}}_{t+1}\right)$ being the expected sum of all the future rewards the agent receives in the resultant state ${{\boldsymbol{s}}}_{t+1}$. In the context of materials design, the agent uses a policy π to maximize the expected future reward (material property) loop by learning a DQN Q^π (Fig. 1a). The agent (material generator) generates materials, which are then evaluated by material property predictor(s), hence assigning the reward R to the agent. The goal here is to learn a DQN material generator that maximizes expected rewards to generate compounds of desired properties.

In Q-learning, the Q-value is the expected discounted reward for a given state and action, and therefore the optimal policy π^∗ can be found using iterative updates with the Bellman equation (Eq. (6)). Upon convergence, the optimal Q-value Q^∗ is defined as:

$${Q}^{* }\left({{\boldsymbol{s}}}_{t},{{\boldsymbol{a}}}_{t}\right)={{\mathbb{E}}}_{{{\boldsymbol{s}}}_{t+1} \sim p}\left[{R}_{t+1}+\gamma \mathop{\max }\limits_{{{\boldsymbol{a}}}_{t+1}}{Q}^{* }\left({{\boldsymbol{s}}}_{t+1},{{\boldsymbol{a}}}_{t+1}\right)\left|{{\boldsymbol{s}}}_{t}\right.,{{\boldsymbol{a}}}_{t}\right]$$

(7)

Here, the Q-function Q(s_t,a_t) is approximated with a deep Q-network (DQN) Q_θ parameterized by weights θ¹¹³.

Without constraints, a DQN may generate invalid materials through unbounded exploration. To address this, charge neutrality and electronegativity balance constraints are incorporated using a constraint oracle^114,115 (a trained function) ${\hat{c}}_{j,t}$ which is formulated as

$${\hat{c}}_{j,t}=\mathop{\max }\limits_{{t}^{{\prime} }\ge t}{c}_{j,{t}^{{\prime} }}$$

(8)

with c_j,t being the jth constraint to be satisfied at step t, and the ${\hat{c}}_{j,t}$ is the maximum level of violation occurring in all current and future steps t^′ in the generation process. The constraint oracles C_j(s,a) are learned functions (trained on ${\hat{c}}_{j,t}$) of the probability of an action satisfying a constraint in all future steps. Consequently, the constrained agent now follows a modified policy

$$\pi \left({\boldsymbol{s}}\right)=\mathop{{\rm{argmax}}}\limits_{{\boldsymbol{a}}}\left[{Q}^{\pi }\left({\boldsymbol{s}},{\boldsymbol{a}}\right)+\infty \cdot \min \left(0,{C}_{j}\left({\boldsymbol{s}},{\boldsymbol{a}}\right)-T\right)\right]$$

(9)

where T ∈ [0,1] is a user-defined threshold set at 0.5. This important modification to DQN ensures that actions predicted by constraint oracle C_j(s,a) that eventually lead to invalid compounds are penalized by the last term in Eq. 9, increasing the probability of generating chemically valid compounds.

DQN agents typically take the greedy action that maximizes the expected return, rendering the policy deterministic in a static environment. However, for a materials generation task, a deterministic policy would be problematic as the agent would always take the same actions and generate the same material over and over again. This would not be optimal as materials generated by the agent must be chemically diverse. To address this, instead of taking the greedy action, the policy π is given where the action is sampled from the top n% of actions such that $\pi \left({\boldsymbol{s}}\right)={random}.{sample}({{\boldsymbol{a}}}_{1}^{* },\ldots ,{{\boldsymbol{a}}}_{n}^{* })$ where ${{\boldsymbol{a}}}_{1}^{* },\ldots ,{{\boldsymbol{a}}}_{n}^{* }$ refer to the top n% of actions ranked by the Q^π and constraint oracles C_j(s,a). Empirically, it was found that n can be increased from 0 up until 20 without a deterioration of material properties, hence n was set to 20 to maximize diversity of generated materials. Any larger value than n = 20 would further improve diversity at the expense of material properties.

The DQN state representation s_t (Fig. 1d) has two components. The first is the material composition, where each generated material (complete or incomplete) is featurized using the Magpie⁶² featurization framework to give a ${{\mathbb{R}}}^{145}$ vector consisting of stoichiometric properties, elemental properties, electronic structure features, and ionic features. The second is the current step t of generation process. Previous works¹⁵ have reported that the inclusion of the step (time-dependent policy) outperforms exclusion of such (time-independent policy). Since the horizon is capped at 5 steps, this is represented by a one-hot encoding of ${{\mathbb{R}}}^{5}$. The DQN action space is identical to the PGN action space with both an element and element coefficient component. The element component is represented by a one-hot encoding of ${{\mathbb{R}}}^{80}$ and the element coefficient component is represented by a one-hot encoding of ${{\mathbb{R}}}^{10}$, with the sizes of the element and integer (0–9) sets, respectively.

Dataset acquisition and preprocessing

To train the RL models and materials property predictor models, we leveraged a subset of inorganic materials and their computed properties contained within the Materials Project (MP) database⁴⁸. Materials Project is a widely used inorganic materials database containing crystal structures and materials properties data calculated from high-throughput quantum mechanical calculations. To train the sintering and calcination temperature predictors model, we used a publicly released inorganic solid-state synthesis database text-mined from scientific literature using a combination of NLP and rule-based extraction techniques¹¹⁶. We restricted our inorganic materials dataset to oxides, as the majority (> 97%) of the synthesis dataset consists of oxides and our synthesis predictor models would therefore exhibit high uncertainty for compounds not containing oxygen in multi-objective generation tasks.

To preprocess the materials chemical formula and property data, we use a similar preprocessing strategy as Dan et al.¹⁸, Jha, et al.¹¹⁷, and Pathak et al.¹⁷. For a formula with multiple reported formation energies, we choose the lowest one as a heuristic for stability. We additionally removed all single-element compounds and any compounds with a formation energy outside the interval [µ − 5σ, µ + 5σ], where µ and σ are the average and standard deviation of the formation energies of all compounds in the MP dataset. For each property, all entries which did not contain a computed property value were discarded. After preprocessing, we obtained datasets containing the chemical formulas, formation energies and band gaps for 22,555 compounds as well as the bulk and shear modulus values for 9888 compounds. For bulk and shear moduli, property values were taken as their corresponding natural logarithm.

To preprocess the calcination and sintering temperature data, we followed the procedure reported in Karpovich et al.¹². Reactions without at least one relevant heating step with a reported temperature were removed. Temperatures were converted into units of Celsius (°C) and limited to between 200 °C and 2000 °C. Times were converted into units of hours (h) and limited to less than 100 h. If a relevant heating step occurred more than once in a recipe (i.e. multiple sintering steps), the last operation in chronological order was taken, and if a relevant heating step was reported with more than one temperature or time, the highest value for that step was taken. For reactions with more than one reported occurrence in literature (same target and precursors), the ground truth reaction condition was taken to be the average of the reported conditions. After preprocessing, a final reaction dataset consisting of 12,228 calcination temperatures and 12,296 sintering temperatures was obtained.

ML property prediction model training

We built ML property prediction models for both materials properties (formation energy, bulk/shear modulus, band gap) and synthesis objectives (sintering temperature, calcination temperature). For the materials property prediction, we leveraged Roost¹¹⁸, a graph-composition message passing neural network architecture which can predict materials properties from composition. Roost property prediction models were trained based on default hyperparameters using a random 90/10 train-test split.

For sintering and calcination temperature prediction, inorganic compounds were featurized using Magpie compositional features, which are physically motivated descriptors that take the form of a 145-dimensional embedding containing stoichiometric properties, elemental properties, electronic structure, and ionic compound features⁶². To predict temperature from materials composition, we trained a random forest (RF) model and optimized hyperparameters using a 5-fold cross-validation. Optimized hyperparameters for the RF models can be found in Supplementary Table 9 and performance metrics for all ML property prediction models can be found in Supplementary Table 10.

Policy gradient RL training

The PGN training is conducted in two steps. We first pretrained a generative model on the extracted 22,555 oxide compounds from Materials Project. This pretraining step ensures the model can learn general chemical guidelines to produce valid material formulas. We then combine the generative and predictive models in a RL framework, where the predictive model is used to assign reward values to each generated material and the generative model is trained to maximize the expected reward.

The generative model is a stack-augmented recurrent neural network (RNN) consisting of a gated recurrent unit (GRU) layer with a hidden size of 1500, a stack width of 1500, and a stack depth of 10. The action representation has two components. The first is the element, or the atomic species being added to the material formula sequence. We consider the set of elements present in the pre-processed MP oxide date, or 80 elements, in the action space. The second is the element coefficient, or the numerical subscript of the atomic species that is added to the sequence which we consider as integers from 0 to 9. The action space consists of all possible combinations of the 80 elements with the element coefficients. The state space consists of all possible strings in this alphabet with a maximum length of 5, as a maximum generation length of 5 tokens was used to limit compound generation to at most 5 unique elements.

The generative model was pretrained for 10,000 iterations and the PGN was trained for 500 iterations, both using a learning rate of 0.001 and the Adadelta optimizer. Models were trained on a single NVIDIA RTX A5000 GPU and training a PGN agent takes 2-3 h.

Deep Q-network RL training

For the DQN architecture, the leaky ReLU activation function is used in all fully connected layers. The 4 vectors are then individually passed through fully connected layers, and the 64-dimensional hidden states are concatenated into a single vector before passing through another fully connected layer to give the Q-value.

The DQN was trained over 500 iterations with the use of a replay buffer of size 50,000 that addresses the issue of correlated sequential samples¹¹⁹. In each iteration, 100 compounds were generated, stored in the replay buffer, and the Q-network is trained on 100 samples randomly sampled from the replay buffer with smooth L1 loss function and Adam optimizer with learning rate of 0.01. Initial exploration is encouraged using an ϵ-greedy policy starting with high ϵ value of 0.99, which is decayed (by a factor of 0.99 after each iteration) over the course of training to ensure eventual exploitation and convergence to the optimal policy. Discount factor γ was set at 0.9. Models were trained on a single NVIDIA RTX A5000 GPU and training a DQN agent takes 2-3 h.

DING model training

Code for the DING model implementation was adapted from the authors’ publicly released repository¹⁷. The CVAE models were trained based on default hyperparameters using a random 72/18/10 train/validation/test split over 150 epochs. The model with the best validation loss was used for each trial with reconstruction accuracy computed on the test set.

SMACT enumeration

With the five-component inorganic composition space estimated to exceed 10¹³ materials⁶³, a comprehensive brute-force search across the five-component space is intractable; we approximated a brute-force search by limiting our search space to binary, ternary, and quaternary oxides, with three non-oxygen components and one oxygen component. Starting with our initial search space of 83 elements, we generated all combinations of the non-oxygen elements. We then used a custom multithreaded version of the smact.screening.smact_filter function with a stoichiometry threshold of 9 (to search for element coefficients of 1 to 9) to generate all possible binary, ternary, and quaternary oxide combinations for the initial element set. The SMACT filter function generates all possible compositions from the specified element set which pass both the SMACT charge neutrality and electronegativity balance filters. To make the search space more tractable, we randomly selected one coefficient combination per element combination, so for each unique element combination we generated at most one oxide containing the elements, barring any combinations which were rejected by the smact_filter function. For the final 87,088 generated oxides, we used the trained property prediction models to predict each of the target property values. For each target property, we sorted the generate composition list by the target property of interest (ascending for bulk modulus, shear modulus, band gap and descending for formation energy, sintering temperature, and calcination temperature) and selected the top 1,000 compositions.

Template-based crystal structure prediction

Template-based crystal structure prediction was conducted using the authors’ publicly released repository following the methodology detailed in Kusaba et al.⁹⁰. Predictions were made using the provided ensemble of neural network models trained on the 33,153 stable compounds obtained from Materials Project. For each query composition, we predicted the top six candidate structures ranked in descending order by their confidence match. We then selected a subset of query compositions with ≥ 80% confidence in the highest confidence matched template crystal structures. Crystal structures were visualized using the .cif files provided by the Materials Project database.

Data availability

The inorganic compound training and test data for the reinforcement learning models are downloaded from the Materials Project database. Other results data used and analyzed during the current study is available from the corresponding author on reasonable request.

Code availability

Corresponding code can be found online at: https://github.com/olivettigroup/deep-rl-inorganic.

References

Noh, J., Gu, G. H., Kim, S. & Jung, Y. Machine-enabled inverse design of inorganic solid materials: promises and challenges. Chem. Sci. 11, 4871–4881 (2020).
Article CAS PubMed PubMed Central Google Scholar
Merchant, A. et al. Scaling deep learning for materials discovery. Nature 624, 80–85 (2023).
Article CAS PubMed PubMed Central Google Scholar
Petousis, I. et al. High-throughput screening of inorganic compounds for the discovery of novel dielectric and optical materials. Sci. Data, 4, 1–12 (2017).
Szymanski, N. J. et al. An autonomous laboratory for the accelerated synthesis of novel materials. Nature 624, 86–91 (2023).
Article CAS PubMed PubMed Central Google Scholar
Eyke, N. S., Koscher, B. A. & Jensen, K. F. Toward machine learning-enhanced highthroughput experimentation. Trends Chem. 3, 120–132 (2021).
Article CAS Google Scholar
Szymanski, N. J. et al. Toward autonomous design and synthesis of novel inorganic materials. Mater. Horiz. 8, 2169–2198 (2021).
Article CAS PubMed Google Scholar
Coley, C. W. et al. A robotic platform for flow synthesis of organic compounds informed by AI planning. Science 365, eaax1566 (2019).
Article CAS PubMed Google Scholar
Seifrid, M. et al. Autonomous chemical experiments: Challenges and perspectives on establishing a self-driving lab. Acc. Chem. Res. 55, 2454–2466 (2022).
Article CAS PubMed PubMed Central Google Scholar
Prein, T.; Pan, E.; Doerr, T.; Olivetti, E.; Rupp, J. MTENCODER: A Multi-task Pretrained Transformer Encoder for Materials Representation Learning. AI for Accelerated Materials Design-NeurIPS 2023 Workshop. 2023.
Court, C. J., Yildirim, B., Jain, A. & Cole, J. M. 3-D inorganic crystal structure generation and property prediction via representation learning. J. Chem. Inf. Modeling 60, 4518–4535 (2020).
Article CAS Google Scholar
Na, G. S. & Kim, H. W. Contrastive representation learning of inorganic materials to overcome lack of training datasets. Chem. Commun. 58, 6729–6732 (2022).
Article CAS Google Scholar
Karpovich, C., Pan, E., Jensen, Z. & Olivetti, E. Interpretable Machine Learning Enabled Inorganic Reaction Classification and Synthesis Condition Prediction. Chem. Mater. 35, 1062–1079 (2023).
Article CAS Google Scholar
G´omez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Sci. 4, 268–276 (2018).
Lim, J., Ryu, S., Kim, J. W. & Kim, W. Y. Molecular generative model based on conditional variational autoencoder for de novo molecular design. J. Cheminformatics 10, 1–9 (2018).
Article Google Scholar
Zhou, Z., Kearnes, S., Li, L., Zare, R. N. & Riley, P. Optimization of molecules via deep reinforcement learning. Sci. Rep. 9, 1–10 (2019).
Google Scholar
Sanchez-Lengeling, B.; Outeiral, C.; Guimaraes, G. L.; Aspuru-Guzik, A. Optimizing distributions over molecular space. An objective-reinforced generative adversarial network for inverse-design chemistry (ORGANIC). 2017,
Pathak, Y., Juneja, K. S., Varma, G., Ehara, M. & Priyakumar, U. D. Deep learning enabled inorganic material generator. Phys. Chem. Chem. Phys. 22, 26935–26943 (2020).
Article CAS PubMed Google Scholar
Dan, Y. et al. Generative adversarial networks (GAN) based efficient sampling of chemical composition space for inverse design of inorganic materials. npj Computational Mater. 6, 1–7 (2020).
Article Google Scholar
Fu, N. et al. Material transformers: deep learning language models for generative materials design. Mach. Learn.: Sci. Technol. 4, 015001 (2023).
Google Scholar
Song, Y., Siriwardane, E. M. D., Zhao, Y. & Hu, J. Computational discovery of new 2D materials using deep learning generative models. ACS Appl. Mater. Interfaces 13, 53303–53313 (2021).
Article CAS PubMed Google Scholar
Siriwardane, E. M. D., Zhao, Y., Perera, I. & Hu, J. Generative design of stable semiconductor materials using deep learning and density functional theory. npj Computational Mater. 8, 164 (2022).
Article CAS Google Scholar
Yao, Z. et al. Inverse design of nanoporous crystalline reticular materials with deep generative models. Nat. Mach. Intell. 3, 76–86 (2021).
Article Google Scholar
Zhao, Y. et al. Highthroughput discovery of novel cubic crystal materials using deep generative neural networks. Adv. Sci. 8, 2100566 (2021).
Article CAS Google Scholar
Catalan, A. B., Mantese, J. V., Micheli, A. L., Schubring, N. W. & Wong, C. A. Effects of sintering temperature and time on grain size and dielectric constant of potassium tantalum niobate films. J. Am. Ceram. Soc. 75, 3007–3010 (1992).
Article CAS Google Scholar
Udomporn, A. & Ananta, S. Effect of calcination condition on phase formation and particle size of lead titanate powders synthesized by the solid-state reaction. Mater. Lett. 58, 1154–1159 (2004).
Article CAS Google Scholar
Ngamjarurojana, A., Khamman, O., Yimnirun, R. & Ananta, S. Effect of calcination conditions on phase formation and particle size of zinc niobate powders synthesized by solid-state reaction. Mater. Lett. 60, 2867–2872 (2006).
Article CAS Google Scholar
Heegn, H., Trinkler, M. & Langbein, H. Phase formation and solid state structure on calcination of a nickel ferrite acetate precursor. Cryst. Res. Technol.: J. Exp. Ind. Crystallogr. 35, 255–264 (2000).
Article CAS Google Scholar
Wang, M. et al. Effect of calcination temperature on structural, magnetic and optical properties of multiferroic YFeO3 nanopowders synthesized by a low temperature solid-state reaction. Ceram. Int. 43, 10270–10276 (2017).
Article CAS Google Scholar
Huang, M. et al. Effect of sintering temperature on structure and ionic conductivity of Li7- xLa3Zr2O12- 0.5 x (x= 0.5˜ 0.7) ceramics. Solid State Ion. 204, 41–45 (2011).
Article Google Scholar
Fei, Y. et al. Effects of sintering temperatures on the crystallinity and electrochemical properties of the Li10GeP2S12 via solid-state sintering method. IOP Conference Series: Materials Science and Engineering. 2018; p 022038.
Guo, X. et al. Effect of calcining temperature on particle size of hydroxyapatite synthesized by solid-state reaction at room temperature. Adv. Powder Technol. 24, 1034–1038 (2013).
Article CAS Google Scholar
Chen, R.-J. et al. Effect of calcining and Al doping on structure and conductivity of Li7La3Zr2O12. Solid State Ion. 265, 7–12 (2014).
Article CAS Google Scholar
Betz, J. et al. Understanding the impact of calcination time of high-voltage spinel Li1+ xNi0. 5Mn1. 5O4 on structure and electrochemical behavior. Electrochimica Acta 2019, 325, 134901.
Arjovsky, M.; Bottou, L. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862 2017,
Sutton, R. S.; Barto, A. G. Reinforcement learning: An introduction; MIT press, 2018.
Popova, M., Isayev, O. & Tropsha, A. Deep reinforcement learning for de novo drug design. Sci. Adv. 4, eaap7885 (2018).
Article CAS PubMed PubMed Central Google Scholar
Olivecrona, M., Blaschke, T., Engkvist, O. & Chen, H. Molecular de-novo design through deep reinforcement learning. J. Cheminformatics 9, 1–14 (2017).
Article Google Scholar
Rajak, P. et al. Autonomous reinforcement learning agent for chemical vapor deposition synthesis of quantum materials. npj Computational Mater. 7, 108 (2021).
Article CAS Google Scholar
Shah, T., et al. Reinforcement learning applied to metamaterial design. J. Acoust. Soc. Am. 150, 321–338 (2021).
Sui, F., Guo, R., Zhang, Z., Gu, G. X. & Lin, L. Deep reinforcement learning for digital materials design. ACS Mater. Lett. 3, 1433–1439 (2021).
Article CAS Google Scholar
Park, H., Majumdar, S., Zhang, X., Kim, J. & Smit, B. Inverse design of metal–organic frameworks for direct air capture of CO 2 via deep reinforcement learning. Digital Discov. 3, 728–741 (2024).
Article CAS Google Scholar
Zheng, B., Zheng, Z. & Gu, G. X. Designing mechanically tough graphene oxide materials using deep reinforcement learning. npj Computational Mater. 8, 225 (2022).
Article CAS Google Scholar
Wang, Y., Cao, Z. & Barati Farimani, A. Efficient water desalination with graphene nanopores obtained using artificial intelligence. npj 2D Mater. Appl. 5, 66 (2021).
Article Google Scholar
Pan, E.; Karpovich, C.; Olivetti, E. Deep reinforcement learning for inverse inorganic materials design. arXiv preprint arXiv:2210.11931 2022,
Arulkumaran, K., Deisenroth, M. P., Brundage, M. & Bharath, A. A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 34, 26–38 (2017).
Article Google Scholar
Li, Y. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274 2017,
Manzano, J. S. et al. An autonomous portable platform for universal chemical synthesis. Nat. Chem. 14, 1311–1318 (2022).
Jain, A. et al. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Mater. 1, 011002 (2013).
Article Google Scholar
Fu, P., Zhao, Y., Dong, Y., An, X. & Shen, G. Low temperature solid-state synthesis routine and mechanism for Li3V2 (PO4) 3 using LiF as lithium precursor. Electrochim. acta 52, 1003–1008 (2006).
Article CAS Google Scholar
Stein, A., Keller, S. W. & Mallouk, T. E. Turning down the heat: Design and mechanism in solid-state synthesis. Science 259, 1558–1564 (1993).
Article CAS PubMed Google Scholar
Yamamoto, A. et al. Improved critical current properties observed in MgB2 bulks synthesized by low-temperature solid-state reaction. Superconductor Sci. Technol. 18, 116 (2004).
Article Google Scholar
Kotobuki, M., Kanamura, K., Sato, Y. & Yoshida, T. Fabrication of all-solid-state lithium battery with lithium metal anode using Al2O3-added Li7La3Zr2O12 solid electrolyte. J. Power Sources 196, 7750–7754 (2011).
Article CAS Google Scholar
Tsai, C.-L., Kopczyk, M., Smith, R. & Schmidt, V. H. Low temperature sintering of Ba (Zr0. 8- xCexY0. 2) O3- δ using lithium fluoride additive. Solid State Ion. 181, 1083–1090 (2010).
Article CAS Google Scholar
Waetzig, K., Heubner, C. & Kusnezoff, M. Reduced sintering temperatures of Li+ conductive Li1. 3Al0. 3Ti1. 7 (PO4) 3 ceramics. Crystals 10, 408 (2020).
Article CAS Google Scholar
Miara, L. et al. About the compatibility between high voltage spinel cathode materials and solid oxide electrolytes as a function of temperature. ACS Appl. Mater. interfaces 8, 26842–26850 (2016).
Article CAS PubMed Google Scholar
Mahbub, R. et al. Text mining for processing conditions of solid-state battery electrolytes. Electrochem. Commun. 121, 106860 (2020).
Article CAS Google Scholar
Brook, R. J. Concise encyclopedia of advanced ceramic materials; Elsevier, 2012; 49–440.
West, A. R. Solid state chemistry and its applications; John Wiley & Sons, 2014; 87–213.
Blendell, J. Solid-state sintering. Encyclopedia of materials: science and technology 2011, 8745–8750.
Deleu, T.; Dureau, J. Learning operations on a stack with Neural Turing Machines. arXiv preprint arXiv:1612.00827 2016,
Hopcroft, J. E.; Ullman, J. D. Formal languages and their relation to automata; Addison-Wesley Longman Publishing Co., Inc., 1969.
Ward, L., Agrawal, A., Choudhary, A. & Wolverton, C. A general-purpose machine learning framework for predicting properties of inorganic materials. npj Computational Mater. 2, 1–7 (2016).
Article Google Scholar
Davies, D. W. et al. Computational screening of all stoichiometric inorganic materials. Chem 1, 617–627 (2016).
Article CAS PubMed PubMed Central Google Scholar
Takahashi, K.; Yoshikawa, A.; Sandhu, A. Wide bandgap semiconductors. Verlag Berlin Heidelberg 2007,
Zhao, Z., Xu, B. & Tian, Y. Recent advances in superhard materials. Annu. Rev. Mater. Res. 46, 383–406 (2016).
Article CAS Google Scholar
Imada, M., Fujimori, A. & Tokura, Y. Metal-insulator transitions. Rev. Mod. Phys. 70, 1039 (1998).
Article CAS Google Scholar
Yeung, M. T., Mohammadi, R. & Kaner, R. B. Ultraincompressible, superhard materials. Annu. Rev. Mater. Res. 46, 465–485 (2016).
Article CAS Google Scholar
De Jong, M. et al. Charting the complete elastic properties of inorganic crystalline compounds. Sci. data 2, 1–13 (2015).
Google Scholar
Hu, T., Song, H., Jiang, T. & Li, S. Learning representations of inorganic materials from generative adversarial networks. Symmetry 12, 1889 (2020).
Article Google Scholar
Hargreaves, C. J., Dyer, M. S., Gaultois, M. W., Kurlin, V. A. & Rosseinsky, M. J. The earth mover’s distance as a metric for the space of inorganic compositions. Chem. Mater. 32, 10610–10620 (2020).
Article CAS Google Scholar
Bartel, C. J. et al. A critical examination of compound stability predictions from machine-learned formation energies. npj computational Mater. 6, 97 (2020).
Article Google Scholar
Kuang, X., Carotenuto, G. & Nicolais, L. A review of ceramic sintering and suggestions on reducing sintering temperatures. Adv. Perform. Mater. 4, 257–274 (1997).
Article CAS Google Scholar
Yu, H., Liu, J., Zhang, W. & Zhang, S. Ultra-low sintering temperature ceramics for LTCC applications: a review. J. Mater. Sci.: Mater. Electron. 26, 9414–9423 (2015).
CAS Google Scholar
Sebastian, M. T., Wang, H. & Jantunen, H. Low temperature co-fired ceramics with ultra-low sintering temperature: A review. Curr. Opin. Solid State Mater. Sci. 20, 151–170 (2016).
Article CAS Google Scholar
Jin, R., Yuan, X. & Gao, E. Atomic stiffness for bulk modulus prediction and highthroughput screening of ultraincompressible crystals. Nat. Commun. 14, 4258 (2023).
Article CAS PubMed PubMed Central Google Scholar
Liu, A. Y. & Cohen, M. L. Prediction of new low compressibility solids. Science 245, 841–842 (1989).
Article CAS PubMed Google Scholar
Gilman, J. J. Electronic basis of the strength of materials; Cambridge University Press, 2003.
Kim, E. & Chen, C. Calculation of bulk modulus for highly anisotropic materials. Phys. Lett. A 326, 442–448 (2004).
Article CAS Google Scholar
Mansouri Tehrani, A. et al. Machine learning directed search for ultraincompressible, superhard materials. J. Am. Chem. Soc. 140, 9844–9853 (2018).
Article CAS PubMed Google Scholar
Cumberland, R. W. et al. Osmium diboride, an ultra-incompressible, hard material. J. Am. Chem. Soc. 127, 7264–7265 (2005).
Article CAS PubMed Google Scholar
He, T. et al. Similarity of precursors in solid-state synthesis as text-mined from scientific literature. Chem. Mater. 32, 7861–7873 (2020).
Article CAS Google Scholar
Huo, H. et al. Machine-learning rationalization and prediction of solid-state synthesis conditions. Chem. Mater. 34, 7323–7336 (2022).
Article CAS PubMed PubMed Central Google Scholar
Zeng, S., Li, G., Zhao, Y., Wang, R. & Ni, J. Machine learning-aided design of materials with target elastic properties. J. Phys. Chem. C. 123, 5042–5047 (2019).
Article CAS Google Scholar
Brazhkin, V. V., Lyapin, A. G. & Hemley, R. J. Harder than diamond: Dreams and reality. Philos. Mag. A 82, 231–253 (2002).
Article CAS Google Scholar
Jia, X. et al. Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis. Nature 573, 251–255 (2019).
Article CAS PubMed Google Scholar
Kononova, O. et al. Opportunities and challenges of text mining in materials research. Iscience 24, 102155 (2021).
Article PubMed PubMed Central Google Scholar
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. Adv. Adv. Neural Inform. Process. Syst. 30, 4765–4774 (2017).
Google Scholar
Ribeiro, M. T.; Singh, S.; Guestrin, C. ” Why should i trust you?” Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016; 1135–1144.
Liang, H., Stanev, V., Kusne, A. G. & Takeuchi, I. CRYSPNet: Crystal structure predictions via neural networks. Phys. Rev. Mater. 4, 123802 (2020).
Article CAS Google Scholar
Kusaba, M., Liu, C. & Yoshida, R. Crystal structure prediction with machine learningbased element substitution. Computational Mater. Sci. 211, 111496 (2022).
Article CAS Google Scholar
Wei, L. et al. TCSP: a Template-Based Crystal Structure Prediction Algorithm for Materials Discovery. Inorg. Chem. 61, 8431–8439 (2022).
Article CAS PubMed Google Scholar
Li, Y., Yang, W., Dong, R. & Hu, J. MLatticeABC: generic lattice constant prediction of crystal materials using machine learning. ACS Omega 6, 11585–11594 (2021).
Article CAS PubMed PubMed Central Google Scholar
Hu, J. et al. Deep learning-based prediction of contact maps and crystal structures of inorganic materials. ACS Omega 8, 26170–26179 (2023).
Article CAS PubMed PubMed Central Google Scholar
Morita, S. & Toda, K. Determination of the crystal structure of Pb2CrO5. J. Appl. Phys. 55, 2733–2737 (1984).
Article CAS Google Scholar
Ryan, K., Lengyel, J. & Shatruk, M. Crystal structure prediction via deep learning. J. Am. Chem. Soc. 140, 10158–10168 (2018).
Article PubMed Google Scholar
Wu, C. et al. Variance reduction for policy gradient with action-dependent factorized baselines. arXiv preprint arXiv:1803.07246 2018,
Hasselt, H. Double Q-learning. Adv. Neural Inform. Process. Syst. 23, 2613–2621 (2010).
Google Scholar
Jien-Wei, Y. Recent progress in high entropy alloys. Ann. Chim. Sci. Mat. 31, 633–648 (2006).
Article Google Scholar
Ye, Y., Wang, Q., Lu, J., Liu, C. & Yang, Y. High-entropy alloy: challenges and prospects. Mater. Today 19, 349–362 (2016).
Article CAS Google Scholar
Shi, X.-L., Zou, J. & Chen, Z.-G. Advanced thermoelectric design: from materials and structures to devices. Chem. Rev. 120, 7399–7515 (2020).
Article CAS PubMed Google Scholar
Zheng, Y. et al. A review of composite solid-state electrolytes for lithium batteries: fundamentals, key materials and advanced structures. Chem. Soc. Rev. 49, 8790–8839 (2020).
Article CAS PubMed Google Scholar
Mao, J., Chen, G. & Ren, Z. Thermoelectric cooling materials. Nat. Mater. 20, 454–461 (2021).
Article CAS PubMed Google Scholar
Siemenn, A. E., Ren, Z., Li, Q. & Buonassisi, T. Fast Bayesian optimization of Needle-ina-Haystack problems using zooming memory-based initialization (ZoMBI). npj Computational Mater. 9, 79 (2023).
Article Google Scholar
Lillicrap, T. P. et al. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 2015,
Gu, S., Lillicrap, T., Sutskever, I. & Levine, S. Continuous deep q-learning with model-based acceleration. Int. conf. mach. learn. 48, 2829–2838 (2016).
Google Scholar
Haarnoja, T., Zhou, A., Abbeel, P. & Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Int. conf. mach. learn. 80, 1861–1870 (2018).
Google Scholar
Van Hasselt, H., Guez, A. & Silver, D. Deep reinforcement learning with double q-learning. Proc. AAAI conf. artif. intell. 30, 2094–2100 (2016).
Google Scholar
Hessel, M. et al. Rainbow: Combining improvements in deep reinforcement learning. Proceedings of the AAAI conference on artificial intelligence. (2018).
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 2017,
Joulin, A. & Mikolov, T. Inferring algorithmic patterns with stack-augmented recurrent nets. Adv. NeuralInform. Process. Syst. 28, 190–198 (2015).
Google Scholar
Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8, 229–256 (1992).
Article Google Scholar
Watkins, C. J. & Dayan, P. Q-learning. Mach. Learn. 8, 279–292 (1992).
Article Google Scholar
Mnih, V. et al. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 2013,
Pan, E., Petsagkourakis, P., Mowbray, M., Zhang, D. & del Rio-Chanona, E. A. Constrained model-free reinforcement learning for process optimization. Computers Chem. Eng. 154, 107462 (2021).
Article CAS Google Scholar
Pan, E., Petsagkourakis, P., Mowbray, M., Zhang, D. & del Rio-Chanona, A. Constrained Q-learning for batch process optimization. IFAC-PapersOnLine 54, 492–497 (2021).
Article Google Scholar
Kononova, O. et al. Text-mined dataset of inorganic materials synthesis recipes. Sci. data 6, 1–11 (2019).
Google Scholar
Jha, D. et al. Elemnet: Deep learning the chemistry of materials from only elemental composition. Sci. Rep. 8, 1–13 (2018).
Article Google Scholar
Goodall, R. E. & Lee, A. A. Predicting materials properties without crystal structure: Deep representation learning from stoichiometry. Nat. Commun. 11, 1–9 (2020).
Article Google Scholar
Lin, L.-J. Reinforcement learning for robots using neural networks. Dissertation, Carnegie Mellon Univ. (1993).

Download references

Acknowledgements

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. 1745302. The information, data, or work presented herein was also funded in part by the Advanced Research Projects Agency-Energy (ARPAE), U.S. Department of Energy, under Award Number DE-AR0001209. We would like to acknowledge partial funding from the National Science Foundation DMREF Awards 1922311, 1922372, and 1922090 and the Office of Naval Research (ONR) under contracts N00014-201-2280 and N00014-19-1-2114. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. We would also like to thank Prof. Wei Chen, Prof. Daniel Apley, Prof. James Rondinelli, Prof. Ramin Bostanabad, and their respective groups for thoughtful advice and discussion.

Author information

Authors and Affiliations

Department of Materials Science and Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA, 02139, USA
Christopher Karpovich, Elton Pan & Elsa A. Olivetti

Authors

Christopher Karpovich
View author publications
Search author on:PubMed Google Scholar
Elton Pan
View author publications
Search author on:PubMed Google Scholar
Elsa A. Olivetti
View author publications
Search author on:PubMed Google Scholar

Contributions

C.K. and E.A.O. conceived the project. C.K. and E.P. developed the methodology, implemented the methods, and prepared the figures. C.K., E.P., and E.A.O. wrote the paper and interpreted the results. E.A.O. supervised the project. All authors reviewed and commented on the paper.

Corresponding author

Correspondence to Elsa A. Olivetti.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplemental Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Karpovich, C., Pan, E. & Olivetti, E.A. Deep reinforcement learning for inverse inorganic materials design. npj Comput Mater 10, 287 (2024). https://doi.org/10.1038/s41524-024-01474-5

Download citation

Received: 29 March 2024
Accepted: 27 November 2024
Published: 19 December 2024
Version of record: 19 December 2024
DOI: https://doi.org/10.1038/s41524-024-01474-5

This article is cited by

AIMATDESIGN: knowledge-augmented reinforcement learning for inverse materials design under data scarcity
- Yeyong Yu
- Xilei Bian
- Quan Qian
npj Computational Materials (2026)
DiffSyn: a generative diffusion approach to materials synthesis planning
- Elton Pan
- Soonhyoung Kwon
- Elsa A. Olivetti
Nature Computational Science (2026)
Inverse Design Using Goal-Conditioned Reinforcement Learning for Organic Semiconductor Materials from Benzene and Thiophene-based Polycyclic Aromatic Compounds
- Tri M. Nguyen
- Thanh N. Truong
npj Computational Materials (2025)
Acceleration of crystal structure relaxation with deep reinforcement learning
- Elena Trukhan
- Efim Mazhnik
- Artem R. Oganov
npj Computational Materials (2025)
Artificial intelligence-empowered functional design of semi-transparent optoelectronic and photonic devices via deep Q-learning
- Caglar Cetinkaya
- Erman Cokduygulular
- Baris Kinaci
Scientific Reports (2025)

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Initiating models

Single-objective materials generation

Multi-objective materials generation

Template-based crystal structure prediction

Discussion

Methods

Policy gradient network overview

Deep Q-network overview

Dataset acquisition and preprocessing

ML property prediction model training

Policy gradient RL training

Deep Q-network RL training

DING model training

SMACT enumeration

Template-based crystal structure prediction

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links