Introduction

The discovery of new materials with desirable properties is crucial to applications in areas such as energy storage and electronics manufacturing. However, virtual materials screening is still gated by time- and resource-consuming physics-based simulations1,2,3, high-throughput experimentation4,5,6,7,8, and the need for good learned representations of materials9,10,11,12. In recent years, machine learning (ML) has revolutionized inverse materials design, where given one or more target properties, a model is trained to generate candidate compounds that satisfy the desired objectives.

Generative neural networks such as autoencoders have been used to optimize organic molecules with desirable properties over a learned latent space13,14. However, optimizing over a non-convex objective function in a high-dimensional latent space is difficult15. Generative adversarial networks (GANs) have also seen success in molecular generation tasks. While GANs circumvent the need for latent space optimization, they suffer from issues such as training instability and mode collapse, which make it difficult to generate compounds with high diversity and validity16. In the inorganic composition space, variational autoencoders (VAEs), GANs, and autoregressive models have been leveraged for generation of hypothetical inorganic materials17,18,19, 2D materials20, semiconductors21, reticular frameworks22, and cubic materials23. However, these prior works exploring generative model architectures to generate compositions satisfying materials property objectives do so by altering the distribution of the training set (e.g., only training on high band gap materials). This training strategy is data-inefficient, as it requires a comprehensive materials dataset of compositions exhibiting the property of interest within the target range. These prior works are also limited in applicability for real-world materials discovery use cases. Many cases require narrow materials synthesis or property windows or are subject to complex multi-objective tasks, where it is difficult to curate large, high-quality materials science datasets containing sufficient data in property regimes of interest. Furthermore, these works only consider materials property objectives without considering synthesis objectives for generated materials. Synthesis objectives such as synthesis temperature or time are important factors in inorganic materials design. For instance, controlling calcination and sintering temperature and time is experimentally useful to influence materials properties (magnetic, electrical, optical, electrochemical), compound morphology (crystal structure, particle size, grain size, phase purity), and reaction yield12,24,25,26,27,28,29,30,31,32,33. Prior works do not take into account multi-objective reward functions such as optimizing both synthesis and property objectives, which is a challenge for generative methods such as GANs and VAEs14,34.

Reinforcement learning (RL) is a ML approach which describes how intelligent agents take actions in an interactive environment to maximize an expected cumulative reward35. Recent advances in ML have paired the RL mathematical formalism of decision-making with advances in deep learning to train models which learn from complex, high-dimensional inputs and action spaces. RL approaches to molecular generation predict molecules with respect to a reward function. Prior work includes both policy gradient and Q-learning approaches to design of organic molecules and have demonstrated these methods to be sample efficient, stable during training, and high performance15,36,37. Prior applications of RL to the inorganic materials space has been limited to specific case studies such as optimization of chemical vapor deposition synthesis conditions for MoS238, metamaterial design39, digital materials design40, metal-organic frameworks41, graphene oxide functional group distribution42, and nanopore design43. Our previous work44 showed initial promise for materials discovery using RL methods; to the best of our knowledge, no other prior work has leveraged RL for inorganic materials composition generation. Deep RL is particularly suited to guided materials generation tasks as it is adept at learning from high-dimensional data that is commonly encountered in the materials space45. Specifically, Q-learning based RL approaches are not constrained by prior knowledge or large initialization data sets, which is important in data-scarce domains frequently encountered in materials science46. Policy gradient-based approaches can directly optimize a policy in large discrete and continuous action spaces, such as the space of all elements and coefficients, respectively46. Consequently, there is a dearth of ML-informed screening methods in the inorganic materials space which can suggest initial target compounds subject to synthesis objectives and property objectives. This is a crucial and necessary step for the advancement of high-throughput compound screening and laboratory automation, such as ML-accelerated self-driving laboratories4,5,6,8,47.

We use a data-driven RL approach to generate novel inorganic oxide materials satisfying both materials property (band gap, formation energy, bulk modulus, shear modulus) and synthesis (sintering temperature, calcination temperature) objectives. We present two RL approaches, a deep policy gradient network (PGN) based algorithm and a deep Q-network (DQN) based algorithm, which are employed to explore the inorganic chemical design space. We use template-based matching to propose potential crystal structures for the identified compositions of interest. We compare and contrast the two methods to highlight their respective benefits and shortcomings in different contexts of ML-enabled inorganic materials design. The RL-generated compositions in this work exhibit high validity, negative formation energy, and strong adherence to the target objectives. The proposed models outperform baseline ML methods in validity, diversity, and materials property satisfaction. Ultimately, this work proposes two key methods that can accelerate the discovery of novel materials satisfying multiple objectives to bridge synthesis-property design spaces.

Results

We frame the RL-assisted discovery of novel inorganic materials with desirable properties as a feedback loop (Fig. 1a). Each of the RL approaches developed in this work uses a generator model and a predictor model. The generator model initiates novel, chemically valid materials satisfying one or more synthesis or property objectives. The predictor model is a supervised ML algorithm which is trained to speculate materials synthesis or property data from its chemical composition. In our learning workflow, the generator model first suggests an inorganic material composition. The predictor model then assigns a reward to the generated chemical formula based on a user-specified reward function, such as maximizing or minimizing a particular property (e.g., maximize bulk modulus, minimize sintering temperature) or targeting a specific value (e.g., a band gap of 2.5 eV). The reward function can consist of a single objective or differently weighted multiple objectives depending on the preferences and end goals of the user. For instance, if an experimentalist wishes to design materials with a stringent objective of a large bulk modulus but a more forgiving objective of processing temperature, a reward function that places a strong emphasis on maximizing bulk modulus and a weaker emphasis on minimizing calcination temperature would be more appropriate.

Fig. 1: Overview of RL architectures and materials generation.
figure 1

a ML workflow for material design where an RL agent (material generator) is assigned rewards based on predicted material properties in a feedback loop. b Inorganic materials generation process. At each step of the generation sequence for both the policy gradient network (PGN) and deep Q-network (DQN), state s is represented by material composition while action a constitutes the addition of an element (e.g., La in step 1) and its corresponding composition (e.g., the coefficient of 3 in step 1). c Materials generation process and architecture overview of PGN. d Architecture overview of DQN (fc = fully connected layer, concat. = concatenation).

We formulate the material generation process as a sequence generation task where a trajectory is defined as T = {s0, a0, s1, a1, …aT-1, sT} with a horizon capped at T = 5 steps, as shown in Fig. 1b. Here, states s are represented by material compositions while actions a constitute the addition of an element and its corresponding composition to the existing incomplete material. At each step, an element (out of a set of 80 possible elements) and its corresponding composition (an integer ranging from 0–9) are appended to the existing (or empty) material. If the corresponding composition is 0, the agent does not add the element. This approach allows the agent to generate materials containing ≤ 5 elements. In all experiments, we restrict to generating oxides, hence any element can be considered in the first four steps, but only oxygen for the final step.

Here, we compare two different RL formulations: 1) policy gradient approach, PGN, which aims to optimize the policy directly 2) deep Q-learning approach, DQN, which aims to optimize the policy via learning a surrogate value function. In both approaches, we define the reward Rt at timestep t given values st and at as

$${R}_{t}\left({{\boldsymbol{s}}}_{t},{{\boldsymbol{a}}}_{t}\right)=\mathop{\sum }\limits_{i=1}^{N}{w}_{i}{R}_{i,t}\left({{\boldsymbol{s}}}_{t},{{\boldsymbol{a}}}_{t}\right)$$
(1)

where Ri,t is the reward from the ith objective and wi is the user-specified weight placed on the ith reward. For materials synthesis and property objectives, Ri,t = P(sT), where P is a predictor model for the particular objective. This reward formulation encompasses both single-objective (N = 1) and multi-objective (N > 1) tasks. We assign rewards Rt at all non-terminal a value of zero until step 5 (terminal step) when the compound is fully generated.

Supplementary Fig. 1 depicts the materials property and synthesis parameter distributions across the preprocessed inorganic oxide data extracted from Materials Project48. We select four materials property objectives (formation energy, bulk modulus, shear modulus, band gap) and two materials synthesis objectives (sintering temperature, calcination temperature). This subset contains objectives relevant to materials properties and their applications (negative formation energy, band gap: electronic devices, semiconductors, bulk/shear modulus: building materials, superhard materials) and synthesis objectives (calcination/sintering temperature). Lower processing temperature conditions are desirable due to energy savings capabilities, realization of metastable structure types and desirable materials properties, as well as manufacturing requirements49,50,51,52,53,54. For example, in the processing of solid-state electrolytes, lower synthesis temperatures can reduce interdiffusion and improve battery component compatibility55,56. As the dataset consists of oxides, the majority of the compounds in the dataset have formation energies below zero and band gaps greater than zero. Sintering temperatures generally skew higher than calcination temperatures because in calcination, a mixture of compounds is heated to a high temperature to remove impurities and unwanted volatile substances, often through thermal decomposition, while in sintering, a compound is heated at a higher temperature than that reached by calcination to induce densification and grain growth57,58,59.

Initiating models

We build upon the PGN model architecture proposed by Popova et al.36 to generate inorganic materials with desired properties. We frame this task as optimizing the parameters of a policy network to maximize the expected reward, which is calculated as the sum of rewards assigned to predicted material properties and synthesis parameters. The reward function (Supplementary Table 1) considers all possible terminal states and is approximated through sampling from the generative model.

Our PGN approach (Fig. 1c) employs a stack-augmented recurrent neural network (stack-RNN) as the generative model, well-suited for sequence prediction tasks such as generating complex molecules and materials due to its ability to capture patterns effectively36,60,61. This is particularly advantageous in the context of inorganic materials generation, where the generator model must understand complex chemical concepts like ionic charge and electronegativity to predict chemically valid compounds. We initiate the generation process with a <START> token and iteratively predict the next element and composition until reaching a termination condition signaled by an <END> token or reaching a maximum time horizon. The final reward is estimated based on the generated material using a modified reward function. This comprehensive approach provides a framework for systematically generating compounds that not only meet synthesis and property objectives but also adhere to chemical constrains.

In the context of materials design, we also train a DQN agent to maximize expected rewards to generate compounds with desired properties. Similar to the PGN, the DQN generates materials which are then evaluated by a trained material property predictor. The DQN state representation (Fig. 1d) consists of the material composition and the current step of the generation process. The material composition is featurized using the Magpie62 framework, yielding a vector consisting of stoichiometric properties, elemental properties, electronic structure features, and ionic features. Moreover, the representation includes the current step to account for time-dependent policies, which have been shown to outperform time-independent policies15. The action space of the DQN mirrors that of the PGN and comprises an element and element coefficient component encoded as one-hot vectors.

To compare our modeling approaches to a suitable baseline, we implement three additional methods and investigate performance. The first is a random agent which selects actions at random and is used to ensure that the RL models are learning from the data.

The second is an implementation of “Deep learning enabled INorganic material Generator” (DING), a conditional variational autoencoder (CVAE) model by Pathak et al. which was demonstrated to generate inorganic material compositions with targeted formation energy, volume per atom, and energy per atom17. We adapt the model to target the synthesis and materials objectives selected in this work. Other general ML-enabled inorganic materials composition generation methods such as refs. 18,19. are not suitable baselines, as they rely on altering the training set distribution to generate materials with specified properties. Finally, we leveraged SMACT63, a Python library of rapid screening and informatics tools that uses chemical rules for high-performance computational screening, to compare to a comprehensive brute-force search across the composition space. We note that a brute-force enumeration with SMACT is an approximate upper bound for target property performance, as it systematically generates all composition combinations in the search space rather than conducting an informed search.

Single-objective materials generation

Our example selected single-objective tasks are:

  1. 1.

    Minimize formation energy per atom (Form. energy): compound stability relative to its elemental constituents.

  2. 2.

    Maximize bulk modulus, shear modulus (Bulk mod., Shear mod.): maximize mechanical strength of material, important for structural applications and superhard or ultraincompressible materials.

  3. 3.

    Maximize band gap (Band gap): large band gap materials are desirable for applications such as wide-bandgap semiconductors and other electronic materials.

  4. 4.

    Minimize sintering, calcination temperatures (Sinter temp., Calcine temp.): lower temperatures are associated with lower energy consumption costs and greener synthesis pathways.

In Fig. 2 we apply t-distributed stochastic neighbor embedding (t-SNE) dimensionality reduction to 1,000 generated compounds from each of the four modeling approaches across the different objectives after training. When these plots are colored by their target objective (excluding Fig. 2d, as the random agent follows a random policy), clusters emerge. Fig. 2a and b show clear clustering of the generated compounds based on the target objective for the PGN and DQN approaches, respectively. In comparison, we see a higher degree of overlap between the compounds generated (Fig. 2c) using the different objectives for the DING baseline model. Higher degrees of clustering based on target objective support that the PGN and DQN models more effectively explore the high-dimensional space and identify desirable compositions which maximize the target reward function as compared to the DING model. Moreover, the clusters identified by the PGN model span a larger portion of the lower-dimensional space as compared to the DQN model. Similar trends are observed when we plot the principal component analysis (PCA) dimensionality reduction of the generated compounds (Supplementary Fig. 2). To quantify these trends, we computed the cluster variances of each of the target objective clusters in Supplementary Table 2 in the original 145-dimensional feature space as well as after t-SNE and PCA dimensionality reduction for the three ML modeling techniques. In nearly all cases in both the original and reduced feature spaces, we find DING clusters exhibit higher variance than PGN or DQN clusters. For instance, DING cluster variance is on average across all tasks approximately 57% and 13% higher than PGN and DQN, respectively, for the original feature space, 20% and 40% higher (respectively) after t-SNE, and 70% and 58% higher (respectively) after PCA. PGN and DQN can better learn meaningful information from the reward function about the relationship between composition and the target property.

Fig. 2: t-SNE dimensionality reduction for generated compositions.
figure 2

Generated inorganic materials spaces as visualized by t-SNE dimensionality reduction for six tasks for the a PGN, b DQN, c DING, d random modeling approaches over 1,000 sampled compounds. Compounds are colored by their respective task.

Supplementary Fig. 3 depicts the differences in progression of the t-SNE plots over training for the PGN and DQN. Compounds generated by PGN during training cluster more broadly across the lower-dimensional space as compared to compounds generated by DQN which cluster more tightly, many of which do not overlap between the two strategies. Differences in RL training strategy result in unique subsets of identified compounds for the same training objectives. All three ML models (PGN, DQN, DING) learn more meaningful representations of the inorganic compositions and their mapping to the synthesis and property tasks than the random baseline.

From Fig. 2a–c, the synthesis objectives (calcination and sintering temperature) and formation energy objective clusters are more widely distributed across the low-dimensional space and overlap with the other materials property objective clusters. Formation energy and calcination/sintering temperature are broader materials science concepts which are more difficult to explicitly define and isolate, as they are correlated to other materials properties as well as crystal structure, electronic structure, and processing conditions. In comparison, bulk/shear modulus and band gap are well-defined for materials, and a number of composition families have been closely associated with large band gap (wide-bandgap semiconductors, insulators) and high bulk/shear modulus (superhard/ultraincompressible compounds), so these properties naturally cluster more closely64,65,66,67. Bulk and shear modulus also cluster more closely to each other than the other selected objectives because they are correlated properties in condensed matter physics. These observations correspond well to physical intuition and support that the ML models have learned general chemical guidelines governing materials composition, property and synthesis conditions.

To evaluate how well the generated compounds adhere to the objective of interest, we show in Fig. 3a the average target property values over all modeling strategies for each of the single-objective tasks. Three of the tasks target property maximization (bulk modulus, shear modulus, band gap) and the other three target property minimization (formation energy, sintering temperature, calcination temperature). We evaluate model performance by predicting target property values of the generated inorganic compositions using the trained property prediction models. For maximization tasks, a higher value of the target property is desired, while for minimization tasks, a lower value of the target property is desired. A comprehensive set of performance metrics for all four modeling approaches (PGN, DQN, DING, random) and SMACT enumeration can be found in Supplementary Tables 37, respectively.

Fig. 3: Materials property, validity, and diversity metrics for generated compositions.
figure 3

Average (a) (left) target property and (right) formation energy for independent target objectives, (b) charge neutrality and electronegativity balance,) (c) uniqueness and Element Mover’s distance of generated compounds for each modeling approach. Metrics are computed for 1000 generated compounds. Error bars are calculated as the standard deviation over three trials.

The DQN model performs the best for bulk/shear modulus and sinter/calcination temperature tasks, while PGN is most successful in the band gap and formation energy tasks. Both PGN and DQN outperform both the DING model and random baselines on all tasks, demonstrating the capability of the RL frameworks to effectively generate inorganic compositions satisfying property objectives. A training progression of the PGN over time (Supplementary Figs. 4-5) supports this observation. The RL approach allows the model to explore the high-dimensional composition space over time and iteratively locate inorganic compositions better satisfying the given objective, indicated by gradually optimizing the property value with lower average loss and higher average reward. Surprisingly, in certain tasks such as bulk/shear modulus, DING does not outperform the random baseline, motivating the need for more intelligent ML-based approaches to inorganic compound discovery. Bulk and shear modulus are complex physical phenomena with experimental data available for only a small subset of known inorganic compounds48,68, and ML predictor models are most reliable in the composition regime they are trained on. DING is empirically observed to generate less realistic compositions than PGN and DQN by exploiting the higher uncertainty regimes of the predictor model in these tasks, leading to lower performance.

In comparison to the brute-force search (Fig. 3, Supplementary Tables 37) which by nature is best-performing as it finds a near-optimal solution, PGN and DQN perform considerably well in the different property tasks. With the SMACT enumeration as a reference point, we find absolute differences of 1.05 eV/atom, 41.1 GPa/0.19 log GPa, 43.2 GPa/0.39 log GPa, 1.29 eV, 90.2 °C, and 68.8 °C for formation energy, bulk modulus, shear modulus, band gap, sintering temperature, and calcination temperature, respectively, between the best-performing model and the exhaustive search. As calculated in Davies et al.63, even after the application of chemical filters such as charge neutrality and electronegativity balance, the space for four-component inorganic materials exceeds 1010 combinations and the five-component space is estimated to exceed 1013 combinations. Brute-force search is useful in smaller, more limited composition spaces where the number of components is small. RL-based methods can excel in larger, more complex search spaces where a complete enumeration of compositions is computationally expensive and more targeted exploration is desirable. In this case, it is encouraging to see that RL models perform within a close range of the brute-force approach.

For each task, we evaluate model performance based on metrics related to the material discovery process. We focus on charge neutrality and electronegativity balance, which are two fundamental rules of inorganic compounds that have been used to filter out chemically implausible compositions18,19,20,63,69. We evaluate the charge neutrality and electronegativity balance of generated compounds following the methods in refs. 18,63. by querying each model for 1,000 inorganic compositions and evaluating the percentage adherance to each rule. In Fig. 3b, we compare the percentage of generated compounds that adhere to the charge neutrality and electronegativity balance rules, respectively. PGN consistently outperforms the other three approaches in both metrics, achieving > 90% charge neutral and electronegativity balanced compounds. In comparison, the DQN model shows greater performance in charge neutrality over DING for the majority of the objectives and but has comparable or better electronegativity balance than DING in only around half of the tasks. Furthermore, both PGN and DQN outperform the random agent baseline in terms of charge neutrality (random = 0.59 ± 0.01) and electronegativity balance (random = 0.42 ± 0.01) for the majority of the tasks, showing their utility in learning important chemical guidelines to generate valid inorganic materials. SMACT brute-force enumeration is guaranteed to generate compositions passing the charge neutrality and electronegativity balance checks.

We investigate model performance through measuring the diversity of generated compounds. Generating a diverse selection of inorganic compounds that satisfy the desired objectives is crucial for experimentalists for applications in high-throughput screening and materials discovery. We compute the percentage of unique compounds generated in a fixed sample size of 1,000 attempts and compare between modeling approaches (Fig. 3c). Both DQN and the random agent obtain 100% uniqueness, which is not surprising because the random agent always takes random actions (virtually ensuring all generated compounds are unique) and the DQN samples from the top 20% of ranked actions at each step to introduce stochasticity in materials generation. SMACT enumeration is also guaranteed to have 100% uniqueness because we compute all possible combinations of elements. In the PGN formulation, each action is sampled from a multinomial distribution over the probability of the next action given the previous state pΘ(at|st−1), which introduces stochasticity but to a lesser degree. However, the PGN approach still outperforms DING in uniqueness percentage over the majority of the tasks, showcasing its ability to generate compounds that are both unique and exhibit desirable properties.

While uniqueness is a practical metric, it falls short in that many compounds can be strictly unique but chemically similar in composition. To understand chemical diversity more quantitatively, we leverage the Element Mover’s Distance (EMD)70, a well-defined distance metric for chemical compositions which enables the measure of inorganic composition similarity. We plot in Fig. 3c the average EMD for each of the modeling strategies over 1000 generated compounds, with higher EMD signifying greater diversity. The random agent again exhibits the highest diversity because it always selects random actions. On average, DQN and DING show similar EMD ranges, with PGN exhibiting moderately lower EMD diversity for all objectives. SMACT brute-force enumeration displays similar EMD scores to the RL-based approaches. This indicates that RL is able to achieve the same level of composition diversity as exhaustive search. Consequently, this means that the RL models are not over- or under-targeting a particular subset of compositions. We note that it is possible to have high uniqueness and low EMD diversity (or vice versa), as in cases such as the PGN formation energy and band gap tasks where uniqueness differs significantly by 30% and EMD is similar (∆EMD ≈ 1.5). For instance, the model could have found an optimal compositional subspace that contains slight variations in chemical formula (e.g., doped versions of the same compound), which could be beneficial or detrimental depending on the use case. In early stage screening, an experimentalist might favor generation of highly diverse compound families for further downstream exploration, while in more stringent use cases variations of specific compound families may be preferred for compatibility or resource requirements. There is also a clear trade-off for the PGN model between generated compound diversity and property satisfaction. As training progresses, the model further hones in on the optimal compositional subspace which results in the maximum reward, causing property satisfaction to increase over time (Supplementary Fig. 4) but uniqueness and EMD to decrease over time (Supplementary Fig. 5c). Hence, the experimentalist can tune such training hyperparameters to preferred values depending on the use case of interest.

We also evaluate model performance by measuring the predicted formation energies of generated compounds from each approach. Formation energy is a measure of the driving force to form an inorganic compound from its constituent elements, with lower (more negative) formation energies signifying a larger driving force. We note that a comprehensive comparison to neighboring compounds on the convex hull would be necessary to determine true compound stability71. For each model, we generate 1,000 inorganic compositions and use a machine learned formation energy predictor model to predict their formation energies. Fig. 3a (right) depicts the average formation energies for generated models targeting various independent objectives. The x-axis corresponds to the independent objective targeted by the model and the y-axis corresponds to the average formation energy over 1000 generated compositions. From Fig. 3a, explicitly biasing the models towards lower formation energy results causes compounds generated by PGN, DQN, and DING to have more negative average formation energy compared to the random baseline, with PGN and DQN outperforming DING. This behavior is expected, as the models are either rewarded for (in the case of PGN and DQN) or conditioned to (in the case of DING) generate compounds with more negative formation energies. However, PGN-determined formation energy generally becomes more negative during training in all objectives (Supplementary Fig. 5b), even when formation energy is not rewarded. Furthermore, PGN generates compounds with significantly lower formation energies than DQN or DING over all single property tasks (Fig. 3a), indicating that PGN can generate compounds with both negative formation energies and exhibit the target property. DQN generates compounds with more negative formation energies than DING and random when it is explicitly rewarded but generates compounds of similar or more positive formation energies in all other tasks. This is likely because in a number of cases DQN is more adept at locating composition subspaces better satisfying the target property (Fig. 3a) than PGN or DING, some of which may contain larger concentrations of unstable or metastable chemical formulas. We note that the random baseline is a strong baseline for formation energy, as our compound space is restricted to oxides which are generally stable compounds. It is therefore unsurprising that compounds generated by DQN and DING have comparable or less negative formation energies than the random agent in the cases where formation energy is not targeted. Furthermore, PGN and DQN formation energies are within close proximity to brute-force search. Excluding the case when the target property is explicitly formation energy, PGN generates compositions with a lower average formation energy than SMACT enumeration for four out of the five tasks.

Multi-objective materials generation

In materials discovery, experimentalists typically seek to design materials that exhibit multiple desirable properties, so we also use this RL framework to learn joint objectives. The generation protocol and evaluation metrics used in multi-objective tasks to analyze model performance are similar to that of the single-property tasks. We jointly maximize bulk modulus and minimize sintering temperature. The reward function is now a linear combination of individual rewards for each objective (Eq. 1), where wi are hyperparameters to weigh the relative importance of individual objectives in the joint objective task. For a materials generation task with two distinct objectives, this reduces to a single-objective weight w. We investigate the influence of varying w for two selected objectives, sintering temperature (Tsint) and bulk modulus (K), where increasing w increases bulk modulus and decreasing w decreases sintering temperature:

$${R}_{t}\left({{\boldsymbol{s}}}_{t},{{\boldsymbol{a}}}_{t}\right)=-\left(1-w\right){T}_{{sint}}+{wK}$$
(2)

Equation. 2 contains a negative sign on the weight for Tsint because we aim to minimize sintering temperature. We selected the minimization of sintering temperature and the maximization of bulk modulus as our multi-objective tasks. High sintering temperatures increase production costs, maintenance fees, and equipment costs while making quality control more difficult72. As a result, materials with ultra-low sintering temperatures are desirable to save energy, reduce processing time, and enable further integrations with semiconductors such as silicon or GaAs, metals, and polymer-based substrates73,74. Bulk modulus is a crucial parameter in condensed matter physics measures volumetric elasticity and closely correlates to hardness and toughness75,76,77. “Ultraincompressible” and “superhard” materials with large bulk moduli which demonstrate high incompressibility or extreme hardness are of great interest to the broader materials science community in application such as replacing diamond in cutting tools, abrasives, coatings, and other high-pressure devices78,79,80.

Fig. 4 depicts the multi-objective property distributions of materials generated by PGN, DQN, and DING, color-coded by the objective weight w. As w increases, PGN and DQN prioritize maximizing bulk modulus over minimizing sintering temperature, as is observed by the property distribution of the generated materials shifting up (larger bulk modulus) and to the right (higher sintering temperature). When w = 0.2, the generated materials skew towards low bulk modulus and and low sintering temperature, so the majority of the reward favors minimizing sintering temperature. When w = 0.4 or w = 0.6, PGN and DQN generates materials that have simultaneous improvements in both material properties over the baseline, as shown in the distribution shift to the upper-left corner of the subplots. Here, a more equally proportional weight is placed on each of the reward objectives of minimizing sintering temperature and maximizing bulk modulus. DQN materials have higher rates of validity (charge neutrality and electronegativity balance) but with lower diversity (EMD). However, when w = 0.2 or w = 0.4, the PGN and DQN over-prioritize minimizing sintering temperature at the expense of bulk modulus. This shows the trade-off between the two properties, resulting in a characteristic Pareto front in multi-objective optimization, where w is a parameter that can be tuned according to the application of interest.

Fig. 4: Multi-objective property distributions.
figure 4

Property distributions generated by multi-objective materials generation for a PGN, b, DQN, c DING models. Multi-objective training is conducted under different objective weights, where w is the weight between the synthesis and property rewards in the multi-objective reward function. Each subplot contains 1,000 generated compositions. Starred compositions indicate near Pareto-optimal compounds with ≥ 80% confidence matched template crystal structures.

Multi-objective materials generation metrics encompassing validity, diversity, and properties of the materials generated by each of the agents are described in Supplementary Table 8. PGN and DQN outperform DING and the random agent in all metrics except EMD for generation of inorganic oxides with both high bulk modulus and low sintering temperature. DING evidently struggles to learn more selective composition spaces such as that required by a dual property objective, as the average bulk modulus and sintering temperature of DING-generated compounds is nearly unperturbed by changing the weight between the two target properties. In contrast, in both PGN and DQN-generated compounds the sintering temperature and bulk modulus are directly influenced by changing the reward weight, with decreases in sintering temperature of 312 °C, 111 °C and increases in bulk modulus/log bulk modulus of 26.3 GPa/0.27 log GPa, 132.9 GPa/1.07 log GPa for PGN and DQN, respectively. The DQN model generated compounds with lower sintering temperatures and higher bulk moduli for w = 0.6 and w = 0.8, while the PGN model performed better in the w = 0.2 and w = 0.4 regimes. In terms of formation energy, the PGN model was able to consistently generate compounds with comparable or lower formation energy, higher charge neutrality, and higher electronegativity balance percentages across all objective weights for the multi-target task. Policy gradient pretraining allows the underlying recurrent neural network to learn a general understanding of composition rules of forming stable chemically valid inorganic compositions from a large dataset, while Q-learning only relies on exploration/exploitation and is not constrained by these rules. Compared to the inorganic materials subspace generated by the random policy (Supplementary Fig. 6) which spans the majority of the property space, the ML model architectures learn a more informed and restricted mapping of inorganic composition to desired materials properties.

Multi-objective RL provides a useful way to elucidate the effects of different chemistries in the laboratory by aiding experimentalists with initial suggestions on how to achieve or alter desired materials properties and synthesis parameters with compositional design choices. From an analysis of the generated compounds, we can visualize broader trends in how composition may affect the target properties by examining commonly-found elements in regions of interest (Fig. 5). From these insights, we can qualitatively evaluate the extent of information loss between the property prediction models and the trained RL models. For instance, if the visualized trends did not corroborate with expectations from materials theory or prior works, there may have been significant information loss during training, or the RL models may have learned from noise in the data. For instance, we expect to see that the RL models learn expected trends in the data concerning the relationship between composition and materials property. For the composition-property space learned by the PGN (Fig. 5a), we see a clear split between the upper-right and lower-left sides of the space. The PGN has learned that Fe-containing compounds tend to have a higher predicted bulk modulus and higher sintering temperature, while compounds containing Li tend to have a lower sintering temperature and lower predicted bulk modulus. Other smaller pockets in the space provide some insight as well, such as the addition of V and Na potentially leading to lower values of bulk modulus and sintering temperature.

Fig. 5: Visualizing the multi-objective RL composition space.
figure 5

Most commonly found elements in compounds generated by a PGN and b DQN models with w {0.2,0.4,0.6,0.8} spanning the synthesis-property space. Elements are color-coded by their identity. Dark blue squares lacking a label correspond to not enough data to plot.

A similar investigation of the DQN model (Fig. 5b) exemplifies how the exploration/exploitation nature of Q-learning and stochastic sampling of the top actions allows for a wider diversity of elements to be probed for reward maximization. Bi, Rb, and Au are correlated with lower sintering temperature, likely due to correlations and trends in bonding strength and selective doping12,49,54,81, while Os, Fe, Si, Ge, and Pt are found to increase bulk modulus of generated compositions. Our results correspond well with previous works which identify trends in composition, bulk modulus and sintering temperature12,75,82,83. Moreover, our findings are consistent with empirical design rules in materials science for ultraincompressible and superhard materials which suggest that transition metal compounds are promising candidates for ultraincompressible materials84. However, sintering temperature and bulk modulus are inherently complex material characteristics, which can be determined by a combination of crystal structure, electronic structure, method of synthesis, and anthropogenic factors such as experimentalist bias12,81,85,86. Nevertheless, such a tool offers beneficial initial insights for both forward (assessing the impact of a specific additive on resulting properties) and inverse (determining the suitable additives to achieve a particular synthesis condition) inorganic materials design. While this analysis aims to highlight learned correlations without introducing additional assumptions, we hope these findings can be used in conjunction with model explainability methods such as SHAP87 or LIME88 on the property prediction models themselves to reveal important composition-property relationships.

Template-based crystal structure prediction

Due to the rapid growth in machine learning-enabled materials screening, recent works10,89,90,91,92,93 have made strides in predicting structural properties of inorganic materials solely given their composition. These methods take as input an inorganic composition and predict structural properties such as Bravais lattice, space group, and lattice parameters. To supplement the high-quality hypothetical compositions screened by our models, we adapt the template-based crystal structure prediction algorithm proposed in Kusaba et al.90 to obtain hypothetical substitution-based crystal structures for select generated Pareto-efficient novel materials. This structure prediction method relies on metric learning, an ML-based algorithm which automates the selection of template structures from a crystal structure database with high chemical replaceability to best match the unknown stable structure for a target chemical composition. For each target composition, the classifier predicts the closest match crystal structure and a percentage score which quantifies the confidence in the prediction.

We highlight a subset of compounds in Fig. 4a, b which are PGN- and DQN-generated near Pareto-optimal compositions satisfying both target objectives (large bulk modulus and low sintering temperature), ≥ 80% confidence matched template crystal structures, charge neutrality and electronegativity balance, negative predicted formation energy per atom, and no existing entry in the Materials Project composition dataset used in this work. We additionally show in Fig. 6 template crystal structure matches for a subset of the near Paretooptimal inorganic compositions, with their predicted bulk moduli, sintering temperatures, formation energies, space groups, closest matching Materials Project ID and composition, and percent confidence match tabulated in Table 1. The selected compositions exhibit high predicted bulk modulus (≥ 4 log GPa/≥ 50 GPa), low predicted sintering temperature (< 900 °C), negative predicted formation energy per atom, and high confidence (> 85%) crystal structure matches. Another utility of the crystal structure matching is the connection to experimentally observed compounds sharing similar structures. For instance, the best match structure suggested for Ge2OsO5 by the crystal structure prediction algorithm has been experimentally synthesized for its template composition CrPb2O5, which could provide additional useful information in terms of synthesis route and conditions for experimentalists attempting to realize the new analogue in the laboratory94.

Fig. 6: High confidence, near Pareto-optimal template matched crystal structures.
figure 6

Template crystal structure matches for a subset of the near Pareto-optimal inorganic compositions predicted by the PGN (a, b) and DQN (c, d) models. Crystal structure matches for a LiCr2FeO6, b NaLiMoO4, c Ge2OsO5, d MnPt3O4.

Table 1 Predicted bulk moduli, sintering temperatures, formation energies, space groups, closest matching Materials Project ID (MP-ID) and composition, and percent confidence match for a subset of the near Pareto-optimal compositions

We have demonstrated how RL-based screening methods can identify promising inorganic compositions constrained by target properties of interest. These target compositions can, along with hypothetical crystal structure matches, be used as starting points for further downstream computational calculations (i.e., Density Functional Theory (DFT) and first-principles methods) and experimental validation in the laboratory. ML methods are thus able to aid in filtering out unsuitable candidate compositions and suggest plausible structures to reduce computational and experimental burden and increase materials discovery throughput. The substitution-based crystal structure predictions can be used to filter and constrain the prototype structures used as starting points in first-principles calculations such as DFT calculations which can be prohibitively expensive for high-throughput screening approaches. Moreover, by providing predicted symmetry and structure parameters, these predictions can help significantly streamline the process of structure refinement from experimental data such as powder X-ray diffraction measurements90,91,95.

Discussion

We compare the strengths and shortcomings of the different modeling strategies discussed here to identify where different approaches may excel, as the contexts in which they are deployed could certainly differ. One important metric to evaluate is based on sample efficiency. Value-learning methods can use off-policy learning and experience replay to boost sample efficiency, while policy gradient-based methods cannot. Another trade-off to consider is the bias-variance trade-off which impacts the optimality and diversity of the generated compositions. The PGN suffers from variance of gradient estimates especially when directly sampling from high-dimensional action spaces96. On the other hand, the DQN suffers from bias, as it is overly optimistic when the actions are chosen based on maximizing an approximate valuefunction97. This can lead to differences in how diverse (how many different elements and composition families are explored) and how optimal (how well does the generated compound distribution satisfy the target objectives) the resulting learned distributions are. Another crucial comparison is between the training methods of each modeling strategy. The PGN requires pretraining of an unbiased model on a large dataset (103 − 105) of inorganic materials compositions, which is beneficial in data-rich use cases like inorganic oxides or alloys which but could prove difficult in a data-scarce domain where the composition space is narrow18,98,99. The DQN model, however, can learn simply from trial-and-error since it uses exploration/exploitation, making it more suitable for materials science domains where the number of known compounds is limited in the tens or hundreds such as thermoelectrics or solid-state electrolytes100,101,102.

From the performed experiments, the PGN model excels in generating compounds that conform well to the chemical guidelines we have outlined (charge neutrality, electronegativity balance, negative formation energy) while satisfying the target materials synthesis and property objectives at some cost of chemical diversity. One scenario of interest could be where experimentalists have a suite of existing materials satisfying their objective (for instance, a suite of Li-ion and Na-ion battery cathode materials) and wish to discover additional analogues or variations of the selected material families. In this context, the PGN would be more suitable because a pretraining dataset exists and validity is more crucial than diversity. The DQN model outperforms in identifying a diverse set of inorganic compositions with superior materials synthesis and property values but is more prone to error in terms of compound validity. This behavior could be useful in high-risk, high-reward or needle-in-a-haystack use cases (e.g., high ZT thermoelectrics) where high-throughput experimentation is readily available and finding a material which strongly satisfies the synthesis or property objectives outweighs the cost of validating additional predicted compositions102,103.

Another limitation is that our RL modeling strategies only consider generation of oxides with integer coefficients. However, materials with fractional coefficients such as doped compounds are prevalent across materials science, particularly in functional materials like Li-ion battery cathode materials, solid-state electrolytes, and heterogeneous catalysts. Generalizing our modeling strategy to a continuous action space for element coefficients would make the problem significantly more difficult, as there are potentially infinitely many compositions to attempt to satisfy the reward objective which would make the optimization landscape more complex. Future endeavors in this space could adapt new advances in RL which extend these methods to both continuous and discrete action spaces to address this problem104,105.

Furthermore, our methods rely on using learned surrogates (trained ML models) to inform our synthesis and property objectives because they are relatively inexpensive to train as compared to more accurate physics-informed calculations. However, using learned surrogates to determine rewards can lead to generation of compounds out of the distribution of the training set of the surrogates. Optimizing over the chemical space in this manner will naturally push the explored compounds away from the training set distribution, making prospective real-world errors in synthesis and property prediction higher than the original train-test split error. Therefore, the RL algorithm could potentially favor minimizing the error over the true property value, hampering its real-world predictive abilities. Future work is required to explore more rigorous physics-based reward methods to inform our RL algorithms more effectively and increase the reliability of the suggested compounds and their objectives of interest. Quantum-mechanical methods could also be applied for validation of compositions and their templated crystal structures to further determine their stability and exhibited properties.

Finally, RL is a fast-growing field, and the machine learning community has rapidly developed new algorithms and model training paradigms over the past years which have mitigated drawbacks and expanded its potential use cases. This work has explored two RL strategies, policy gradient and deep Q-learning, which have shown great success in a number of problem areas. As an initial effort to explore RL in the context of machine learning-assisted materials design, we have compared the advantages and drawbacks of the policy gradient and deep Q-network RL approaches and how they perform against existing ML-guided algorithms, particularly with multi-objective synthesis and property tasks. However, more advanced RL strategies (and newer derivative approaches) such as soft actor-critic106, double deep Q-network107, rainbow deep Q-network108 or proximal policy optimization109 have explored ways to improve in areas such as stability and sample efficiency. We intend to investigate more advanced RL algorithms for inverse inorganic materials design in future works, which we envision will further improve upon model performance and the quality of generated compositions.

We have demonstrated and contrasted two new RL approaches for the generation of inorganic compositions subject to single and multi-objective reward objectives. Our models satisfy common chemical guidelines of charge neutrality, electronegativity balance, and negative formation energy. They successfully learn the composition-synthesis-property space to suggest promising compositions of interest which are matched with template crystal structures to accelerate the screening of inorganic compositions. The developed models in this work capture physics-based trends in the relationships between composition, materials properties, and synthesis parameters and outperform baseline ML methods in the field. Our approach could help guide experimentalists with initial suggestions for inorganic compositions and composition families to further explore to aid in synthesis planning for the discovery and design of new materials. The demonstrated methods could be used as an initial step in high-throughput screening of inorganic compositions in combination with higher-fidelity quantum mechanical calculations and laboratory experiments to identify compounds exhibiting target properties and convex hull stability. These modeling strategies take an important step towards realizing true high-throughput inorganic synthesis, where machine learning can effectively screen large numbers of inorganic compounds to hasten the synthesis of inorganic materials with desirable properties. The initial methods of this work could be effectively applied to other disciplines in materials science which are hard-pressed to hasten the realization of novel materials for emerging technologies, such as renewable energy generation, catalysis, and carbon capture.

Methods

Policy gradient network overview

The model architecture used for the PGN in this work builds upon the work of Popova et al.36 We formulate the goal of generating inorganic materials with desirable properties as a task of finding the parameters Θ of the policy network which maximizes the expected reward

$$J\left(\Theta \right)\,{\mathbb{=}}\,{\mathbb{E}}\left[R\left({{\boldsymbol{s}}}_{t}\right)\left|{{\boldsymbol{s}}}_{0}\right.,\Theta \right]=\sum _{{{\boldsymbol{s}}}_{t}}{p}_{\Theta }\left({{\boldsymbol{s}}}_{t}\right)R({{\boldsymbol{s}}}_{t})\approx \mathop{\prod }\limits_{t=1}^{T}{{\mathbb{E}}}_{{{\boldsymbol{a}}}_{t} \sim {p}_{\Theta }({{\boldsymbol{a}}}_{t}\left|{{\boldsymbol{s}}}_{t-1}\right.)}R({{\boldsymbol{s}}}_{t})$$
(3)

where R(sT) is the reward assigned based on predicted material properties and synthesis parameters of sT. We sum over all possible terminal states sT and approximate the sum through sampling from the generative model. We use a stack-augmented recurrent neural network (stack-RNN) as the generative model trained with cross-entropy loss function minimization and the REINFORCE algorithm to conduct policy gradient updates during learning110,111.

The stack-RNN architecture is particularly suited for sequence prediction problems where pattern matching is important, such as in context-free languages and SMILES string generation, due to its ability to “count”36,60,61. In inorganic materials generation, stack-RNNs are useful because the generator model must learn complex concepts such as ionic charge and electronegativity to predict chemically valid compounds. To further reinforce this concept and mirror the effects of a constraint oracle as is used in the DQN model (see Q-learning RL framework), we include additional reward terms in the reward function which positively weight generated compounds that are both charge neutral and electronegativity balanced leading to the modified reward function

$${R}_{t}\left({{\boldsymbol{s}}}_{t},{{\boldsymbol{a}}}_{t}\right)=\left(1-{w}_{{ce}}\right)\left[\mathop{\sum }\limits_{i=1}^{N}{w}_{i}{R}_{i,t}\left({{\boldsymbol{s}}}_{t},{{\boldsymbol{a}}}_{t}\right)\right]+{w}_{{ce}}{R}_{t}^{{ce}}\left({{\boldsymbol{s}}}_{t},{{\boldsymbol{a}}}_{t}\right)$$
(4)

where Rtce is the reward for satisfying charge neutrality and electronegativity balance and wce is the weight placed on the charge neutrality and electronegativity objectives. In practice, we found that equally weighting materials synthesis/property objectives and charge neutrality/electronegativity balance objectives (i.e., setting wce = 0.5) resulted in the best performance.

We initiate the generation process (Fig. 1c) by giving the model a <START> token, which signals to the model to begin a new sequence. At each time step 0 < t < T, the policy network takes the previously generated element and composition and produces a probability distribution pΘ(at|st−1) over the action space. We then sample from the predicted distribution to determine what element and composition will be next added. The sequence generation task is finished either when the maximum time horizon is reached or the model generates an <END> token, and the generated material is the concatenation of all of the generated element-composition tokens. The final reward is then estimated using Eq. 4.

Deep Q-network overview

Q-learning is a model-free reinforcement learning algorithm which trains an agent to behave optimally in a Markov process112. The agent performs actions to maximize the expected sum of all future discounted rewards given an objective function

$$J\left(\Theta \right)=\mathop{\sum }\limits_{t=0}^{T}{\gamma }^{t}{R}_{t}\left({{\boldsymbol{s}}}_{t},{{\boldsymbol{a}}}_{t}\right)$$
(5)

where γ [0,1] is the discount factor and Rt represents the reward at step t given values st

$${Q}^{\pi }\left({{\boldsymbol{s}}}_{t},{{\boldsymbol{a}}}_{t}\right)={R}_{t+1}+\gamma \mathop{\max }\limits_{{{\boldsymbol{a}}}_{t+1}}{Q}^{\pi }\left({{\boldsymbol{s}}}_{t+1},{{\boldsymbol{a}}}_{t+1}\right)$$
(6)

with \({Q}^{\pi }\left({{\boldsymbol{s}}}_{t+1},{{\boldsymbol{a}}}_{t+1}\right)\) being the expected sum of all the future rewards the agent receives in the resultant state \({{\boldsymbol{s}}}_{t+1}\). In the context of materials design, the agent uses a policy π to maximize the expected future reward (material property) loop by learning a DQN Qπ (Fig. 1a). The agent (material generator) generates materials, which are then evaluated by material property predictor(s), hence assigning the reward R to the agent. The goal here is to learn a DQN material generator that maximizes expected rewards to generate compounds of desired properties.

In Q-learning, the Q-value is the expected discounted reward for a given state and action, and therefore the optimal policy π can be found using iterative updates with the Bellman equation (Eq. (6)). Upon convergence, the optimal Q-value Q is defined as:

$${Q}^{* }\left({{\boldsymbol{s}}}_{t},{{\boldsymbol{a}}}_{t}\right)={{\mathbb{E}}}_{{{\boldsymbol{s}}}_{t+1} \sim p}\left[{R}_{t+1}+\gamma \mathop{\max }\limits_{{{\boldsymbol{a}}}_{t+1}}{Q}^{* }\left({{\boldsymbol{s}}}_{t+1},{{\boldsymbol{a}}}_{t+1}\right)\left|{{\boldsymbol{s}}}_{t}\right.,{{\boldsymbol{a}}}_{t}\right]$$
(7)

Here, the Q-function Q(st,at) is approximated with a deep Q-network (DQN) Qθ parameterized by weights θ113.

Without constraints, a DQN may generate invalid materials through unbounded exploration. To address this, charge neutrality and electronegativity balance constraints are incorporated using a constraint oracle114,115 (a trained function) \({\hat{c}}_{j,t}\) which is formulated as

$${\hat{c}}_{j,t}=\mathop{\max }\limits_{{t}^{{\prime} }\ge t}{c}_{j,{t}^{{\prime} }}$$
(8)

with cj,t being the jth constraint to be satisfied at step t, and the \({\hat{c}}_{j,t}\) is the maximum level of violation occurring in all current and future steps t in the generation process. The constraint oracles Cj(s,a) are learned functions (trained on \({\hat{c}}_{j,t}\)) of the probability of an action satisfying a constraint in all future steps. Consequently, the constrained agent now follows a modified policy

$$\pi \left({\boldsymbol{s}}\right)=\mathop{{\rm{argmax}}}\limits_{{\boldsymbol{a}}}\left[{Q}^{\pi }\left({\boldsymbol{s}},{\boldsymbol{a}}\right)+\infty \cdot \min \left(0,{C}_{j}\left({\boldsymbol{s}},{\boldsymbol{a}}\right)-T\right)\right]$$
(9)

where T [0,1] is a user-defined threshold set at 0.5. This important modification to DQN ensures that actions predicted by constraint oracle Cj(s,a) that eventually lead to invalid compounds are penalized by the last term in Eq. 9, increasing the probability of generating chemically valid compounds.

DQN agents typically take the greedy action that maximizes the expected return, rendering the policy deterministic in a static environment. However, for a materials generation task, a deterministic policy would be problematic as the agent would always take the same actions and generate the same material over and over again. This would not be optimal as materials generated by the agent must be chemically diverse. To address this, instead of taking the greedy action, the policy π is given where the action is sampled from the top n% of actions such that \(\pi \left({\boldsymbol{s}}\right)={random}.{sample}({{\boldsymbol{a}}}_{1}^{* },\ldots ,{{\boldsymbol{a}}}_{n}^{* })\) where \({{\boldsymbol{a}}}_{1}^{* },\ldots ,{{\boldsymbol{a}}}_{n}^{* }\) refer to the top n% of actions ranked by the Qπ and constraint oracles Cj(s,a). Empirically, it was found that n can be increased from 0 up until 20 without a deterioration of material properties, hence n was set to 20 to maximize diversity of generated materials. Any larger value than n = 20 would further improve diversity at the expense of material properties.

The DQN state representation st (Fig. 1d) has two components. The first is the material composition, where each generated material (complete or incomplete) is featurized using the Magpie62 featurization framework to give a \({{\mathbb{R}}}^{145}\) vector consisting of stoichiometric properties, elemental properties, electronic structure features, and ionic features. The second is the current step t of generation process. Previous works15 have reported that the inclusion of the step (time-dependent policy) outperforms exclusion of such (time-independent policy). Since the horizon is capped at 5 steps, this is represented by a one-hot encoding of \({{\mathbb{R}}}^{5}\). The DQN action space is identical to the PGN action space with both an element and element coefficient component. The element component is represented by a one-hot encoding of \({{\mathbb{R}}}^{80}\) and the element coefficient component is represented by a one-hot encoding of \({{\mathbb{R}}}^{10}\), with the sizes of the element and integer (0–9) sets, respectively.

Dataset acquisition and preprocessing

To train the RL models and materials property predictor models, we leveraged a subset of inorganic materials and their computed properties contained within the Materials Project (MP) database48. Materials Project is a widely used inorganic materials database containing crystal structures and materials properties data calculated from high-throughput quantum mechanical calculations. To train the sintering and calcination temperature predictors model, we used a publicly released inorganic solid-state synthesis database text-mined from scientific literature using a combination of NLP and rule-based extraction techniques116. We restricted our inorganic materials dataset to oxides, as the majority (> 97%) of the synthesis dataset consists of oxides and our synthesis predictor models would therefore exhibit high uncertainty for compounds not containing oxygen in multi-objective generation tasks.

To preprocess the materials chemical formula and property data, we use a similar preprocessing strategy as Dan et al.18, Jha, et al.117, and Pathak et al.17. For a formula with multiple reported formation energies, we choose the lowest one as a heuristic for stability. We additionally removed all single-element compounds and any compounds with a formation energy outside the interval [µ − 5σ, µ + 5σ], where µ and σ are the average and standard deviation of the formation energies of all compounds in the MP dataset. For each property, all entries which did not contain a computed property value were discarded. After preprocessing, we obtained datasets containing the chemical formulas, formation energies and band gaps for 22,555 compounds as well as the bulk and shear modulus values for 9888 compounds. For bulk and shear moduli, property values were taken as their corresponding natural logarithm.

To preprocess the calcination and sintering temperature data, we followed the procedure reported in Karpovich et al.12. Reactions without at least one relevant heating step with a reported temperature were removed. Temperatures were converted into units of Celsius (°C) and limited to between 200 °C and 2000 °C. Times were converted into units of hours (h) and limited to less than 100 h. If a relevant heating step occurred more than once in a recipe (i.e. multiple sintering steps), the last operation in chronological order was taken, and if a relevant heating step was reported with more than one temperature or time, the highest value for that step was taken. For reactions with more than one reported occurrence in literature (same target and precursors), the ground truth reaction condition was taken to be the average of the reported conditions. After preprocessing, a final reaction dataset consisting of 12,228 calcination temperatures and 12,296 sintering temperatures was obtained.

ML property prediction model training

We built ML property prediction models for both materials properties (formation energy, bulk/shear modulus, band gap) and synthesis objectives (sintering temperature, calcination temperature). For the materials property prediction, we leveraged Roost118, a graph-composition message passing neural network architecture which can predict materials properties from composition. Roost property prediction models were trained based on default hyperparameters using a random 90/10 train-test split.

For sintering and calcination temperature prediction, inorganic compounds were featurized using Magpie compositional features, which are physically motivated descriptors that take the form of a 145-dimensional embedding containing stoichiometric properties, elemental properties, electronic structure, and ionic compound features62. To predict temperature from materials composition, we trained a random forest (RF) model and optimized hyperparameters using a 5-fold cross-validation. Optimized hyperparameters for the RF models can be found in Supplementary Table 9 and performance metrics for all ML property prediction models can be found in Supplementary Table 10.

Policy gradient RL training

The PGN training is conducted in two steps. We first pretrained a generative model on the extracted 22,555 oxide compounds from Materials Project. This pretraining step ensures the model can learn general chemical guidelines to produce valid material formulas. We then combine the generative and predictive models in a RL framework, where the predictive model is used to assign reward values to each generated material and the generative model is trained to maximize the expected reward.

The generative model is a stack-augmented recurrent neural network (RNN) consisting of a gated recurrent unit (GRU) layer with a hidden size of 1500, a stack width of 1500, and a stack depth of 10. The action representation has two components. The first is the element, or the atomic species being added to the material formula sequence. We consider the set of elements present in the pre-processed MP oxide date, or 80 elements, in the action space. The second is the element coefficient, or the numerical subscript of the atomic species that is added to the sequence which we consider as integers from 0 to 9. The action space consists of all possible combinations of the 80 elements with the element coefficients. The state space consists of all possible strings in this alphabet with a maximum length of 5, as a maximum generation length of 5 tokens was used to limit compound generation to at most 5 unique elements.

The generative model was pretrained for 10,000 iterations and the PGN was trained for 500 iterations, both using a learning rate of 0.001 and the Adadelta optimizer. Models were trained on a single NVIDIA RTX A5000 GPU and training a PGN agent takes 2-3 h.

Deep Q-network RL training

For the DQN architecture, the leaky ReLU activation function is used in all fully connected layers. The 4 vectors are then individually passed through fully connected layers, and the 64-dimensional hidden states are concatenated into a single vector before passing through another fully connected layer to give the Q-value.

The DQN was trained over 500 iterations with the use of a replay buffer of size 50,000 that addresses the issue of correlated sequential samples119. In each iteration, 100 compounds were generated, stored in the replay buffer, and the Q-network is trained on 100 samples randomly sampled from the replay buffer with smooth L1 loss function and Adam optimizer with learning rate of 0.01. Initial exploration is encouraged using an ϵ-greedy policy starting with high ϵ value of 0.99, which is decayed (by a factor of 0.99 after each iteration) over the course of training to ensure eventual exploitation and convergence to the optimal policy. Discount factor γ was set at 0.9. Models were trained on a single NVIDIA RTX A5000 GPU and training a DQN agent takes 2-3 h.

DING model training

Code for the DING model implementation was adapted from the authors’ publicly released repository17. The CVAE models were trained based on default hyperparameters using a random 72/18/10 train/validation/test split over 150 epochs. The model with the best validation loss was used for each trial with reconstruction accuracy computed on the test set.

SMACT enumeration

With the five-component inorganic composition space estimated to exceed 1013 materials63, a comprehensive brute-force search across the five-component space is intractable; we approximated a brute-force search by limiting our search space to binary, ternary, and quaternary oxides, with three non-oxygen components and one oxygen component. Starting with our initial search space of 83 elements, we generated all combinations of the non-oxygen elements. We then used a custom multithreaded version of the smact.screening.smact_filter function with a stoichiometry threshold of 9 (to search for element coefficients of 1 to 9) to generate all possible binary, ternary, and quaternary oxide combinations for the initial element set. The SMACT filter function generates all possible compositions from the specified element set which pass both the SMACT charge neutrality and electronegativity balance filters. To make the search space more tractable, we randomly selected one coefficient combination per element combination, so for each unique element combination we generated at most one oxide containing the elements, barring any combinations which were rejected by the smact_filter function. For the final 87,088 generated oxides, we used the trained property prediction models to predict each of the target property values. For each target property, we sorted the generate composition list by the target property of interest (ascending for bulk modulus, shear modulus, band gap and descending for formation energy, sintering temperature, and calcination temperature) and selected the top 1,000 compositions.

Template-based crystal structure prediction

Template-based crystal structure prediction was conducted using the authors’ publicly released repository following the methodology detailed in Kusaba et al.90. Predictions were made using the provided ensemble of neural network models trained on the 33,153 stable compounds obtained from Materials Project. For each query composition, we predicted the top six candidate structures ranked in descending order by their confidence match. We then selected a subset of query compositions with ≥ 80% confidence in the highest confidence matched template crystal structures. Crystal structures were visualized using the .cif files provided by the Materials Project database.