Abstract
Materials design often becomes an expensive black-box optimization problem due to limitations in balancing exploration-exploitation trade-offs in high-dimensional spaces. We propose a reinforcement learning (RL) framework that effectively navigates the complex design spaces through two complementary approaches: a model-based strategy utilizing surrogate models for sample-efficient exploration, and an on-the-fly strategy when direct experimental feedback is available. This approach demonstrates better performance in high-dimensional spaces (D ≥ 6) compared to Bayesian optimization (BO) with the Expected Improvement (EI) acquisition function through more dispersed sampling patterns and better landscape learning capabilities. Furthermore, we observe a synergistic effect when combining BO’s early-stage exploration with RL’s adaptive learning. Evaluations on both standard benchmark functions (Ackley, Rastrigin) and real-world high-entropy alloy data, demonstrate statistically significant improvements (p < 0.01) over traditional BO with EI, particularly in complex, high-dimensional scenarios. This work addresses limitations of existing methods while providing practical tools for guiding experiments.
Similar content being viewed by others
Introduction
Optimization and discovery are key to finding materials for critical applications ranging from energy storage1,2,3,4,5 to aerospace engineering6. The process faces two challenges: first, it heavily relies on experimental data acquisition and analysis, often lacking closed-form physical models for precise predictions7,8; second, it involves vast and complex design spaces9,10,11 for which the true function is unknown or a “black box”. The relationship between input parameters (chemical compositions and processing conditions) and resulting materials performance is non-trivial, and objective evaluation is both time-consuming and resource-intensive12,13,14.
Bayesian Optimization (BO) has been used widely in materials science to navigate complex design spaces12,15,16,17,18. While BO efficiently balances exploration and exploitation in single-step optimization scenarios, making it suitable for materials optimization tasks with relatively small search spaces13,19, it faces several inherent limitations. First, its performance can significantly degrade as the dimensionality of the search space increases, struggling to efficiently explore high-dimensional spaces and potentially missing promising regions20,21. Second, despite recent advances in multi-step optimization, BO primarily focuses on myopic (single-step) decision-making, which may not capture the interdependence of consecutive design decisions22,23. Furthermore, the absence of a clear stopping criteria in BO frameworks makes it challenging to determine when the optimization process should terminate.
In recent years, a number of artificial intelligence techniques have been increasingly applied to materials design to address these limitations24,25,26,27. Among these, Reinforcement Learning (RL) has emerged as a promising paradigm for tackling complex optimization problems10,28,29,30. It has demonstrated remarkable success in various domains, from playing top-level adversarial games31 to controlling real-time robotics32, where it has shown an ability to navigate large, complex state spaces and make decisions under uncertainty. Notably, RL’s capacity to incorporate future outcome predictions into its current decision-making process offers unique advantages in sequential optimization tasks. However, the application of RL in materials design optimization remains largely unexplored, particularly its potential to learn and adapt from environmental feedback in materials discovery processes. While both RL and BO can leverage models for exploration, RL’s ability to develop adaptive strategies through environmental interactions and consider long-term consequences makes it particularly suitable for materials optimization. Furthermore, the continuous development of autonomous self-driving laboratories (SDLs) in materials science and the possibility of on-the-fly experimental feedback in the near future, require the timely examination of whether RL can effectively navigate high-dimensional design spaces more efficiently than traditional BO. In this work, we explore the potential of RL to unlock the black box of materials design spaces beyond the capabilities of BO.
We address several aspects on the afore mentioned challenges:
-
We conceptualize the application of RL to iterative materials design in a framework that encompasses both model-based and on-the-fly RL approaches. It provides an overview of how RL integrates into the materials design process.
-
Using optimization functions with very different feature landscapes as well as materials data, we demonstrate that RL is particularly promising in navigating high-dimensional spaces. By focusing on high-value regions, RL offers a more efficient exploration strategy compared to BO, uncovering optimal solutions missed by BO. This aligns well with the emerging trend of autonomous experimentation in materials science, where RL’s efficient decision-making can fully leverage the increased throughput of automated laboratories.
-
We propose a hybrid algorithm that combines the strengths of both BO and RL. This aims to leverage the early-stage exploration strength of BO with the later-stage adaptive self-learning of RL, offering a more robust and efficient strategy combining the separate strength of both methods.
In this study, we implement BO with Expected Improvement (EI) as the acquisition function, which is one of the most commonly used acquisition functions in materials design7,33. Our numerical experimental results on our test cases for functions and data reveal several key findings that emphasize the advantages of using RL for black-box optimization. First, model-based RL consistently outperforms traditional BO for high-dimensional cases (D ≥ 6) through more dispersed sampling patterns. Second, our RL agent exploits the high-value regions and our surrogate model, when interacting with the agent, learns better the overall landscapes compared to BO’s surrogate model. This is particularly evident in complex landscapes, such as the Rastrigin function with the dimensionality 10. Furthermore, we discover that a hybrid strategy creates a synergistic effect, leveraging the complementary strengths of each optimizer. Finally, in applications such as high-entropy alloy design, our framework shows increasing advantages as the number of components grows, achieving statistically significant improvements (p < 0.01) over BO in a 10-component design.
Results
Overall framework
Figure 1 shows two alternative training approaches. In the model-based approach (orange inner training loop), the RL agent interacts with a surrogate model trained on existing materials data to predict materials properties y(x) = f(x) + ε(x). The agent learns to navigate the design space by only receiving rewards from the surrogate model’s predictions. The surrogate model enables RL to achieve sample-efficient exploration of the candidate search space without the need for costly experimental validation at each action step. That is, each action step assigns a specific value to one dimension of the N-dimensional input vector x ∈ ℝN, for example composition/processing conditions. The surrogate model is updated at the end of each design episode, where an episode concludes after determining all dimensions of x, thereby completing one full design iteration. This stands in contrast to a recent application of RL to the design of multi-step chemical processes for quantum dots, where the problem naturally possesses a sequential structure, and immediate experimental outcomes are available following each action, which then immediately updates a belief model for the reward10.
The framework shows two alternative training approaches: an inner loop utilizing a surrogate model for sample-efficient RL agent learning (orange), and an on-the-fly loop interacting directly with the experimental environment (green). The RL agent can be trained through either approach or their combination to optimize materials properties through iterative design decisions. The dashed blue arrow and the gray argmax formula represent the ranking of candidates using an acquisition function (α(·)) based on model predictions in Bayesian Optimization.
As an alternative or complementary approach, the on-the-fly training loop (green) enables direct interaction between the RL agent and the experimental environment, which consists of actual materials synthesis, characterization, and/or computer simulations. In this loop, the Q-function (state-action value function) Q(s, a) ∈ ℝ, is defined in terms of the current state s and the action a, and is the expected cumulative discounted reward in taking action a in state s and following the optimal policy thereafter. Formally, Q(s, a) = \({\mathbb{E}}\left[\sum _{t=0}{\gamma }^{t}\cdot {r}_{t}|{s}_{0}=s,{a}_{0}=a\right]\), where γ ∈ [0, 1) is the discount factor and rt is the reward at time step t. The agent’s actions a, which maximize the Q-function (argmaxa Q(s, a)), directly determine the materials to be synthesized or simulated, with the resulting performance measurements serving as environmental rewards. While this approach may require more resources and time compared to model-based training, it provides more accurate feedback for the RL agent’s learning process. Depending on the specific materials optimization task and available resources, one can choose to implement either approach independently or combine them to leverage their respective advantages.
Problem formulation
The materials optimization is often formulated as a black-box optimization task over a complex design space. Let x ∈ X ⊆ ℝN denote a N-dimensional design vector representing material compositions and/or processing parameters, where X is the feasible design space with both physical and practical constraints. The objective is to find x that maximizes an unknown and expensive-to-evaluate function f: X → ℝ:
where f(x) also represents the material property or figure of merit of interest.
Model-based design
Model-based materials design relies on surrogate models to approximate expensive-to-evaluate objective functions f(x). While Gaussian Process Regression (GPR) has been widely adopted in BO frameworks, where an acquisition function α(·) guides the search for optimal designs (Fig. 1), this approach faces two nested optimization problems: inner-loop acquisition function maximization (Table 1 Line 6) and outer-loop experimental exploration (Table 1 Line 7). As an alternative approach, model-based RL can leverage the same surrogate models while offering more flexible exploration strategies through, for example, reward engineering. We implement a Deep Q-Network (DQN) agent that employs a neural network to approximate Q(s, a), mapping state-action pairs to their expected cumulative rewards34. As shown in Table 2, this RL agent learns a policy (represented by the Q table) to navigate the design space by receiving rewards based on the surrogate model’s predictions, which can be later adapted for on-the-fly optimization with real experimental measurement results as reward feedback. We specifically choose GPR as our surrogate model due to its prediction accuracy with small sample sizes—a common scenario in materials design. Unlike traditional model-based RL methods that place significant emphasis on modeling dynamics of stochastic state transitions, our approach leverages the deterministic nature of materials state transitions and prioritizes prediction accuracy. Also, in contrast to a previous study that applied model-based Proximal Policy Optimization (PPO) to train an RL agent for DNA sequence design35, our method differs in two aspects: (1) We employ a value-based approach (DQN) to explicitly model the Q-function, whereas the policy-based approach (PPO) learns the probability of selecting an action for a given state; (2) For exploration, we adopt an epsilon-greedy strategy, in contrast to the density-based exploration reward bonus scheme.
Test environments
To evaluate our framework’s effectiveness, we established a two-tier validation strategy combining benchmark test functions and a neural network predictor for alloys. The benchmark functions serve as controlled testing environments, allowing us to rigorously assess the framework’s optimization capabilities across different dimensionalities and landscape complexities. We evaluated our framework on two widely-used functions: the Ackley function and the Rastrigin function (Fig. 2 right panels), with dimensionalities ranging from 4 to 10. For each dimension, we discretized the search space into 51 evenly spaced values between [−5.0, 5.0], resulting in design spaces containing from 514 (~6.8 × 106) to 5110 (~1.2 × 1017) candidates. These functions, both featuring numerous local optima but with different landscape characteristics, provide challenging test beds for evaluating optimization algorithms in discrete, high-dimensional spaces.
Performance comparison on (a) Ackley and (b) Rastrigin benchmark functions across different dimensionalities (D = 4–10). The y-axis shows the best-so-far value (global solution is at zero) over 200 experimental iterations, with shaded areas representing standard deviations across multiple runs. The rightmost 3-dimensional surface plots show the characteristic landscapes of Ackley (top) and Rastrigin (bottom) functions in 2-dimensional input space, illustrating their distinct optimization challenges.
Additionally, we adopted a neural network architecture from36 for high-entropy alloy (HEA) that has given reliable predictions for multi-component alloys and which maps compositional parameters of up to ten elements to key mechanical properties: yield strength (\({\sigma }_{Y}\)), ultimate tensile strength (\({\sigma }_{U}\)), and elongation (\(\varepsilon\)). An HEA figure of merit (FOMHEA) can be defined to capture the inherent trade-offs between strength and ductility as a weighted combination of the three normalized properties: \({{\rm{FOM}}}_{{\rm{HEA}}}=\frac{1}{3}(\frac{{\sigma }_{Y}}{{{\sigma }_{Y}}_{N}}+\frac{{\sigma }_{U}}{{{\sigma }_{U}}_{N}}+\frac{\varepsilon }{{\varepsilon }_{N}})\), where subscript N indicates the normalization values for the corresponding properties. While real materials optimization scenarios can present additional challenges, this combined validation approach provides valuable insights into both the fundamental performance characteristics of our framework and its applicability to materials-specific optimization problems. The complementary nature of these test environments - abstract benchmark functions and materials-informed neural network models - establishes a foundation for future applications in more extensive (automated) materials discovery campaigns.
Performance comparison of model-based RL and BO
Figure 2 compares the performance of model-based RL with Bayesian Optimization using the EI acquisition function on the Ackley and Rastrigin function across different dimensionalities (D = 4–10). The y-axis shows the best-so-far value over 200 experimental iterations, with the green dashed line indicating the global optimum. The shaded areas represent the standard deviation across multiple runs. In lower dimensions (D = 4, 5) of the Ackley function, the final performance of the methods is relatively comparable, with model-based RL showing notably faster convergence and consistently approximating closer to the global optimum. As the dimensionality increases (D ≥ 6), the performance gap between the two methods becomes more pronounced. While model-based RL maintains its ability to approximate the global optimum even in higher dimensions, BO-EI’s performance deteriorates significantly, as shown in Fig. 2a. This can be attributed to two key technical challenges in the BO-EI implementation (based on the BoTorch14 implementation in Daulton et al.37): First, the EI values often approach zero in later iterations (Fig. S1), creating a numerical issue that significantly hinders gradient-based inner optimization. Second, we speculate that the “gradient optimization followed by discretization near the best” strategy employed for the inner argmax operation in BO-EI may not effectively handle the discrete nature of our search space, particularly in higher dimensions. This behavior in higher dimensions suggests that model-based RL’s sequential decision-making strategy, which leverages updated surrogate models at each iteration, is more robust to the curse of dimensionality compared to BO’s myopic optimization approach. The results highlight the advantages of RL’s handling of discrete, high-dimensional spaces, while also revealing important numerical and implementation challenges of traditional BO-EI approaches in such scenarios.
Similar performance patterns are observed on the Rastrigin function (Fig. 2b), although with some notable differences. Despite the function’s more pronounced local optima structure compared to Ackley’s relatively flat landscape (Fig. 2), model-based RL maintains its performance advantage, particularly in higher dimensions (D ≥ 6).
To address the issue of vanishing EI values as shown in Fig. S1, we replaced EI with logarithmic Expected Improvement (logEI)38 as the modified acquisition function. The logEI transformation ensures that if EI approaches zero from ℝ+, its logarithm tends to negative infinity, providing better numerical stability for gradient-based optimization implemented with floating-point arithmetic. Our comparative analysis (Fig. 3) reveals interesting patterns: For the Ackley function, BO-logEI shows improved convergence compared to standard BO-EI, suggesting that the numerical challenges can be mitigated through this modification. However, for the Rastrigin function, both BO variants still struggle to match the performance of RL, particularly in higher dimensions (D = 10). This suggests that the dimensional issues for BO-EI extend beyond the numerical limitations observed in the Ackley function, likely stemming from the difficulty in optimizing the acquisition functions over discrete, high-dimensional spaces.
a Performance on the Ackley function (D = 10); b Performance on the Rastrigin function (D = 10). The solid lines represent the mean performance over multiple runs, while the shaded areas indicate the standard deviation over 96 parallel runs. The dashed green line represents the global optimal value.
Analysis of search patterns
To better understand the different exploration behavior, we visualized their search patterns in the 10-dimensional Rastrigin function space using t-SNE dimensionality reduction (Fig. 4). The scatter plots (Fig. 4c, d) show the sampled points across different experimental iterations (#30-#150), with colors indicating their proximity to the global optimum (red being closer). The corresponding density estimation maps (Fig. 4e) highlight the relative sampling preferences of both methods, where yellow and blue regions indicate areas more frequently explored by model-based RL and BO-EI, respectively. The visualization reveals distinct exploration strategies: BO-EI tends to concentrate its sampling in a more centralized region of the search space, whereas model-based RL exhibits a more dispersed pattern. Both methods identify different high-value regions in the complex landscape of the Rastrigin function, though model-based RL appears to maintain better performance in finding near-optimal solutions in this high-dimensional setting, as evidenced by our previous performance analysis.
To understand why model-based RL outperforms BO-EI in high-dimensional Rastrigin function optimization, we investigated their surrogate model’s ability to reconstruct the function landscape across different regions. We defined five nested regions centered around x = 0 (Fig. 5f), where region ‘a’ represents a high-value area near the global optimum, and regions ‘b’ through ‘e’ progressively expand to cover larger portions of the design space. The root mean square error (RMSE) values between surrogate predictions and true function values were calculated using random sampling points within each region throughout the black-box optimization process. Interestingly, while both methods show comparable prediction accuracy in the high-value region ‘a’ (p > 0.05), BO’s surrogate model demonstrates superior landscape reconstruction in intermediate regions ‘b-d’ (p < 0.05). This aligns with our previous observation that BO tends to concentrate its exploration in more centralized areas (Fig. 4c). However, when considering the entire design space (region ‘e’), model-based RL’s surrogate exhibits significantly better modeling capability (p < 0.05), suggesting a more balanced trade-off between local accuracy and global representation.
RMSE values between surrogate predictions and true function values are shown for (a) high-value region near the optimum, (b–d) intermediate regions, and (e) entire design space. f A schematic illustration of the one-dimensional Rastrigin function showing the five nested regions (a–e). g Box plots comparing the final RMSE values between BO and model-based RL across the five regions. Statistical significance was assessed using unpaired T-tests, with check marks (✓) indicating significant differences and cross marks (×) indicating non-significant differences between the two methods.
Hybrid optimization combining BO and RL
Building upon our previous observations of the complementary strengths of BO and model-based RL, we investigated a hybrid approach that combines both methods. Given that BO typically performs better in early iterations while potentially facing challenges in later stages, we propose a sequential combination strategy. Three different approaches were compared: (1) random sampling followed by BO, (2) random sampling followed by model-based RL, and (3) a three-stage approach combining random sampling, BO, and model-based RL. The third choice is motivated by several practical considerations: BO can effectively initiate the optimization process with small sample sizes, while RL, despite its initial sample inefficiency, provides robust performance once properly trained39. This is evidenced by the frequency distribution plots (Fig. 4e, iteration #150), where RL demonstrates more focused behavior in later stages, with notably higher sampling frequency in the vicinity of global optima compared to BO. As shown in Fig. 6, the results on the Rastrigin function with dimensionality 10 demonstrate that the hybrid approach (Random 20 + BO 80 + RL 200) outperforms the other two strategies till about 225 experimental samples and then performs equally well for the remainder of the optimization process. This superior performance suggests that leveraging BO’s early-stage exploration capabilities followed by model-based RL’s robust late-stage optimization creates a synergistic effect. While the optimal timing for strategy switching remains an open question, these results provide evidence that combining BO and model-based RL offers a promising solution for complex design space optimization problems.
On-the-fly RL for HEA design
Finally, to further validate the effectiveness of our RL framework independent of surrogate model quality, we implemented an on-the-fly DQN approach where rewards are directly obtained from experimental measurements (in this case, simulated by the neural network predictor for HEA). Figure 7 compares the optimization performance across different numbers of compositional elements (4–10 components), using the previously defined FOMHEA that combines yield strength, ultimate tensile strength, and elongation.
The on-the-fly optimization follows a similar framework as Algorithm 2, but replaces the GPR surrogate model predictions with actual experimental measurements for reward calculation. Specifically, instead of using r = M(s′) - M(s) from the surrogate model, the episodic reward is derived directly from experimental results. This modification allows the RL agent to learn directly from materials performance, though at the cost of increased experimental iterations. Conversely, the model-based approach (Algorithm 2 in Table 2) can be viewed as a pre-training step for on-the-fly optimization, potentially accelerating the learning process when transitioning to real experiments.
Starting with 20 initial experimental points, both BO-EI and on-the-fly DQN significantly outperform random sampling across all dimensionalities. Notably, while BO-EI shows strong early-stage performance, particularly in lower-dimensional spaces (4–6 components), the advantages of RL become increasingly pronounced as the number of components increases. In the 10-component case, on-the-fly DQN achieves statistically superior performance (p < 0.01) compared to both BO-EI and random sampling, suggesting that the RL framework’s sequential decision-making strategy is particularly effective in navigating complex, high-dimensional composition spaces.
It is important to note that this superior performance comes at the cost of requiring substantially more experimental iterations for the RL agent to learn an effective Q-table, particularly in high-dimensional spaces. This limitation highlights a key challenge in applying RL to real-world materials optimization problems and suggests an important direction for future research: developing more sample-efficient RL algorithms while maintaining their advantages in handling complex design spaces.
While Fig. 7 demonstrates the effectiveness of on-the-fly RL for the HEA composition design, it is necessary to understand how RL explores the design space compared to BO-EI. We therefore visualize the sampling patterns of both methods using t-SNE dimensionality reduction for the 10-component HEA composition case (Fig. 8). The patterns reveal distinct search tendencies: BO-EI tends to focus on exploitation in regions with initially promising FOM values, while RL maintains a more balanced exploration-exploitation approach across the composition space. Notably, RL identifies and explores promising high-FOM regions while maintaining broader search coverage globally. This difference is particularly evident in the density maps (Fig. 8e), where RL shows a broader distribution of sampling points while still maintaining sufficient sampling density in high-FOM regions. This pattern aligns with the behavior of RL in converging rapidly in later iterations to optimal compositions. We can identify the changing behavior of RL between iteration #900 and #1200 from a more exploratory pattern to exploitative pattern as it starts to converges towards the best FOM found.
a Complete sampling distribution with the best FOM point marked with a star. b Density estimation map of sampling preferences at iteration #1500. c, d Time evolution of sampling points for BO-EI and model-based RL, respectively, shown at iterations #300, #600, #900, #1200, and #1500. e Corresponding density maps showing the relative exploration preferences of both methods. Colors in scatter plots indicate the FOM values (red: high, blue: low), while density maps show the differences in sampling frequency (yellow: RL-preferred, blue: BO-preferred regions).
Parametric study of exploration strategies
The exploration-exploitation trade-off is fundamental to optimization. While BO leverages uncertainty for efficient exploration, our model-based RL strategy uses a different mechanism: random actions in the proposing stage with probability ε. Previously, to choose a candidate for one round of iterative experiments (corresponding to one outer loop in Fig. 1), we used a 10% probability to let the agent choose a random action, and 90% probability to choose a greedy action based on the trained Q-value network. We further investigate the impact of different ε values on optimization performance. The results shown in Fig. 9a reveal that moderate exploration rates (ε = 0.1–0.3) consistently achieve superior performance, with the best-so-far values converging more rapidly and reliably towards the global optimum. Higher ε values (>0.5) lead to increasingly dispersed sampling patterns, resulting in slower convergence and suboptimal final performance. This behavior suggests that maintaining a balanced exploration-exploitation trade-off by tuning ε is crucial for efficient optimization. Another critical aspect is the ability to conduct parallel experiments, where simultaneous synthesis and characterization can significantly reduce the total experimental time and resource consumption. For proposing parallel experiments with batch size B, the “Design stage” in Table 2 should be executed B times, allowing the algorithm to propose multiple designs before updating the surrogate model. Figure 9b examines parallel experiments through varying batch sizes (1, 2, 4, 8, 16). The x-axis represents the cumulative number of synthesized samples across all batch experiments, where each point reflects the total number of materials that have been synthesized regardless of batch size. The inset figure of Fig. 9b provides a complementary view by plotting the best-so-far values against the number of experimental rounds (where each round corresponds to one batch of parallel experiments), focusing on the initial 40 rounds to highlight the early-stage convergence behavior. The results show that smaller batch sizes lead to fewer synthesized samples per round but require more total rounds of batch experiments, underutilizing available experimental resources. Notably, intermediate batch sizes (4–8) provide a balanced performance between the number of batch rounds and synthesized samples. The marginal performance gain observed with larger batch sizes (16) can be attributed to delayed feedback. These findings have significant implications for practical implementation in automated materials platforms, where efficient utilization of parallel synthesis and characterization capabilities is essential. These parametric studies not only provide practical guidelines for implementing our framework but also offer empirical insights into the exploration-exploitation in materials optimization.
a Impact of different epsilon values (ε = 0.05–0.9) during the design stage, where higher ε indicates more random exploration. b Effect of batch size (1, 2, 4, 8, 16) on optimization performance, investigating the trade-off between parallel experimentation and learning efficiency. In both plots, the y-axis shows the best-so-far values (lower is better) over the number of conducted experiments, with the global optimum indicated by the dashed line.
Discussion
Our results reveal several insights into the application of RL for materials design. The primary finding is that RL-based approaches show considerable promise in navigating high-dimensional materials design spaces (D ≥ 6) compared to BO with EI acquisition function. This advantage stems from RL’s ability to adapt through environmental interactions, leading to more dispersed sampling patterns and better landscape learning. Particularly noteworthy is the synergistic effect observed when combining BO’s systematic early-stage exploration with RL’s adaptive learning capabilities, suggesting a promising direction for hybrid model-based strategies in materials discovery. For complex design tasks with dimensionality exceeding 6, our findings indicate an RL-based approach is favored. Furthermore, the HEA design application indicates its potential applicability to broader materials systems where composition-property relationships are complex. As self-driving laboratories with automated synthesis and characterization for alloys continue to advance with the prospect of enabling on-the-fly experimental feedback, we expect RL to play an increasingly crucial role for exploring high-dimensional design spaces.
Methods
Standard test functions
The analytical form of the Ackley function is defined as
where d denotes the dimensionality, and the analytical form of the Rastrigin function is defined as
Both functions achieve their global minimum value of f(x*) = 0 at x* = (0, …, 0). The two functions exhibit distinctly different landscape characteristics, making them ideal benchmarks for evaluating optimization algorithm performance. The Ackley function (Fig. 2a) presents a funnel-like global structure with numerous local minima, while the Rastrigin function (Fig. 2b) displays a periodic distribution of local minima. In our study, the input space x was discretized into 51 values for each dimension within the range of [−5, 5]. The test functions aid to validate algorithm performance across both low and high dimensional optimization problems (D = 3–10 dimensions). When the dimensionality reaches 10, the search space expands to 1.19 × 1017 possible combinations, presenting a compelling challenge for the optimization task.
HEA test environment
We followed the procedures outlined in ref. 36 to train the neural networks that establish the HEA test environment.
On-the-fly DQN agent
The on-the-fly DQN agent used for testing consists of a Q-network with three fully connected layers (256-1024-N, where N is the number of usable composition values that varies with the dimensionality of the design space, as detailed in Table S1), utilizing Rectified Linear Unit (ReLU) activation functions between layers. The network weights are robustly initialized using a uniform distribution, with limits calculated based on parameters specific to each layer. To balance exploration and exploitation, we employed an ε-greedy policy, where ε is annealed from 0.95 to 0.01 with a decay rate of 0.985. The agent uses experience replay with a buffer size of 20,000 transitions and a batch size of 128 for training. The learning process employs the Adam optimizer with a learning rate of 10−3. To stabilize training, a target network is updated every 10 steps, with a discount factor (γ = 0.8) for future rewards. Immediate pseudo experiments are conducted only at the end of each episode, dealing with episodic rewards. The Q-network is trained to predict action values directly from experimental measurements, eliminating the need for separate surrogate models during the optimization process
Model-based DQN agent
The model-based DQN agent employs a similar Q-network structure with two simplified hidden layers (64–32 neurons). Similar to the on-the-fly agent, this network employs ReLU activation functions between layers, with weights initialized through a scaled uniform distribution to enhance convergence stability. We implemented an ε-greedy policy with annealing from 0.95 to 0.01, but with a gentler decay rate of 0.99. The agent maintains a replay memory of 20,000 transitions with a training batch size of 128. Learning occurs based on the Adam optimizer, operating at a reduced learning rate of 5 × 10–4 to improve stability. Another distinguishing feature of this agent is the use of a complete discount factor (γ = 1.0), facilitating better long-term reward tracking.
BO with EI acquisition function
For the BO with EI acquisition function, we utilized the code from ref. 37, making only necessary modifications to implement the black-box tests considered in this work.
Data availability
All elemental features and experimental composition-property datasets used in developing the HEA test environment are publicly available in the ‘OnTheFlyRL’ directory of our open-source repository at: https://github.com/wsxyh107165243/RLvsBO4Materials.
Code availability
Codes used in this study can be found at: https://github.com/wsxyh107165243/RLvsBO4Materials.
References
Gao, Y.-C. et al. Data-driven insight into the reductive stability of ion–solvent complexes in lithium battery electrolytes. J. Am. Chem. Soc. 145, 23764–23770 (2023).
Gurnani, R. et al. AI-assisted discovery of high-temperature dielectrics for energy storage. Nat. Commun. 15, 6107 (2024).
Chen, A., Zhang, X. & Zhou, Z. Machine learning: accelerating materials development for energy storage and conversion. InfoMat 2, 553–576 (2020).
Sekine, S. et al. Na[Mn0.36Ni0.44Ti0.15Fe0.05]O2 predicted via machine learning for high energy Na-ion batteries. J. Mater. Chem. A. 12, 31103–31107 (2024).
Sendek, A. D. et al. Machine learning modeling for accelerated battery materials design in the small data regime. Adv. Energy Mater. 12, 2200553 (2022).
Brunton, S. L. et al. Data-driven aerospace engineering: reframing the industry with machine learning. AIAA J. 59, 2820–2847 (2021).
Lookman, T., Balachandran, P. V., Xue, D. & Yuan, R. Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design. npj Comput. Mater. 5, 21 (2019).
Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature 620, 47–60 (2023).
Liu, X., Zhang, J. & Pei, Z. Machine learning for high-entropy alloys: Progress, challenges and opportunities. Prog. Mater. Sci. 131, 101018 (2023).
Volk, A. A. et al. AlphaFlow: autonomous discovery and optimization of multi-step chemistry using a self-driven fluidic lab guided by reinforcement learning. Nat. Commun. 14, 1403 (2023).
Lee, D., Chen, W., Wang, L., Chan, Y. C. & Chen, W. Data‐Driven Design for Metamaterials and Multiscale Systems: A Review. Adv. Mater. 36, 2305254 (2024).
Lookman, T. Mesoscopic Phenomena in Multifunctional Materials: Synthesis, Characterization, Modeling and Applications (Springer, 2014).
Frazier, P. I. & Wang, J. Information science for materials discovery and design (Springer, 2016).
Balandat, M. et al. BoTorch: A framework for efficient Monte-Carlo Bayesian optimization. Proc. Adv. Neural Inform. Process. Syst. 33, (2020).
Xue, D. et al. Accelerated search for materials with targeted properties by adaptive design. Nat. Commun. 7, 1–9 (2016).
Lookman, T., Balachandran, P. V., Xue, D., Hogden, J. & Theiler, J. Statistical inference and adaptive design for materials discovery. Curr. Opin. Solid St. M. 21, 121–128 (2017).
Yuan, X. et al. Active Learning-Based Guided Synthesis of Engineered Biochar for CO2 Capture. Environ. Sci. Technol. 58, 6628–6636 (2024).
Gongora, A. E. et al. A Bayesian experimental autonomous researcher for mechanical design. Sci. Adv. 6, eaaz1708 (2020).
Frazier, P. I. A tutorial on Bayesian optimization Preprint at https://arxiv.org/abs/1807.02811 (2018).
Eriksson, D., Pearce, M., Gardner, J., Turner, R. D. & Poloczek, M. Scalable global optimization via local Bayesian optimization. In Proc. Advances in Neural Information Processing Systems Vol. 32 (NeurIPS, 2019).
Kirschner, J., Mutny, M., Hiller, N., Ischebeck, R. & Krause, A. Adaptive and safe Bayesian optimization in high dimensions via one-dimensional subspaces. In International Conference on Machine Learning (PMLR, 2019).
Gelbart, M. A., Snoek, J. & Adams, R. P. Bayesian optimization with unknown constraints Preprint at https://arxiv.org/abs/1403.5607 (2014).
Kandasamy, K. et al. Myopic posterior sampling for adaptive goal oriented design of experiments. In International Conference on Machine Learning (PMLR, 2019).
Rao, Z. et al. Machine learning–enabled high-entropy alloy discovery. Science 378, 78–85 (2022).
Sanchez-Lengeling, B. & Aspuru-Guzik, A. Inverse molecular design using machine learning: Generative models for matter engineering. Science 361, 360–365 (2018).
Moosavi, S. M., Jablonka, K. M. & Smit, B. The role of machine learning in the understanding and design of materials. J. Am. Chem. Soc. 142, 20273–20287 (2020).
Wei, Q. et al. Divide and conquer: Machine learning accelerated design of lead-free solder alloys with high strength and high ductility. npj Comput. Mater. 9, 201 (2023).
Popova, M., Isayev, O. & Tropsha, A. Deep reinforcement learning for de novo drug design. Sci. Adv. 4, eaap7885 (2018).
Xian, Y. et al. Compositional design of multicomponent alloys using reinforcement learning. Acta Mater. 274, 120017 (2024).
Rajak, P. et al. Autonomous reinforcement learning agent for stretchable kirigami design of 2D materials. npj Comput. Mater. 7, 1–8 (2021).
Vinyals, O. et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 350–354 (2019).
Won, D.-O., Müller, K.-R. & Lee, S.-W. An adaptive deep reinforcement learning framework enables curling robots with human-like performance in real-world conditions. Sci. Robot. 5, eabb9764 (2020).
Tian, Y. et al. Determining multi‐component phase diagrams with desired characteristics using active learning. Adv. Sci. 8, 2003165 (2021).
Mnih, V. et al. Playing atari with deep reinforcement learning Preprint at https://arxiv.org/abs/1807.02811 (2013).
Angermueller, C. et al. Model-based reinforcement learning for biological sequence design. In Proc. International Conference on Learning Representations (ICLR, 2020).
Wang, J., Kwon, H., Kim, H. S. & Lee, B.-J. A neural network model for high entropy alloy design. npj Comput. Mater. 9, 60 (2023).
Daulton, S. et al. Bayesian optimization over discrete and mixed spaces via probabilistic reparameterization. Proc. Adv. Neural Inform. Process. Syst. 35, 12760–12774 (2022).
Ament, S., Daulton, S., Eriksson, D., Balandat, M. & Bakshy, E. Unexpected improvements to expected improvement for bayesian optimization. Proc. Adv. Neural Inform. Process. Syst. 36 (2023).
Guo, D. et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning Preprint at https://arxiv.org/abs/2501.12948 (2025).
Acknowledgements
We thank Peter Frazier (Cornell) for discussion. We also acknowledge the support of National Key Research and Development Program of China (2021YFB3802102), National Natural Science Foundation of China (Nos.51931004, 52350710205, 52173228, and 52271190), Innovation Capability Support Program of Shaanxi (2024ZG-GCZX-01(1)-06) and Natural Science Foundation Project of Shaanxi Province (Grant No. 2022JM-205).
Author information
Authors and Affiliations
Contributions
Y.X. developed methodology, wrote software code, performed investigation and validation. X.D. acquired funding. X.J. contributed to methodology development. Y.Z. conducted investigation. J.S. acquired funding. D.X. developed methodology, acquired funding and provided supervision. T.L. conceptualized the project, developed methodology, provided supervision, wrote the original draft and performed writing–review & editing. All authors reviewed and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Xian, Y., Ding, X., Jiang, X. et al. Unlocking the black box beyond Bayesian global optimization for materials design using reinforcement learning. npj Comput Mater 11, 143 (2025). https://doi.org/10.1038/s41524-025-01639-w
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41524-025-01639-w











