Unlocking the black box beyond Bayesian global optimization for materials design using reinforcement learning

Xian, Yuehui; Ding, Xiangdong; Jiang, Xue; Zhou, Yumei; Sun, Jun; Xue, Dezhen; Lookman, Turab

doi:10.1038/s41524-025-01639-w

Download PDF

Article
Open access
Published: 22 May 2025

Unlocking the black box beyond Bayesian global optimization for materials design using reinforcement learning

Yuehui Xian¹,
Xiangdong Ding¹,
Xue Jiang²,
Yumei Zhou¹,
Jun Sun¹,
Dezhen Xue¹ &
…
Turab Lookman^1,3

npj Computational Materials volume 11, Article number: 143 (2025) Cite this article

5340 Accesses
8 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Materials design often becomes an expensive black-box optimization problem due to limitations in balancing exploration-exploitation trade-offs in high-dimensional spaces. We propose a reinforcement learning (RL) framework that effectively navigates the complex design spaces through two complementary approaches: a model-based strategy utilizing surrogate models for sample-efficient exploration, and an on-the-fly strategy when direct experimental feedback is available. This approach demonstrates better performance in high-dimensional spaces (D ≥ 6) compared to Bayesian optimization (BO) with the Expected Improvement (EI) acquisition function through more dispersed sampling patterns and better landscape learning capabilities. Furthermore, we observe a synergistic effect when combining BO’s early-stage exploration with RL’s adaptive learning. Evaluations on both standard benchmark functions (Ackley, Rastrigin) and real-world high-entropy alloy data, demonstrate statistically significant improvements (p < 0.01) over traditional BO with EI, particularly in complex, high-dimensional scenarios. This work addresses limitations of existing methods while providing practical tools for guiding experiments.

Performance of uncertainty-based active learning for efficient approximation of black-box functions in materials science

Article Open access 06 November 2024

Learning in continuous action space for developing high dimensional potential energy models

Article Open access 18 January 2022

A physics informed bayesian optimization approach for material design: application to NiTi shape memory alloys

Article Open access 13 December 2023

Introduction

Optimization and discovery are key to finding materials for critical applications ranging from energy storage^1,2,3,4,5 to aerospace engineering⁶. The process faces two challenges: first, it heavily relies on experimental data acquisition and analysis, often lacking closed-form physical models for precise predictions^7,8; second, it involves vast and complex design spaces^9,10,11 for which the true function is unknown or a “black box”. The relationship between input parameters (chemical compositions and processing conditions) and resulting materials performance is non-trivial, and objective evaluation is both time-consuming and resource-intensive^12,13,14.

Bayesian Optimization (BO) has been used widely in materials science to navigate complex design spaces^{12,15,16,17,18}. While BO efficiently balances exploration and exploitation in single-step optimization scenarios, making it suitable for materials optimization tasks with relatively small search spaces^13,19, it faces several inherent limitations. First, its performance can significantly degrade as the dimensionality of the search space increases, struggling to efficiently explore high-dimensional spaces and potentially missing promising regions^20,21. Second, despite recent advances in multi-step optimization, BO primarily focuses on myopic (single-step) decision-making, which may not capture the interdependence of consecutive design decisions^22,23. Furthermore, the absence of a clear stopping criteria in BO frameworks makes it challenging to determine when the optimization process should terminate.

In recent years, a number of artificial intelligence techniques have been increasingly applied to materials design to address these limitations^24,25,26,27. Among these, Reinforcement Learning (RL) has emerged as a promising paradigm for tackling complex optimization problems^10,28,29,30. It has demonstrated remarkable success in various domains, from playing top-level adversarial games³¹ to controlling real-time robotics³², where it has shown an ability to navigate large, complex state spaces and make decisions under uncertainty. Notably, RL’s capacity to incorporate future outcome predictions into its current decision-making process offers unique advantages in sequential optimization tasks. However, the application of RL in materials design optimization remains largely unexplored, particularly its potential to learn and adapt from environmental feedback in materials discovery processes. While both RL and BO can leverage models for exploration, RL’s ability to develop adaptive strategies through environmental interactions and consider long-term consequences makes it particularly suitable for materials optimization. Furthermore, the continuous development of autonomous self-driving laboratories (SDLs) in materials science and the possibility of on-the-fly experimental feedback in the near future, require the timely examination of whether RL can effectively navigate high-dimensional design spaces more efficiently than traditional BO. In this work, we explore the potential of RL to unlock the black box of materials design spaces beyond the capabilities of BO.

We address several aspects on the afore mentioned challenges:

We conceptualize the application of RL to iterative materials design in a framework that encompasses both model-based and on-the-fly RL approaches. It provides an overview of how RL integrates into the materials design process.
Using optimization functions with very different feature landscapes as well as materials data, we demonstrate that RL is particularly promising in navigating high-dimensional spaces. By focusing on high-value regions, RL offers a more efficient exploration strategy compared to BO, uncovering optimal solutions missed by BO. This aligns well with the emerging trend of autonomous experimentation in materials science, where RL’s efficient decision-making can fully leverage the increased throughput of automated laboratories.
We propose a hybrid algorithm that combines the strengths of both BO and RL. This aims to leverage the early-stage exploration strength of BO with the later-stage adaptive self-learning of RL, offering a more robust and efficient strategy combining the separate strength of both methods.

In this study, we implement BO with Expected Improvement (EI) as the acquisition function, which is one of the most commonly used acquisition functions in materials design^7,33. Our numerical experimental results on our test cases for functions and data reveal several key findings that emphasize the advantages of using RL for black-box optimization. First, model-based RL consistently outperforms traditional BO for high-dimensional cases (D ≥ 6) through more dispersed sampling patterns. Second, our RL agent exploits the high-value regions and our surrogate model, when interacting with the agent, learns better the overall landscapes compared to BO’s surrogate model. This is particularly evident in complex landscapes, such as the Rastrigin function with the dimensionality 10. Furthermore, we discover that a hybrid strategy creates a synergistic effect, leveraging the complementary strengths of each optimizer. Finally, in applications such as high-entropy alloy design, our framework shows increasing advantages as the number of components grows, achieving statistically significant improvements (p < 0.01) over BO in a 10-component design.

Results

Overall framework

Figure 1 shows two alternative training approaches. In the model-based approach (orange inner training loop), the RL agent interacts with a surrogate model trained on existing materials data to predict materials properties y(x) = f(x) + ε(x). The agent learns to navigate the design space by only receiving rewards from the surrogate model’s predictions. The surrogate model enables RL to achieve sample-efficient exploration of the candidate search space without the need for costly experimental validation at each action step. That is, each action step assigns a specific value to one dimension of the N-dimensional input vector x ∈ ℝ^N, for example composition/processing conditions. The surrogate model is updated at the end of each design episode, where an episode concludes after determining all dimensions of x, thereby completing one full design iteration. This stands in contrast to a recent application of RL to the design of multi-step chemical processes for quantum dots, where the problem naturally possesses a sequential structure, and immediate experimental outcomes are available following each action, which then immediately updates a belief model for the reward¹⁰.

**Fig. 1: Schematic flowchart of the RL-based materials optimization framework.**

As an alternative or complementary approach, the on-the-fly training loop (green) enables direct interaction between the RL agent and the experimental environment, which consists of actual materials synthesis, characterization, and/or computer simulations. In this loop, the Q-function (state-action value function) Q(s, a) ∈ ℝ, is defined in terms of the current state s and the action a, and is the expected cumulative discounted reward in taking action a in state s and following the optimal policy thereafter. Formally, Q(s, a) = ${\mathbb{E}}\left[\sum _{t=0}{\gamma }^{t}\cdot {r}_{t}|{s}_{0}=s,{a}_{0}=a\right]$, where γ ∈ [0, 1) is the discount factor and r_t is the reward at time step t. The agent’s actions a, which maximize the Q-function (argmax_a Q(s, a)), directly determine the materials to be synthesized or simulated, with the resulting performance measurements serving as environmental rewards. While this approach may require more resources and time compared to model-based training, it provides more accurate feedback for the RL agent’s learning process. Depending on the specific materials optimization task and available resources, one can choose to implement either approach independently or combine them to leverage their respective advantages.

Problem formulation

The materials optimization is often formulated as a black-box optimization task over a complex design space. Let x ∈ X ⊆ ℝ^N denote a N-dimensional design vector representing material compositions and/or processing parameters, where X is the feasible design space with both physical and practical constraints. The objective is to find x that maximizes an unknown and expensive-to-evaluate function f: X → ℝ:

$$x={argmax\; f}(x),\,x\in X,$$

where f(x) also represents the material property or figure of merit of interest.

Model-based design

Model-based materials design relies on surrogate models to approximate expensive-to-evaluate objective functions f(x). While Gaussian Process Regression (GPR) has been widely adopted in BO frameworks, where an acquisition function α(·) guides the search for optimal designs (Fig. 1), this approach faces two nested optimization problems: inner-loop acquisition function maximization (Table 1 Line 6) and outer-loop experimental exploration (Table 1 Line 7). As an alternative approach, model-based RL can leverage the same surrogate models while offering more flexible exploration strategies through, for example, reward engineering. We implement a Deep Q-Network (DQN) agent that employs a neural network to approximate Q(s, a), mapping state-action pairs to their expected cumulative rewards³⁴. As shown in Table 2, this RL agent learns a policy (represented by the Q table) to navigate the design space by receiving rewards based on the surrogate model’s predictions, which can be later adapted for on-the-fly optimization with real experimental measurement results as reward feedback. We specifically choose GPR as our surrogate model due to its prediction accuracy with small sample sizes—a common scenario in materials design. Unlike traditional model-based RL methods that place significant emphasis on modeling dynamics of stochastic state transitions, our approach leverages the deterministic nature of materials state transitions and prioritizes prediction accuracy. Also, in contrast to a previous study that applied model-based Proximal Policy Optimization (PPO) to train an RL agent for DNA sequence design³⁵, our method differs in two aspects: (1) We employ a value-based approach (DQN) to explicitly model the Q-function, whereas the policy-based approach (PPO) learns the probability of selecting an action for a given state; (2) For exploration, we adopt an epsilon-greedy strategy, in contrast to the density-based exploration reward bonus scheme.

Table 1 Algorithm: Bayesian optimization

Full size table

Table 2 Algorithm: model-based DQN for materials design

Full size table

Test environments

To evaluate our framework’s effectiveness, we established a two-tier validation strategy combining benchmark test functions and a neural network predictor for alloys. The benchmark functions serve as controlled testing environments, allowing us to rigorously assess the framework’s optimization capabilities across different dimensionalities and landscape complexities. We evaluated our framework on two widely-used functions: the Ackley function and the Rastrigin function (Fig. 2 right panels), with dimensionalities ranging from 4 to 10. For each dimension, we discretized the search space into 51 evenly spaced values between [−5.0, 5.0], resulting in design spaces containing from 51⁴ (~6.8 × 10⁶) to 51¹⁰ (~1.2 × 10¹⁷) candidates. These functions, both featuring numerous local optima but with different landscape characteristics, provide challenging test beds for evaluating optimization algorithms in discrete, high-dimensional spaces.

**Fig. 2: Comparative analysis of model-based RL with BO across increasing dimensionality in benchmark functions.**

Additionally, we adopted a neural network architecture from³⁶ for high-entropy alloy (HEA) that has given reliable predictions for multi-component alloys and which maps compositional parameters of up to ten elements to key mechanical properties: yield strength (${\sigma }_{Y}$), ultimate tensile strength (${\sigma }_{U}$), and elongation ($\varepsilon$). An HEA figure of merit (FOM_HEA) can be defined to capture the inherent trade-offs between strength and ductility as a weighted combination of the three normalized properties: ${{\rm{FOM}}}_{{\rm{HEA}}}=\frac{1}{3}(\frac{{\sigma }_{Y}}{{{\sigma }_{Y}}_{N}}+\frac{{\sigma }_{U}}{{{\sigma }_{U}}_{N}}+\frac{\varepsilon }{{\varepsilon }_{N}})$, where subscript N indicates the normalization values for the corresponding properties. While real materials optimization scenarios can present additional challenges, this combined validation approach provides valuable insights into both the fundamental performance characteristics of our framework and its applicability to materials-specific optimization problems. The complementary nature of these test environments - abstract benchmark functions and materials-informed neural network models - establishes a foundation for future applications in more extensive (automated) materials discovery campaigns.

Performance comparison of model-based RL and BO

Figure 2 compares the performance of model-based RL with Bayesian Optimization using the EI acquisition function on the Ackley and Rastrigin function across different dimensionalities (D = 4–10). The y-axis shows the best-so-far value over 200 experimental iterations, with the green dashed line indicating the global optimum. The shaded areas represent the standard deviation across multiple runs. In lower dimensions (D = 4, 5) of the Ackley function, the final performance of the methods is relatively comparable, with model-based RL showing notably faster convergence and consistently approximating closer to the global optimum. As the dimensionality increases (D ≥ 6), the performance gap between the two methods becomes more pronounced. While model-based RL maintains its ability to approximate the global optimum even in higher dimensions, BO-EI’s performance deteriorates significantly, as shown in Fig. 2a. This can be attributed to two key technical challenges in the BO-EI implementation (based on the BoTorch¹⁴ implementation in Daulton et al.³⁷): First, the EI values often approach zero in later iterations (Fig. S1), creating a numerical issue that significantly hinders gradient-based inner optimization. Second, we speculate that the “gradient optimization followed by discretization near the best” strategy employed for the inner argmax operation in BO-EI may not effectively handle the discrete nature of our search space, particularly in higher dimensions. This behavior in higher dimensions suggests that model-based RL’s sequential decision-making strategy, which leverages updated surrogate models at each iteration, is more robust to the curse of dimensionality compared to BO’s myopic optimization approach. The results highlight the advantages of RL’s handling of discrete, high-dimensional spaces, while also revealing important numerical and implementation challenges of traditional BO-EI approaches in such scenarios.

Similar performance patterns are observed on the Rastrigin function (Fig. 2b), although with some notable differences. Despite the function’s more pronounced local optima structure compared to Ackley’s relatively flat landscape (Fig. 2), model-based RL maintains its performance advantage, particularly in higher dimensions (D ≥ 6).

To address the issue of vanishing EI values as shown in Fig. S1, we replaced EI with logarithmic Expected Improvement (logEI)³⁸ as the modified acquisition function. The logEI transformation ensures that if EI approaches zero from ℝ⁺, its logarithm tends to negative infinity, providing better numerical stability for gradient-based optimization implemented with floating-point arithmetic. Our comparative analysis (Fig. 3) reveals interesting patterns: For the Ackley function, BO-logEI shows improved convergence compared to standard BO-EI, suggesting that the numerical challenges can be mitigated through this modification. However, for the Rastrigin function, both BO variants still struggle to match the performance of RL, particularly in higher dimensions (D = 10). This suggests that the dimensional issues for BO-EI extend beyond the numerical limitations observed in the Ackley function, likely stemming from the difficulty in optimizing the acquisition functions over discrete, high-dimensional spaces.

**Fig. 3: Optimization performance comparison of RL and BO variants (BO-EI and BO-logEI) in high-dimensional spaces.**

Analysis of search patterns

To better understand the different exploration behavior, we visualized their search patterns in the 10-dimensional Rastrigin function space using t-SNE dimensionality reduction (Fig. 4). The scatter plots (Fig. 4c, d) show the sampled points across different experimental iterations (#30-#150), with colors indicating their proximity to the global optimum (red being closer). The corresponding density estimation maps (Fig. 4e) highlight the relative sampling preferences of both methods, where yellow and blue regions indicate areas more frequently explored by model-based RL and BO-EI, respectively. The visualization reveals distinct exploration strategies: BO-EI tends to concentrate its sampling in a more centralized region of the search space, whereas model-based RL exhibits a more dispersed pattern. Both methods identify different high-value regions in the complex landscape of the Rastrigin function, though model-based RL appears to maintain better performance in finding near-optimal solutions in this high-dimensional setting, as evidenced by our previous performance analysis.

**Fig. 4: Visualization of search patterns in the 10-dimensional Rastrigin function space using t-SNE dimensionality reduction.**

To understand why model-based RL outperforms BO-EI in high-dimensional Rastrigin function optimization, we investigated their surrogate model’s ability to reconstruct the function landscape across different regions. We defined five nested regions centered around x = 0 (Fig. 5f), where region ‘a’ represents a high-value area near the global optimum, and regions ‘b’ through ‘e’ progressively expand to cover larger portions of the design space. The root mean square error (RMSE) values between surrogate predictions and true function values were calculated using random sampling points within each region throughout the black-box optimization process. Interestingly, while both methods show comparable prediction accuracy in the high-value region ‘a’ (p > 0.05), BO’s surrogate model demonstrates superior landscape reconstruction in intermediate regions ‘b-d’ (p < 0.05). This aligns with our previous observation that BO tends to concentrate its exploration in more centralized areas (Fig. 4c). However, when considering the entire design space (region ‘e’), model-based RL’s surrogate exhibits significantly better modeling capability (p < 0.05), suggesting a more balanced trade-off between local accuracy and global representation.

**Fig. 5: Analysis of surrogate model accuracy across different regions of the Rastrigin function space.**

Hybrid optimization combining BO and RL

Building upon our previous observations of the complementary strengths of BO and model-based RL, we investigated a hybrid approach that combines both methods. Given that BO typically performs better in early iterations while potentially facing challenges in later stages, we propose a sequential combination strategy. Three different approaches were compared: (1) random sampling followed by BO, (2) random sampling followed by model-based RL, and (3) a three-stage approach combining random sampling, BO, and model-based RL. The third choice is motivated by several practical considerations: BO can effectively initiate the optimization process with small sample sizes, while RL, despite its initial sample inefficiency, provides robust performance once properly trained³⁹. This is evidenced by the frequency distribution plots (Fig. 4e, iteration #150), where RL demonstrates more focused behavior in later stages, with notably higher sampling frequency in the vicinity of global optima compared to BO. As shown in Fig. 6, the results on the Rastrigin function with dimensionality 10 demonstrate that the hybrid approach (Random 20 + BO 80 + RL 200) outperforms the other two strategies till about 225 experimental samples and then performs equally well for the remainder of the optimization process. This superior performance suggests that leveraging BO’s early-stage exploration capabilities followed by model-based RL’s robust late-stage optimization creates a synergistic effect. While the optimal timing for strategy switching remains an open question, these results provide evidence that combining BO and model-based RL offers a promising solution for complex design space optimization problems.

**Fig. 6: Performance evaluation of hybrid optimization strategies with random sampling, BO-EI, and model-based RL.**

On-the-fly RL for HEA design

Finally, to further validate the effectiveness of our RL framework independent of surrogate model quality, we implemented an on-the-fly DQN approach where rewards are directly obtained from experimental measurements (in this case, simulated by the neural network predictor for HEA). Figure 7 compares the optimization performance across different numbers of compositional elements (4–10 components), using the previously defined FOM_HEA that combines yield strength, ultimate tensile strength, and elongation.

**Fig. 7: Optimization performance on the HEA design problem using on-the-fly DQN compared to BO-EI and random sampling.**

The on-the-fly optimization follows a similar framework as Algorithm 2, but replaces the GPR surrogate model predictions with actual experimental measurements for reward calculation. Specifically, instead of using r = M(s′) - M(s) from the surrogate model, the episodic reward is derived directly from experimental results. This modification allows the RL agent to learn directly from materials performance, though at the cost of increased experimental iterations. Conversely, the model-based approach (Algorithm 2 in Table 2) can be viewed as a pre-training step for on-the-fly optimization, potentially accelerating the learning process when transitioning to real experiments.

Starting with 20 initial experimental points, both BO-EI and on-the-fly DQN significantly outperform random sampling across all dimensionalities. Notably, while BO-EI shows strong early-stage performance, particularly in lower-dimensional spaces (4–6 components), the advantages of RL become increasingly pronounced as the number of components increases. In the 10-component case, on-the-fly DQN achieves statistically superior performance (p < 0.01) compared to both BO-EI and random sampling, suggesting that the RL framework’s sequential decision-making strategy is particularly effective in navigating complex, high-dimensional composition spaces.

It is important to note that this superior performance comes at the cost of requiring substantially more experimental iterations for the RL agent to learn an effective Q-table, particularly in high-dimensional spaces. This limitation highlights a key challenge in applying RL to real-world materials optimization problems and suggests an important direction for future research: developing more sample-efficient RL algorithms while maintaining their advantages in handling complex design spaces.

While Fig. 7 demonstrates the effectiveness of on-the-fly RL for the HEA composition design, it is necessary to understand how RL explores the design space compared to BO-EI. We therefore visualize the sampling patterns of both methods using t-SNE dimensionality reduction for the 10-component HEA composition case (Fig. 8). The patterns reveal distinct search tendencies: BO-EI tends to focus on exploitation in regions with initially promising FOM values, while RL maintains a more balanced exploration-exploitation approach across the composition space. Notably, RL identifies and explores promising high-FOM regions while maintaining broader search coverage globally. This difference is particularly evident in the density maps (Fig. 8e), where RL shows a broader distribution of sampling points while still maintaining sufficient sampling density in high-FOM regions. This pattern aligns with the behavior of RL in converging rapidly in later iterations to optimal compositions. We can identify the changing behavior of RL between iteration #900 and #1200 from a more exploratory pattern to exploitative pattern as it starts to converges towards the best FOM found.

**Fig. 8: Visualization of search patterns in the 10-component HEA composition space using t-SNE dimensionality reduction.**

Parametric study of exploration strategies

The exploration-exploitation trade-off is fundamental to optimization. While BO leverages uncertainty for efficient exploration, our model-based RL strategy uses a different mechanism: random actions in the proposing stage with probability ε. Previously, to choose a candidate for one round of iterative experiments (corresponding to one outer loop in Fig. 1), we used a 10% probability to let the agent choose a random action, and 90% probability to choose a greedy action based on the trained Q-value network. We further investigate the impact of different ε values on optimization performance. The results shown in Fig. 9a reveal that moderate exploration rates (ε = 0.1–0.3) consistently achieve superior performance, with the best-so-far values converging more rapidly and reliably towards the global optimum. Higher ε values (>0.5) lead to increasingly dispersed sampling patterns, resulting in slower convergence and suboptimal final performance. This behavior suggests that maintaining a balanced exploration-exploitation trade-off by tuning ε is crucial for efficient optimization. Another critical aspect is the ability to conduct parallel experiments, where simultaneous synthesis and characterization can significantly reduce the total experimental time and resource consumption. For proposing parallel experiments with batch size B, the “Design stage” in Table 2 should be executed B times, allowing the algorithm to propose multiple designs before updating the surrogate model. Figure 9b examines parallel experiments through varying batch sizes (1, 2, 4, 8, 16). The x-axis represents the cumulative number of synthesized samples across all batch experiments, where each point reflects the total number of materials that have been synthesized regardless of batch size. The inset figure of Fig. 9b provides a complementary view by plotting the best-so-far values against the number of experimental rounds (where each round corresponds to one batch of parallel experiments), focusing on the initial 40 rounds to highlight the early-stage convergence behavior. The results show that smaller batch sizes lead to fewer synthesized samples per round but require more total rounds of batch experiments, underutilizing available experimental resources. Notably, intermediate batch sizes (4–8) provide a balanced performance between the number of batch rounds and synthesized samples. The marginal performance gain observed with larger batch sizes (16) can be attributed to delayed feedback. These findings have significant implications for practical implementation in automated materials platforms, where efficient utilization of parallel synthesis and characterization capabilities is essential. These parametric studies not only provide practical guidelines for implementing our framework but also offer empirical insights into the exploration-exploitation in materials optimization.

**Fig. 9: Analysis of exploration strategies in model-based RL for the 5-dimensional Ackley function.**

Discussion

Our results reveal several insights into the application of RL for materials design. The primary finding is that RL-based approaches show considerable promise in navigating high-dimensional materials design spaces (D ≥ 6) compared to BO with EI acquisition function. This advantage stems from RL’s ability to adapt through environmental interactions, leading to more dispersed sampling patterns and better landscape learning. Particularly noteworthy is the synergistic effect observed when combining BO’s systematic early-stage exploration with RL’s adaptive learning capabilities, suggesting a promising direction for hybrid model-based strategies in materials discovery. For complex design tasks with dimensionality exceeding 6, our findings indicate an RL-based approach is favored. Furthermore, the HEA design application indicates its potential applicability to broader materials systems where composition-property relationships are complex. As self-driving laboratories with automated synthesis and characterization for alloys continue to advance with the prospect of enabling on-the-fly experimental feedback, we expect RL to play an increasingly crucial role for exploring high-dimensional design spaces.

Methods

Standard test functions

The analytical form of the Ackley function is defined as

$$f\left(x\right)=-20\,\cdot\, \exp \left(-0.2\,\cdot\, \sqrt{\frac{1}{d}\mathop{\sum}\limits_{i=1}^{d}{x}_{i}^{2}}\right)-\exp \left(\frac{1}{d}\,\cdot\, \mathop{\sum}\limits_{i=1}^{d}\cos \left(2\pi \,\cdot\, {x}_{i}\right)\right)-20+\exp$$

(1)

where d denotes the dimensionality, and the analytical form of the Rastrigin function is defined as

$$f\left(x\right)=10\,\cdot\, \mathop{\sum }\limits_{i=1}^{d}\left[{x}_{i}^{2}-10\,\cdot\, \cos (2\pi \,\cdot\, {x}_{i})\right]\cdot$$

Both functions achieve their global minimum value of f(x*) = 0 at x* = (0, …, 0). The two functions exhibit distinctly different landscape characteristics, making them ideal benchmarks for evaluating optimization algorithm performance. The Ackley function (Fig. 2a) presents a funnel-like global structure with numerous local minima, while the Rastrigin function (Fig. 2b) displays a periodic distribution of local minima. In our study, the input space x was discretized into 51 values for each dimension within the range of [−5, 5]. The test functions aid to validate algorithm performance across both low and high dimensional optimization problems (D = 3–10 dimensions). When the dimensionality reaches 10, the search space expands to 1.19 × 10¹⁷ possible combinations, presenting a compelling challenge for the optimization task.

HEA test environment

We followed the procedures outlined in ref. ³⁶ to train the neural networks that establish the HEA test environment.

On-the-fly DQN agent

The on-the-fly DQN agent used for testing consists of a Q-network with three fully connected layers (256-1024-N, where N is the number of usable composition values that varies with the dimensionality of the design space, as detailed in Table S1), utilizing Rectified Linear Unit (ReLU) activation functions between layers. The network weights are robustly initialized using a uniform distribution, with limits calculated based on parameters specific to each layer. To balance exploration and exploitation, we employed an ε-greedy policy, where ε is annealed from 0.95 to 0.01 with a decay rate of 0.985. The agent uses experience replay with a buffer size of 20,000 transitions and a batch size of 128 for training. The learning process employs the Adam optimizer with a learning rate of 10⁻³. To stabilize training, a target network is updated every 10 steps, with a discount factor (γ = 0.8) for future rewards. Immediate pseudo experiments are conducted only at the end of each episode, dealing with episodic rewards. The Q-network is trained to predict action values directly from experimental measurements, eliminating the need for separate surrogate models during the optimization process

Model-based DQN agent

The model-based DQN agent employs a similar Q-network structure with two simplified hidden layers (64–32 neurons). Similar to the on-the-fly agent, this network employs ReLU activation functions between layers, with weights initialized through a scaled uniform distribution to enhance convergence stability. We implemented an ε-greedy policy with annealing from 0.95 to 0.01, but with a gentler decay rate of 0.99. The agent maintains a replay memory of 20,000 transitions with a training batch size of 128. Learning occurs based on the Adam optimizer, operating at a reduced learning rate of 5 × 10^–4 to improve stability. Another distinguishing feature of this agent is the use of a complete discount factor (γ = 1.0), facilitating better long-term reward tracking.

BO with EI acquisition function

For the BO with EI acquisition function, we utilized the code from ref. ³⁷, making only necessary modifications to implement the black-box tests considered in this work.

Data availability

All elemental features and experimental composition-property datasets used in developing the HEA test environment are publicly available in the ‘OnTheFlyRL’ directory of our open-source repository at: https://github.com/wsxyh107165243/RLvsBO4Materials.

Code availability

Codes used in this study can be found at: https://github.com/wsxyh107165243/RLvsBO4Materials.

References

Gao, Y.-C. et al. Data-driven insight into the reductive stability of ion–solvent complexes in lithium battery electrolytes. J. Am. Chem. Soc. 145, 23764–23770 (2023).
Article PubMed CAS Google Scholar
Gurnani, R. et al. AI-assisted discovery of high-temperature dielectrics for energy storage. Nat. Commun. 15, 6107 (2024).
Article PubMed PubMed Central CAS Google Scholar
Chen, A., Zhang, X. & Zhou, Z. Machine learning: accelerating materials development for energy storage and conversion. InfoMat 2, 553–576 (2020).
Article CAS Google Scholar
Sekine, S. et al. Na[Mn0.36Ni0.44Ti0.15Fe0.05]O2 predicted via machine learning for high energy Na-ion batteries. J. Mater. Chem. A. 12, 31103–31107 (2024).
Article CAS Google Scholar
Sendek, A. D. et al. Machine learning modeling for accelerated battery materials design in the small data regime. Adv. Energy Mater. 12, 2200553 (2022).
Article CAS Google Scholar
Brunton, S. L. et al. Data-driven aerospace engineering: reframing the industry with machine learning. AIAA J. 59, 2820–2847 (2021).
Google Scholar
Lookman, T., Balachandran, P. V., Xue, D. & Yuan, R. Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design. npj Comput. Mater. 5, 21 (2019).
Article Google Scholar
Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature 620, 47–60 (2023).
Article PubMed CAS Google Scholar
Liu, X., Zhang, J. & Pei, Z. Machine learning for high-entropy alloys: Progress, challenges and opportunities. Prog. Mater. Sci. 131, 101018 (2023).
Article Google Scholar
Volk, A. A. et al. AlphaFlow: autonomous discovery and optimization of multi-step chemistry using a self-driven fluidic lab guided by reinforcement learning. Nat. Commun. 14, 1403 (2023).
Article PubMed PubMed Central CAS Google Scholar
Lee, D., Chen, W., Wang, L., Chan, Y. C. & Chen, W. Data‐Driven Design for Metamaterials and Multiscale Systems: A Review. Adv. Mater. 36, 2305254 (2024).
Article CAS Google Scholar
Lookman, T. Mesoscopic Phenomena in Multifunctional Materials: Synthesis, Characterization, Modeling and Applications (Springer, 2014).
Frazier, P. I. & Wang, J. Information science for materials discovery and design (Springer, 2016).
Balandat, M. et al. BoTorch: A framework for efficient Monte-Carlo Bayesian optimization. Proc. Adv. Neural Inform. Process. Syst. 33, (2020).
Xue, D. et al. Accelerated search for materials with targeted properties by adaptive design. Nat. Commun. 7, 1–9 (2016).
Article Google Scholar
Lookman, T., Balachandran, P. V., Xue, D., Hogden, J. & Theiler, J. Statistical inference and adaptive design for materials discovery. Curr. Opin. Solid St. M. 21, 121–128 (2017).
Article CAS Google Scholar
Yuan, X. et al. Active Learning-Based Guided Synthesis of Engineered Biochar for CO2 Capture. Environ. Sci. Technol. 58, 6628–6636 (2024).
Article PubMed PubMed Central CAS Google Scholar
Gongora, A. E. et al. A Bayesian experimental autonomous researcher for mechanical design. Sci. Adv. 6, eaaz1708 (2020).
Article PubMed PubMed Central Google Scholar
Frazier, P. I. A tutorial on Bayesian optimization Preprint at https://arxiv.org/abs/1807.02811 (2018).
Eriksson, D., Pearce, M., Gardner, J., Turner, R. D. & Poloczek, M. Scalable global optimization via local Bayesian optimization. In Proc. Advances in Neural Information Processing Systems Vol. 32 (NeurIPS, 2019).
Kirschner, J., Mutny, M., Hiller, N., Ischebeck, R. & Krause, A. Adaptive and safe Bayesian optimization in high dimensions via one-dimensional subspaces. In International Conference on Machine Learning (PMLR, 2019).
Gelbart, M. A., Snoek, J. & Adams, R. P. Bayesian optimization with unknown constraints Preprint at https://arxiv.org/abs/1403.5607 (2014).
Kandasamy, K. et al. Myopic posterior sampling for adaptive goal oriented design of experiments. In International Conference on Machine Learning (PMLR, 2019).
Rao, Z. et al. Machine learning–enabled high-entropy alloy discovery. Science 378, 78–85 (2022).
Article PubMed CAS Google Scholar
Sanchez-Lengeling, B. & Aspuru-Guzik, A. Inverse molecular design using machine learning: Generative models for matter engineering. Science 361, 360–365 (2018).
Article PubMed CAS Google Scholar
Moosavi, S. M., Jablonka, K. M. & Smit, B. The role of machine learning in the understanding and design of materials. J. Am. Chem. Soc. 142, 20273–20287 (2020).
Article PubMed PubMed Central CAS Google Scholar
Wei, Q. et al. Divide and conquer: Machine learning accelerated design of lead-free solder alloys with high strength and high ductility. npj Comput. Mater. 9, 201 (2023).
Article Google Scholar
Popova, M., Isayev, O. & Tropsha, A. Deep reinforcement learning for de novo drug design. Sci. Adv. 4, eaap7885 (2018).
Article PubMed PubMed Central CAS Google Scholar
Xian, Y. et al. Compositional design of multicomponent alloys using reinforcement learning. Acta Mater. 274, 120017 (2024).
Article CAS Google Scholar
Rajak, P. et al. Autonomous reinforcement learning agent for stretchable kirigami design of 2D materials. npj Comput. Mater. 7, 1–8 (2021).
Google Scholar
Vinyals, O. et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 350–354 (2019).
Article PubMed CAS Google Scholar
Won, D.-O., Müller, K.-R. & Lee, S.-W. An adaptive deep reinforcement learning framework enables curling robots with human-like performance in real-world conditions. Sci. Robot. 5, eabb9764 (2020).
Article PubMed Google Scholar
Tian, Y. et al. Determining multi‐component phase diagrams with desired characteristics using active learning. Adv. Sci. 8, 2003165 (2021).
Article CAS Google Scholar
Mnih, V. et al. Playing atari with deep reinforcement learning Preprint at https://arxiv.org/abs/1807.02811 (2013).
Angermueller, C. et al. Model-based reinforcement learning for biological sequence design. In Proc. International Conference on Learning Representations (ICLR, 2020).
Wang, J., Kwon, H., Kim, H. S. & Lee, B.-J. A neural network model for high entropy alloy design. npj Comput. Mater. 9, 60 (2023).
Article CAS Google Scholar
Daulton, S. et al. Bayesian optimization over discrete and mixed spaces via probabilistic reparameterization. Proc. Adv. Neural Inform. Process. Syst. 35, 12760–12774 (2022).
Google Scholar
Ament, S., Daulton, S., Eriksson, D., Balandat, M. & Bakshy, E. Unexpected improvements to expected improvement for bayesian optimization. Proc. Adv. Neural Inform. Process. Syst. 36 (2023).
Guo, D. et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning Preprint at https://arxiv.org/abs/2501.12948 (2025).

Download references

Acknowledgements

We thank Peter Frazier (Cornell) for discussion. We also acknowledge the support of National Key Research and Development Program of China (2021YFB3802102), National Natural Science Foundation of China (Nos.51931004, 52350710205, 52173228, and 52271190), Innovation Capability Support Program of Shaanxi (2024ZG-GCZX-01(1)-06) and Natural Science Foundation Project of Shaanxi Province (Grant No. 2022JM-205).

Author information

Authors and Affiliations

State Key Laboratory for Mechanical Behavior of Materials, Xi’an Jiaotong University, Xi’an, 710049, China
Yuehui Xian, Xiangdong Ding, Yumei Zhou, Jun Sun, Dezhen Xue & Turab Lookman
Beijing Center for Materials Genome, University of Science and Technology, Beijing, China
Xue Jiang
AiMaterials Research LLC, Santa Fe, 87501, USA
Turab Lookman

Authors

Yuehui Xian
View author publications
Search author on:PubMed Google Scholar
Xiangdong Ding
View author publications
Search author on:PubMed Google Scholar
Xue Jiang
View author publications
Search author on:PubMed Google Scholar
Yumei Zhou
View author publications
Search author on:PubMed Google Scholar
Jun Sun
View author publications
Search author on:PubMed Google Scholar
Dezhen Xue
View author publications
Search author on:PubMed Google Scholar
Turab Lookman
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.X. developed methodology, wrote software code, performed investigation and validation. X.D. acquired funding. X.J. contributed to methodology development. Y.Z. conducted investigation. J.S. acquired funding. D.X. developed methodology, acquired funding and provided supervision. T.L. conceptualized the project, developed methodology, provided supervision, wrote the original draft and performed writing–review & editing. All authors reviewed and approved the final manuscript.

Corresponding authors

Correspondence to Xiangdong Ding, Dezhen Xue or Turab Lookman.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Xian, Y., Ding, X., Jiang, X. et al. Unlocking the black box beyond Bayesian global optimization for materials design using reinforcement learning. npj Comput Mater 11, 143 (2025). https://doi.org/10.1038/s41524-025-01639-w

Download citation

Received: 16 December 2024
Accepted: 30 April 2025
Published: 22 May 2025
Version of record: 22 May 2025
DOI: https://doi.org/10.1038/s41524-025-01639-w

Subjects

Abstract

Similar content being viewed by others

Performance of uncertainty-based active learning for efficient approximation of black-box functions in materials science

Learning in continuous action space for developing high dimensional potential energy models

A physics informed bayesian optimization approach for material design: application to NiTi shape memory alloys

Introduction

Results

Overall framework

Problem formulation

Model-based design

Test environments

Performance comparison of model-based RL and BO

Analysis of search patterns

Hybrid optimization combining BO and RL

On-the-fly RL for HEA design

Parametric study of exploration strategies

Discussion

Methods

Standard test functions

HEA test environment

On-the-fly DQN agent

Model-based DQN agent

BO with EI acquisition function

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links