Table 2 Algorithm: model-based DQN for materials design
[1] Input: Design space X, initial dataset D0, GPR surrogate model M, DQN agent A, RL training episode N, episode length L (number of design decisions/actionsa per episode), Batch size Btrain for Q-network update, Optimization type: minimization | |
[2] Output: Best material design x* | |
[3] Initialize GPR surrogate model M with D0, D = D0 | |
[4] Initialize DQN with Q-network Q(s, a) | |
[5] Initialize replay buffer R | |
[6] while not terminated do | // Until max iterations or target performance met |
[7] Train/Update GPR model M using D | |
[8] // Agent trai ning stage: learn from surrogate model; train DQN for N episodes | |
[9] for episode = 1 to N do | // Training budget for DQN agent |
[10] s = get_initial_state() | // Initialize with predefined or random state |
[11] for step = 1 to L do | // Make L sequential design decisions/actions |
[12] a = ε-greedy(Q(s,·), X) | // Select action within design space |
[13] s’ = next_state(s, a) | // State transition from s to s' |
[14] r = M(s) - M(s’) | // Get reward using GPR, assuming minimization |
[15] Store (s, a, r, s’) in R | |
[16] Update Q(s, a) using a random batch of Btrain (s, a, r, s’) tuples from R | |
[17] s = s’ | |
[18] end for | |
[19] end for | |
[20] // Design stage: propose and evaluate one new material designb | |
[21] s = get_initial_state() | // Initialize with predefined or random state |
[22] for step = 1 to L do | // Make L sequential design decisions/actions |
[23] a = ε-greedy(Q(s,·), X) | // Propose materials design action |
[24] s’ = next_state(s, a) | // State transition from s to s' |
[25] s = s’ | |
[26] end for | |
[27] xnext = s | |
[28] ynext = f(xnext) | // Evaluate the new design |
[29] D = D ∪ {(xnext, ynext)} | // Update dataset with experiment results |
[30] end while | |
[31] return x* = argmax x∈D f(x) | |