Table 1 The Q algorithm.
Input: state set S, action set A | |||
|---|---|---|---|
Training phase: | |||
Initialize the E and D matrices as zero matrices; discount factor γ = 0.8; | |||
For each episode, do | |||
Randomly select the initial state s0; | |||
s: = s0; | |||
If convergence is not reached, do | |||
Select behavior a for the current state; | |||
execute a; | |||
Generate the next state sʹ; | |||
Calculate new Q [s, a]; | |||
s: = s0; | |||
end if | |||
end for | |||
Use stage: | |||
s: = s0; | |||
Determine the current optimal behavior aʹ according to Q[s, a] = max Q[s, aʹ]; | |||
s: = sʹ; | |||
Iterate continuously until convergence; | |||
Output: E and D matrix; | |||