Table 1 The Q algorithm.

Input: state set S, action set A
	Training phase:
	Initialize the E and D matrices as zero matrices; discount factor γ = 0.8;
	For each episode, do
		Randomly select the initial state s₀;
		s: = s₀;
		If convergence is not reached, do
			Select behavior a for the current state;
			execute a;
			Generate the next state sʹ;
			Calculate new Q [s, a];
			s: = s₀;
		end if
	end for
Use stage:
s: = s₀;
Determine the current optimal behavior aʹ according to Q[s, a] = max Q[s, aʹ];
s: = sʹ;
Iterate continuously until convergence;
Output: E and D matrix;

Quick links

Search