Fig. 6

DQN training reward evolution over 2500 episodes demonstrating stable convergence to near-optimal policy without significant oscillations.

DQN training reward evolution over 2500 episodes demonstrating stable convergence to near-optimal policy without significant oscillations.