Figure 9 | Scientific Reports

Figure 9

From: Exploring optimal control of epidemic spread using reinforcement learning

Figure 9

The figure illustrates the loss (upper) and reward (lower) comparison of the agents while training. Each of the agents was evaluated after 250 episodes. A single evaluation is presented as a mean of ten runs, and it is guaranteed that every model is tested on the same environment scenario. The M30 agent was trained for 7000 episodes. The M7, M15, M45, and M60 agents were initialized with the M30 agent’s trained weights. Therefore, these agents converged to an optimal state within 3000 episodes.

Back to article page