Fig. 10: Structure of the PPO algorithm used in this work, focusing on its structural and operational aspects.

In the Collect phase, the agent interacts with the environments to extract triples (observation, action, reward) that are then used in the Update phase to update the parameters of the neural networks via gradient ascent.