Fig. 2: Conceptual illustration of the policy extraction process in the proposed framework.
From: Dynamic optimizers for complex industrial systems via direct data-driven synthesis

The learning algorithm evaluates the optimality of each trajectory in the historical operational data using a learned value function. Trajectories with lower long-term costs are assigned higher weights. The optimization policy (dynamic optimizer) is then derived using weighted regression, prioritizing trajectories with the highest expected returns. The resulting policy selects plant setpoints that lead to lower long-term production costs.