Fig. 3
From: Reinforcement learning-based optimal control for stochastic opinion dynamics

Cumulative discounted cost trajectories of the RL policy and the theoretical optimal controller from a fixed initial condition \(x_0=[1,2]^\top\) (mean Âħ one standard deviation over 5 rollouts). The vertical axis shows the time-accumulated cost \(\sum _{k=0}^{t} \delta ^k c_k\), which reflects transient cost evolution along representative trajectories. The percentage values indicate the relative difference in the expected total discounted cost\(\Delta = (J_{\textrm{RL}}-J_{\textrm{opt}})/J_{\textrm{opt}}\times 100\%\), where \(J=\mathbb {E}[\sum _{k=0}^{T-1}\delta ^k c_k]\) is estimated by Monte-Carlo simulation over randomly sampled initial states.