Introduction

The control of opinion dynamics in social networks has garnered significant attention in recent years. In particular, the quest for optimal control strategies that minimize cost functions under dynamic models has found extensive applications across various domains, among others autonomous energy consumption optimization, robotics1,2. From a control-theoretic perspective, opinion evolution can be formalized as a networked dynamical system where external interventions steer collective states toward desired targets3,4,5; from the lens of computational social science, classical models such as the DeGroot model6 and Friedkin–Johnsen model7 provide linear dynamical foundations for analyzing opinion formation and consensus building.

Traditionally, optimal control design has relied on well-established techniques such as dynamic programming8, the Pontryagin Maximum Principle9 and the classic Linear Quadratic Regulator (LQR)10. These methods have been widely applied to opinion dynamics regulation, including communication structure optimization11 and Stackelberg game-based intervention with stubborn agents12. However, these classical methods face several limitations when applied to the complex and dynamic nature of social interactions. For instance, the classic LQR method requires precise knowledge of the system, which is often impractical in the context of social interactions that exhibit non-stationary statistical characteristics. Moreover, existing stochastic control approaches rely heavily on prior knowledge of probability distributions13, which is difficult to obtain in real-world social scenarios where interaction strengths fluctuate randomly due to user emotions, external events, and contextual information14. In some cases, linear models fail to adequately approximate the complex nonlinear dynamics of social interactions, or the underlying system dynamics remain unknown. These challenges have motivated researchers to explore alternative data-driven approaches, particularly Reinforcement Learning (RL)15 and Adaptive Dynamic Programming (ADP)16,17,18, often combined with Neural Networks (NNs)19,20. These approaches offer a promising framework for solving optimal control problems and have a wide range of applications. For instance21, provides an overview of the application of RL and optimal adaptive control in robotics; in social network analysis, data-driven control methods have been used to learn hidden influence structures in large-scale dynamical networks22.

Reinforcement Learning (RL) is a powerful data-driven method that enables agents to learn optimal control policies through interactions with their environment, thereby minimizing cumulative costs. The motivation for employing RL in optimal control design stems from several key factors, including its adaptability, ability to handle complex dynamics, capacity to learn from real-world experiences, and robustness against inaccuracies in system models17. These advantages position RL as an ideal candidate for addressing the limitations of traditional control methods in the context of social network opinion dynamics.

In this paper, we propose a hierarchical control framework that integrates the strengths of control theory and modern data-driven techniques. First, model-based stochastic optimal control, where the probability density function (PDF) of random parameters in the agents’ opinion dynamics are known, enabling the derivation of analytical optimal control policies via dynamic programming with expectation integrals. Second, model-free Reinforcement Learning (RL), where the PDF of stochastic interactions are unknown, but the system dynamics (e.g., network structure and opinion update rules) are preserved; here, we employ RL to learn optimal control strategies from observed interactions. Third, data-driven RL under fully unknown dynamics, where neither the system model nor the statistical properties of randomness are available, necessitating a purely data-driven approach to discover optimal policies through real-world interaction data. We adopt the linear quadratic (LQ) framework for its natural alignment with linear opinion dynamics models (e.g., DeGroot) and the tractability of quadratic cost functions, which enable convex optimization formulations and closed-form policy updates.

Compared with existing work, our method combines the convex optimization advantages of Asri et al.’s approach23 with the theoretical completeness of the classic LQR. Unlike the approach proposed by Asri and Rodrigues23, which employ quadratic neural networks to approximate the Q-function, this work directly parameterizes the Q-function as a quadratic form. This choice is motivated by the intrinsic quadratic structure of stochastic linear–quadratic regulation problems, under which the optimal Q-function remains quadratic even in the presence of random system parameters. As a result, policy evaluation reduces to a convex least-squares problem with a unique global solution, and policy improvement admits a closed-form analytical expression. This avoids the non-convex optimization and local optimality issues inherent in neural network-based approaches, while enabling rigorous guarantees on global convergence and almost sure optimality. Yaghmaie et al.24 study deterministic linear quadratic regulation with known system matrices, where the quadratic structure of the Q-function and the optimal policy can be derived analytically from the system model. Their policy iteration framework is Riccati-free but remains fundamentally model-based. In contrast, our work considers stochastic linear systems with randomly varying and unknown system parameters. Since the Riccati equation cannot be constructed in this setting, we adopt a data-driven policy iteration approach, where the Q-function is estimated via convex least-squares using sampled transitions. This enables policy evaluation and improvement without knowledge of the system dynamics or the distribution of the random parameters.

The core innovations of our approach are threefold: (1) We leverage the structural properties of quadratic cost functions to transform policy evaluation into a convex least-squares problem, avoiding non-convex neural network approximation; (2) We achieve analytical policy improvement through matrix block operations, thereby overcoming the non-convex optimization challenges associated with deep learning; (3) We prove the global convergence of the \({\varvec{H}}\)-matrix estimation, with theoretical guarantees of almost sure convergence to the optimal policy.

The remainder of this paper is organized as follows. In Section 2, we analyze the optimal control problem with known probability distributions based on stochastic dynamic programming theory. In Section 3, we consider the scenario where the probability distributions are unknown and propose a data-driven RL approach. Building on this, Section 4 generalizes the method to address optimal control problems in systems with completely unknown dynamics. Section 5 validates the proposed methods through illustrative examples. Finally, Section 6 concludes the paper and discusses potential future work.

Stochastic optimal control with known probability distributions

This section studies the model-based stochastic optimal control problem when the PDF \(\phi (p \mid \theta ^*)\) of the stochastic parameter \(p\) is known. In social networks, this corresponds to cases where the influence of opinion leaders can be quantified from historical interaction data, allowing a known statistical model of opinion dynamics to guide precise intervention strategies.

We consider a social network consisting of \(n\) agents. Let \(x_i(t) \in \mathbb {R}\) denote the opinion of agent \(i\) at discrete time \(t = 0, 1, 2, \ldots\). The collective opinion state is represented by the vector \({\varvec{x}}(t) = [x_1(t), \ldots , x_n(t)]^\top \in \mathbb {R}^n\), with a known initial state \({\varvec{x}}(0)\).Inter-agent influence is characterized by a row-stochastic trust matrix \({\varvec{A}} \in \mathbb {R}^{n \times n}\), where each element \(A_{ij} \ge 0\) quantifies the trust weight that agent \(i\) assigns to agent \(j\). The row-stochastic property, \({\varvec{A}} {\varvec{1}}_n = {\varvec{1}}_n\), ensures that each agent’s total influence outflow sums to one. An player (external decision-maker e.g., a platform moderator) applies a scalar control input \(u(t) \in \mathbb {R}\), whose effect is distributed via a constant input vector \({\varvec{B}} \in \mathbb {R}^{n \times 1}\). The resulting discrete-time opinion dynamics are:

$$\begin{aligned} {\varvec{x}}(t+1) = {\varvec{A}}{\varvec{x}}(t) + {\varvec{B}}u(t), \end{aligned}$$
(1)

The player aims to steer all agents’ opinions toward a target consensus value \(s \in \mathbb {R}\), formalized by minimizing an infinite-horizon discounted quadratic cost:

$$\begin{aligned} J(u) = \sum _{t=0}^{\infty } \delta ^{t} \Big [ ({\varvec{x}}(t) - s {\varvec{1}}_n)^\top {\varvec{Q}} ({\varvec{x}}(t) - s {\varvec{1}}_n) + u(t)^T {\varvec{R}} u(t) \Big ], \end{aligned}$$
(2)

where \({\varvec{Q}} \succ 0\) (typically \({\varvec{Q}} = {\varvec{I}}_n\)) weights state deviations, \({\varvec{R}}= \gamma> 0\) penalizes control effort, and \(\delta \in (0, 1]\) is a discount factor.

Here, we only consider the case of two agents (n=2). We adopt this minimal yet functionally complete system as a testbed, aiming to reveal the core mechanisms by which random interactions affect control and learning processes with maximum analytical and conceptual clarity. This simplified model avoids the representational complexity and visualization difficulties brought about by high-dimensional state spaces, enabling the solution to the stochastic Riccati equation and the algorithmic behaviors to be presented directly and clearly. It should be noted that the mathematical structure of the framework proposed in this paper, which is based on quadratic value function parameterization and policy iteration, is itself independent of the number of agents. Therefore, the insights gained here lay a solid foundation for understanding similar behaviors in larger-scale networks.

In this scenario, the matrix \({\varvec{A}}\) is a \(2 \times 2\) row-stochastic matrix with first row \([p, 1 - p]\) and second row \([q, 1 - q]\), and the matrix \({\varvec{B}}\) is a \(2 \times 1\) column vector with entries \([1, 0]^\top\). We assume that the player knows the value of q, the distribution of the random variable p, and the actual value of \(\theta ^*\), which describes the characteristics of the probability distribution of p. Specifically, \(\theta ^*\) can represent the mean, variance, probability of success, or any other key parameter defined in the distribution. For example, if p follows a Beta distribution, \(\theta ^*\) refers to the set of parameters that define the distribution with shape parameters \(\alpha>0\) and \(\beta>0\).

The goal is to find a control strategy such that the opinion states of two agents eventually converge to the desired consensus state s set by the player, while minimizing cumulative cost function. The objective function of the player can be written in expected value form as follows:

$$\begin{aligned} J(u(t)) = \min _{u(t)}\mathbb {E}\left[ \sum _{t=0}^{\infty } \delta ^{t} \left[ ({\varvec{x}}(t) - s {\varvec{1}}_n)^T {\varvec{Q}} ({\varvec{x}}(t) - s {\varvec{1}}_n) + u(t)^T {\varvec{R}} u(t) \right] \right] , \end{aligned}$$
(3)

where \(\mathbb {E}\) is the expectation operator with respect to the stochastic variables in the model. We assume that p is a realization of a random variable \(\tilde{p}\), whose conditional probability distribution is given by the function \(\phi (p \mid \theta ^*)\), where \(\theta ^{*} \in \Theta \subset \mathbb {R}^k\) is the vector of sufficient parameters of the probability density function (PDF) \(\phi\). Let the support of this PDF be given by \(\mathscr {H}\), such that \(p \in \mathscr {H} \subseteq (0, 1]\). The player does not know the future realizations of the random variable \(\tilde{p}\)13.

We can construct a value function that will help us evaluate the performance of the optimal strategy under the given probability distributions of p and the value of q. This value function is a key tool in designing and analyzing the optimal control strategy. The value function V is the dynamic version of the objective function J, considering the cumulative rewards of all future time steps starting from the current state \({\varvec{x}}(t)\). The Bellman equation is a recursive expression of the value function, linking the value function of a state with the value function of its subsequent states, and it explains how to calculate the value function through the immediate reward of the current state and the expected cumulative reward of the future state. The specific form is as follows:

$$\begin{aligned} \begin{aligned} V\left( {\varvec{x}}(t), \theta ^*\right) = \min _{u(t)}\left\{ L({\varvec{x}}(t), u(t))+\delta \mathbb {E}\left[ V\left( {\varvec{x}}(t+1), \theta ^*\right) \right] \right\} , \end{aligned} \end{aligned}$$
(4)

The integral form can be expressed as:

$$\begin{aligned} V\left( {\varvec{x}}(t), \theta ^*\right) = \min _{u(t)}\left\{ L({\varvec{x}}(t), u(t))+\delta \int _ \mathscr {H} V({\varvec{x}}(t+1),\theta ^*)\phi (p \mid \theta ^*) d p\right\} , \end{aligned}$$
(5)

where \(L({\varvec{x}}(t),u(t)) = ({\varvec{x}}(t) - s {\varvec{1}}_n)^T {\varvec{Q}} ({\varvec{x}}(t) - s {\varvec{1}}_n) + u(t)^T {\varvec{R}} u(t)\) is the immediate cost function under the current state and control.

Theorem 1

Consider the dynamic system (1) having two agents with random parameter \(p \sim \phi (p|\theta ^*)\) and fixed parameter q. The optimal control \(u^*(t)\) minimizing the cost function (3) takes the state feedback form:

$$\begin{aligned} u^*(t) = \tilde{e}x_1(t) + \tilde{d}x_2(t) + \tilde{c}, \end{aligned}$$
(6)

where the coefficients \(\tilde{e}\), \(\tilde{d}\), and \(\tilde{c}\) are determined by the following integral equations:

$$\begin{aligned} \begin{aligned}&\tilde{e}=-\frac{\delta \int _ \mathscr {H} \left( k_{11} p+k_{12} q\right) \phi (p \mid \theta ^*)dp}{\gamma +\delta k_{11}}, \\&\tilde{d}=-\frac{\delta \int _ \mathscr {H}\left[ k_{11}(1-p)+k_{12}(1-q)\right] \phi (p \mid \theta ^*)dp}{\gamma +\delta k_{11}}, \\&\tilde{c}=-\frac{\delta k_1 }{2(\gamma +\delta k_{11})}. \end{aligned} \end{aligned}$$

and the quadratic value function coefficients \(k_{11}\), \(k_{12}\), \(k_{22}\), \(k_1\), \(k_2\), and \(k_0\) solve the coupled stochastic Riccati equations defined by expectation integrals (Equation (12)).

Remark 1

Defining the deviation state \(\textbf{z}(t) = \textbf{x}(t) - s\textbf{1}_2\), the optimal control problem formed by system equation (1) and the cost function (2) is transformed into a standard LQR problem. Consequently, the optimal control can be equivalently expressed as \(u^*(t) = -\textbf{K}\textbf{z}(t)\), where \(\textbf{K} = -[\tilde{e}, \tilde{d}]\). The constant term \(\tilde{c}\) in (6) originates from \(\textbf{K}(s\textbf{1}_2)\) and accounts for the nonzero target state. This theorem generalizes the classical discrete-time LQR solution \(u^*(t) = -\textbf{K}\textbf{x}(t)\) to systems with stochastic parameters and nonzero target states.

Proof

Considering the objective function structure of our model in (3), we conjecture that the value function is quadratic and specified as follows:

$$\begin{aligned} V\left( {\varvec{x}}(t), \theta ^*\right) =k_{11}x_1^2(t) + 2k_{12}x_1(t)x_2(t) + k_{22}x_2^2(t) + k_{1}x_1(t) + k_{2}x_2(t) + k_0, \end{aligned}$$
(7)

where \({\varvec{x}}(t)=(x_1(t),x_2(t))\). Plugging the conjectured value function into (5), we obtain

$$\begin{aligned} \begin{aligned}&k_{11}x_1^2(t) + 2k_{12}x_1(t)x_2(t) + k_{22}x_2^2(t) + k_{1}x_1(t) + k_{2}x_2(t) + k_0 \\&\quad =\min _{u(t)}\left\{ \underbrace{(x_1(t)-s)^2+x_2(t)-s)^2+ \gamma u^2(t))}_{L({\varvec{x}}(t), u(t))}+\delta \int _ \mathscr {H} [k_{11}x_1^2(t+1) + 2k_{12}x_1(t+1)x_2(t+1)\right. \\&\qquad \left. + k_{22}x_2^2(t+1) + k_{1}x_1(t+1) + k_{2}x_2(t+1) + k_0]\phi (p \mid \theta ^*) d p\right\} . \end{aligned} \end{aligned}$$
(8)

Next, the first-order equilibrium condition is given by solving for the first derivative of u(t) of the above equation:

$$\begin{aligned} \gamma u(t) + \delta \int _\mathscr {H}\left[ k_{11}(p x_1(t) + (1-p) x_2(t) + u(t)) + k_{12}(q x_1(t) + (1-q) x_2(t)) + \frac{k_1}{2} \right] \phi (p \mid \theta ^*) dp = 0. \end{aligned}$$
(9)

Separating out the terms containing u(t) and solving for the optimal solution \(u^*(t)\) to make the Bellman equation (5) valid yields:

$$\begin{aligned} \begin{aligned} {u}^*(t)&= \frac{-\delta }{\gamma + \delta k_{11} \int _{\mathscr {H}} \phi (p \mid \theta ^*) \, dp} \int _{\mathscr {H}} \Bigg [ k_{11}(p x_1(t) + (1-p) x_2(t)) \\&\quad + k_{12}(q x_1(t) + (1-q) x_2(t)) + \frac{1}{2} k_1 \Bigg ] \phi (p \mid \theta ^*) \, dp. \end{aligned} \end{aligned}$$
(10)

For any Probability Density Function (PDF), the integral over the entire domain is 1. Thus \(\int _ \mathscr {H}\phi (p \mid \theta ^*) d p = 1\). Let

$$\begin{aligned} \begin{aligned}&\tilde{e}=-\frac{\delta \int _ \mathscr {H} \left( k_{11} p+k_{12} q\right) \phi (p \mid \theta ^*)dp}{\gamma +\delta k_{11}}, \\&\tilde{d}=-\frac{\delta \int _ \mathscr {H}\left[ k_{11}(1-p)+k_{12}(1-q)\right] \phi (p \mid \theta ^*)dp}{\gamma +\delta k_{11}}, \\&\tilde{c}=-\frac{\delta k_1 }{2(\gamma +\delta k_{11})}. \end{aligned} \end{aligned}$$

Thus, the optimal control \(u^*(t)\) can be denoted as

$$\begin{aligned} u^*(t) =\tilde{e}x_1(t) +\tilde{d}x_2(t)+\tilde{c}. \end{aligned}$$
(11)

Next, to find the coefficients of the value function, we substitute \(u^*(t)\) into (8) and equate the coefficients in order of \({\textbf {x}}(t)\). This leads to a system of 6 equations and 6 unknowns: \(k_{11},k_{12},k_{22},k_{1},k_{2},k_0\). That is,

$$\begin{aligned} \begin{aligned} k_{11}&=1+\gamma \tilde{e}^2+\delta \int _ \mathscr {H}\left[ k_{11}(p+\tilde{e})^2+2 k_{12}(p+\tilde{e}) q+k_{22} q^2\right] \phi (p \mid \theta ^*)dp\\ k_{12}&= \gamma \tilde{e} \tilde{d}+\delta \int _ \mathscr {H}\left[ k_{11}(p+\tilde{e})(1-p+\tilde{d})+k_{12}[(p+\tilde{e})(1-q)+(1-p+\tilde{d})q]\right. +k_{22} q(1-q)] \phi (p \mid \theta ^*)dp\\ k_{22}&=1+\gamma \tilde{d}^2+\delta \int _ \mathscr {H}\left[ k_{11}(1-p+\tilde{d})^2+2 k_{12}(1-p+\tilde{d})(1-q)+k_{22}(1-q)^2\right] \phi (p \mid \theta ^*)dp \\ k_1&=-2 s+2 \gamma \tilde{e} \tilde{c}+\delta \int _ \mathscr {H}\left[ 2 k_{11} \tilde{c}(p+\tilde{e})+2 k_{12} \tilde{c} q+k_1(p+\tilde{e})+k_{2}q\right] \phi (p \mid \theta ^*)dp\\ k_2&=-2 s+2 \gamma \tilde{d} \tilde{c}+\delta \int _ \mathscr {H}\left[ 2 k_{11} \tilde{c}(1-p+\tilde{d})+2 k_{12} \tilde{c}(1-q)+k_1(1-p+\tilde{d})\right. +k_2(1-q)] \phi (p \mid \theta ^*)dp\\ k_0&=2 s^2+\gamma \tilde{c}^2+\delta \int _ \mathscr {H}\left[ k_{11} \tilde{c}^2+k_1 \tilde{c}+k_0\right] \phi (p \mid \theta ^*)dp \end{aligned} \end{aligned}$$
(12)

\(\square\)

Reinforcement learning(RL) under partially known system dynamic

This section focuses on the complementary partially known setting—where network structure and update rules are available, but the PDF \(\phi (p \mid \theta )\) of the stochastic parameter \(p\) is unknown. This is like in actual social networks, where the influence of opinion leaders on their followers is often affected by various psychological and social factors. These factors may be impossible to directly observe or quantify, resulting in unknown parameter distributions.

Problem formulation under partially known model

Since the system contains random terms, the model can be rewritten as follows:

$$\begin{aligned} {\varvec{x}}(t+1) = {\varvec{A}}(p(t)){\varvec{x}}(t) + {\varvec{B}}u(t), \end{aligned}$$
(13)

where the matrix \({\varvec{A}}(p(t))\) is a \(2 \times 2\) row-stochastic matrix with first row \([p(t), 1 - p(t)]\) and second row \([q, 1 - q]\), and the matrix \({\varvec{B}}\) is a \(2 \times 1\) column vector with entries \([1, 0]^\top\). Here, p(t) is a random variable with unknown distribution at each time step, u(t) is the control input, and q is a fixed parameter.

Minimize the discounted cumulative cost:

$$\begin{aligned} J = \mathbb {E}\left[ \sum _{t=0}^\infty \delta ^t \left( (x_1(t)-s)^2 + (x_2(t)-s)^2 + \gamma u(t)^2\right) \right] , \end{aligned}$$
(14)

where s is the target value of the state variable, \(\delta\) is the discount factor, and \(\gamma\) is the weight of the control input.

The Q-function is parameterized as a quadratic form, \(Q({\varvec{x}}, w) = [{\varvec{x}}; w]^T {\varvec{H}} [{\varvec{x}}; w]\). This choice is motivated by the structure of stochastic linear–quadratic (LQ) optimal control: under random system parameters, the cost function remains quadratic and the dynamics linear, which together preserve the quadratic nature of the optimal Q-function. Consequently, the quadratic parameterization is consistent with the intrinsic mathematical structure of the problem. Beyond theoretical alignment, this formulation offers analytical tractability—enabling closed-form policy updates—while maintaining computational efficiency and facilitating theoretical analysis.

Using the policy iteration method, policy evaluation and policy improvement are performed alternately. In policy evaluation, Q-learning25 is used to estimate the Q-function of the current policy; in policy improvement, the policy is updated by minimizing the Q-function. Since the Q-function is quadratic, the policy improvement step may have an analytical solution, similar to the case in Linear Quadratic Regulation (LQR)10.

According to linear quadratic regulator (LQR) theory, for linear time-invariant systems with quadratic performance indices, there exists an optimal state feedback control law of the form \({\varvec{u}}(t) = -{\varvec{K}}{\varvec{x}}(t)\). Here, \({\varvec{K}}\) is a constant matrix obtained by solving the Riccati equation, which serves as the feedback gain matrix (equivalently termed “control law matrix” in control theory or “policy matrix” in Reinforcement Learning) to parameterize the mapping from system states to control inputs. This is a standard conclusion for the infinite-horizon LQR problem.

Thus, in this section, for \(q \ne 0\), the pair \(({\varvec{A}}(p(t)), {\varvec{B}})\) remains controllable. The learned policy retains an LQR-like structure, though \({\varvec{K}}\) is derived empirically rather than via Riccati equations. The policy u(t) is a linear feedback control in deviation form:

$$\begin{aligned} u(t) = -{\varvec{K}}{\varvec{z}}(t) = - K_1(x_1(t)-s) - K_2(x_2(t)- s), \end{aligned}$$
(15)

where \({\varvec{K}} = [K_1, K_2]\) is the feedback gain matrix to be optimized and \({\varvec{z}}(t) =[x_1(t)-s,x_2(t)- s]^T\) represents the deviation state.

Policy iteration for stochastic linear quadratic regulation

The objective of policy iteration is to find the optimal gain matrix \({\varvec{K}}\) that minimizes the total cost J. Policy iteration is an efficient dynamic programming method that gradually optimizes the strategy by alternating between Policy Evaluation (PE) and Policy Improvement (PI) until the optimal strategy is found. Below we give the specific steps to solve the Stochastic Linear Quadratic Regulation (SLQR) problem using an RL strategy iteration.

Policy evaluation (PE)

During the policy evaluation (PE) step, sample data is collected to fit the parameters of the Q-function - \({\varvec{H}}\). Benefiting from the quadratic structure of the cost function in (14), the Q-function can be strictly parameterized as a quadratic form. Specifically, the parameter estimation of the Q-function is transformed into a convex least-squares problem, which has a unique solution and can be efficiently solved by linear regression, ensuring the stability and reliability of the estimation process. Parameterize the Q-function:

$$\begin{aligned} Q({\varvec{z}}, u) = \begin{bmatrix} {\varvec{z}} \\ u \end{bmatrix}^T {\varvec{H}} \begin{bmatrix} {\varvec{z}} \\ u \end{bmatrix}, \end{aligned}$$
(16)

where \({\varvec{H}}\) is a symmetric matrix that needs to be learned as a parameter.

Data Collection: Under the current policy \({\varvec{K}}_j\) (the average dynamic matrix is chosen to initialize \({\varvec{K_0}}\)), apply the control

$$\begin{aligned} u(t) = - {\varvec{K}}_j{\varvec{z}}(t) + \text {exploration noise}, \end{aligned}$$
(17)

where \({\varvec{z}}(t) ={\varvec{x}}(t)-{\varvec{s}}\) and exploration noise may be added to satisfy the persistent excitation condition. Record \({\varvec{x}}(t)\), u(t), \({\varvec{x}}(t+1)\), and the cost \(c(t) = (x_1 - s)^2 + (x_2 - s)^2 + \gamma u^2(t)\).

Training Q-function: Using the collected data, update the \({\varvec{H}}\)-matrix by minimizing the Bellman error26. The Bellman equation is:

$$\begin{aligned} Q({\varvec{z}}, u) = c({\varvec{z}}, u) + \delta \mathbb {E}[Q({\varvec{z}}', u')], \end{aligned}$$
(18)

where \(u'\) is the control input at the next time step under the current policy. According to the principle of Q learning, the next best action under the current policy should be used here. In policy iteration, the next action \(u'\) is selected by the current policy at the next state \({\varvec{z}}'\), i.e., \(u' = \pi ({\varvec{z}}')\).

Policy evaluation: It may be necessary to solve for \({\varvec{H}}\) such that:

$$\begin{aligned} \begin{bmatrix} {\varvec{z}} \\ u \end{bmatrix}^T {\varvec{H}} \begin{bmatrix} {\varvec{z}} \\ u \end{bmatrix} = c({\varvec{z}}, u) + \delta \mathbb {E}\left[ \begin{bmatrix} {\varvec{z}}' \\ u' \end{bmatrix}^T {\varvec{H}} \begin{bmatrix} {\varvec{z}}' \\ u' \end{bmatrix} \right] , \end{aligned}$$
(19)

where \({\varvec{z}}'\) is the next state and \(u'\) is the next control input according to the current policy. Since the distribution of p is unknown, the expectation \(\mathbb {E}\) cannot be computed analytically and must be estimated using samples:

$$\begin{aligned} \begin{bmatrix} {\varvec{z}} \\ u \end{bmatrix}^T {\varvec{H}} \begin{bmatrix} {\varvec{z}} \\ u \end{bmatrix} \approx c({\varvec{z}}, u) + \delta \begin{bmatrix} {\varvec{z}}' \\ \pi ({\varvec{z}}') \end{bmatrix}^T {\varvec{H}} \begin{bmatrix} {\varvec{z}}' \\ \pi ({\varvec{z}}') \end{bmatrix}. \end{aligned}$$
(20)

In RL, the goal is to learn an optimal strategy \(\pi ^*\). Here \(\pi ({\varvec{z}}')\) represents the action \(u'\) chosen by the policy \(\pi\) in state \({\varvec{z}}'\). By collecting multiple samples \(({\varvec{z}}, u, {\varvec{z}}')\), a Least-Squares Problem27 can be constructed to solve for \({\varvec{H}}\). This approach is similar to the method described by Asri and Rodrigues23 for training a Quadratic Neural Network (QNN), where the parameters of \({\varvec{H}}\) can be solved using convex optimization since the parameters of the quadratic form are linear.

Lemma 2

(Least-Squares Solution for Q-function Parameter Learning) Assume the Q-function is parameterized as a quadratic form:

$$\begin{aligned} Q({\varvec{z}}, u) = \begin{bmatrix} {\varvec{z}} \\ u \end{bmatrix}^T {\varvec{H}} \begin{bmatrix} {\varvec{z}} \\ u \end{bmatrix} = {\varvec{\phi }}({\varvec{z}}, u)^T {\varvec{h}}, \end{aligned}$$
(21)

where \({\varvec{\phi }}({\varvec{z}}, u)\) contains all quadratic basis terms of \(({\varvec{z}}, u)\), and \({\varvec{h}}\) is the vectorized form of the symmetric matrix \({\varvec{H}}\). Given a dataset \(\{{\varvec{\phi }}_t, c_t, {\varvec{\phi }}_{t+1}\}_{t=1}^N\) generated under policy \(\pi _j\), where \(c_t\) is the instantaneous cost and \(\delta\) is the discount factor, the parameter vector \({\varvec{h}}\) can be learned by solving the following least-squares problem:

$$\begin{aligned} \min _{{\varvec{h}}} \sum _{t=1}^N \left\| {\varvec{\phi }}_t^T {\varvec{h}} - \left( c_t + \delta {\varvec{\phi }}_{t+1}^T {\varvec{h}} \right) \right\| ^2. \end{aligned}$$
(22)

If the data satisfies the persistent excitation condition and the covariance matrix of \(\{{\varvec{\phi }}_t\}\) is full-rank, then the problem admits a unique solution \({\varvec{h}}^*\), which can be efficiently computed via linear regression.

Policy improvement

In the policy improvement step, the control law \({\varvec{K}}\) that minimizes the Q-function with respect to u is derived. This involves taking the derivative of \(Q({\varvec{z}}, u)\) with respect to w and setting it to zero, resulting in a new \({\varvec{K}}\). The optimal control \(u^*\) is given by:

$$\begin{aligned} u^* = \mathop {\mathrm {arg\,min}}\limits _u Q({\varvec{z}}, u). \end{aligned}$$
(23)

Since \(Q({\varvec{z}}, u)\) is a quadratic function of u, we can obtain an analytical solution by taking the derivative with respect to u and setting it to zero. First, partition the symmetric \({\varvec{H}}\) matrix into block matrices based on state and control:

$$\begin{aligned} {\varvec{H}} = \begin{bmatrix} {\varvec{H}}_{zz} & {\varvec{H}}_{zu} \\ {\varvec{H}}_{zu}^T & {\varvec{H}}_{uu} \end{bmatrix}, \end{aligned}$$
(24)

where \({\varvec{H}}_{zz} \in \mathbb {R}^{2 \times 2}\) is the state-state block, \({\varvec{H}}_{zu} \in \mathbb {R}^{2 \times 1}\) is the state-control cross block, and \({\varvec{H}}_{uu} \in \mathbb {R}\) is the control-control block. Expanding \(Q({\varvec{z}}, u)\) yields:

$$\begin{aligned} Q({\varvec{z}}, u) = \begin{bmatrix} {\varvec{z}} \\ u \end{bmatrix}^T {\varvec{H}} \begin{bmatrix} {\varvec{z}} \\ u \end{bmatrix} = {\varvec{z}}^T {\varvec{H}}_{zz} {\varvec{z}} + 2 {\varvec{z}}^T {\varvec{H}}_{zu} u + u^T {\varvec{H}}_{uu} u. \end{aligned}$$
(25)

Taking the derivative with respect to u results in:

$$\begin{aligned} 2 {\varvec{H}}_{zu}^T {\varvec{z}} + 2 {\varvec{H}}_{uu} u = 0 \quad \Rightarrow \quad u = - ({\varvec{H}}_{uu})^{-1} {\varvec{H}}_{zu}^T {\varvec{z}}. \end{aligned}$$
(26)

Therefore, the updated policy matrix \({\varvec{K}}_{j+1}\) is given by:

$$\begin{aligned} {\varvec{K}}_{j+1} = ({\varvec{H}}_{uu})^{-1} {\varvec{H}}_{zu}^T, \end{aligned}$$
(27)

where \({\varvec{H}}_{zu}\) is the cross-term of the \({\varvec{H}}\) matrix involving the state and u.

Lemma 3

(Policy Improvement via Quadratic Q-function) Given a quadratic Q-function parameterized as:

$$\begin{aligned} Q({\varvec{z}}, u) = \begin{bmatrix} {\varvec{z}} \\ u \end{bmatrix}^T {\varvec{H}} \begin{bmatrix} {\varvec{z}} \\ u \end{bmatrix} = {\varvec{z}}^T {\varvec{H}}_{zz} {\varvec{z}} + 2 {\varvec{z}}^T {\varvec{H}}_{zu} u + u^T {\varvec{H}}_{uu} u, \end{aligned}$$
(28)

where \({\varvec{H}}\) is symmetric and partitioned into blocks \({\varvec{H}}_{zz}\), \({\varvec{H}}_{zu}\), \({\varvec{H}}_{uu}\), the optimal control policy \(u^*\) minimizing \(Q({\varvec{z}}, u)\) is analytically given by:

$$\begin{aligned} u^* = - {\varvec{H}}_{uu}^{-1} {\varvec{H}}_{zu}^T {\varvec{z}}. \end{aligned}$$
(29)

Consequently, the updated policy matrix \({\varvec{K}}_{j+1}\) at iteration \(j+1\) is:

$$\begin{aligned} {\varvec{K}}_{j+1} = {\varvec{H}}_{uu}^{-1} {\varvec{H}}_{zu}^T. \end{aligned}$$
(30)

What calls for special attention is that \({\varvec{H}}_{uu}\) must be invertible, which is guaranteed if \(Q({\varvec{z}}, u)\) is strictly convex in u (i.e., \({\varvec{H}}_{uu} \succ 0\)). This lemma provides the explicit policy improvement step in a policy iteration algorithm, connecting to Lemma 2 for policy evaluation.

Iteration until convergence and solution algorithm

We now use the Least-Squares Policy Iteration (LSPI)32 algorithm, which integrates least-squares Q-learning for policy evaluation (Lemma 2) with analytical policy improvement (Lemma 3) to iteratively optimize the control gain \({\varvec{K}}\). The detailed procedure is summarized in Algorithm 1.

Algorithm 1
Algorithm 1The alternative text for this image may have been generated using AI.
Full size image

LSPI for partially known Stochastic Opinion Dynamics.

Global convergence of \(\textbf{H}\)-matrix estimation and almost sure convergence to optimal policy

This section establishes the theoretical guarantees for the proposed method, including the global convergence of the \({\varvec{H}}\)-matrix estimation and the almost sure convergence of the policy to the optimal one, which are the core theoretical supports for the third innovation.

Lemma 4

(Consistency of \({\varvec{H}}\)-matrix Least-Squares Estimation) Under the persistent excitation condition of the exploration noise (ensuring the covariance matrix of \(\{{\varvec{\phi }}_t\}\) is full-rank) and sufficient sample size N, the least-squares estimator \(\hat{{\varvec{h}}}\) of the \({\varvec{H}}\)-matrix vectorization \({\varvec{h}}\) (defined in Lemma 2) is consistent, i.e., \(\hat{{\varvec{h}}} \rightarrow {\varvec{h}}^*\) almost surely as \(N \rightarrow \infty\). Here, \({\varvec{h}}^*\) is the true parameter vector corresponding to the optimal Q-function.

Proof

The complete proof is presented in Appendix A.1. \(\square\)

Theorem 5

(Global Convergence of \({\varvec{H}}\)-matrix Estimation and Almost Sure Convergence to Optimal Policy) For the stochastic linear quadratic system in (1) with the cost function in (3), under the policy iteration framework (Lemmas 2 and 3) and the persistent excitation condition, the following conclusions hold:

1. The sequence of \({\varvec{H}}\)-matrix estimates \(\{\hat{{\varvec{H}}}_j\}\) converges globally to the optimal \({\varvec{H}}^*\) (corresponding to the optimal Q-function \(Q^*\));

2. The policy sequence \(\{{\varvec{K}}_j\}\) converges almost surely to the optimal policy \({\varvec{K}}^*\).

Proof

The complete proof is presented in Appendix A.2. \(\square\)

Remark 2

Theorem 5 provides rigorous theoretical guarantees for the third innovation. It confirms that the \({\varvec{H}}\)-matrix estimation achieves global convergence without falling into local optima, and the learned policy almost surely converges to the optimal one with probability 1. This theoretical advantage is derived from the convexity of the least-squares problem and the monotonicity of policy iteration, which is superior to non-convex deep learning-based methods that lack global convergence guarantees.

In summary, this part addresses a linear quadratic regulation problem with stochastic dynamics using policy iteration. The approach involves data-driven policy evaluation and improvement to find the optimal control strategy. The traditional LQR solution serves as the initial policy, but due to the system’s stochastic nature, the optimal strategy may require further adjustment through policy iteration. The effectiveness of the final policy is verified through simulation in next section, stabilizing the state near the target \(s\). The learned policy is compared with the theoretical LQR solution (which is valid under the deterministic average case) etc.

Model-free reinforcement learning (RL) for unknown system dynamics

Recent advances in the field of data-driven control, particularly Reinforcement Learning (RL), offer promising alternatives for model-free optimal control28,29,30,31. Based on these developments, this section focuses on the most extreme scenario in social network opinion dynamics: when both the system matrix \({\varvec{A}}\) (governing opinion interactions) and the control matrix \({\varvec{B}}\) (describing the effects of interventions) are completely unknown.This setting is motivated by real social platforms, where operators typically observe only opinion trajectories, while the underlying interaction mechanisms remain inaccessible. We investigate how to derive an optimal intervention policy using a model-free Reinforcement Learning approach in this setting.

Problem formulation with complete model uncertainty

The generalized system dynamics maintain the same form but with expanded uncertainty:

$$\begin{aligned} {\varvec{x}}(t+1) = {\varvec{A}}{\varvec{x}}(t) + {\varvec{B}}u(t), \end{aligned}$$
(31)

where \({\varvec{A}}\) and \({\varvec{B}}\) are time-varying random matrices with unknown distributions, but with known structural constraints.

The cost function remains identical to Equation (14), preserving the quadratic structure that enables our Q learning approach.

Algorithm framework and implementation

Although the system model is unknown, the policy iteration framework introduced in Section 3 remains applicable, but its implementation requires key adaptations for the model-free setting. The core idea is to implicitly capture the effects of unknown dynamics through data-driven approximation of the Q-function, while leveraging the analytic convenience of quadratic parameterization for policy improvement.

Policy evaluation (PE)

Using collected state-action-transition data, a quadratic action-value function \(Q({\varvec{z}}, u)\) is estimated directly via Least-Squares Policy Iteration (LSPI). This function implicitly encodes all information about the system dynamics and cost.

Policy improvement(PI)

The estimated quadratic Q-function is used to compute the greedily improved policy analytically via differentiation, yielding an updated feedback gain matrix \({\varvec{K}}\).

Data collection

An off-policy, offline data collection paradigm is adopted. A fixed behavior policy (typically a conservative initial policy augmented with exploration noise) is used to collect a large batch of interaction data in a single pass, which is then reused throughout the iterations. This approach contrasts with the on-policy, online data collection used in Section 3, and is better suited for obtaining stable, efficient, and coverage-rich data when the model is unknown.

Overall implementation of the algorithm

When the system dynamics are completely unknown, the algorithm initialization requires special attention: the policy is initialized conservatively as \({\varvec{K}}_0 = [0,0]\). The algorithm then iteratively improves the policy using offline data and LSPI updates. Convergence is determined based on the norm of successive policy gain updates.

These improvements allow the algorithm to still guarantee reliability in completely model-free scenarios and show stronger robustness than in the previous section. The algorithm is executed as follows:

Algorithm 2
Algorithm 2The alternative text for this image may have been generated using AI.
Full size image

Model-Free LSPI for Unknown Stochastic Dynamics.

Algorithm 1 and Algorithm 2 are tailored for distinct scenarios of stochastic opinion dynamics, serving as complementary solutions rather than a sequential improvement. This design is motivated by addressing different practical constraints and application requirements, which is elaborated as follows to clarify their respective positioning. Algorithm 1 is designed for scenarios where the stochastic opinion dynamics are partially known and the initial operating condition is fixed: it adopts an on-policy online data sampling paradigm, which ensures tight alignment between collected data and the current policy, and uses a 9-dimensional full quadratic basis function for precise Q-function estimation under scenario-specific assumptions. In contrast, Algorithm 2 targets more general and challenging scenarios with unknown stochastic dynamics and variable initial conditions: it employs an off-policy offline data collection framework, where a single batch of data is collected once and reused for all iterations, significantly reducing computational overhead in complex unknown environments. To enhance policy generalization across diverse initial states, Algorithm 2 samples initial states uniformly from a bounded set instead of fixing them. A simplified 6-dimensional quadratic basis function is adopted to mitigate computational complexity and avoid overfitting in model-free settings. Additionally, Algorithm 2 standardizes hyperparameters (e.g., separate random seeds for data generation and evaluation) and adopts a stricter convergence tolerance to ensure reproducibility and control precision in unknown dynamics. Overall, the two algorithms cover different application scenarios of stochastic opinion dynamics: Algorithm 1 is suitable for scenario-specific tasks with known partial dynamics, while Algorithm 2 is applicable to general tasks with unknown dynamics and variable initial conditions. Their complementary design enhances the applicability and robustness of the proposed framework to diverse practical requirements.

Numerical simulations

This section presents two numerical experiments to validate the performance of the Least Squares Policy Iteration (LSPI) algorithm for the stochastic optimal control of two-agent opinion dynamics. Both experiments are built upon the same core system and cost function formulations but employ distinct configurations of unknown dynamics to demonstrate the general applicability of the model-free RL approach.

The stochastic opinion dynamics adhere to the following discrete-time state-space model:

$$\begin{aligned} {\varvec{x}}(t+1) = {\varvec{A}}(t) {\varvec{x}}(t) + {\varvec{B}}(t)u(t), \end{aligned}$$
(32)

where \({\varvec{x}}(t)=[x_1(t), x_2(t)]^\top\) denotes the vector of agent opinions, u(t) is global control input, \({\varvec{A}}(t)=\begin{bmatrix} p(t) & 1-p(t) \\ q(t) & 1-q(t) \end{bmatrix}\)is a row-stochastic transition matrix capturing the endogenous interaction between agents and \({\varvec{B}}(t)=[b(t), 1-b(t)]^\top\) is column-stochastic control allocation matrix, \(b(t) \in [0,1]\)).

The objective is to find a linear control policy \(u^*(t) = -{\varvec{K}}^* {\varvec{z}}(t)\), where \({\varvec{z}}(t) = {\varvec{x}}(t) - s {\varvec{1}}_2\) is the deviation from the target consensus \(s {\varvec{1}}_2\), that minimizes the expected discounted infinite-horizon cost:

$$\begin{aligned} J^* = \min _{{\varvec{K}}} \mathbb {E}\left[ \sum _{t=0}^{\infty } \delta ^{t} \Big ( \Vert {\varvec{x}}(t) - s {\varvec{1}}_2\Vert _2^2 + \gamma u(t)^2 \Big )\right] , \end{aligned}$$
(33)

This objective is implemented in Algorithm 1 and Algorithm 2 through the Q-function parameterization and policy iteration framework described in Sections3 and 4,with target opinion \(s=1.2\), discount factor \(\delta =0.9\), and control penalty weight \(\gamma =0.2\).

Model-free RL with unknown distributions

In this first experiment, we consider a common scenario in social networks where the precise strength of interpersonal influence is unpredictable and time-varying. Specifically, we model the case where the influence weight p(t) that Agent 1 places on its own opinion versus that of Agent 2 is stochastic and its distribution is completely unknown to the learning agent. The system (32) is partially known: the control allocation vector is set to \({\varvec{B}} = [1, 0]^\top\), meaning the controller can only directly influence Agent 1; the second row of \({\varvec{A}}(t)\) is fixed and known (\(q(t)=0.2\)); and the initial state is \({\varvec{x}}(0)=[1.0, 2.0]^\top\). The unknown parameter p(t) is randomly sampled at each time step from one of three latent probability distributions (Uniform(0,1), Beta(3,5), or Truncated Normal(0.3,0.7)). These distributions are used solely to define the test environment and are never revealed to the RL algorithm, which must learn an effective control policy solely from observed state transitions and costs.

The learning procedure follows the standard LSPI framework (Algorithm 1). A dataset of experience tuples is generated through environment interaction to serve as the basis for policy iteration. For each tuple, an initial state deviation \({\varvec{z}}_t\) is uniformly sampled from \([-1, 1]^2\). The agent, using its current policy \({\varvec{K}}_j\), selects an exploratory action \(u_t = -{\varvec{K}}_j {\varvec{z}}_t + \eta (t)\), where \(\eta (t) \sim \mathscr {N}(0, 0.0025)\) is additive Gaussian noise. The environment then computes the next state: it first instantiates \({\varvec{A}}(t)\) by sampling p(t) from the latent (unknown) distribution and using the known \(q(t)=0.2\), then applies (32) to obtain \({\varvec{z}}_{t+1}\). The immediate cost \(c_t = {\varvec{z}}_t^\top {\varvec{z}}_t + \gamma u_t^2\) is computed. Crucially, only the tuple \({{\varvec{z}}_t, u_t, c_t, {\varvec{z}}_{t+1}}\) is recorded; the sampled p(t) value and the resulting \({\varvec{A}}(t)\) matrix are never stored or used by the learning algorithm. This process is repeated to build a sample set of size \(M=3000\).

The Algorithm 1 is applied to this dataset. For numerical stability, ridge regression is employed during the policy evaluation step and an adaptive learning rate (\(\alpha \in [0.05, 0.3]\)) smooths the policy updates. The initial policy gain \({\varvec{K}}_0 = [0.4, 0.5415]\) is derived from the LQR solution for the nominal system with \(\mathbb {E}[p]=0.5\), providing a reasonable starting point. Algorithm hyperparameters, chosen in line with standard practices, are detailed in Table 1. To ensure statistical reliability, results are averaged over 5 independent runs with different random seeds.

Table 1 Hyperparameters for LSPI (Unknown p(t) Distribution).

Table 2 presents the quantitative comparison between the learned policies and the theoretical LQR solutions (computed assuming known distribution). The percentage deviations are calculated as:

$$\begin{aligned} \Delta K_i = \frac{|K_{i,\text {learned}} - K_{i,\text {LQR}}|}{|K_{i,\text {LQR}}|} \times 100\%, \end{aligned}$$
(34)

and all results are reported as mean ± standard deviation over 5 independent runs.

Table 2 Quantitative comparison of learned policies and theoretical LQR solutions (Mean ± Std over 5 runs).

Table 2 compares the performance of learned RL policies against theoretical LQR benchmarks across three stochastic environments. The RL algorithm consistently learns effective control gains with errors below 20% relative to model-based optima. Under Uniform distribution, the policy achieves near-optimal performance with only 1.8% cost increase, validating learning efficacy in symmetric environments. For Beta distribution, the 18.7% cost degradation reflects the challenge of skewed dynamics yet maintains stability. Most notably, under Truncated Normal distribution, RL outperforms the theoretical optimum by 28.0%, demonstrating that adaptive exploration can exploit bounded distribution properties more effectively than model-based approaches reliant on Gaussian assumptions. The low standard deviations across multiple runs confirm statistical robustness, with steady-state control variances below \(10^{-3}\) for all distributions.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Convergence of agent opinion dynamics under different distributions (mean ± 1 std over 5 runs). Solid lines: \(x_1(t)\), dashed lines: \(x_2(t)\). The shaded regions represent one standard deviation around the mean, indicating statistical variability across independent runs. Time is measured in simulation steps.

Figure 1 illustrates the convergence of opinion dynamics starting from initial values \(x_1(0) = 1.0\) and \(x_2(0) = 2.0\). All trajectories successfully converge to the target \(s = 1.2\) within 100 simulation steps. Key observations include: a). All distributions achieve convergence despite different stochastic characteristics. b). Beta(3,5) shows slower initial convergence due to its skewed mean (\(\mathbb {E}[p] = 0.375\)), while Truncated Normal exhibits the smoothest convergence with minimal oscillations. c). The narrow shaded regions (\(\pm 1\) std) indicate low variability across 5 independent runs, confirming algorithm stability.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Control input trajectories \(u(t)\) under different distributions (mean ± 1 std over 5 runs). The control signals are negative initially (driving opinions downward from above-target values) and converge to near-zero steady-state values. The steady-state control magnitudes (last 20 steps) are approximately zero for all distributions, confirming successful stabilization. Time is measured in simulation steps.

Figure 2 shows the corresponding control input trajectories. Key characteristics include: a). Consistent with the control objective of reducing opinions from initial values (1.0, 2.0) to the target 1.2. b). All control signals approach zero in steady state, with final values \(u(t) \approx 0\) for \(t> 80\). c). Beta(3,5) requires larger initial control magnitudes due to its lower mean \(p\)-value, while Uniform and Truncated Normal distributions show more moderate control signals.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Cumulative discounted cost trajectories of the RL policy and the theoretical optimal controller from a fixed initial condition \(x_0=[1,2]^\top\) (mean ± one standard deviation over 5 rollouts). The vertical axis shows the time-accumulated cost \(\sum _{k=0}^{t} \delta ^k c_k\), which reflects transient cost evolution along representative trajectories. The percentage values indicate the relative difference in the expected total discounted cost\(\Delta = (J_{\textrm{RL}}-J_{\textrm{opt}})/J_{\textrm{opt}}\times 100\%\), where \(J=\mathbb {E}[\sum _{k=0}^{T-1}\delta ^k c_k]\) is estimated by Monte-Carlo simulation over randomly sampled initial states.

Figure 3 reports two complementary performance metrics. The curves show the cumulative discounted cost trajectories \(\sum _{k=0}^{t}\delta ^k c_k\) from a fixed initial condition \(x_0=[1,2]^\top\), which illustrate the transient evolution of control effort and state deviation along representative rollouts. In contrast, the percentage value \(\Delta\) summarizes the relative difference in the expected total discounted cost \(J=\mathbb {E}[\sum _{k=0}^{T-1}\delta ^k c_k]\), estimated via Monte-Carlo simulations over randomly sampled initial states.

For the Uniform(0, 1) distribution, the RL policy achieves a cost nearly identical to the theoretical controller (\(\Delta =+1.8\%\)), indicating successful learning of the average system dynamics. Under the Beta(3, 5) distribution, the RL policy incurs a moderately higher expected cost (\(\Delta =+18.7\%\)), which is attributable to the strong skewness and heavier tail of the distribution that violates the linear–quadratic optimality assumptions.

Under the Truncated Normal distribution, the RL policy attains a significantly lower expected discounted cost (\(\Delta =-28.0\%\)), despite occasional crossings in the trajectory-level curves. This indicates that, although individual rollout trajectories may exhibit transient variability, the model-free RL controller achieves a substantially smaller expected cumulative cost when averaged over stochastic initial conditions and realizations of the uncertainty.

Model-free RL with unknown system dynamics

The second experiment evaluates the proposed approach in a more challenging setting where the dynamics of system (32) are entirely unknown to the agent. In this scenario, all time-varying parameters—p(t), q(t), b(t)—are unspecified, simulating an opinion network where interpersonal influence strengths and the allocation of control effects are not only hidden but may also evolve stochastically over time. Under these conditions, model-based methods are infeasible. Instead, the Least-Squares Policy Iteration (LSPI) algorithm (Algorithm 2) must learn the optimal control policy \(u(t) = -{\varvec{K}} {\varvec{z}}(t)\) directly from offline interaction data, demonstrating its capability as a fully model-free solver.

To construct a reproducible yet opaque environment, we define the latent parameter distributions as: \(p(t) \sim \text {Uniform}(0,1)\), \(q(t) \sim \text {Beta}(3,5)\), and \(b(t) \sim \text {TruncNormal}(0.3,0.7,0,1)\). This selection covers uniform, skewed, and bounded-normal regimes, providing a comprehensive test bed. The agent has no prior knowledge of these distributions or their temporal variations.

Key hyperparameters for LSPI are listed in Table 3. A convergence threshold \(\epsilon = 10^{-6}\) ensures that gain fluctuations remain below \(1.5\%\) across trials. We conduct \(n_{\text {trials}}=20\) independent runs with different random seeds to ensure statistical reliability.

Table 3 Hyperparameters for LSPI under Fully Unknown Dynamics.

Data generation follows a standard offline RL paradigm: using an initial zero-gain policy (\({\varvec{K}}_0 = [0, 0]\)) augmented with clipped Gaussian noise (\(\mathscr {N}(0, 0.8^2)\)), we collect a dataset of 2000 episodes (each of 220 steps), totaling 440,000 state-transition tuples \({{\varvec{z}}_t, u_t, c_t, {\varvec{z}}_{t+1}}\). The algorithm 2 processes this offline dataset, using ridge regression for policy evaluation. No information about p(t), q(t), b(t) or their distributions is used. Each learned policy is evaluated over 1000 episodes to compute its expected discounted cost.

The algorithm demonstrates strong performance, reproducibility, and robustness. Aggregated results over 20 trials are shown in Table 4. The low coefficients of variation (CV < 2% for gains, < 0.5% for cost) and narrow 95% confidence intervals confirm high reproducibility. All trials converged within \(5 \pm 1\) iterations.

Table 4 Statistical Results: Learned Policy over 20 Trials (Fully Unknown Dynamics).

The small CV values (below \(2\%\) for control gains and \(0.45\%\) for cost) indicate high output consistency across different random seeds in an unknown stochastic environment, confirming the reproducibility of the model-free LSPI approach. The narrow confidence intervals (0.6–\(0.7\%\) of the mean) are substantially smaller than the inherent cost variation induced by the stochastic parameters (0.17–0.18), demonstrating the algorithm’s ability to suppress disturbances from unknown dynamics. All trials converged within \(5 \pm 1\) iterations under the condition \(\Vert \Delta K\Vert < 10^{-6}\), with no instances of divergence, verifying stable policy iteration within the model-free framework.

To illustrate the detailed dynamic behavior—such as policy convergence and state regulation—we present results from a representative single trial whose outputs lie within one standard deviation of the aggregated means. This avoids obscuring iterative details that can be lost in averaged plots while maintaining statistical representativeness.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Cumulative discounted cost (mean over 1000 evaluation episodes) as a function of LSPI iteration. The blue solid line denotes the mean cost of the current policy; the red dashed line marks the convergence iteration (5th iteration, satisfying \(\Vert \Delta {\varvec{K}}\Vert < 10^{-6}\)); the green dashed line denotes the stable mean cost after convergence.Note that each LSPI iteration consists of multiple environment time steps collected under a fixed policy.

Figure 4 shows rapid convergence of LSPI: the cumulative cost drops sharply from \(\sim 3.7\) (initial policy) to \(\sim 1.0\) within 1 iteration, then stabilizes near 1.0 for subsequent iterations. By the 5th iteration,cost variations fall below 0.01, confirming that the learned policy has approached the optimal cost. This trend aligns with the statistical result (mean cost \(= 0.948\)) in Table 4, verifying the algorithm’s ability to quickly approximate the optimal control objective in unknown stochastic dynamics.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

Opinion values of two agents as a function of simulation time step under the converged LSPI policy. The red solid line denotes Agent 1’s opinion; the blue dashed line denotes Agent 2’s opinion; the black dotted line denotes the target opinion \(s = 1.2\).

As shown in Figure 5, both agents’ opinions converge to the target \(s=1.2\) within 10 steps, despite the unknown stochastic dynamics. These time steps represent the inner-loop simulation under the policy learned from LSPI iterations (as detailed in Figure 4). After convergence, opinions fluctuate by less than 0.02 around the target—a variation attributable to system stochasticity rather than policy instability, consistent with the low cost variance in Table 4. This confirms that the learned policy effectively drives the system to the desired state in an unknown environment.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Evolution of control gains \(K_1\) (red solid line) and \(K_2\) (blue dashed line) as a function of LSPI iteration. The dotted lines denote the stable gain values after convergence (5th iteration).

Figure 6 illustrates the convergence of the control gains: \(K_1\) and \(K_2\) stabilize at 0.562 and 0.710, respectively, within 5 iterations. Both gains rise sharply in the first iteration and then adjust toward their final values. Post-convergence fluctuations remain below \(10^{-6}\), consistent with the chosen threshold \(\epsilon = 10^{-6}\). This confirms that LSPI learns a stable feedback policy without knowledge of the underlying dynamics.

Overall, this experiment validates LSPI as a practical model-free method for stochastic optimal control when system dynamics are completely unknown and time-varying. The consistent convergence, low performance variance across trials, and effective state regulation demonstrate that the algorithm can reliably learn robust control policies from data alone, without any prior model knowledge. This makes it well-suited for real-world applications such as opinion dynamics or other socio-technical systems where precise interaction models are unavailable.

Conclusion

In this study, we address the optimal control problem of opinion dynamics in social systems by establishing a hierarchical control framework that transitions from model-based to data-driven approaches. We explore optimal control solutions across stochastic models and scenarios with unknown system dynamics. A key contribution of this work is the development of a convex optimization-based RL algorithm, leveraging the structural properties of quadratic cost functions. This algorithm achieves global convergence of policy evaluation through least-squares policy iteration, thereby reducing computational complexity. Our research provides a novel methodological foundation for controlling complex social systems.

Future work will be extended along two complementary tracks. On the one hand, we plan to scale our framework to larger and more realistic social networks, incorporating agent heterogeneity and validating its performance on real-world datasets. On the other hand, we aim to expand the theoretical scope of our method by generalizing it to nonlinear opinion dynamics and integrating game-theoretic considerations in multi-agent reinforcement learning settings. Together, these directions will bridge our current model with richer social phenomena and enhance its practical applicability.