Experience-based integral reinforcement learning consensus for unknown multi-agent systems

Ma, Longquan; Zhao, Huarong; Chen, Yuhao; Gao, Yi; Yu, Hongnian

doi:10.1038/s41598-025-15573-w

Download PDF

Article
Open access
Published: 26 September 2025

Experience-based integral reinforcement learning consensus for unknown multi-agent systems

Longquan Ma¹,
Huarong Zhao¹,
Yuhao Chen¹,
Yi Gao² &
…
Hongnian Yu³

Scientific Reports volume 15, Article number: 32962 (2025) Cite this article

1782 Accesses
Metrics details

Subjects

Abstract

This paper investigates an optimal consensus control problem and proposes a policy iteration algorithm based on online integral reinforcement learning for nonlinear multi-agent systems with unknown dynamics. Introducing a critic-actor neural network into the traditional policy iteration avoids the identification of unknown dynamics. To address the issue of local optima in online learning, an experience-based weight-tuning law is introduced to ensure the persistence of excitation conditions during the training phase. The theoretical results show that the system is asymptotically stable, and the network weights converge. Finally, the effectiveness and correctness are verified by several simulation studies.

Relative importance sampling for off-policy actor-critic in deep reinforcement learning

Article Open access 24 April 2025

Near real-time online reinforcement learning with synchronous or asynchronous updates

Article Open access 17 May 2025

The analysis of deep reinforcement learning for dynamic graphical games under artificial intelligence

Article Open access 02 July 2025

Introduction

The consensus optimal control problem of multi-agent systems (MASs) has been a highly active research area due to the multitude of applications in different areas, such as satellite scheduling¹, wireless sensor networks², multi-quadrotor formation flight³, robotics⁴, and vehicle formation control⁵. The primary objective of optimal control for MASs is to stabilize the system with the minimum tracking error and the energy consumed by a designed optimal controller⁶. Typically, the optimal controller is derived from the Hamilton-Jacobi-Bellman (HJB) equation, which is formulated based on a predefined performance index function⁷. However, solving the HJB equation is challenging due to its inherent nonlinearity and partial derivatives, making an analytical solution practically infeasible^8,9. As an alternative, the policy iteration (PI) algorithm has been widely adopted as an effective method for approximating the solution to the HJB equation¹⁰. Instead of solving the equation directly, PI iteratively alternates between policy evaluation and improvement until convergence is achieved¹¹. Nevertheless, most PI algorithms^12,13,14 require that the dynamics of the controlled MASs are known. It should be pointed out that accurate dynamics of MASs are hard to obtain because of complex constructions and aging of components^15,16. Therefore, developing a PI scheme that operates independently of system dynamics is a critical and meaningful research direction.

As the issue mentioned above, it is noted that neural networks are often introduced to design a controller for the controlled system with unknown or uncertain dynamics^17,18,19,20. For example, Ferede et al.²¹ designed an end-to-end neural network controller for an aggressive high-speed quadcopter, Zishan et al.²² formulated an implementation of a densely connected neural network to detect arrhythmia on a low-compute device, and Hu et al.²³ investigated a neural network-based robust tracking control algorithm for multi-motor driving servo systems. While these approaches effectively reduce control errors for systems with partially unknown dynamics, they often overlook the energy consumption of the designed controllers^21,22,23. Motivated by this limitation, this paper aims to integrate a neural network-based approach into the PI framework to achieve optimal control while considering energy efficiency.

Recently, Modares et al.²⁴ studied an integral reinforcement learning (IRL) method to address the challenges above, constructing an actor-critic neural network framework based on the PI algorithm. Building on this foundation, several advanced IRL approaches have been proposed. For example, Shen et al.²⁵ studied an IRL method for nonlinear Markov jump singularly perturbed systems, Lin et al.²⁶ designed a dynamic compensator-based IRL approach for unknown nonaffine nonlinear systems, and Yan et al.²⁷ investigated a graphical game-based IRL bipartite containment scheme for high-order nonlinear MASs. However, most existing IRL methods^28,29,30,31 face the risk of being trapped in local optima. While exploration and target strategies^32,33,34 have been proposed to mitigate this issue, they are primarily suited for offline learning, which demands significant computational and storage resources. Consequently, a key motivation of this paper is to develop an online learning IRL approach that avoids local optima while maintaining computational efficiency by utilizing historical information over a specific period.

This paper investigates an optimal consensus control problem for MASs with unknown dynamics and proposes a PI algorithm based on online integral reinforcement learning with experience data. The main contributions of this article are listed as follows:

(1)
An actor-critic neural network-based PI algorithm is designed for nonlinear MASs. Unlike existing methods^35,36, the proposed approach does not require prior knowledge of system dynamics or relies on a neural network-based identifier to approximate unknown dynamics, thereby avoiding additional cumulative errors.
(2)
An experience-based IRL method is formulated that bridges online and offline learning schemes for nonlinear MASs. Compared with existing methods^37,38, the proposed approach selectively utilizes historical information over a specific period, effectively preventing convergence to local optima.

Notations: $R^n$ and $R^{n \times m}$ denote the set of n-dimensional real vectors and $n \times m$ real matrices, respectively. $\underline{1} \in R^n$ denotes the n-dimensional vector, where all elements are 1. ||a|| denotes the Euclidean norm of $a \in R^n$, ${I_n}$ denotes an identity matrix with n dimensions, and $\otimes$ denotes the Kronecker product.

Preliminaries and problem formulations

Here, we introduce fundamental knowledge and notations used in this article and describe the consensus issue of MASs.

Graph theory

Define a directed graph $\mathscr {G} = \left( {\mathscr {V},\mathscr {E},\mathscr {A}} \right)$, which is composed of a finite nonempty node set $\mathscr {V} = \{ 1,2,...,N\}$, an edge set $\mathscr {E} \subseteq \mathscr {V} \times \mathscr {V}$, and a weighted adjacent matrix $\mathscr {A} = \left[ {{a_{ij}}} \right] \in {\mathscr {R}^{N \times N}}$. If node j can send information to node i, it has $(j,i) \in \mathscr {E}$, ${a_{ij}}> 0$ (otherwise ${a_{ij}} = 0$), and node j is one of the neighbors of node i. Define ${\mathscr {N}_i} = \left\{ {{j}:\left( {{j},{i}} \right) \in \mathscr {E}} \right\}$ as the set of all the neighbors of node i. Define the in-degree matrix $\mathscr {D}\, = diag\left( {{d_1},...,{d_N}} \right)$, where ${d_i} = \sum \nolimits _{j \in {\mathscr {N}_i}} {{a_{ij}}}$, and the Laplacian matrix $\mathscr {L}$ is defined as $\mathscr {L} = \mathscr {D} - \mathscr {A} = [l_{ij}]$.

Consensus of MASs

Consider MASs in the form of a communication network ${\mathscr {G}}$ consisting of N agents, where the dynamics of agent i is given as

$$\begin{aligned} {\dot{x}_i} = f\left( {{x_i}} \right) + g\left( {{x_i}} \right) {u_i},i = 1,...,N \end{aligned}$$

(1)

where $f\left( {{x_i}} \right) \in {R^n}$ are partially unknown nonlinear smoothly function, and $g\left( {{x_i}} \right) \in {R^{n \times m}}$ is partially unknown control effectiveness function with $||g\left( {{x_i}} \right) ||<\bar{g}$. Moreover, $f\left( {{x_i}} \right) \in {R^n}$ and $g\left( {{x_i}} \right)$ are Lipschitz continuous. ${x_i} = {x_i}\left( t \right)$ is the state vector, and ${u_i} = {u_i}\left( t \right) \in {R^m}$ is the control input vector of agent i.

The global network dynamics is

$$\begin{aligned} \dot{x} = f\left( x \right) + g\left( x \right) u \end{aligned}$$

(2)

where $x = {\left[ {x_1^ \top , \hspace{5.0pt}x_2^ \top , \hspace{5.0pt}..., \hspace{5.0pt}x_N^ \top } \right] ^ \top } \in {R^{Nn}}$, $f\left( x \right) = {\left[ {{f^ \top }\left( {{x_1}} \right) ,\hspace{5.0pt}{f^ \top }\left( {{x_2}} \right) ,\hspace{5.0pt}...,\hspace{5.0pt}{f^ \top }\left( {{x_N}} \right) } \right] ^ \top } \in {R^{Nn}}$, $g\left( x \right) = diag\left( {g\left( {{x_i}} \right) } \right) \in {R^{Nn \times Nm}}$ with $i = 1,2,...,N$, and $u = {\left[ {u_1^ \top \hspace{5.0pt}u_2^ \top \hspace{5.0pt}...\hspace{5.0pt}u_N^ \top } \right] ^ \top } \in {R^{Nm}}$.

Assumption 1

³⁹The system (2) is controllable on a set $\Omega \in {R^{Nn}}$, which implies that there exists a control policy that can asymptotically stabilize the system.

Assumption 2

⁴⁰A directed spanning tree is present in the communication network ${\mathscr {G}}$, and all agents have direct or indirect access to the leader agent’s information.

The leader state vector ${x_0} = {x_0}\left( t \right) \in {R^n}$ satisfies that

$$\begin{aligned} {\dot{x}_0} = k\left( {{x_0}} \right) \end{aligned}$$

(3)

where $k\left( {{x_0}} \right) \in {R^n}$. The tracking error of agent i is defined as

$$\begin{aligned} {\delta _i} = \sum \limits _{j \in {\mathscr {N}_i}} {{a_{ij}}\left( {{x_i} - {x_j}} \right) + {b_i}\left( {{x_i} - {x_0}} \right) } \end{aligned}$$

(4)

where ${\delta _i} = {\left[ {{\delta _{i1}}\hspace{5.0pt}{\delta _{i2}}\hspace{5.0pt}...\hspace{5.0pt}{\delta _{in}}} \right] ^ \top } \in {R^n}$. Note that ${b_i}> 0$ if and only if the leader agent can send information to agent i; otherwise, ${b_i} = 0$. The global error vector is given as

$$\begin{aligned} \delta = \mathscr {L}\left( {x - \underline{x} _0} \right) \end{aligned}$$

(5)

where $\mathscr {L} = \left( {L + B} \right) \otimes {I_n}$, $\delta = {\left[ {\delta _1^ \top \hspace{5.0pt}\delta _2^ \top \hspace{5.0pt}...\hspace{5.0pt}\delta _N^ \top } \right] ^ \top } \in {R^{Nn}}$ with $\underline{I} = \underline{1} \otimes {I_n} \in {R^{Nn \times n}}$, and $B = diag\left\{ {{b_1},{b_2},...,{b_N}} \right\} \in {R^{N \times N}}$(${b_{ii}} = {b_i}$ and ${b_{ij}} = 0,i \ne j$).

By differentiating Eq. (4), the dynamics of ${\delta _i}$ is

$$\begin{aligned} {{\dot{\delta }} _i} =&\; \left( {{L_i} + {B_i}} \right) \otimes {I_n}\left( \dot{x} - \dot{\underline{x}}_0 \right) \nonumber \\ =&\; \left( {{L_i} + {B_i}} \right) \otimes {I_n} \left( f\left( x \right) + g\left( x \right) u - k(\underline{x}_0) \right) \nonumber \\ =&\; \left( {{L_i} + {B_i}} \right) \otimes {I_n} \left( f_e\left( x \right) + g\left( x \right) u \right) \nonumber \\ =&\; \sum \limits _{j \in \{\mathscr {N}_i,i\}}\left( \left( l_{ij}+b_{ij} \right) \otimes {I_n}\right) \left( f(x_j)-k(x_0)+g(x_j)u_j\right) \nonumber \\ =&\; {\mathscr {L}_i}{f_e}\left( x \right) + \left( {{d_i} + {b_i}} \right) g\left( {{x_i}} \right) {u_i} - \sum \limits _{j \in {\mathscr {N}_i}} {{a_{ij}}g\left( {{x_j}} \right) {u_j}} \end{aligned}$$

(6)

where ${\mathscr {L}_i} = \left( {{L_i} + {B_i}} \right) \otimes {I_n}$. ${L_i} = \left[ {{l_{i1}}\hspace{5.0pt}...\hspace{5.0pt}{l_{ii}}\hspace{5.0pt}...\hspace{5.0pt}{l_{iN}}} \right]$ and ${B_i} = \left[ {{b_{i1}}\hspace{5.0pt}...\hspace{5.0pt}{b_{ii}}\hspace{5.0pt}...\hspace{5.0pt}{b_{iN}}} \right]$ are denoted as the ith row vector of L and B, respectively. Moreover, ${f_e}\left( x \right) = f\left( x \right) - \underline{k} \left( {{x_0}} \right)$ with $\underline{k} \left( {{x_0}} \right) = \underline{I} k\left( {{x_0}} \right)$.

Problem statements

This paper aims to address the following problems and challenges.

(1)
How to solve the HJB equation for nonlinear MASs. To address this issue, we propose a PI scheme that is independent of system dynamics.
(2)
How to design a controller for MASs with unknown dynamics. To address this problem, we establish an actor-critic neutral network-based PI algorithm, where the dynamics are no longer needed.
(3)
How to avoid the designed IRL method from getting trapped in local optima. To address this drawback, we formulate an experience-based IRL method that selectively utilizes historical information to prevent getting trapped in local optima.

To sum up, the object of this study is to design a control strategy that ensures the uniform ultimate boundedness (UUB) of the system error ${\delta _i}$ for all $i \in \left\{ 0,1,..., N \right\}$ without relying on the system model. Specifically, there exist positive constants $\epsilon$ and $\bar{t}$ such that, for all initial conditions and under Assumption 1, the tracking error ${\delta _i}$ satisfies the following condition:

$$||\delta _i(t)||\le \epsilon , \forall t>\bar{t}$$

In other words, ${\delta _i}$ will enter and remain within a bounded region after time $\bar{t}$, thereby achieving the consensus control.

Controller design and convergence analysis

Discont fator-based optimal control policy

Here, a distributed performance function with a discount factor is proposed to guarantee that each agent minimizes their performance function.

Define the discounted local performance index function of agent i as

$$\begin{aligned} {J_i}\left( {{\delta _i}\left( 0 \right) ,{u_i},{u_{ - i}}} \right) = \int _0^\infty {{e^{ - \alpha v}}r\left( {{\delta _i}\left( v \right) ,{u_i}\left( v \right) ,{u_{ - i}}\left( v \right) } \right) dv} \end{aligned}$$

(7)

where $r\left( {{\delta _i}\left( v \right) ,{u_i}\left( v \right) ,{u_{ - i}}\left( v \right) } \right) = {r_1}\left( {{\delta _i}} \right) + {r_2}\left( {{u_i},{u_{ - i}}} \right)$ denotes the cost function. ${r_1}\left( {{\delta _i}} \right) = \delta _i^ \top {Q_{ii}}{\delta _i}$ denotes the error cost, and ${r_2}\left( {{u_i},{u_{ - i}}} \right) = u_i^ \top {R_{ii}}{u_i} + \sum \nolimits _{j \in {\mathscr {N}_i}} {u_j^ \top {R_{ij}}{u_j}}$ denotes the control cost. $\alpha> 0$ is the discount factor. ${Q_{ii}} \geqslant 0$, ${R_{ii}}> 0$, and ${R_{ij}}> 0$ are positive symmetric matrices. For brief, $\sum \nolimits _{j \in {\mathscr {N}_i}} {u_j}$ is abbreviated to ${u_{ - i}}$.

We need to design a distributed optimal consensus method for each agent to ensure that all agents reach a consensus with the leader and reduce the local power function. In other words, this paper aims to minimize each agent’s local performance function with a designed control input set $\{ u_1, u_2,..., u_N \}$.

Definition 1

(Global Nash Equilibrium) For all $i = 1,2,...,N$, a set of control inputs $\{u_1^ *,u_2^ *,...,u_N^ * \}$ is considered to establish a global Nash equilibrium if the following equation is satisfied:

$$\begin{aligned} J_i^* = {J_i}\left( {{\delta _i}\left( {{t_0}} \right) ,u_i^*,u_{-i}^*} \right) \leqslant {J_i}\left( {{\delta _i}\left( {{t_0}} \right) ,u_i,u_{-i}^*} \right) ,\left( u_i \ne u_i^* \right) \end{aligned}$$

(8)

The local discounted value function of each agent i is defined as

$$\begin{aligned} {V_i}\left( {{\delta _i}\left( t \right) } \right) = \int _t^\infty {{e^{ - \alpha \left( {v - t} \right) }}r\left( {{\delta _i}\left( v \right) ,{u_i}\left( v \right) ,{u_{ - i}}\left( v \right) } \right) dv} \end{aligned}$$

(9)

The optimal local discounted value function is defined as

$$\begin{aligned} V_i^ * \left( {{\delta _i}\left( t \right) } \right) = \mathop {\min }\limits _{{u_1},{u_2},...,{u_N}} \int _t^\infty {{e^{ - \alpha \left( {v - t} \right) }}r\left( {{\delta _i}\left( v \right) ,{u_i}\left( v \right) ,{u_{ - i}}\left( v \right) } \right) dv} \end{aligned}$$

(10)

The local coupled Hamiltonian function is constructed as

$$\begin{aligned} {H_i}\left( {{\delta _i},\nabla {V_i},{u_i},{u_{ - i}}} \right)&= \delta _i^ \top {Q_{ii}}{\delta _i} + u_i^ \top {R_{ii}}{u_i} + \sum \limits _{j \in \mathscr {N}_i} {u_j^ \top {R_{ij}}{u_j}} \nonumber \\&\quad \; + \nabla V_i^ \top \left( {{\mathscr {L}_i}{f_e}\left( x \right) + \left( {{d_i} + {b_i}} \right) g\left( {{x_i}} \right) {u_i} - \sum \limits _{j \in {\mathscr {N}_i}} {{a_{ij}}g\left( {{x_j}} \right) {u_j}} } \right) \end{aligned}$$

(11)

The gradient of the value function ${V_i}$ with respect to ${\delta _i}\left( t \right)$ is denoted by $\nabla {V_i}$. $V_i^ * \left( {{\delta _i}} \right)$ denotes the local optimal value function that satisfies

$$\begin{aligned} \mathop {\min }\limits _{{u_i}} {H_i}\left( {{\delta _i},\nabla V_i^ *,{u_i},{u_{ - i}}} \right) = 0 \end{aligned}$$

(12)

The optimal control policy can minimize Eq. (12), which can be expressed as

$$\begin{aligned} \frac{{\partial {H_i}}}{{\partial {u_i}}} = 0 \rightarrow u_i^ * = - \frac{1}{2}\left( {{d_i} + {b_i}} \right) R_{ii}^{ - 1}{g^ \top }\left( {{x_i}} \right) \nabla V_i^ * \end{aligned}$$

(13)

An IRL-based PI algorithm

Substituting Eq. (13) into Eq. (12), we have

$$\begin{aligned} \delta _i^ \top {Q_{ii}}\delta&+ \frac{1}{4}{\left( {{d_i} + {b_i}} \right) ^2}{\left( {\nabla V_i^ * } \right) ^ \top }g\left( {{x_i}} \right) R_{ii}^{ - 1}g{\left( {{x_i}} \right) ^ \top }\nabla V_i^ * \nonumber \\&+ \frac{1}{4}\sum \limits _{j \in {\mathscr {N}_i}} {{{\left( {{d_j} + {b_j}} \right) }^2}{{\left( {\nabla V_j^ * } \right) }^ \top }g\left( {{x_j}} \right) R_{jj}^{ - 1}{R_{ij}}R_{jj}^{ - 1}g{{\left( {{x_j}} \right) }^ \top }\nabla V_j^ * } \nonumber \\&+ {\left( {\nabla V_i^ * } \right) ^ \top }\left( {{\mathscr {L}_i}{f_e}\left( x \right) - \frac{1}{2}{{\left( {{d_i} + {b_i}} \right) }^2}g\left( {{x_i}} \right) R_{ii}^{ - 1}g{{\left( {{x_i}} \right) }^ \top }\nabla V_i^ * } \right) \nonumber \\&+ {\left( {\nabla V_i^ * } \right) ^ \top }\left( {\frac{1}{2}\sum \limits _{j \in {\mathscr {N}_i}} {{a_{ij}}\left( {{d_j} + {b_j}} \right) g\left( {{x_i}} \right) R_{jj}^{ - 1}g{{\left( {{x_i}} \right) }^ \top }\nabla V_j^ * } } \right) = 0 \end{aligned}$$

(14)

It is noted that in solving Eq. (14), the optimal control policy of the agents is obtained. However, since Eq. (14) contains the information of the system’s dynamics, which is unknown, the solution of Eq. (14) is difficult to obtain. A common approach to solving this problem is using the PI algorithm, which approximates the solution through constant iteration. The PI algorithm consists of two steps: 1) Policy evaluation and 2) Policy improvement. In the policy evaluation step, according to Eq. (11) and Eq. (12), the given control policy $u_i^{(k)}(t)$ is evaluated by

$$\begin{aligned} 0&= {r_i}\left( {{\delta _i},u_i^{\left( k \right) },u_j^{\left( k \right) }} \right) + {\left( {\nabla V_i^{\left( k \right) }} \right) ^ \top }\nonumber \\&\quad \;\times \left[ {{{{\mathscr {L}}}_i}{f_e}\left( x \right) + \left( {{d_i} + {b_i}} \right) g\left( {{x_i}} \right) {u^{\left( k \right) }} - \sum \limits _{j \in {{{\mathscr {N}}}_i}} {{a_{ij}}g\left( {{x_j}} \right) u_j^{\left( k \right) }} } \right] \end{aligned}$$

(15)

and updates it by

$$\begin{aligned} u_i^{\left( {k + 1} \right) }\left( t \right) = - \frac{1}{2}\left( d_i+b_i\right) R_{ii}^{ - 1}{g^ \top }\left( {{x_i}} \right) \nabla V_i^{\left( k \right) } \end{aligned}$$

(16)

From Eqs. (15) and (16), it is found that they incorporate the dynamics of the controlled system. In order to release the need for the system’s dynamics, an IRL algorithm with an integration interval T is introduced into Eq. (9) that yields

$$\begin{aligned} {V_i}\left( {{\delta _i}\left( t \right) } \right) = \int _{t}^{t+T} {{e^{ - \alpha \left( {v - t} \right) }}r\left( {{\delta _i}\left( v \right) ,{u_i}\left( v \right) ,{u_{ - i}}\left( v \right) } \right) dv} + e^{-\alpha T}{V_i}\left( {{\delta _i}\left( t+T \right) } \right) \end{aligned}$$

(17)

where the controlled system’s dynamics are not contained.

Then, an IRL-based PI algorithm is designed as Algorithm 1.

Due to Algorithm 1 including an integral interval T, the update of control inputs is discontinuous. Thus, the designed IRL-based PI scheme can be regarded as a time-triggered control strategy, where T represents a special sampling period.

Remark 1

Compared with the common PI, the designed IRL-based PI eliminates the requirements for system dynamics. Moreover, according to the analysis in the work of Lewis et al.⁴², it is obtained that Eqs. (15) and (18) are equivalent. However, it also introduces a new problem: How to solve Eq. (19), presented below.

Model-free distributed consensus control algorithm

Here, we constructed a critic neural network and an actor neural network for Eqs. (18) and (19), respectively. Moreover, the weights tuning laws are designed, and the convergence is analyzed.

According to the universal approximation theorem⁴³, it is obtained that neural networks can approximate smooth functions on a compact set. Thus, we construct neural networks to estimate the objective function, where the established neural networks consist of three layers: 1) the input layer, 2) the hidden layer, and 3) the output layer. The input-to-hidden weight is set to be 1 and no longer tuned. The hidden layer contains an activation function, and the weight connecting the hidden layer to the output can be tuned to minimize the approximation error. Eqs. (18) and (19) is expressed as

$$\begin{aligned} & {\hat{V}}_i^{(k)}\left( {\delta _i}\left( t \right) \right) = {\left( {{\hat{W}}_{ci}^{(k),(p)}} \right) ^ \top }{\phi _{ci}}\left( {{z_{ci}}\left( t \right) } \right) \end{aligned}$$

(20)

$$\begin{aligned} & {\hat{u}}_i^{(k)}\left( t \right) = {\left( {{\hat{W}}_{ai}^{(k),(p)}} \right) ^ \top }{\psi _{ai}}\left( {{z_{ai}}\left( t \right) } \right) \ \end{aligned}$$

(21)

where ${\phi _{ci}}\left( {{z_{ci}}\left( t \right) } \right) \in {R^{{h_{vi}}}}$, and ${\psi _{ai}}\left( {{z_{ai}}\left( t \right) } \right) \in {R^{{h_{di}}}}$ denote the activation function vectors. ${h_{vi}}$ and ${h_{di}}$ are the numbers of neurons. ${\hat{W}}_{ci}^k$ and ${\hat{W}}_{ai}^k$ denote the weight vectors. ${z_{ci}}\left( t \right)$ is a vector of the information from ${\delta _i}\left( t \right)$, $u_i^k\left( t \right)$, and $u_{ - i}^k\left( t \right)$. ${z_{ai}}\left( t \right)$ denotes a vector of the information from ${\delta _i}\left( t \right)$.

Based on Eq. (18), the approximation error of the critic neural network is defined as

$$\begin{aligned} e_{ci}^{(k)}\left( t \right)&= \int _t^{t + T} {{e^{ - \alpha \left( {v - t} \right) }}{r_i}\left( {{\delta _i}\left( v \right) ,u_i^{(k)}\left( v \right) ,u_{ - i}^{(k)}\left( v \right) } \right) } dv \nonumber \\&\quad \; + {e^{ - \alpha T}}{\hat{V}}_i^{(k)}\left( {t + T} \right) - {\hat{V}}_i^{(k)}\left( t \right) \end{aligned}$$

(22)

To make the square error $E_{ci}^{(k)}\left( t \right) = \left( {1/2} \right) {\left( {e_{ci}^{(k)}\left( t \right) } \right) ^ \top }e_{ci}^{(k)}\left( t \right)$ minimal, a gradient-based update rule with history data for the critic neural network weights of agent i is derived as

$$\begin{aligned} {\hat{W}}_{ci}^{(k + 1)} =&\; {\hat{W}}_{ci}^{(k)} - {\beta _{ci}}\int _{t - s}^t {\frac{{{e^{v - t + s}}}}{{\int _{t - s}^t {{e^{v - t + s}}dv} }}\left[ {\frac{{\partial E_{ci}^{(k)}\left( v \right) }}{{\partial {\hat{W}}_{ci}^{(k)}}}} \right] dv} \nonumber \\ =&\; {\hat{W}}_{ci}^{(k)} - {\beta _{ci}}\int _{t - s}^t {\frac{{{e^{v - t + s}}}}{{{e^s} - 1}}\frac{{\partial E_{ci}^{(k)}\left( v \right) }}{{\partial e_{ci}^{(k)}\left( v \right) }}\frac{{\partial e_{ci}^{(k)}\left( v \right) }}{{\partial {\hat{V}}_i^{(k)}\left( v \right) }}\frac{{\partial {\hat{V}}_i^{(k)}\left( v \right) }}{{\partial {\hat{W}}_{ci}^{(k)}}}dv} \nonumber \\ =&\; {\hat{W}}_{ci}^{(k)} - {\beta _{ci}}\int _{t - s}^t {\frac{{{e^{v - t + s}}}}{{{e^s} - 1}}{e^{ - \alpha T}}e_{ci}^{(k)}\left( v \right) {\phi _{ci}}\left( {{z_{ci}}\left( v \right) } \right) dv} \end{aligned}$$

(23)

where ${\beta _{ci}}> 0$ is the learning rate, and s denotes the time range of the history data, which could be used to update the weights.

The approximation error of the actor neural network is defined as

$$\begin{aligned} e_{ai}^{(k)}\left( t \right) = {\hat{V}}_i^{(k)}\left( t \right) \end{aligned}$$

(24)

Define the square error as $E_{ai}^{(k)} = \left( {1/2} \right) {\left( {e_{ai}^{(k)}} \right) ^ \top }e_{ai}^{(k)}$, and the update rule for the actor neural network weights of agent i is derived as

$$\begin{aligned} {\hat{W}}_{ai}^{(k + 1)}=&\; {\hat{W}}_{ai}^{(k)} - {\beta _{ai}}\int _{t - s}^t {\frac{{{e^{v - t + s}}}}{{\int _{t - s}^t {{e^{v - t + s}}dv} }}\left[ {\frac{{\partial E_{ai}^{(k)}\left( v \right) }}{{\partial {\hat{W}}_{ai}^{(k)}}}} \right] dv} \nonumber \\ =&\; {\hat{W}}_{ai}^{(k)} - {\beta _{ai}}\int _{t - s}^t {\frac{{{e^{v - t + s}}}}{{{e^s} - 1}}\frac{{\partial E_{ai}^{(k)}\left( v \right) }}{{\partial e_{ai}^{(k)}\left( v \right) }}\frac{{\partial e_{ai}^{(k)}\left( v \right) }}{{\partial {\hat{V}}_i^{(k)}\left( v \right) }}} \nonumber \\&\times \frac{{\partial {\hat{V}}_i^{(k)}\left( v \right) }}{{\partial {\phi _{ci}}\left( {{z_{ci}}\left( v \right) } \right) }}\frac{{\partial {\phi _{ci}}\left( {{z_{ci}}\left( v \right) } \right) }}{{\partial {z_{ci}}\left( v \right) }}\frac{{\partial {z_{ci}}\left( v \right) }}{{\partial {\hat{u}}_i^{(k)}\left( v \right) }}\frac{{\partial {\hat{u}}_i^{(k)}\left( v \right) }}{{\partial {\hat{W}}_{ai}^{(k)}}}dv \nonumber \\ =&\; {\hat{W}}_{ai}^{(k)} - {\beta _{ai}}\int _{t - s}^t {\frac{{{e^{v - t + s}}}}{{{e^s} - 1}}{\psi _{ai}}\left( {{z_{ai}}\left( v \right) } \right) {{\left( {{\hat{W}}_{ci}^{(k)}} \right) }^ \top }} \nonumber \\&\times {{\phi '}_{ci}}\left( {{z_{ci}}\left( v \right) } \right) {Z_i}{\left[ {{{\left( {{\hat{W}}_{ci}^{(k)}} \right) }^ \top }{\phi _{ci}}\left( {{z_{ci}}\left( v \right) } \right) } \right] ^ \top }dv \end{aligned}$$

(25)

where ${\beta _{ai}}> 0$ is the learning rate, ${Z_i} = \partial {z_{ci}}\left( t \right) /\partial {\hat{u}}_i^{(k)}\left( t \right)$, and ${\phi '_{ci}}\left( {{z_{ci}}\left( t \right) } \right) = \partial {\phi _{ci}}\left( {{z_{ci}}\left( t \right) } \right) /\partial {z_{ci}}\left( t \right)$.

Combining the critic-actor neural network framework with Algorithm 1 yields the following algorithm, as shown in Algorithm 2.

Remark 2

Notably, this weight update rule utilizes historical data, and a replay buffer is established to store this data, similar to references^37,38. However, those update rules do not assign ratios to different periods of historical data. This paper introduces the exponential term for importance sampling, which allows the proportion of history data in the weight update to be inversely proportional to the time interval $t-s^{\prime }$ between the historical moment $s^{\prime }$ $(t-s<s^{\prime }<t)$ and the current moment t. The reason for importance sampling is that control inputs and states closer to the current moment have a more substantial influence on the state at the current moment. This weight update rule prevents the critic neural network from making inaccurate approximations of the past states of the agents, improves the control policy estimated by the actor neural network for real-time control processes, and accelerates the convergence of weights.

Convergence analysis

Theorem 1

Let the update rules for critic and actor neutral network weights be as in Eqs. (23) and (25). Define the weights estimation errors ${\tilde{W}}_{ci}^{(k)} = {\hat{W}}_{ci}^{(k)} - W_{ci}^ *$ and ${\tilde{W}}_{ai}^{(k)} = {\hat{W}}_{ai}^{(k)} - W_{ai}^ *$. Then, ${\tilde{W}}_{ci}^{(k)}$, ${\tilde{W}}_{ai}^{(k)}$, and $\delta _i$ are UUB as ${k} \rightarrow \infty$, and there exists a scalar $W> 0$, which satisfies

$$\begin{aligned} 2{\beta _i^ \top }\int _{t - s}^t {{\rho _i}\left( v \right) \eta _i \left( v \right) dv}>&{\left\| {\int _{t - s}^t {\beta _i^{\top } {\rho _i ^{}}\left( v \right) \lambda _i \left( v \right) dv} } \right\| ^2} + \beta _i^\top \int _{t - s}^t {{\rho _i }\left( v \right) \lambda _i \left( v \right) dv}\underline{W} \end{aligned}$$

(26)

Proof

By Eqs. (23) and (25), ${\tilde{W}}_{ci}^{(k)}$ and ${\tilde{W}}_{ai}^{(k)}$ are rewritten as

$$\begin{aligned} & {\tilde{W}}_{ci}^{(k + 1)} = {\tilde{W}}_{ci}^{(k)} - {\beta _{ci}}\int _{t - s}^t {{\rho _{ci}}\left( v \right) e_{ci}^{(k)}\left( v \right) dv} \end{aligned}$$

(27)

$$\begin{aligned} & {\tilde{W}}_{ai}^{(k + 1)} = {\tilde{W}}_{ai}^{(k)} - {\beta _{ai}}\int _{t - s}^t {{\rho _{ai}}\left( v \right) {\hat{V}}_i^{(k)}\left( v \right) dv} \end{aligned}$$

(28)

where

$$\begin{aligned} & {\rho _{ci}}\left( v \right) = \frac{e^{v - t + s}}{e^s - 1} {e^{ - \alpha T}}{\phi _{ci}}\left( {{z_{ci}}\left( v \right) } \right) \\ & {\rho _{ai}}\left( v \right) = \frac{e^{v - t + s}}{e^s - 1} {\psi _{ai}}\left( {{z_{ai}}\left( v \right) } \right) {\left( {{\hat{W}}_{ci}^{(k)}} \right) ^ \top }{\phi '_{ci}}\left( {{z_{ci}}\left( v \right) } \right) {Z_i} \end{aligned}$$

Choose a Lyapunov function as

$$\begin{aligned} {L_i}\left( k \right) = {L_{i,1}}\left( k \right) + {L_{i,2}}\left( k \right) +{L_{i,3}}\left( k \right) \end{aligned}$$

(29)

where ${L_{i,1}}\left( k \right) = {\left( {{\tilde{W}}_{ci}^{(k)}} \right) ^ \top }{\tilde{W}}_{ci}^{(k)}$, ${L_{i,2}}\left( k \right) = {\left( {{\tilde{W}}_{ai}^{(k)}} \right) ^ \top }{\tilde{W}}_{ai}^{(k)}$, and ${L_{i,3}}\left( k \right) = \delta _i^\top \delta _i + \Theta _iV_i(\delta _i(t))$.

The gradient of ${L_{i,1}}\left( k \right)$ is obtained as

$$\begin{aligned} \nabla {L_{i,1}}\left( k \right) =&{\left( {{\tilde{W}}_{ci}^{(k + 1)}} \right) ^\top }{\tilde{W}}_{ci}^{(k + 1)} - {\left( {{\tilde{W}}_{ci}^{(k)}} \right) ^\top }{\tilde{W}}_{ci}^{(k)} \nonumber \\ =&- 2{\beta _{ci}}\int _{t - s}^t { \left( {\rho _{ci}}\left( v \right) e_{ci}^{(k)}\left( v \right) \right) ^\top dv} \left( {{\hat{W}}_{ci}^{(k)} - {W_{ci}^{(k)}} + {W_{ci}^{(k)}} - W_{ci}^ * } \right) + \beta _{ci}^2{\left\| {\int _{t - s}^t {{\rho _{ci}}\left( v \right) e_{ci}^{(k)}\left( v \right) dv} } \right\| ^2} \end{aligned}$$

(30)

where ${W_{ci}^{(k)}}$ satisfies

$$\begin{aligned} \left( {{\phi _{ci}}\left( {{z_{ci}}\left( t \right) } \right) - {e^{ - \alpha T}}{\phi _{ci}}\left( {{z_{ci}}\left( {t + T} \right) } \right) } \right) ^\top W_{ci}^{(k)} = \int _t^{t + T} {{e^{ - \alpha \left( {v - t} \right) }}{r_i}\left( {{\delta _i}\left( v \right) ,u_i^{(k)}\left( v \right) ,u_{ - i}^{(k)}\left( v \right) } \right) } dv \end{aligned}$$

(31)

Because $W_{ci}^{(k)}$ is bounded, there exists a scalar ${W_1}> 0$ which satisfies $\left\| {W_{ci}^{(k)} - W_{ci}^ * } \right\| ^2 \leqslant {W_1}$. Substituting Eq. (31) into Eq. (22), one has

$$\begin{aligned} e_{ci}^{(k)}\left( t \right) = \left( {{e^{ - \alpha T}}{\phi _{ci}}\left( {{z_{ci}}\left( {t + T} \right) } \right) - {\phi _{ci}}\left( {{z_{ci}}\left( t \right) } \right) } \right) ^\top \left( {{\hat{W}}_{ci}^{(k)} - W_{ci}^{(k)}} \right) \end{aligned}$$

(32)

According to Eq. (32), Eq. (30) is rewritten as

$$\begin{aligned} \nabla {L_{i,1}}\left( k \right)&=\; {\beta _{ci}}\left( {{\beta _{ci}}{{\left\| {\int _{t - s}^t {{\rho _{ci}}\left( v \right) e_{ci}^{(k)}\left( v \right) dv} } \right\| }^2} - 2\int _{t - s}^t {{\rho ^\top _{ci}}\left( v \right) {\eta _{ci}}\left( v \right) dv} } \right) \nonumber \\&+ \beta _{ci}^{}\int _{t - s}^t {\left( {\rho _{ci}}\left( v \right) e_{ci}^{(k)}\left( v \right) \right) ^\top dv} \left( {W_{ci}^ * - W_{ci}^{(k)}} \right) \nonumber \\ \leqslant&\; {\beta _{ci}}\left( {{\beta _{ci}}{{\left\| {\int _{t - s}^t {{\rho _{ci}}\left( v \right) e_{ci}^{(k)}\left( v \right) dv} } \right\| }^2} - 2\int _{t - s}^t {{\rho ^\top _{ci}}\left( v \right) {\eta _{ci}}\left( v \right) dv} } \right) + \beta _{ci}^{}\int _{t - s}^t {\left( {\rho _{ci}}\left( v \right) e_{ci}^{(k)}\left( v \right) \right) ^\top dv} \underline{W_1} \end{aligned}$$

(33)

where ${\eta _{ci}} = {{e^{ - \alpha T}}{\phi _{ci}}\left( {{z_{ci}}\left( {v + T} \right) } \right) - {\phi _{ci}}\left( {{z_{ci}}\left( v \right) } \right) }$, and $\underline{W_1} = \underline{1}\otimes {W_1}\in R^{h_{vi}}$.

The gradient of ${L_{i,2}}\left( k \right)$ is given as

$$\begin{aligned} \nabla {L_{i,2}}\left( k \right) =&\; {\left( {{\tilde{W}}_{ai}^{(k + 1)}} \right) ^\top }{\tilde{W}}_{ai}^{(k + 1)} - {\left( {{\tilde{W}}_{ai}^{(k)}} \right) ^\top }{\tilde{W}}_{ai}^{(k)} \nonumber \\ =&\; - 2{\beta _{ai}}\int _{t - s}^t { \left( {\rho _{ai}}\left( v \right) {\hat{W}}_{ai}^{(k)}\left( v \right) \right) ^\top dv} \left( {{\hat{W}}_{ai}^{(k)} - {W_{ai}^{(k)}} + {W_{ai}^{(k)}} - W_{ai}^ * } \right) + \beta _{ai}^2{\left\| {\int _{t - s}^t {{\rho _{ai}}\left( v \right) {\hat{W}}_{ai}^{(k)}\left( v \right) dv} } \right\| ^2} \end{aligned}$$

(34)

where ${W_{ai}^{(k)}}$ makes following equation hold

$$\begin{aligned} e_{ci}^{(k)}\left( t \right) - {e^{ - \alpha T}}{\hat{V}}_i^{(k)}\left( {t + T} \right) =&\;\int _t^{t + T} {{e^{ - \alpha \left( {v - t} \right) }}\left( {{r_1}\left( {{\delta _i}\left( v \right) } \right) + \sum \limits _{j \in {\mathscr {N}_i}} {{{\left( {u_j^{(k)}} \right) }^\top }{R_{ij}}u_j^{(k)}} } \right) } dv \nonumber \\&+ \int _t^{t + T} {{e^{ - \alpha \left( {v - t} \right) }}{{\left( {u_i^{(k)}} \right) }^\top }{R_{ii}}\psi _{ai}^\top \left( {{z_{ai}}\left( v \right) } \right) } dv W_{ai}^{(k)} \end{aligned}$$

(35)

Since ${\hat{u}}_i^{(k)}$ is bounded, there exists a scalar ${W_2}> 0$ which satisfies $\left\| {W_{ai}^{(k)} - W_{ai}^ * } \right\| ^2 \leqslant {W_2}$.

Combining Eqs. (22) and (35), one has

$$\begin{aligned} {\hat{V}}_i^{(k)}\left( t \right) = \int _t^{t + T} {{e^{ - \alpha \left( {v - t} \right) }}{{\left( {u_i^{(k)}} \right) }^\top }{R_{ii}}\psi _{ai}^\top \left( {{z_{ai}}\left( v \right) } \right) dv} \left( {{\hat{W}}_{ai}^{(k)} - W_{ai}^{(k)}} \right) \end{aligned}$$

(36)

Substituting Eq. (36) into Eq. (34), one has

$$\begin{aligned} \nabla {L_{i,2}}\left( k \right) =&\; {\beta _{ai}}\left( {{\beta _{ai}}{{\left\| {\int _{t - s}^t {{\rho _{ai}}\left( v \right) e_{ai}^{(k)}\left( v \right) dv} } \right\| }^2} - 2\int _{t - s}^t {{\rho _{ai}^\top }\left( v \right) {\eta _{ai}}\left( v \right) dv} } \right) \nonumber \\&+ \beta _{ai}^{}\int _{t - s}^t {\left( {\rho _{ai}}\left( v \right) e_{ai}^{(k)}\left( v \right) \right) ^\top dv} \left( {W_{ai}^ * - W_{ai}^{(k)}} \right) \nonumber \\ \leqslant&\; {\beta _{ai}}\left( {{\beta _{ai}}{{\left\| {\int _{t - s}^t {{\rho _{ai}}\left( v \right) e_{ai}^{(k)}\left( v \right) dv} } \right\| }^2} - 2\int _{t - s}^t {{\rho _{ai}^\top }\left( v \right) {\eta _{ai}}\left( v \right) dv} } \right) + \beta _{ai}^{}\int _{t - s}^t {\left( {\rho _{ai}}\left( v \right) e_{ai}^{(k)}\left( v \right) \right) ^\top dv} \underline{W_2} \end{aligned}$$

(37)

where ${\eta _{ai}}\left( v \right) = \int _t^{t + T} {{e^{ - \alpha \left( {v - t} \right) }}{{\left( {u_i^{(k)}} \right) }}{R_{ii}}\psi _{ai}\left( {{z_{ai}}\left( v \right) } \right) } dv$, and $\underline{W_2} = \underline{1}\otimes {W_2}\in R^{h_{di}}$.

In addition, the gradient of ${L_{i,3}}\left( k \right)$ is derived as follows:

$$\begin{aligned} \nabla {L_{i,3}}\left( k \right) =&\; 2\delta _i^\top \delta _i+2\Theta _i \dot{V}_i(\delta )\nonumber \\ =&\; 2\delta _i^\top {\mathscr {L}_i}\left( f_e(x)+g(x)u \right) -2\Theta _i e^{-\alpha T}\left( \delta _i^ \top {Q_{ii}}{\delta _i} + u_i^ \top {R_{ii}}{u_i} + \sum \limits _{j \in {\mathscr {N}_i},i} {u_j^ \top {R_{ij}}{u_j}} \right) \nonumber \\ =&\;2\delta _i^\top {\mathscr {L}_i}f_e(x) + 2\delta _i^\top {\mathscr {L}_i}\left( \left( {{d_i} + {b_i}} \right) g\left( {{x_i}} \right) {u_i} - \sum \limits _{j \in {\mathscr {N}_i}} {{a_{ij}}g\left( {{x_j}} \right) {u_j}} \right) - 2 \Theta _i e^{-\alpha T}\left( \delta _i^ \top {Q_{ii}}{\delta _i}+\sum \limits _{j \in \{\mathscr {N}_i,i\}} {u_j^ \top {R_{ij}}{u_j}} \right) \nonumber \\ \le&\; ||{\mathscr {L}_i}f_e(x)||^2 + ||{\mathscr {L}_i}\left( d_i+b_i-\sum \limits _{j \in {\mathscr {N}_i}}a_{ij} \right) \bar{g}_i ||^2-2\Theta _i \lambda _{\text {min}}(Q_{ii}) ||\delta _i||^2 - 2\Theta _i\lambda _{\text {min}}(R_{ij})||u_i||^2 \end{aligned}$$

(38)

If $\Theta _i$ satisfies

$$\Theta _i> \text {max}\left\{ \frac{||{\mathscr {L}_i}f_e(x)||^2}{2\lambda _{\text {min}}(Q_{ii})}, \frac{\left\| {\mathscr {L}_i}\left( d_i+b_i-\sum \limits _{j \in {\mathscr {N}_i}}a_{ij} \right) \bar{g}_i \right\| ^2}{2\lambda _{\text {min}}(R_{ij})} \right\}$$

and the inequality $||\delta _i||^2> \frac{||{\mathscr {L}_i}f_e(x)||^2}{2\Theta _i\lambda _{\text {min}}(Q_{ii})}$ holds, one has $\nabla {L_{i,3}}\left( k \right) <0$.

Based on Eqs. (33), (37), and (38), one has

$$\begin{aligned} \nabla {L_i}\left( k \right) <&\; {\left\| {\int _{t - s}^t {\beta _i^{\top } {\rho _i ^{}}\left( v \right) \lambda _i\left( v \right) dv} } \right\| ^2} - 2 \beta _i^\top \int _{t - s}^t {{\rho _i}\left( v \right) \eta _i \left( v \right) dv} + \beta _i^\top \int _{t - s}^t {{\rho _i}\left( v \right) \lambda _i \left( v \right) dv} \underline{W} \end{aligned}$$

(39)

where ${\beta _i} = {\left[ {\beta _{ci}^\top \hspace{5.0pt}\beta _{ai}^\top } \right] ^\top }$, $\rho _i \left( v \right) = diag \{ {\rho _{ci}^\top \left( v \right) , \rho _{ai}^\top \left( v \right) } \}$, $\lambda _i \left( v \right) = [ ( e_{ci}^{(k)}\left( v \right) )^\top$ $( e_{ai}^{(k)}( v ) )^\top ]^\top$, $\eta _i \left( v \right) = {\left[ {\eta _{ci}^\top \left( v \right) \hspace{5.0pt}\eta _{ai}^\top \left( v \right) } \right] ^\top }$, $W = \max \left\{ {{W_1},{W_2}} \right\}$, and $\underline{W} = \underline{1} \otimes {W}\in R^{h_{vi+di}}$.

It’s clear that $\nabla {L_i}\left( k \right) < 0$ if Eq. (26) holds, it is obtained that ${\tilde{W}}_{ci}^{(k)}$, ${\tilde{W}}_{ai}^{(k)}$, and $\delta _i$ are UUB, that is, Algorithm 2 can approximate the optimal solution instead of finding the exact optimal solution. $\square$

Remark 3

According to the Lyapunov-based convergence analysis^25,26,27, ${\tilde{W}}_{ci}^{(k)}$, ${\tilde{W}}_{ai}^{(k)}$, and $\delta _i$ are UUB if the gradient of $L_i(k)$ can be proved to be negative under certain conditions.

Remark 4

It is worthwhile to note that the selections of $Q_{ii}$ and $R_{ij}$ affect the size of the ultimate bounded region $\frac{||{\mathscr {L}_i}f_e(x)||^2}{2\Theta _i\lambda _{\text {min}}(Q_{ii})}$. To minimize this region, it is recommended to select larger $Q_{ii}$ and smaller $R_{ij}$.

Remark 5

It should be pointed out that this study is focused on the issues of how to solve the HJB equation for nonlinear MASs, how to design an IRL controller, and how to avoid the designed IRL method from getting trapped in local optima. Meanwhile, it is noteworthy that the designed IRL method can be further optimized and extended to address more interesting issues^44,45,46, such as limited communication resources, unknown disturbances, and constrained actuators.

Simulation studies

In this section, two simulation examples are given to demonstrate the effectiveness of the designed method. The simulation platform is PyCharm with Python, where the sampling period, T, is set to 0.001 seconds. The controlled MASs consist of five agents, which are connected as shown in Fig. 1.

Example I

In this part, a nonlinear MASs⁴⁷ is considered, where the communication topology of the MASs is set as shown in Fig. 1. $f\left( {{x_i}} \right)$ and $g\left( {{x_i}} \right)$ of each agent are given as $f\left( {{x_i}} \right) = \left[ \begin{aligned} {x_{i2}} \\ 0.2{x_{i2}} - 5{x_{i1}} \\ \end{aligned} \right]$ and $\left( {{x_i}} \right) = \left[ \begin{aligned} 0 \\ 5\cos {\left( {{x_{i2}}{x_{i1}}} \right) ^3} \\ \end{aligned} \right]$.

The system dynamics of the leader node is $k\left( {{x_0}} \right) = \left[ {x_{02}}, - 5{x_{01}} \right] ^T$. Set ${Q_{11}} = {Q_{22}} = {Q_{33}} = {Q_{44}} = {Q_{55}} = 5$, ${R_{11}} = {R_{22}} = {R_{33}} = {R_{44}} = {R_{55}} = 0.5$, ${R_{ij}} = 0.2$ (note that ${R_{ij}} = 0.2$ only when $j \in {\mathscr {N}_i}$, or ${R_{ij}} = 0$), and the discount factor $\alpha = 0.1$. Select the sigmoid functions as the activation functions ${\phi _{ci}}\left( \cdot \right)$ and ${\psi _{ai}}\left( \cdot \right)$ of critic network and actor network, respectively. Set ${h_{vi}} = 15$, ${h_{di}} = 18$, and $s=5$. Set the learning rates ${\beta _{ci}} = 0.001$, and $\beta _{ai} = 0.0005$. Set ${N_{c,\max }} = 150$ and ${N_{a,\max }} = 100$. The thresholds are set as ${E_{c,thr}} = {E_{a,thr}} = 0.001$, and the computation error is set as $\varepsilon = 0.01$. The initial states of the leader and followers are selected as around (0, 0.5).

The local neighborhood tracking errors are shown in Fig. 2, where it can be observed that all errors converged at about 6 seconds. Fig. 3 shows the control inputs of all agents. The states are shown in Fig. 4, which shows that all agents reach a consensus on the leader within 6 seconds.

Fig. 5 shows the weight curves of each agent’s critic and actor neutral networks, demonstrating that the weights are convergent. To illustrate the merits of the proposed weight update rule, we compare it with the existing method²⁸. The variables $\bigtriangleup \hat{W}_{ci}(t) = {\textstyle \sum _{i\in N}^{}} || \hat{W}_{ci}(t+T) - \hat{W}_{ci}(t) ||$ and $\bigtriangleup \hat{W}_{ai}(t) = {\textstyle \sum _{i\in N}^{}}|| \hat{W}_{ai}(t+T) - \hat{W}_{ai}(t) ||$ are introduced to express the total change of the weights value. The comparison results are shown in Fig. 6, where it’s apparent that the proposed weight update rule outperforms the existing method²⁸.

Example II

In this section, a load frequency control simulation for a multi-area interconnected power system, shown in Fig. 7, is given to verify the effectiveness of the proposed scheme in practical systems.

The dynamics model⁴⁸ of ith power system is described as

$$\begin{aligned} {\dot{x}_i} = f_i\left( {{x_i}} \right) + g_i\left( {{x_i}} \right) {u_i}+h_i\left( {{x_i}} \right) w_i \end{aligned}$$

(40)

where $x_i^\top = [ \Delta {F}_i, \Delta {P}_{\text{ tie-i } }, \Delta {P}_{m i}, \Delta {P}_{g i}]$ denotes the system states, $w_i^\top =[\Delta P_{d i}, \sum _{j=1, j \ne i}^N T_{i j} \Delta F_j]$ denotes the external disturbances, $f_i(x_i) =\left[ \begin{array}{cccc} -\frac{D_i}{M_i} & -\frac{1}{M_i} & \frac{1}{M_i} & 0 \\ 2 \pi \sum _{j=1, j \ne i}^N T_{i j} & 0 & 0 & 0 \\ 0 & 0 & -\frac{1}{T_{t i}} & \frac{1}{T_{t i}} \\ -\frac{1}{R_i T_{g i}} & 0 & 0 & -\frac{1}{T_{g i}} \end{array}\right] x_i$, $g_i =\left[ \begin{array}{llll} 0&0&0&\frac{1}{T_{g i}} \end{array}\right] ^{\top }$, and $h_i =\left[ \begin{array}{llll} \beta _i&1&0&0 \end{array}\right]$.

Moreover, $\Delta F_i$, $\Delta P_{\text{ tie }-i}$, $\Delta P_{m i}$, $\Delta P_{g i}$, and $\Delta P_{d i}$ denote the frequency deviation, the tie line power deviation, the generator output power deviation, the load change, the governor valve position deviation, and the load disturbance, respectively. The system parameters are set as $D_1 = 0.031$, $D_2 = 0.035$, $D_3 = 0.037$, $M_1 = 0.076$, $M_2 = 0.085$, $M_3 = 0.083$, $R_1 = 1.48$, $R_2 = 1.53$, $R_3 = 1.62$, $T_{g1} = 0.071$, $T_{g2} = 0.074$, $T_{g3} = 0.076$, $T_{t1} = 0.51, T_{t2} = 0.47 , T_{t3} = 0.46$, $T_{12}=T_{21}=0.22$, $T_{13}=T_{31}= 0.31$, and $T_{23}=T_{32}=0.23$. The parameters of the controller are set as ${Q_{11}} = {Q_{22}} = {Q_{33}} = {Q_{44}} = {Q_{55}} = 5$, and ${R_{11}} = {R_{22}} = {R_{33}} = {R_{44}} = {R_{55}} = 0.5$. Select the sigmoid functions as the activation functions ${\phi _{ci}}\left( \cdot \right)$ and ${\psi _{ai}}\left( \cdot \right)$ of critic neutral network and actor neural network, respectively. Set ${h_{vi}} = 6$, ${h_{di}} = 3$, and $s=10$. Set the discount factor $\alpha = 0.1$, the learning rates ${\beta _{ci}} = 0.001$, and $\beta _{ai} = 0.0005$. The weights of critic and actor neural networks are updated with ${N_{c,\max }} = 150$ and ${N_{a,\max }} = 100$, the thresholds are set as ${E_{c,thr}} = {E_{a,thr}} = 0.001$, and the computation error $\varepsilon = 0.01$. The initial states of power systems are selected as 0, and the load disturbances are $\Delta P_{di} = 0.03$ for all areas. The generation rate constraint is selected as $|\Delta \dot{P}_{c i}|\le 0.01$.

Figure 8 illustrates the frequency deviation of three areas, indicating that the proposed algorithm stabilizes the system within 7 seconds. Fig. 9 presents the control results of the existing data-driven method⁴⁸, which exhibit higher overshoot and slower convergence compared to the proposed method. Fig. 10 shows the control input curves for the three areas, which stabilize after approximately 5 seconds. Fig. 11 illustrates the convergence of the weight parameters of the actor-critic NNs, which stabilizes around 100 seconds, indicating the completion of the training process. Collectively, Figs. 8 to 11 demonstrates that the proposed method effectively achieves load frequency control in a multi-area power system, verifying its effectiveness and applicability.

Conclusions

This paper investigated an optimal consensus control issue for a nonlinear multi-agent system with unknown dynamics. By employing the integral reinforcement learning algorithm and policy iteration method, the control policy and value function solution have been approximated using an actor-critic neural network framework, and the historical data have been utilized in the update rule for the weights of the actor and critic neural networks. Compared to existing methods, the proposed method exhibits a faster convergence speed and effectively leverages historical information to prevent falling into local optima. Further study of limited communication resources, unknown disturbances, or constrained actuators is a meaningful endeavor in our efforts.

Data availability

All data generated or analysed during this study are included in this published article.

References

Erkeç, T. & Chingiz, H. Formation flight for close satellites with GPS-based state estimation method. IEEE Sensors Journal 22, 15457–15464 (2022).
Article ADS Google Scholar
Ren, Y., Liu, S., Li, D., Zhang, D., Lei, T., & Wang, L. Model-free adaptive consensus design for a class of unknown heterogeneous nonlinear multi-agent systems with packet dropouts. Scientific Reports 14(1), 23093 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Lee, H., Lee, C., Lee, J. & Kwon, C. Linear quadratic control and estimation synthesis for multi-agent systems with application to formation flight. IET Control Theory and Applications 18(18), 23093 (2024).
Article MathSciNet Google Scholar
Yao, D., Li, H. & Shi, Y. Event-based average consensus of disturbed MASs via fully distributed sliding mode control. IEEE Transactions on Automatic Control 69(3), 2015–2022 (2023).
Article ADS MathSciNet Google Scholar
Dui, H., Zhang, S., Liu, M., Dong, X. & Bai, G. IoT-enabled real-time traffic monitoring and control management for intelligent transportation systems. IEEE Internet of Things Journal 11, 15842–15854 (2024).
Article Google Scholar
Wei, Q., Jiao, S., Wang, F. & Dong, Q. Robust optimal parallel tracking control based on adaptive dynamic programming. IEEE Transactions on Cybernetics 54, 4308–4321 (2024).
Article PubMed Google Scholar
Wang, D., Gao, N., Liu, D., Li, J. & Lewis, F. Recent progress in reinforcement learning and adaptive dynamic programming for advanced control applications. IEEE/CAA Journal of Automatica Sinica 11(1), 18–36 (2024).
Article Google Scholar
Bianca, C. & Dogbe, C. Regularization and propagation in a Hamilton-Jacobi-Bellman-Type equation in infinite-simensional Hilbert apace. Symmetry 16(8), 1017 (2024).
Article ADS Google Scholar
Jimenez, C., Marigonda, A. & Quincampoix, M. Dynamical systems and Hamilton-Jacobi-Bellman equations on the Wasserstein space and their L-2 representations. Siam Journal on Mathematical Analysis 55, 5919–5966 (2024).
Article MathSciNet Google Scholar
Luo, A., Zhou, Q., Ma, H. & Li, H. Event-Triggered optimal consensus control for mass with multiple constraints: a flexible performance approach. IEEE Transactions on Automation Science and Engineering 22, 13117–13127 (2025).
Article Google Scholar
Jiang, H., Li, X., Zhou, B. & Cao, X. Bias-policy iteration-based adaptive dynamic programming for optimal control of discrete-time nonlinear systems. IEEE Transactions on Circuits and Systems I-Regular Papers 72(8), 4284–4296 (2025).
Article Google Scholar
Lin, Z., Ma, J., Duan, J., Li, E., Ma, H., & Cheng, B., Policy iteration based approximate dynamic programming toward autonomous driving in constrained dynamic environment. IEEE Transactions on Intelligent Transportation Systems 24, 5003–5013 (2023).
Article Google Scholar
Qasem, O., Davari, M., Gao, W., Kirk, D. & Chai, T. Hybrid iteration ADP algorithm to solve cooperative, optimal output regulation problem for continuous-time, linear, multiagent systems: theory and application in islanded modern microgrids with IBRs. IEEE Transactions on Industrial Electronics 71, 834–845 (2024).
Article ADS Google Scholar
Wang, X., Zhao, C., Huang, T., Chakrabarti, P. & Kurths, J. Cooperative learning of multi-agent systems via reinforcement learning. IEEE Transactions on Signal and Information Processing over Networks 9, 13–23 (2023).
Article MathSciNet CAS Google Scholar
Hong, X. et al. Adaptive average arterial pressure control by multi-agent on-policy reinforcement learning. Scientific Reports 15(1), 679 (2025).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhao, H., Peng, L., Xie, L. & Yu, H. Event-triggered bipartite consensus for multi-agent systems via model-free sliding-mode scheme. IEEE Transactions on Network Science and Engineering 12(2), 1137–1145 (2025).
Article Google Scholar
Cao, H., Mao, Y., Sha, L. & Caccamo, M. Physics-regulated deep reinforcement learning: Invariant embeddings. arXiv preprint, 2023; arXiv:2305.16614.
Huo, W., Huang, L., Dey, S. & Shi, L. Neural-network-based distributed generalized nash equilibrium seeking for uncertain nonlinear multiagent systems. IEEE Transactions on Control of Netowrk Systems 11(3), 1323–1334 (2024).
Article MathSciNet Google Scholar
Li, W., Yue, J., Shi, M., Lin, B. & Qin, K. Neural network-based dynamic target enclosing control for uncertain nonlinear multi-agent systems over signed networks. Neural Networks 184, 107057 (2025).
Article PubMed Google Scholar
Zhang, Y., Wu, W., Lu, J. & Zhang, W. Neural predictor-based dynamic surface parallel control for MIMO uncertain nonlinear strict-feedback systems. IEEE Transactions on Circuits and Systems II-Express Briefs 70(8), 2909–2913 (2024).
Google Scholar
Ferede, R., De Croon, G., De Wagter, C. & Izzo, D. End-to-end neural network based optimal quadcopter control. Robotics and Autonomous Systems 172, 104588 (2024).
Article Google Scholar
Zishan, M. et al. Dense neural network based arrhythmia classification on low-cost and low-compute micro-controller. Expert Systems with Applications 239, 122560 (2024).
Article Google Scholar
Hu, S., Ren, X., Zheng, D. & Chen, Q. Neural network-based robust adaptive synchronization and tracking control for multi-motor driving servo systems. IEEE Transactions on Transportation Electrification 10(4), 9618–9630 (2024).
Article Google Scholar
Modares, H. & Lewis, F. L. Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning. Automatica 50, 1780–1792 (2014).
Article MathSciNet Google Scholar
Shen, H., Wang, Y., Wang, J. & Park, J. A fuzzy-model-based approach to optimal control for nonlinear Markov jump singularly perturbed systems: A novel integral reinforcement learning scheme. IEEE Transactions on Fuzzy Systems 31, 3734–3740 (2023).
Article Google Scholar
Lin, J., Zhao, B., Liu, D. & Wang, Y. Dynamic compensator-based near-optimal control for unknown nonaffine systems via integral reinforcement learning. Neurocomputing 564, 126973 (2024).
Article Google Scholar
Yan, L. et al. Game-based adaptive fuzzy optimal bipartite containment of nonlinear multi-agent systems. IEEE Transactions on Fuzzy Systems 32, 1455–1465 (2024).
Article Google Scholar
Li, H., Wu, Y., Chen, M. & Lu, R. Aaptive multigradient recursive reinforcement learning event-triggered tracking control for multiagent systems. IEEE Transactions on Neural Networks and Learning Systems 34, 144–156 (2023).
Article MathSciNet PubMed Google Scholar
Zhao, H., Shan, J., Peng, L. & Yu, H. Adaptive event-triggered bipartite formation for multiagent systems via reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems 35(12), 17817–17828 (2024).
Article MathSciNet PubMed Google Scholar
Lv, Y., Chang, H. & Zhao, J. Online adaptive integral reinforcement learning for nonlinear multi-input system. IEEE Transactions on Circuits and Systems II: Express Briefs 70, 4176–4180 (2023).
Google Scholar
Sun, Y., Xu, J., Chen, C., & Hu, W. Reinforcement learning-based optimal tracking control for levitation system of maglev vehicle with input time delay. IEEE Transactions on Instrumentation and Measurement 71, 1–13 (2022).
Google Scholar
Li, C., Liu, Q., Zhou, Z., Buss, M. & Liu, F. Off-policy risk-sensitive reinforcement learning-based constrained robust optimal control. IEEE Transactions on Systems Man Cybernetics-Systems 53, 2478–2491 (2023).
Article ADS Google Scholar
Pang, B. & Jiang, Z. Reinforcement learning for adaptive optimal stationary control of linear stochastic systems. IEEE Transactions on Automatic Control 68, 2383–2390 (2023).
Article ADS MathSciNet Google Scholar
Zhang, H., Lin, Y., Han, S., Wang, S., & Lv, K. Off-policy conservative distributional reinforcement learning with safety constraints. IEEE Transactions on Systems, Man, and Cybernetics: Systems 55, 2033–2045 (2025).
Article ADS Google Scholar
Xia, Y., Zhang, Z., Xu, J., Ren, P., Wang, J., & Han, Z. Eye in the Sky: Energy efficient model-based reinforcement learning aided target tracking using UAVs. IEEE Transactions on Vehicular Technology 73, 19464-19479 (2024).
Article ADS Google Scholar
Zhu, Y., Lv, Y., Lin, S. & Xu, J. A stochastic traffic flow model-based reinforcement learning framework for advanced traffic signal control. IEEE Transactions on Intelligent Transportation Systems 26(1), 714–723 (2025).
Article Google Scholar
Xue, S., Luo, B., Liu, D. & Gao, Y. Neural network-based event-triggered integral reinforcement learning for constrained $H_\infty$ tracking control with experience replay. Neurocomputing 513, 25–35 (2022).
Modares, H. & Lewis, F. Linear quadratic tracking control of partially-unknown continuous-time systems using reinforcement learning. IEEE Transactions on Automatic Control 59, 3051–3056 (2014).
Article ADS MathSciNet Google Scholar
Wang, H. & Li, M. Model-free reinforcement learning for fully cooperative consensus problem of nonlinear multiagent systems. IEEE Transactions on Neural Networks and Learning Systems 33, 1482–1491 (2022).
Article MathSciNet PubMed Google Scholar
Peng, Z. et al. A novel optimal bipartite consensus control scheme for unknown multi-agent systems via model-free reinforcement learning. Applied Mathematics and Computation 369, 124821 (2020).
Article MathSciNet Google Scholar
Xue, S., Luo, B., Liu, D. & Gao, Y. Neural network-based event-triggered integral reinforcement learning for constrained $H_\infty$ tracking control with experience replay. Neurocomputing 513, 25–35 (2022).
Lewis, F. & Vrabie, D. Reinforcement learning and adaptive dynamic programming for feedback control. IEEE circuits and systems magazine 9, 32–50 (2009).
Article Google Scholar
Park, J. & Sandberg, I. Universal approximation using radial-basis-function networks. Neural computation 9, 246–257 (1991).
Article Google Scholar
Tan, L. Distributed $H_\infty$ optimal tracking control for strict-feedback nonlinear large-scale systems with disturbances and saturating actuators. IEEE Transactions on Systems, Man, and Cybernetics: Systems 50(11), 4719–4731 (2018).
Tan, L., Tran, H. & Tran, T. Event-triggered observers and distributed $H_\infty$ control of physically interconnected nonholonomic mechanical agents in harsh conditions. IEEE Transactions on Systems, Man, and Cybernetics: Systems 52(12), 7871–7884 (2022).
Tan, L. Event-triggered observer-based $H_\infty$ optimal tracking formation control for ITSs with longitudinal-lateral slips and disturbances. IEEE Transactions on Intelligent Transportation Systems https://doi.org/10.1109/TITS.2025.3566714 (2025).
Wang, D. & Mu, C. Adaptive-critic-based robust trajectory tracking of uncertain dynamics and its application to a spring–mass–damper system. IEEE Transactions on Industrial Electronics 65(1), 654–663 (2017).
Article ADS Google Scholar
Bu, X., Zhang, Y., Zeng, Y. & Hou, Z. Event-triggered data-driven distributed LFC using controller-dynamic-linearization method. IEEE Transactions on Signal and Information Processing over Networks 11, 85–96 (2025).
Article MathSciNet Google Scholar
Chen, Y., Zhao, H., Peng, L. & Yu, H. Data-driven dynamic event-triggered load frequency control for multi-area interconnected power systems with random delays. International Journal of Robust and Nonlinear Control 35(8), 3358–3369 (2025).
Article MathSciNet Google Scholar
Chen, Y., Zhao, H., Ogura, M., Yu, H. & Peng, L. Data-driven event-triggered fixed-time load frequency control for multi-area power systems with input delays. IEEE Transactions on Circuits and Systems I: Regular Papers https://doi.org/10.1109/TCSI.2025.3580122 (2025).
Article Google Scholar

Download references

Funding

This work was supported in part by the National Natural Science Foundation of China (62403216), in part by the Basic Research Program of Jiangsu Province (BK20241608), in part by the Wuxi Young Science and Technology Talent Support Program (TJXD-2024-114), in part by the European Union (EU) iMARs project (HORIZON-MSCA-2023-101182996), and in part by the 111 project (B23008).

Author information

Authors and Affiliations

Engineering Research Center of Internet of Things Applications Ministry of Education, Jiangnan University, Wuxi, 214122, Jiangsu, China
Longquan Ma, Huarong Zhao & Yuhao Chen
Jiangsu Provincial Sensor Network Engineering Technology Research Center, Wuxi Institute of Technology, Wuxi, 214121, Jiangsu, China
Yi Gao
School of Computing, Engineering and the Built Environment, Edinburgh Napier University, EH10 5DT, Edinburgh, UK
Hongnian Yu

Authors

Longquan Ma
View author publications
Search author on:PubMed Google Scholar
Huarong Zhao
View author publications
Search author on:PubMed Google Scholar
Yuhao Chen
View author publications
Search author on:PubMed Google Scholar
Yi Gao
View author publications
Search author on:PubMed Google Scholar
Hongnian Yu
View author publications
Search author on:PubMed Google Scholar

Contributions

Longquan Ma: Conceptualization, Methodology, Software, Writing - Original Draft. Huarong Zhao: Methodology, Investigation, Writing - Review & Editing, Funding acquisition, Supervision. Yuhao Chen: Visualization, Writing - Review & Editing. Yi Gao: Supervision, Validation. Hongnian Yu: Writing - Review & Editing

Corresponding authors

Correspondence to Huarong Zhao or Yi Gao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ma, L., Zhao, H., Chen, Y. et al. Experience-based integral reinforcement learning consensus for unknown multi-agent systems. Sci Rep 15, 32962 (2025). https://doi.org/10.1038/s41598-025-15573-w

Download citation

Received: 12 January 2025
Accepted: 08 August 2025
Published: 26 September 2025
DOI: https://doi.org/10.1038/s41598-025-15573-w