Introduction

A decrease in reservoir pressure is one of the main causes of production decline during oilfield development. In order to maintain reservoir pressure and improve oilfield recovery, waterflooding has been the main development method in Chinese oilfields, accounting for 74% of total production1,2,3. With the deepening of water-driven development, China’s onshore oilfields have one after another entered the stage of High-water-cut stage, with highly dispersed residual oil and great difficulty in controlling water4. Layered water injection technology is an effective method to improve the efficiency of oilfield development. By setting up multiple water injection segments in oil wells, differentiated water injection can be carried out in different layers to achieve optimal water injection efficiency and reservoir pressure balance, and to increase the degree of balanced utilization of each oil layer5. With the promotion and application of the fourth-generation layered water injection device, the layered water injection technology is digitalized. It can monitor the status of injection wells in real time and regulate the flow rate of each layered segment6water injection efficiency and water injection conformance have been significantly improved. However, flow scheduling in layered water injection systems is a complex optimization problem. The downhole environment is complex, and the dynamic adjustment of water nozzle openings in multi-layer sections involves the influence of multiple factors, such as wellhead pressure, formation pressure, and the structure of the water injection device. Traditional flow scheduling methods usually use manual or empirical formulas that lack adaptivity and intelligence. The inability to effectively cope with reservoir variations and water injection inhomogeneities leads to unsatisfactory flow scheduling. With the deepening of intelligent oilfield construction and the increasing number of water injection wells, the original way of water injection well management can no longer meet the needs of oilfield production, so solving the problem of water injection devices in the incomplete system of wells, the difficulty of achieving the goal of multi-layer segments of flow rate dispensing, is an important topic for oilfield production.

In recent years, many scholars have conducted in-depth studies on the flow dynamics of water injection devices, An Runze et al.7 used the fluid-solid coupling method to reveal the dynamic characteristics of the valve spool; Zhao Guangyuan et al.8 established a water injection volume calculation model to improve the field applicability ; Zhou Lizhi et al.9 proposed an injection algorithm based on the physical property parameters; Jiang Xiufang et al.10 verified the power-law relationship between flow rate and differential pressure. These studies provide an important reference for the study of this paper, but they exhibit significant limitations in dealing with the complex nonlinear problems of the actual water injection process. Traditional methods are inadequate in dealing with the strong nonlinearity caused by time-varying downhole pressure, fail to establish accurate mathematical models of complex downhole environments, and cannot adapt to uncertain, dynamically changing reservoir conditions In contrast, deep reinforcement learning (DRL) shows unique advantages. it can directly learn the optimal control strategy from the system interaction without relying on accurate physical models and enable online adaptive adjustment of the dynamic environment through the inherent nonlinear processing ability of deep neural networks. This ability to overcome the limitations of traditional methods and solve the problem of nonlinear adaptive control of water injection system is the main driving force of our research.

With the rapid development of artificial intelligence and machine learning technology, deep learning algorithms are increasingly being used in the field of intelligent control. Among them, reinforcement learning is a method employed to study intelligent control systems by continuously interacting with the environment to learn, enabling the system can adjust its behavior according to the feedback of the environment, and finally achieve the optimal control strategy11. In recent years, deep reinforcement learning has gained significant attention in the field of robotics12where the agent model serves as the controller of the robot and the robot itself along with its surroundings constitute the deep reinforcement learning environment. A robotic agent operates in a randomized environment by sequentially selecting actions over a series of time steps. This framework allows robotic agents to learn skills and solve tasks autonomously through trial and error13. This approach shows potential for applications in disciplines such as multi-intelligent body collaboration14,15autonomous vehicles16,17,18and robot control19,20,21.

Duy Quang Tran et al.17 proposed a deep reinforcement learning model that integrates the Flow framework with the PPO algorithm and demonstrated that fully automated driving significantly improves the efficiency compared with purely manual driving while effectively suppressing the start-stop wave phenomenon at unsignalized intersections. Chengyi Zhao et al.19 proposed an inverse kinematics solving method for a robotic arm based on the MAPPO algorithm, which significantly improves the generalization and computational efficiency compared with the traditional algorithms, supports the real-time unique solution generation, and enables path planning and intelligent obstacle avoidance in dynamic environments. Jiao Huanyan et al.22 proposed a reinforcement learning-based air conditioning control strategy for metro stations, which uses neural networks to simulate the environment and achieves temperature control and energy saving by training the intelligences, and simulation experiments show that the strategy can effectively reduce energy consumption .

Compared with other reinforcement learning methods, the Soft Actor-Critic (SAC) algorithm has remarkable advantages in layered water injection control23. Its entropy maximization mechanism enables more efficient exploration in the continuous action space, making it suitable for fine-tuned water injection24. The twin Q-network solves the over-estimation problem of traditional value functions. The automatic temperature adjustment mechanism can dynamically balance exploration and exploitation. These advantages make the SAC algorithm perform better in terms of stability and control when dealing with water injection control problems that feature sparse rewards and high-dimensional action spaces.

In order to solve the flow scheduling problem in layered water injection systems, this paper proposes a reinforcement learning-based adaptive control algorithm as shown in Fig. 1. The proposed method models the flow scheduling problem as a Markov decision process, uses a deep neural network as a value function approximator, and learns the optimal water nozzle opening strategy by interacting with the environment.

Fig. 1
figure 1

Adaptive control of water injection device, in which the water injection device was modeled using SolidWorks 2024 (Dassault Systèmes, https://www.solidworks.com); the Deep RL Model was created with Visio 2019 (Microsoft Corporation, https://www.microsoft.com/visio); and the water injection well &others were created in PowerPoint 2019 (Microsoft Corporation, https://www.microsoft.com/zh-cn/microsoft-365/powerpoint).

Hydrodynamic model of layered water injection string

Structure and working principle of water injection device

Due to differences in permeability, pressure, and other dispensing conditions of reservoirs in injection wells as well as varying dispensing volumes in each reservoir.

In order to realize layered water injection, a packer is used to separate the formation into several layer segments, and a water injection device is placed in each layer segment to ensure the smooth advancement of the water drive front by adjusting the injection volume for each segment. The structure of the layered water injection well is shown in Fig. 2.

Fig. 2
figure 2

Water injection well structure diagram.

Layered water injection is an effective method for improving the structure of injection and extraction during oilfield exploitation and increasing the oilfield recovery rate. This technique is widely used in the development of high-water-cut oilfields. The water injection device is located between two packers, and the injection volume for each layer segment is adjusted by changing the water nozzle opening. The injection well parameters are shown in Table 1.

Table 1 List of symbols for water injection well.

Water injection string pressure loss

Considering the water injection process as an incompressible flow, when the water nozzle is no longer changed, the velocity and pressure of the fluid will not change with time. According to Bernoulli’s principle, the total energy of the fluid per unit mass is kept constant in any two streamlines. Thus, the hydraulic equilibrium relationship between the wellhead and each layer segment’s nozzle outlet can be derived from Bernoulli’s Equation25:

$$\frac{{{p_0}}}{{\rho g}}+{h_i}=\frac{{{p_{ai}}}}{{\rho g}}+\frac{{v_{i}^{2}}}{{2g}}+{h_{wi}}$$
(1)

.

Where,\({p_0}\)is the wellhead pressure, Pa; \(\rho\)is the density of the injected fluid, kg/m³; \({\text{g}}\)is the gravitational acceleration, m/s2; \({p_{ai}}\)is the stratigraphic pressure of the layer, Pa; \({v_i}\)is the average flow velocity at the outlet of the nozzle of the i-th layer, m/s; hiis the stratigraphic depth from the wellhead to layer i, m; \({h_{wi}}\)is the head loss from the wellhead to layer i, m. The head loss consists of frictional and localized losses and expressed as:

$${h_{wi}}={h_{fi}}+{h_{ji}}$$
(2)

Where, hfiis the frictional head loss from the wellhead to layer \(\:i\),m; \({h_{ji}}\) is the local loss at the exit of layer i,m.

The local pressure loss at the water injection unit’s nozzle outlet, caused by pipe diameter variation26,27:

$${h_{ji}}={\zeta _i}\frac{{v_{i}^{2}}}{{2g}}$$
(3)

.

Where, \({\zeta _i}\)is the local loss coefficient for the i-th layer of spigot; \({v_i}\)is the average flow velocity at the outlet of the nozzle of the i-th layer, m/s。.

The along-travel resistance loss is the energy loss due to relative motion within the liquid and viscous friction between the liquid and the pipe wall28expressed as:

$$h_{fi} = \begin{cases}\lambda_i \dfrac{h_i}{d_0} \dfrac{\nu_{oi}^2}{2g}, & i = 1 \\[10pt]\lambda_1 \dfrac{h_1}{d_0} \dfrac{\nu_{o1}^2}{2g} + \sum\limits_{i=2}^{n} \lambda_i \dfrac{h_i - h_{i-1}}{d_0} \dfrac{\nu_{oi}^2}{2g}, & i > 1\end{cases}$$
(4)

Where,\({\lambda _i}\) is the loss coefficient along the i-th section, and \({d_0}\) is the diameter of the water injection column, m; \({v_{oi}}\)is the average velocity in the water injection pipe column of the i-th layer, m/s. The flow state of the fluid, which has a large impact on the along-stream loss, is usually described by the Reynolds number of the flow condition of the fluid, and the expression of the Reynolds number of the i-th section is:

$$R{e_i}=\frac{{\rho {v_{oi}}{d_0}}}{\mu }=\frac{{{v_{oi}}{d_0}}}{\nu }$$
(5)

Where, \(\rho\) is the fluid density, kg/m2; \({v_{oi}}\) is the average velocity in the i-th layer of the injection column, m/s; \(\mu\) the hydrodynamic viscosity, \({\text{Pa}} \cdot {\text{s}}\); \(\nu\) is the fluid kinematic viscosity, m²/s; \({d_0}\) is the diameter of the injection column, m.

The pipe flow is transformed from laminar flow to transitional flow and turbulent flow, and the along-travel loss coefficient \({\lambda _i}\) of its pipe has a certain relationship with the Reynolds number \(R{e_i}\) and the relative roughness of the pipe in different flow states, and the expression of the relative roughness of the water injection pipe column is:

$$\bar {\varepsilon }=\frac{\varepsilon }{{{d_0}}}$$
(6)

Where, the relative roughness dimensionless numbers \(\overline {\varepsilon }\), \(\varepsilon\) are the roughness of the inner surface of the pipe, m; and \({d_0}\) is the diameter of the water injection pipe column, m.

When the Reynolds number\(Re<2300\), the water injection column for laminar flow; when the Reynolds number\(Re \geqslant 2300\), the flow development of turbulent flow, this time along the resistance coefficient using the Haaland (Haaland) formula calculation10. For the i-th layer flow, along with the loss coefficient, \({\lambda _i}\) can be expressed as:

$${\lambda _{oi}}=\left\{ {\begin{array}{*{20}{l}} {\frac{{64}}{{R{e_i}}},}&{,Re<2300} \\ {{{\left\{ { - 1.8\lg \left[ {{{\left( {\frac{{\bar {\varepsilon }}}{{3.7}}} \right)}^{1.11}}+\frac{{6.9}}{{R{e_i}}}} \right]} \right\}}^{ - 2}}}&{,Re \geqslant 2300} \end{array}} \right.$$
(7)

Flow model of water injection device

The fluid velocity \({v_{oi}}\) fluid continuity equation in the injection column is expressed as:

$${v_{oi}}=\frac{{4{v_i}{S_i}}}{{\pi d_{0}^{2}}}$$
(8)

Where \({v_i}\) is the average flow velocity at the outlet of the i-th nozzle, m/s; indicates the overflow area of the nozzle, m².

Bringing Eq. (8) into Eq. (4) yields a functional relationship between the outlet flow rate \({v_i}\) at the nozzle and the along-track loss, and bringing Eqs. (4) and (3) into Eq. (1) yields an expression for the outlet flow rate:

$${v_i}({p_0},{h_i},{p_{ai}},{\zeta _i},{\lambda _{oi}})=\sqrt {\frac{{2{p_0}+2\rho g{h_i} - 2{p_{ai}}}}{{\rho +{\zeta _i}\rho +{\lambda _{oi}}\frac{{{h_i}A_{i}^{2}}}{{{d_0}A_{0}^{2}}}\rho }}}$$
(9)

The nozzle outlet flow rate \({v_i}\), brought into Eq. (1) can be calculated to obtain the i-th layer injection volume expression as:

$${Q_i}{\text{=}}{A_i}{v_i}$$
(10)

.

The coefficient of loss along the injection column \({\lambda _{oi}}\) is a function of the velocity within the injection column \({v_{oi}}\), which cannot be derived when the flow rate at the nozzle outlet is unknown. An iterative method is used to solve for the loss coefficient along; the water injection column. The calculation process is shown in Algorithm 1:

Algorithm 1
figure a

Iterative calculation method for layer segment flow rate.

Reinforcement learning control algorithm

Water injection device state space

The environmental state information includes ground information, information about the structure of the layered water injection unit, real-time sensor information, formation information, and information about the target injection volume, and the state space is defined as:

$$S=\{ {p_o},{x_i},{h_i},{p_{ai}},{p_{bi}},t{a_i}\} \forall i \in (1,2,...,n)$$
(11)

Where, \({p_0}\) indicates the wellhead injection pressure, Pa; \({x_i}\) denotes the percentage of nozzle opening at i-th layer; \({h_i}\) denotes the depth of water dispenser at i-th layer, m; \({p_{ai}}\) denotes the formation pressure at i-th layer, Pa; \({p_{bi}}\) denotes the pressure in the injection column at i-th layer, Pa;i denotes the target injection volume at i-th layer i, m2/d.

Water injection device action space

Layered water injection device is by controlling the size of the nozzle opening to change the nozzle pressure difference to achieve the requirements of different layer segments of the water injection flow, here the nozzle movement time \(r{o_i}\) is a parameter of the \(\:agent\) continuous action space, the continuous action space \(\:A\) is defined as:

$$A=\{ r{o_i}\} \forall i \in (1,2,...,n)$$
(12)

Wherein, \(r{o_i} \in [ - 5,5]\), when \(r{o_i}>0\) means that the spool in the water distributor moves in the direction of increasing the overflow area for \(r{o_i}\) seconds, when \(r{o_i}=0\) means that the spool does not move, and when \(r{o_i}<0\) means that the spool in the water distributor moves in the direction of decreasing the overflow area for \(r{o_i}\) seconds.

Water injection device reward function

In the design of the water injection algorithm, the intelligent body (water injection device) contains two objectives, the smallest possible injection error in each layer segment and the smallest possible number of adjustments.

Motion reward \({R_r}\) is a control reward for the direction of motion of the nozzle, where the nozzle opening is proportional to the flow rate, and a reward for the direction of motion of the nozzle is given based on the target injection volume \({q_{ta}}\) and the actual injection volume \({q_{re}}\). The motion reward is expressed as:

$${R_{ri}}=\left\{ {\begin{array}{*{20}{c}} {1,({q_{tai}} - {q_{rei}})>0{\text{ }}and{\text{ }}r{o_i}>0} \\ {1,({q_{tai}} - {q_{rei}})>0{\text{ }}and{\text{ }}r{o_i}>0} \end{array}} \right.$$
(13)

Where, \({q_{ta}} \in \{ {q_{ta1}},{q_{ta2}}, \cdots ,{q_{ai}}\}\) denotes the set of target injections for the layer segment, m³/d; \({q_{re}} \in \{ {q_{re1}},{q_{re2}}, \ldots ,{q_{rei}}\}\) indicates the set of actual injections in the layer segment, which is data used in the regulation process of the waterflooding model, m³/d. \(r{o_i}\) indicates the time and direction of rotation of the i-th layer, when \(r{o_i}>0\) means \(r{o_i}\) second of rotation in the direction of increasing spout opening. Conversely, rotate for \(r{o_i}\) seconds in the opposite direction.

Position reward \({R_l}\) is to prevent the water nozzle movement to the critical value, the maximum or minimum situation, the opening is zero when the water injection device layer segment injection is zero, and seriously affects the life of the water injection device, in order to prevent the occurrence of this situation, when the \({x_i}=0\) or \({x_i}=100\) when the \(\:agent\) penalties is given. The position reward is expressed as :

$$({x_i}=0 \vee {x_i}=100) \Rightarrow {R_{li}}= - k$$
(14)

The purpose of time reward \({R_t}\) is to expect \(\:agent\) to reach the goal in as few adjustment steps as possible, and different numbers of layer segments have different time rewards, n denotes the number of layer segments and the reward is denoted as:

$${R_t}= - 0.5 \times n$$
(15)

The error reward \({R_e}\) indicates the distance between the actual value and the target value, the continuous reward can be a good response, \(\:agen\)t the distance between the target value, the error indicates the distance between the actual value and the target value, the error and the percentage of the target value of ten times as the penalty value of the error reward, the error reward is expressed as:

$${R_{ei}}=10 \times \frac{{|{q_{tai}} - {q_{rei}}|}}{{{q_{tai}}}}$$
(16)

The target reward \({R_{ta}}\) is the reward when \(\:agent\) reaches the error range of the target value, measured in terms of the mean absolute error, and the mean absolute error of the injection wells is expressed as:

$$MAE({q_{tai}},{q_{rei}})=\frac{1}{n}\sum\limits_{{i=1}}^{n} | {q_{tai}} - {q_{rei}}|$$
(17)

When \(\:agent\) reaches the target value, a reward is given, and the target reward table is:

$${R_{ta}}=\left\{ {\begin{array}{*{20}{r}} {100,MAE \leqslant 0.05} \\ {0,MAE>0.05} \end{array}} \right.$$
(18)

Based on motion reward, position reward, time reward, error reward and target reward then the total reward function is expressed as:

$$R=\left\{ {\begin{array}{*{20}{l}} {{R_{ta}},}&{MAE \leqslant 0.05} \\ {{R_t}+\sum\limits_{{i=1}}^{n} {{R_{ri}}} +{R_{li}}+{R_{ei}},}&{MAE>0.05} \end{array}} \right.$$
(19)

Combined with SAC algorithm

The SAC algorithm (Soft Actor-Critic) is a maximum entropy model-free deep reinforcement learning algorithm, which can be trained in an offline environment and can solve the reinforcement learning problem in discrete and continuous action spaces well29,30,31,32.

Fig. 3
figure 3

Reinforcement learning SAC algorithm structure.

The network structure of SAC algorithm increases the exploration space of the network and avoids falling into local optimum. It can maximize the trade-off between expected return and entropy, and has achieved leading results in a number of standard environments33,34,35the structure of the SAC algorithm is shown in Fig. 3.

The policy network receives the state s from the environment as input. For the water injection environment with n layer segments, the number of input parameters is \(1+n \times 5\), for different number of layer segments the hidden layer is n fully connected layers, 64 neurons for one and two layer segments, and 128 neurons for three layer segments, and LeakyReLU and Tanh are used as the activation functions for the hidden and output layers.

Entropy denotes the degree of randomness with respect to a random variable and entropy is defined as36,37:

$$\mathcal{H}\left( {\pi ( \cdot |{s_{t+1}})} \right)= - {\text{log}}\pi ({a_{t+1}}|{s_{t+1}})$$
(20)

The SAC algorithm maximizes the cumulative expected reward while making the strategy more stochastic, and the optimization objective of the strategy is defined as38.

$${\pi ^*}=\arg \mathop {\hbox{max} }\limits_{\pi } {{\mathbb{E}}_\pi }\left[ {\sum\limits_{t} r ({s_t},{a_t})+\alpha \mathcal{H}\left( {\pi ( \cdot |{s_t})} \right)} \right]$$
(21)

Where, \(r \in (0,1)\) discount factor, which responds to the effect of future rewards on the current harvest; \(\alpha \in (0,1)\) temperature coefficient, which controls the importance of entropy; and \(\mathcal{H}(\pi ( \cdot |{s_t}))\) denotes the degree of stochasticity of strategy \(\pi\) in state s.

SAC uses two action value functions \({Q_{\omega i}}({s_t},{a_t})\) and picks the network with the smaller Q value each time the Q network is used, thus mitigating the problem of high Q values, and the loss function for any of the functions Q is39,40:

$$\begin{gathered} {L_Q}(\omega )={E_{({s_t},{a_t},{r_t},{s_{t+1}})\sim D,{a_{t+1}}\sim {\pi _\theta }( \cdot |{s_{t+1}})}}[\frac{1}{2}({Q_\omega }({s_t},{a_t}) - \hfill \\ ({r_t}+\gamma (\mathop {\hbox{min} }\limits_{{j=1,2}} {Q_{\omega _{j}^{ - }}}{({s_{t+1}},{a_{t+1}})^2} - \alpha log\pi ({a_{t+1}}|{s_{t+1}}))))] \hfill \\ \end{gathered}$$
(22)

Where \(\mathcal{D}\) is the data collected by the strategy in the past, \({Q_{\omega _{i}^{ - }}}({s_{t+1}},{a_{t+1}})\) is the Q target network used to approximate it, \({Q_{\omega _{j}^{ - }}}\) is the target Q network with parameter \({\omega ^ - }\), and the update method is denoted as:

$$\omega _{j}^{ - } \leftarrow \tau {\omega _j}+(1 - \tau )\omega _{j}^{ - }$$
(23)

Where \(\tau\) is the learning rate, the original target network and the corresponding Q network after iterative learning are assigned a weighted average to update the target network. The Q network is updated using gradient descent and the gradient expression for the Q network is:

$${\nabla _\omega }{L_Q}(\omega )=\sum {\frac{{{\nabla _\omega }{Q_\omega }}}{{|\mathcal{B}|}}} ({Q_\omega }({s_t},{a_t}) - {y_i})$$
(24)

Where \(\mathcal{B}\) denotes the selection of a fixed batch size of \(\mathcal{B}\) samples from buffer \(\mathcal{D}\), and for each sample a target network is used to compute \({y_i}\) equation:

$${y_i}={r_i}+\gamma {\hbox{min} _{j=1,2}}{Q_{{\omega _j}}}({s_{i+1}},{a_{i+1}}) - \alpha {\text{log}}{\pi _\theta }({a_{i+1}}|{s_{i+1}})$$
(25)

The policy network is a state-to-action mapping, with policy \(\pi\) over minimizing the scatter (KL) to update, and a policy \(\pi\) loss function expressed as:

$${L_{\pi (\theta )}}={E_{{s_t}\sim \mathcal{D},{a_t}\sim {\pi _\theta }}}[\alpha \log ({\pi _\theta }({a_t},{s_t})) - {Q_\omega }({s_t},{a_t})]$$
(26)

Since the process of sampling a Gaussian distribution is not derivable, the SAC algorithm uses a reparameterization to make the sampling process derivable for the policy function, which is denoted by the policy function\({a_t}\)

$${a_t}={f_\theta }({\epsilon _t};{s_t})$$
(27)

where \({\epsilon _t}\) random variable, and considering both Q functions, the loss function of the rewrite strategy is:

$$L_{\pi } (\theta ) = {\mathbb{E}}_{{{\mathbf{s}}_{{\mathbf{t}}} \sim D,\, \in _{1} N}} \left[ {\alpha \log \left( {\pi _{\theta } \left( {f_{\theta } \left( { \in _{t} ;s_{t} } \right)\left| {s_{t} } \right.} \right)\begin{array}{*{20}c} { - \min Q_{{\omega _{j} }} } \\ {j = 1,2} \\ \end{array} \left( {s_{t} ,f_{\theta } \left( { \in _{t} ;s_{t} } \right)} \right)} \right)} \right]$$
(28)

The expression for the gradient \({\nabla _\theta }{L_\pi }\) of the policy network at time slot t is:

$$\begin{gathered} {\nabla _\theta }{L_\pi }(\theta )=\sum {\frac{1}{{|\mathcal{B}|}}} {\nabla _\theta }\alpha \log \left( {{\pi _\theta }({a_t},{s_t})} \right)+ \hfill \\ \left( {{\nabla _{{a_t}}}\alpha \log \left( {{\pi _\theta }({a_t},{s_t})} \right) - {\nabla _{{a_t}}}Q({s_t},{a_t})} \right){\nabla _\theta }{f_\theta }({\epsilon _t};{s_t}) \hfill \\ \end{gathered}$$
(29)

The temperature coefficient of entropy is important in the SAC algorithm, and different sizes of temperature coefficients are chosen for different states. In order to automatically adjust the temperature coefficient of entropy, the SAC algorithm constructs an optimization problem with constraints as:

$${\hbox{max} _\pi }{{\mathbb{E}}_\pi }\left[ {\sum\limits_{t} r ({s_t},{a_t})} \right]{\text{s}}.{\text{t}}.{{\mathbb{E}}_{({s_t},{a_t})\sim {\rho _\pi }}}\left[ { - \log \left( {{\pi _t}({a_t}|{s_t})} \right)} \right] \geqslant {\mathcal{H}_0}$$
(30)

.

The transformation of Eq. (30) into a dyadic problem through the Lagrangian dual method leads to the loss function41,42 at the time slot t:

$$L(\alpha )={{\text{E}}_{{s_t}\sim \mathcal{D},{a_t}\sim \pi ( \cdot |{s_t})}}[ - \alpha {\text{log}}\pi ({a_t}|{s_t}) - \alpha {\mathcal{H}_0}]$$
(31)

.

Where \({\mathcal{H}_0}\) denotes the minimum policy entropy threshold.

Simulation results and analysis

Example verification of iterative calculation method for layer segment flow rate

In the intelligent regulation system for layered water injection, accurate calculation of segment flow is the core foundation for constructing the training environment of reinforcement learning. The flow in multiple downhole segments is affected by strongly coupled factors such as pressure, pipe diameter, and nozzle opening, making it difficult for traditional analytical methods to characterize non-linear flow characteristics. Therefore, an iterative calculation model is constructed based on Bernoulli’s equation and fluid mechanics theory. By iteratively solving the coupling relationship between the along - track loss coefficient and flow rate, it provides flow state feedback of water injection wells for reinforcement learning algorithms, and supports the training of intelligent regulation strategies of algorithms such as SAC in dynamic environments. The real well data of a two - segment well in a certain oilfield in Daqing are shown in Table 2. The correctness of the iterative calculation method for segment flow is verified by comparing with the real well data.

Table 2 Parameter table of a real well in Daqing oilfield.

To verify the accuracy of the iterative calculation method for layer-segment flow, a calculation model is constructed based on the real-well parameters in Table 2. This well includes Wellbore 1 (with a well depth of 890.25 m) and Wellbore 2 (with a well depth of 921.1 m). The inner diameter of the tubing is 0.062 m and the roughness is 0.2 m. Two sets of water nozzles are equipped, and the opening range of each set is 0–0.002 m, which is divided into 100 scales. The flow coefficient\({C_d}=0.31 - 0.01d+1.59 \times {10^{ - 4}}{d^2} - 7.43 \times {10^{ - 7}}{d^3}\), and the corresponding layer-segment flow range is 0–80 m³/d. The injection flow rate at the wellhead is 50.2–77.29 m³/d, the pressure is 11.01–11.09 MPa, the fluid density is 980 kg/m³, and the kinematic viscosity is 0.001 \({\text{Pa}} \cdot {\text{s}}\). The simulated values of the layer-segment flow output by the model is compared with the actually measured flow data at the wellhead. The comparison results between the actual values and the simulated values are shown in Table 3.

Table 3 Comparison table of simulation results and simulation data.
  1. 1.

    The water quality of the water injection well affects the fluid viscosity, which changes the flow state.

  2. 2.

    The roughness of the pipe wall causes deviations in the calculation of long-track resistance.

  3. 3.

    The flow coefficient fitting is not completely matched, leading to inaccurate calculation of nozzle flow.

These factors collectively cause errors in the iterative calculation method for layer - segment flow. The absolute errors under various working conditions are between 1% and 6%, which can reflect the flow change trend of each layer segment in the real environment.

Fig. 4
figure 4

Flow-pressure difference square root fitting relationship diagram with fluctuating characteristics.

During the actual water injection process, sensor data often exhibits fluctuating characteristics due to factors such as equipment vibration and sensor accuracy. To simulate this working condition, a randomly generated coefficient error of -0.05 to 0.05 is added to the flow coefficient. The flow-pressure difference relationship with fluctuating characteristics is shown in Fig. 4.

Training environment and parameters

To determine the optimal hyperparameters of the SAC algorithm, we first identify the learnable learning rates. Through extensive testing, the model converges only when the learning rates are 3e-5, 3e-10, and 3e-15 under different layer segment attribute environments. Therefore, with the learning rate determined, we analyze the neural network parameters by training for 500 steps with 32–512 neurons and 1–4 hidden layers and statistically calculate the average reward of the last 50 steps to determine the optimal neural network parameters. The statistical results are shown in Table 4.

Table 4 Enforcement learning network training hyperparameters.

In a single-layer environment, the average reward reaches its maximum when the number of neurons is 64 and the number of hidden layers is 3. When the number of neurons increases to 128, the reward value shows little change. In two-layer and three-layer environments, the reward value fluctuates as the number of neurons increases. When the number of neurons reaches 256 and 1024, the average reward values in both two-layer and three-layer environments decrease significantly, indicating overfitting of the model. Based on the comparison results, a hierarchical water injection environment is constructed using Python for agent training. The optimal neural network parameters are shown in Table 5.

Table 5 Reinforcement learning network training hyperparameters.

The size of the experience replay buffer increases with the number of layer segments (50,000 for three layer segment) to store high-dimensional data and enhance generalization. The sample size is increased from 64 to 512 to reduce gradient variance. A soft update parameter of 0.005 balances the retention of historical experience and the absorption of new policy information. With 500 training cycles and a maximum of 200 steps per cycle, a discount factor of 0.9 emphasizes short-term rewards. These hyperparameters, optimized for layered water injection well regulation via the SAC algorithm, were determined through extensive trial training.

Training result analysis

Each agent were placed in a water-filled environment for 500 training steps, with a stopping condition of achieving less than 5% error. At the same time, in order to verify the performance of the algorithm, under the same water injection environment, the test randomly generates 100 groups of water injection environment under the completion of different algorithms, comparative analysis of the algorithms in the regulation of the error, the adjustment of the number of steps and other aspects of the performance of the performance differences. Some of the initial data of the injection environment are shown in Table 6.

Table 6 Initial parameters of partial water injection environment.

In order to analyze the performance of different continuous-action-space reinforcement learning for flow regulation in layered water injection wells, the rewards of PPO, DDPG, and SAC algorithms are compared and analyzed during the training process, as shown in Fig. 5, which demonstrates the changes of the reward values for 500 steps of iterative training in the environment of 1 ~ 3 layer segment.

Fig. 5
figure 5

The reward curve under different water injection layers.

The PPO, DDPG, and SAC algorithms all reached the reward ceiling and converged after 500 steps of training. With the increase in the number of layer segments, the reward fluctuations of the algorithms increases significantly. As shown in Fig. 4a, the maximum reward value of each algorithm is close to 0, but PPO fluctuates the most drastically; Fig. 4b shows that SAC rewards are concentrated in the range of -500 ~ 0 (with the slightest fluctuation), DDPG is in the range of -2500 ~ 0, and PPO extends to -5000~-500; in Fig. 4c, SAC rapidly rises to a stable value near 0 in the first 50 steps, DDPG is in the − 2500~ -500range with small fluctuations, while PPO oscillates violently at -8000~-1500. As the complexity of the water injection environment increases due to the increase in the number of layer segments, the overall reward value of each algorithm rises. Still, the stability varies significantly: SAC maintains the highest reward and the lowest fluctuation, DDPG ranks second highest, and PPO has the lowest reward and the largest fluctuation.

Regulation error analysis

In order to analyze the performance of each algorithm error absolute rate under different water injection layer segments, the same water injection environment was used with 0% regulation error and 50 consecutive control steps. The error changes of PPO, DDPG and SAC algorithms were analyzed under different water injection layer segments, and the error curves for different water injection layer segments are shown in Fig. 6. The horizontal axis represents the number of adjustment steps, and the vertical axis represents the relative absolute error (RAE).

Fig. 6
figure 6

Error curves under different water injection layers.

As shown in Figs. 5a-i, the experimental results show that the convergence process of the PPO algorithm is accompanied by significant oscillations in the single-layer segment scenario, and both the DDPG and SAC algorithms can converge to zero relative absolute error within 20 steps; In the two-layer segment environment, the minimum error of the PPO is as high as 0.2, while the DDPG and SAC achieve simultaneous dual-channel convergence, with SAC converging faster. In the three-layer segment scenario, the PPO shows a linear decrease in error, but the interlayer error difference is significant (0.1–0.4). In contrast, both DDPG and SAC show exponential convergence characteristics, with SAC achieves near-zero error control multiple times in the layer 2 section, demonstrating better multi-objective coordination capability.

Fig. 7
figure 7

Analysis of water injection error under different algorithms.

In order to compare the robustness of each algorithm’s regulation performance, the relative absolute error distribution under 100 sets of random initial conditions, the maximum number of adjustment steps is 100 steps, the stopping error is 5% are analyzed, and the regulation errors of different algorithms are shown in Fig. 7. SAC algorithm performs optimally in terms of the convergence stability (average relative absolute error 0.05 ± 0.005) and adaptability to complex environments, and PPO algorithm presents the maximum error fluctuation (0.37 ± 0.1). The difference in error between DDPG and SAC in the single-layer segment scenario is only 0.01, but as the number of layer segments increases to three, the error increase for both is 0.07 and 0.04, respectively, which are significantly lower than the 0.26 increase for PPO.

Table 7 Water injection well parameter table.

The results of the pass-rate comparison are shown in Table 7, with SAC leading in the single-layer segment scenario with a 98% pass rate (DDPG 95%, PPO 79%). As the layer-segment complexity increases to three layers, SAC still maintains an 81% pass rate (mean 88%), significantly higher than DDPG (32%/60%) and PPO (14%/45%). The data indicates that the SAC algorithm has stronger robustness advantage in multi-objective cooperative control scenarios.

Step analysis

The regulatory performance analysis is shown in Fig. 7. With the nozzle opening as the core control variable, its dynamic response characteristics, adjustment frequency, and convergence trajectory directly determines the service life of the water injection device and the measurement and adjustment efficiency. This experiment compares the regulation performance of each algorithm under the zero-error constraint and reveals the optimization mechanism of deep reinforcement learning for the dynamic characteristics of the actuator.

Fig. 8
figure 8

Analysis of Water Injection Openness Variation under Different Algorithms.

As shown in Fig. 8a, the opening of PPO algorithm fluctuates significantly but is close to the optimal value for multiple times, and both DDPG and SAC achieve accurate tracking. Figure 8b shows that for PPO, layer segment 1 exhibits a descending trend, whereas the deviation of layer segment 2 is prominent. Meanwhile, DDPG converges to the optimal opening in about 60 steps, and SAC achieves convergence in 20 steps ahead of time. Figure 8c indicates that the deviation of PPO layer segments 1 and 3 is about 30%, and the deviation of layer segment 2 reaches 60%. By contrast, DDPG maintains a stable deviation of 2%, while SAC nearly coincides with the optimal opening, demonstrating the optimal regulation performance.

Taking the relative absolute error of 5% as the adjustment target, analyze the distribution of adjustment steps for different algorithms. The distribution of adjustment steps for each algorithm is shown in Fig. 9.

Fig. 9
figure 9

The distribution of adjustment steps under different algorithm.

The average adjustment step of the SAC algorithm is 16.04 steps in a single-layer segment environment, 50.40 steps in a two-layer segment environment, and 55.32 steps in a three-layer segment environment, resulting in an increase of 39.26 steps. The DDPG algorithm requires averaged of 23.85 steps in a one-layer segment environment and 85.19 steps for three-layer segment, an increase of 61.34 steps of adjustment. The PPO algorithm requires 36.17steps in a one-layer segment environment and 95.02 steps in three-layer segment with an adjusted step growth of 58.85 steps. The average number of adjustment steps for the SAC algorithm in different environments is 40.59 steps, which is much smaller than 68.86 steps for the PPO algorithm and 61.34 steps for the DDPG algorithm.

Table 8 Probability distribution table for different algorithms.

As shown in Table 8, the SAC algorithm has a probability of completion within 1–20 steps of 31% and within 1–50 steps of 69%, which is much higher than PPO (33%) and DDPG (47%). The SAC algorithm distribution probability in the range of 90–100 steps is (15%) much lower than PPO (55%) and DDPG (41%). Thus, The SAC algorithm adjusts faster than both the PPO and DDPG algorithms.

Conclusion

This study compares and analyzes the performance of three reinforcement learning algorithms, PPO, DDPG, and SAC, in hierarchical water injection regulation. Experiments show that all three algorithms can converge to a stable state in basic training, but the SAC demonstrates outstandingly advantages in complex scenarios, including the fastest convergence of its regulation error, with an average error 5% (standard deviation of 0.005), and an average qualification rate of 88% when the water injection error is < 5% as the qualification criterion, which is significantly higher than that of PPO (45%) and DDPG (60%). Meanwhile, SAC achieves the target with fewer adjustment steps (69% completion probability within 1–50 steps under the 5% error threshold), and performs optimally in control accuracy, stability, and environmental adaptability, providing an efficient solution for water injection regulation.

This study is currently limited to the theoretical simulation stage of intelligent regulation for layered water injection wells. However, the real-world application of intelligent regulation remains essential. To advance its practical implementation, the intelligent regulation algorithm will be deployed locally on near-wellbore surface hosts or embedded water distribution systems via serial communication, thereby avoiding IoT-related latency. Meanwhile, we will explore model optimization techniques to enhance edge deployment feasibility and further develop a hardware testing platform and field trials for intelligent control of layered water injection wells.