Introduction

The durian orchard is a large-area and high-density agricultural plantation. Its management and maintenance require efficient patrol operations to ensure the healthy growth and high yield of durian1. However, since durian orchards usually have large area, much manpower must be invested in patrolling especially the peak periods. In addition, the cost of agricultural labor is rising year by year, which causes higher costs of relying on manual patrols of durian orchards. It is also difficult to ensure full coverage of the entire durian orchard, which will miss some areas and affects the patrol effect. The environment of durian orchard is large, so manual patrols require long walks, which consumes a lot of physical strength2. With the application of drone technology in agriculture, durian orchards have also begun to try to use drones for patrol operations. Drones can cover large areas quickly, provide high-resolution real-time images and data and improve the patrol efficiency and accuracy. However, the drone patrol operations also have some problems. The autonomous flight path planning and navigation accuracy of drone swarms is insufficient, which is easy to deviate from the planned route or fail to cover all target areas3. Due to the complex terrain of the durian orchard, the drone swarm needs to fly at different altitudes, which increases the difficulty of path planning. The weather conditions (such as wind speed, wind direction and rainfall) will also affect the flight stability of the drone swarm, which causes the flight trajectory deviation, mutual interference and collisions. So drone swarms need to calculate and optimize paths to cope with changes in the dynamic environment. It also needs to share the real-time location information and status data. If the synchronization is not timely, the drones cannot coordinate and perform tasks4.

Literature review

In order to solve the above-mentioned problem of motion cruising, some scholars have made some achievements in some scene applications. Miao et al.5 proposed a heuristic algorithm for the SAC (slice admission control) problem in 5G/B5G networks. It introduces the resource efficiency of the introduced service to correct the priority violation, and then sets the target CSAR for each service type and improves its actual CSAR to improve accuracy. Sun et al.6 designed a task scheduling algorithm based on proportional fairness awareness auction (PFAPPO) based on proximal policy optimization. It allocates computing resources reasonably to each drone, so that the drone learns the computing resources available at each unloading destination, solving the problems of extremely long queue delays and load imbalance. Li et al.7 proposed a decision-making and motion planning integration framework with non-oscillation capability to overcome the shortcomings of autonomous driving in lane change/keeping operations. It also designed a belief decision planner with predicted trajectory uncertainty, which provides more appropriate information for autonomous driving planning and solves the optimal motion sequence. Xu et al.8 used sequential images of adjacent stations and the bundle adjustment (BA) method to obtain the precise position of the lunar rover; and proposed a cross-scale cost aggregation stereo matching network to obtain disparity maps to extract the indicators of impact craters, so as to realize the precise positioning and sample collection of impact craters on the lunar surface by the lunar rover. Wang et al.9 explored the basic characteristics of human legs such as linearity and nonlinearity during movement and used them for stability analysis and accurate motion prediction of robots and rehabilitation exoskeletons. Wu et al.10 proposed an anonymous clustering algorithm for obstacle avoidance through the location of obstacle boundary points. It divides the consensus term into speed and speed unit direction, designs gradient-based terms to achieve separation and aggregation of agents and obstacle boundary points, which guides all agents to achieve group target following.

Lu et al.11 describe the current principles and development of neuromorphic computing technology, explore its potential examples and future development routes for application in smart agriculture. Li et al.12 introduced a highly configurable intelligent agricultural robotic arm system (CARA). It integrates a highly configurable robotic arm, image acquisition module, and deep processing center to facilitate accurate and efficient agricultural tasks. Zhou et al.13 developed a hybrid architecture inspired by vehicle lateral dynamics, embedding data-driven models into physical models for parameter identification and error characterization, and achieving accurate and interpretable modeling. Chen et al.14 designed a spatial attention mechanism with a feature fusion module to calculate the weights of different channel features. It also developed a hybrid model combining physical and dual attention neural networks to model vehicle lateral dynamics to solve the problem that neural networks with limited data are difficult to achieve accurate prediction. Meng et al.15 combined mobile navigation with visual perception, using advanced algorithms to grasp objects in a way that suits human preferences, and using path planning and obstacle avoidance to navigate back to the human user. Li et al.16 proposed a road segmentation method based on centroid Voronoi tessellation (CVT) for brain-controlled robot navigation via asynchronous BCI. It also proposed a new road segmentation method based on CVT to generate optional navigation targets in the road area for arbitrary target selection. Zhou et al.17 proposed a drone anomaly detection method based on wavelet decomposition and stacked denoising autoencoder. It takes into account the negative impact of noisy data and the feature extraction ability of deep learning models, aiming to improve the accuracy of the proposed anomaly detection method by using wavelet decomposition and stacked denoising autoencoder methods.

Chen et al.18 proposed a fair and efficient MAC protocol based on CSMA/CA, using multi-user MIMO to achieve concurrent uplink transmission from different drones. Wang et al.19 proposed a drone-assisted URLLC scheme for edge users, using information age as an indicator of system delay to achieve the performance requirements of ultra-reliable low-latency communication. Gao et al.20 obtained the modeling parameters of the theoretical closed-form energy model based on the existing literature based on the curve fitting of the model, and proposed a theoretical energy model for rotorcraft UAVs. Yin et al.21 constructed the autonomous navigation of UAVs in a three-dimensional environment with adaptive control as a Markov decision process, and proposed a deep reinforcement learning algorithm. They also proposed a new speed constraint loss function and added it to the original actor loss to improve the speed control ability of the UAV. Zhang et al.22 proposed an adaptive pseudo-inverse control scheme based on fuzzy logic system (FLS) and barrier Lyapunov function (BLF) for a class of state-constrained hysteresis nonlinear systems, which showed great application potential in the fields of soft bionic robots and rehabilitation robots. Ji et al.23 proposed a multi-agent deterministic policy gradient (MADPG) method based on an actor-critic network, and proved its convergence and optimality by minimizing the local cost (Q-function), thereby improving the data utilization of the network. Liang et al.24 proposed an integrated framework that combines three basic modules (such as the behavior decision-making, path planning, and motion control modules) to improve the safety of AVs in mixed traffic high-speed cruising scenarios.

Literature conclusion

Compares the EN-MASCA algorithm with other studies in the main findings, quantitative results, and key similarities and differences, although these studies (such as11,13,17,21) also combine reinforcement learning, deep learning or hybrid modeling techniques to improve model performance like this study, they aim to optimize path planning, obstacle avoidance and task completion efficiency through advanced algorithms. Some studies (such as12,15,18) emphasize the practical application of smart agriculture and multi-UAV swarm operations, which is similar to the research direction of this study. However, the above studies focus on smart agriculture, brain-controlled robot navigation, and vehicle dynamic modeling, while this study focuses on UAV swarm tasks in complex agricultural scenarios. This study also combines DQN and PPO algorithms for the first time to optimize UAV swarm path planning and obstacle avoidance, and designs the virtual navigator model to enhance environmental adaptability. Table 1 shows the comparison results of the EN-MASCA algorithm with some key studies.

Table 1 The comparison results of the EN-MASCA algorithm with some key studies.

The above studies are similar to this study in terms of overall goals, and both aim to improve the navigation, path planning and task completion efficiency of drone swarms through advanced algorithms. They lay the theoretical foundation for this study, especially in terms of reinforcement learning, dynamic path planning and group behavior modeling. Most of them use reinforcement learning, deep neural networks, heuristic algorithms, etc. for path planning and obstacle avoidance. Some studies combine task allocation or data fusion technology to improve algorithm performance. Although they have achieved some success, most of them are aimed at ideal or relatively simple optimization path planning and task allocation, and lack applicability in the complex dynamic environments (such as agricultural scenes or high-density obstacle environments). Some studies are difficult to meet the requirements of real-time and efficient computing in practical applications due to the high complexity of the algorithm or the need for large-scale computing resources. Other studies focus on generalized task planning and navigation problems, lack optimization design for specific fields (such as agricultural drone patrols, disaster monitoring, etc.), and cannot fully meet the needs of the field. Therefore, this study designed a virtual navigator model and a six-degree-of-freedom drone motion simulation model, and achieved real-time path optimization and efficient collaboration by introducing an enhanced multi-agent cluster control algorithm (EN-MASCA) and integrating DQN and PPO algorithms. It solves the problem of poor adaptability of the above methods in complex scenarios, and greatly improves the stability and efficiency of drone clusters in complex agricultural scenarios such as durian orchards.

Methods and materials

Drone model construction

This study combines ROS with the GAZEBO physical simulation platform16, which controls the drone to obtain the sensor data through ROS and builds a 3D simulation model of the drone. It also introduces the PX4 flight control system, which is divided into the human-computer interaction layer, cluster algorithm layer, PX4 flight control layer and physical simulation layer. The top layer, such as the ground station or off board node, will output the desired state of the drone and pass it to the PX4 flight control layer17. Finally, the flight controller transmits its attitude information to the GAZEBO simulator for 3D display. In order to control the drones in the cluster, this study sets the three variables of speed \({{\text{V}}_i}\), yaw angle \({{\text{R}}_i}~\) and altitude \({{\text{H}}_i}\). The pitch angle, yaw angle and throttle are used to control the PX4 flight control. In order to enable the clustering algorithm to control the PX4, this study converts the output command into the input command of the PX4 flight control attitude loop. It uses the PI control loop and control parameters to complete the control of s. Figure 1 shows the relationship between cluster algorithm, PX4 flight controller and PI control loop tie.

Fig. 1
figure 1

The relationship between cluster algorithm, PX4 flight controller and PI control loop Tie.

Simulation environment construction

This study also collected data from the various obstacles and terrain in the durian garden, used the height maps to create the real terrain and made 3D models of the durian garden and obstacles. It used GPS equipment and drone aerial photography to build the terrain and topography model of the durian garden. It used laser rangefinders and tape measures to measures the height, width and depth of obstacles, which can obtain the position coordinates (x, y, z) of obstacles in the actual environment. It also used the GPS equipment to determine the exact position of each obstacle and target, and took corresponding pictures in the center of the experimental area for model data correction18. Then this study used the GAZEBO’s modeling tool to make the corresponding geometric model (such as trees, rocks, etc.) based on the manually measured geometric dimensions. It also used the surface texture and color information taken on-site to attach corresponding materials and maps to the geometric models, and used GAZEBO’s material editor to make detailed adjustments. It also placed the geometric models in the corresponding positions in the virtual environment, used the measured coordinates for precise positioning, and set the physical properties of the models (such as hardness, reflectivity, etc.). It also introduces dynamic changing factors such as wind speed, wind direction and moving obstacles to simulate real operating scenarios. The durian garden area and surrounding environment patrolled by the drone group are shown in Fig. 2.

Fig. 2
figure 2

The durian garden area and surrounding environment patrolled by the drone group.

Multi-agent swarm control algorithm

The bio-clustering behavior is a natural phenomenon. It is a social behavior of biological groups to adapt the life. The survival ability of animals that have evolved in a long period, such as bird flocks and fish schools, which is based on the cohesion, separation and alignment18. The multi-agent swarm algorithm simulates the characteristics of biological swarms and the synchronous motion of autonomous systems composed of multiple particles19. Its rules are as follows:

  • The agents moving in the system have a constant velocity \({\text{s}}\);

  • Any pair of agents in the system has an influence radius \({{{\upomega}}}\). They only influence each other when the straight-line distance is less than \({{{\upomega}}}\).

  • The movement direction of agents at each moment is consistent with the movement direction of all other intelligent. The average motion direction of energy bodies is the same.

In this model, the agent \({\text{m}}\) has a constant velocity \({\text{v}}\), the displacement is \({{\text{w}}_m}(t)\), and the velocity direction of agent m is \({{\text{U}}_m}(t)\), which satisfies:

$$\frac{{{\text{d}}{{\text{W}}_m}(t)}}{{dt}}={{\text{V}}_m}(t)$$
(1)
$$\frac{{{\text{d}}{{\text{W}}_m}(t)}}{{dt}}=\mu \mathop \sum \limits_{{n \in K,n \ne m}} {{\text{R}}_{mn}}\left( {\left| {{{\text{W}}_m}(t) - {{\text{W}}_n}(t)} \right|} \right) \times \left( {{{\text{V}}_n}(t) - {{\text{V}}_m}(t)} \right)~$$
(2)
$${{\text{R}}_{mn}}(t)=\frac{{H\left( {\left| {{{\text{W}}_m}(t) - {{\text{W}}_n}(t)} \right|} \right)}}{K}$$
(3)
$${\text{H}}\left( {{{\upomega}}} \right)=\frac{1}{{{{\left( {1+{{\left| \omega \right|}^2}} \right)}^\rho }}}~$$
(4)

In Formulas (14), \({{\text{W}}_m}\left( t \right)\) and \({{\text{W}}_n}\left( t \right)\) represent the displacement of agent m and n, \({{\text{V}}_m}\left( t \right)\) and \({{\text{V}}_n}\left( t \right)\) represent the speed of agent m and \({\text{n}}\); \(\mu\) represents the constant, \({{\text{R}}_{mn}}\) represents the mutual influence coefficient between agent \(~m\) and \({\text{n}}\), K represents the constant, \(~H\left( {\left| {{{\text{W}}_m}\left( t \right) - {{\text{W}}_n}\left( t \right)} \right|} \right)\) represents the step size function; \({{{\upomega}}}\) represents the influence radius, ρ represents a constant; According to the above rules, the patrol operation space of drone swarms in the durian garden is recognized as the three-dimensional Euclidean space. The dynamics of the intelligent agent is modeled as the second-order integral link, which is shown in Formula (5):

$$\left\{ {\begin{array}{*{20}{c}} {\overline {{{a_m}}} ={b_m}} \\ {\overline {{{b_m}}} ={c_m}} \end{array}} \right.,\quad ~m=1,\;2, \ldots ,\;{\text{K}}$$
(5)

In Formula (5), \(~{a_m}\), \({b_m}\) and \({c_m} \in {D^k}\) represent the position, velocity and control input of the m-th agent. The agent m can only communicate with the adjacent agents in the communication area. At time t, the set of adjacent agents is shown in Formula (6):

$${\text{K}}_{m}^{\varphi }\left( t \right)=\left\{ {n:\left| {{a_m} - {a_n}} \right| \leqslant {{{\upomega}}},\;\;n=1,\;2, \ldots ,\;X,\;n \ne m} \right\}$$
(6)

In Formula (6), \(~\left| {{a_m} - {a_n}} \right|\) represents the Euler distance, \({\text{d}}\) represents the maximum interaction path or critical distance. The geometric model of cluster expectation requires that each agent is equidistant from all neighbors, which meets the following constraints:

$$\left| {{a_m} - {a_n}} \right|=\tau ,\quad \forall m,n \in {K_m}\left( t \right)$$
(7)

In Formula (7), \(~\tau\) is a positive constant, which represents the minimum allowable distance or critical distance between each pair of adjacent agents and \(\omega \leqslant d\). In the multi-obstacle environment, the input of each agent in the multi-agent control algorithm is divided into three parts20:

$${{\text{P}}_m}=P_{m}^{x}+P_{m}^{y}+P_{m}^{z}$$
(8)

In Formula (8), \({\text{~}}x,\;y\) and z uses the Olfati-Saber theory to represents three types of agents. Agent \(x~\) represent any intelligent agent, Agent y is composed of the agent x projection on the obstacle surface, which represents the physical obstacle is avoided. Agent z is used to construct the navigation feedback, which represents the target to be tracked. \(P_{m}^{x}\) represents the interaction term \(\left( {{\text{x}},{\text{x}}} \right)\), \(P_{m}^{y}\) represents the interaction term\({\text{~}}\left( {{\text{x}},{\text{y}}} \right)\), \(P_{m}^{z}\) represents the distributed navigation feedback. \(P_{m}^{x}\), \(P_{m}^{y}\) and \(P_{m}^{z}\) are defined as Formulas (9), (10) and (11):

$$P_{m}^{x}= - e_{m}^{x}\mathop \sum \limits_{{n \in {\text{K}}_{m}^{x}}} {{\text{S}}_\alpha }\left( {{a_m}} \right){{\text{H}}_{\text{x}}}\left( {{a_m}} \right) - P_{m}^{x}\mathop \sum \limits_{{n \in {\text{K}}_{m}^{\varphi }}} {R_{mn}}\left( {{a_m}} \right)\left( {{b_m} - {b_n}} \right)$$
(9)
$$P_{m}^{y}= - e_{m}^{y}\mathop \sum \limits_{{n \in {\text{K}}_{m}^{y}}} {{\text{O}}_{m,i}}\left( {{a_m}} \right){{\text{H}}_{\text{y}}}\left( {{a_m}} \right) - P_{m}^{x}\mathop \sum \limits_{{n \in {\text{K}}_{m}^{\varphi }}} {{\text{O}}_{m,i}}\left( {{a_m}} \right)\left( {{b_m} - \overline {{{b_{m,i}}}} } \right)$$
(10)
$$P_{m}^{z}= - e_{m}^{z}{{{\uprho}}}\left( {{a_m} - {a_z}} \right) - e_{m}^{z}\left( {{b_m} - {b_z}} \right) - e_{m}^{z}{A_\varepsilon }$$
(11)

In Formulas (911), \({\text{~}}e_{m}^{x}\) represents the constant, \({\text{K}}_{m}^{x}\) represents the set of adjacent agents in the direction \({\text{x}}\), \({{\text{S}}_\alpha }\left( {{a_m}} \right)\) represents the impact function\(,{\text{~}}{{\text{H}}_{\text{x}}}\left( {{a_m}} \right)\) represents the step function, \({R_{mn}}\) represents the mutual influence coefficient between agents m and \({\text{n}}\), \({b_m}\) and \({b_n}\) represents the speed of agents m and \({\text{n}}\). \(e_{m}^{y}\), \(e_{m}^{z}\), \({A_\varepsilon }\) and \(P_{m}^{x}\) represent constants; \({\text{K}}_{m}^{y}\) represents the set of neighboring agents in the direction \(~y\), \({{\text{O}}_{m,i}}\left( {{a_m}} \right)\) represents a constant; \({b_{m,i}}\) represents the speed of the virtual agent; \({{{\uprho}}}\left( {{a_m} - {a_z}} \right)\) represents the distance function; \({\text{~}}{a_z}\) and \({b_z}\) represents the position and speed of the virtual navigator.\({\text{~}}P_{m}^{x}\) represents the aggregated agent, which have two parts. The first part sets the distance between agents. The second part makes the agent’s speed consistent with the speed of its neighbors. The body expression is as follows:

$${{\text{H}}_{\text{x}}}\left( {{a_m}} \right)=\frac{{{L_{mn}}}}{{\sqrt {1+{\alpha _x}{{\left| {{L_{mn}}} \right|}^2}} }}$$
(12)
$${L_{mn}}=\left( {{a_m} - {a_n}} \right) - \frac{{{a_m} - {a_n}}}{{\left| {{a_m} - {a_n}} \right|}}~ \times \beta$$
(13)
$${{\text{S}}_\alpha }\left( {{a_m}} \right)=\frac{{{{\left( {\left| {{a_m} - {a_n}} \right| - \beta ~} \right)}^2}}}{\alpha }+1$$
(14)

In Formulas (1214), \(\alpha\), \({\text{~}}e_{m}^{x}\), \({\text{~}}e_{m}^{y}\) and \(e_{m}^{z}\) represent the constants, and the value of \(\alpha\) is greater than \(\beta\). The fragmentation is the Olfati–Saber clustering algorithm. The trap can effectively prevent fragmentation by introducing \({{\text{S}}_\alpha }\left( {{a_m}} \right)\). When the distance between agents increases, the value \({{\text{S}}_\alpha }\left( {{a_m}} \right)\) also increases rapidly. The second component of \(P_{m}^{x}\) is \({L_{mn}}\left( {{a_m}} \right)={{\text{S}}_\alpha }\left( {\frac{{\left| {{a_m} - {a_n}} \right|}}{{{{\upomega}}}},{a_n}} \right) \in \left[ {0,1} \right],\;m \ne n\). \({{\text{S}}_\alpha }\left( \gamma \right)\) is an impact function as shown as follows21:

$${S_\alpha }\left( \gamma \right)=\left\{ {\begin{array}{*{20}{c}} 1 \\ {0.5 \times \left[ {1+\cos \left( {\pi \frac{{\left( {\gamma - \alpha } \right)}}{{\left( {1 - \alpha } \right)}}} \right)} \right]} \\ 0 \end{array}} \right.~$$
(15)

In Formula (15), γ represents the input of impulse function, \(\alpha\) represents the constant; \(P_{m}^{z}\) makes the agent track the virtual navigator or the desired trajectory, \(e_{m}^{z}\), \(e_{n}^{z}\) and \(e_{\alpha }^{z}\) are positive constants, \({a_z}\) and \({b_z}\) represent the position and speed of the virtual navigator, \({{{{\uprho}}}_1}\left( {{a_i} - {a_n}} \right)\) and \({A_\varepsilon }\) are shown as follows22:

$${{{\uprho}}} \times \left( {{a_m} - {a_n}} \right)=\frac{{{a_m} - {a_n}}}{{\sqrt {1+{\alpha _z}{{\left| {{a_m} - {a_n}} \right|}^2}} }}$$
(16)
$${{\text{A}}_\alpha }=\left[ {\begin{array}{*{20}{c}} 0 \\ 0 \\ {a_{m}^{\alpha }} \end{array}} \right] - \left[ {\begin{array}{*{20}{c}} 0 \\ 0 \\ {a_{z}^{\alpha }} \end{array}} \right]$$
(17)

In Formulas (16) and (17), \({{{\uprho}}}\) represents the distance function, \({a_m}\) and\({\text{~}}{a_n}\) represent the agents m and n. \({\alpha _z}\) represents a constant; \(~a_{m}^{\alpha }\) represents the height of the agent m and \(a_{z}^{\alpha }~\) represents the height of the virtual navigator. The purpose of \({{\text{A}}_\alpha }\) is to minimize the height difference between agents, which makes them track the height of the virtual navigator. \(a_{m}^{\alpha }{\text{~}}\) and \(a_{z}^{\alpha }\) represent the height of the agent and the virtual navigator. \(P_{m}^{y}\) make the agent bypass obstacles, where \(e_{m}^{y}\) and \(e_{n}^{y}\) are positive constants. The virtual agent with position and velocity is constructed on the obstacle surface within the detection range of the agent \(\partial\). The construction method is as follows:

(1) For the obstacle with a hyperplane boundary and a unit normal \({{\text{A}}_\alpha }\), it passes through the point \({{\text{A}}_\alpha }\). The position and velocity of the agent \(\partial\) are determined by:

$$\overline {{{{\text{a}}_{m,i}}}} =F \times {a_m}+\left( {1 - F} \right){{\text{A}}_\alpha },\quad ~\overline {{{{\text{b}}_{m,i}}}} ~=F{b_m}$$
(18)

In Formula (18), \({\text{~}}\overline {{{{\text{a}}_{m,i}}}}\) and \(\overline {{{{\text{b}}_{m,i}}}}\) represents the position and velocity of the agent \(\partial {\text{~}}\), \(F~=\delta - {a_i}a_{i}^{T}~\) is a projection matrix, \({{\text{A}}_\alpha }\) represents the height difference function.

(2) For the spherical obstacle with a radius of \({{\text{Q}}_i}\) and at the centered position \({{\text{d}}_\tau }\), the velocity of agent \(\partial\) is shown as follows:

$$\overline {{{{\text{a}}_{m,i}}}} =P \times {a_m}+\left( {1 - P} \right){{\text{d}}_i},\quad ~\overline {{{{\text{b}}_{m,i}}}} ~=P{b_m}$$
(19)

In Formula (19), P represents the projection matrix, \({{\text{d}}_i}\) represents the center point of the obstacle, \({\text{~}}P=\frac{{{{\text{Q}}_i}}}{{\left| {{a_m} - {{\text{d}}_i}} \right|}}\), \(P=\frac{{{a_m} - {{\text{d}}_i}}}{{\left| {{a_m} - {{\text{d}}_i}} \right|}}\), and \(P=\delta - {a_i}a_{i}^{T}\), so the virtual agent \(\partial\) is constructed. It makes the individual cluster speeds consistent with the virtual agents, which remain consistent to maintain the certain distance, which is shown in Fig. 3.

Fig. 3
figure 3

The position and velocity of the agent \(\partial\).

The \({{\text{O}}_{m,i}}\left( {{a_m}} \right)\) and \({{\text{H}}_{\text{y}}}\left( {{a_m}} \right)\) of \(P_{m}^{y}~\)is defined as:

$${{\text{H}}_{\text{y}}}\left( {{a_m}} \right)=\frac{{{a_m} - \overline {{{a_{m,i}}}} }}{{\sqrt {1+{\alpha _y}{{\left| {{a_m} - \overline {{{a_{m,i}}}} } \right|}^2}} }} - 1$$
(20)
$${{\text{O}}_{m,i}}\left( {{a_m}} \right)={{\text{S}}_\alpha } \times \left( {\frac{{\left| {{a_m} - \overline {{{a_{m,i}}}} } \right|}}{\delta },\;{\varepsilon _y}} \right)$$
(21)

In Formulas (20) and (21), \({{\text{H}}_{\text{y}}}\left( {{a_m}} \right)\) represents the step size function; \(\overline {{{a_{m,i}}}}\) represents the projection of the agent \({a_m}\) on the obstacle surface, \({\alpha _y}\) represents a constant; \({\text{~}}{{\text{O}}_{m,i}}\left( {{a_m}} \right)\) represents the influence function; \({{\text{S}}_\alpha }\) represents the impact function, \({\varepsilon _y}\) represents the maximum detection distance of the drone relative to the obstacle, \({\alpha _y}\) represents the positive constant, \(\delta\) represents the maximum detection distance of the drone relative to the obstacle.

Improved multi-agent swarm control algorithm

In order to improve the performance of the multi-agent cluster control algorithms, this study introduces DQN (Deep Q Network)23, which is a reinforcement learning algorithm that combines the Q-learning and neural networks. It learns the behavior value function corresponding to the optimal strategy and the target value of the Q-learning algorithm by minimizing the loss function. The memory and target network factors make the performance of DQN more powerful. Its state includes the current position of the drone, its speed, the position of neighboring obstacles, the position of neighboring drones, and the current position of the virtual navigator. Its actions include adjusting the flight direction (angle change) and flight speed. The reward function is shown in Formula (22), as follows:

$$D={D_{approach}} - {D_{obstacle~}}+{D_{cluster}}$$
(22)

In Formula (22), \({D_{approach}}\) represents the reward for shortening the distance between the drone and the target point; \({D_{obstacle~}}\) represents the penalty for approaching obstacles; \({D_{cluster}}\) represents the reward for maintaining an appropriate distance and formation with neighboring drones. Based on the reward mechanism of the above cruise path, it controls the virtual navigator to avoids obstacles and navigates, which receives the cluster’s detection information of the environment and forms an interactive network with the drone cluster. The phase feedback makes drone swarms adapt to the complex and changing environments, as shown in Fig. 4.

Fig. 4
figure 4

The obstacle avoidance algorithm control system of drone swarms.

This stud uses the PPO algorithm to control the virtual navigator24. As the guide of the drone cluster, the virtual navigator adjusts the patrol path according to environmental information and provides dynamic reference points for the cluster. This study first uses the DQN algorithm to adjust the position of the virtual navigator so that it can adapt to the dynamic environment and avoid obstacles in real time. It also broadcasts the virtual navigator’s target information to each drone through real-time communication to ensure that the cluster can respond quickly to environmental changes. Its network structure is shown in Fig. 5. It has two networks; the Critic network architecture first processes the input LSTM layer with 128 hidden units; the fully connected layer of 128-bit hidden units uses the TanH layer as its activation function25. The FC layer consists of hidden layer units and TanH activation function. The Actor network is composed of the neural network and normal distribution, which has the 128-unit LSTM, FC and TanH layer. Its output is the mean of a normal distribution and the variance matrix is C = 0.05 I, where I is the identity matrix and the behavior is obtained from this distribution. The output of the Actor is to obtain the velocity vector of navigators. So this study designs the output of the Actor-network, which makes the projection of the ball radius on the three-dimensional axis as the velocity vector. The output of the Actor-network is assumed as the radius of the ball \({{{\upsigma}}}\) and two angles \(\left( {{{{\uptau}}},{{{\upmu}}}} \right)\). \({{{\uptau}}}\) is the angle between the sphere radius \({{{\upsigma}}}\) and the z-axis, \({{{\upmu}}}\) is the angle between the radius projection on the plane x, y and the axis x, which obtains the velocity vector \(\left[ {{\text{G}}\cos {{{\uptau}}},{\text{~G}}\sin {{{\uptau}}}\sin {{{\upmu}}},{\text{~G}}\sin {{{\uptau}}}\cos {{{\uptau}}}} \right]\). Considering that the drone’s speed \({{{\upsigma}}}\) will be limited to [30.0 m, 50.0 m], The angle is limited to \(\left[ { - {{{\uppi}~rad}},\;{{{\uppi}~rad}}} \right]\), so the average value of the radius and angle use TanH as the activation function.

Fig. 5
figure 5

The network structure of PPO algorithm.

In this study, the navigator is the point mass consisting of the position and velocity vector. It enables the virtual navigator to approach the target area; the swarm can avoid obstacles and follow the virtual navigator to reduce the distance between the navigator and swarm26. The construction reward function is as follows:

$$G={G_{obstacle}}+{G_{leader~}}+{G_{center}}$$
(23)

In Formula (23), \({G_{leader~}}\) rewards the leader for getting closer to the destination, and \({G_{center}}\) is rewards the cluster center for shrinking the distance from the leader. The distance between the navigators \({G_{obstacle}}\) is rewards the cluster center for avoiding obstacles. Its network input is the navigator’s position, the drone cluster’s center position. The distance vector between it and the obstacle, and the output is the velocity vector of the navigator. In order to achieve cooperative behavior of drone clusters, this study improved the multi-agent control strategy. Distributed control is achieved between drones through local communication, avoiding excessive reliance on centralized control. It also dynamically adjusts parameters by calculating the relative distance and speed difference between agents to ensure that the cluster maintains an appropriate formation during patrols. Finally, based on the group behavior model of Olfati-Saber, it introduces the gradient descent method to dynamically adjust the flight path to avoid collisions between drones and obstacles. By combining the DQN and PPO algorithms, this study uses the target network and experience replay mechanism to reduce the instability of model training, and uses parallel computing to accelerate the execution of path planning and obstacle avoidance strategies. It also uses the Bayesian optimization method to dynamically adjust parameters such as learning rate and discount factor to improve the robustness of the algorithm in complex environments.

Enhanced multi-agent cluster model construction

Before building the model, this study set the learning rates of the DQN and PPO algorithms to 0.001 and 0.0003. The learning rate of DQN is mainly based on the convergence requirements of the algorithm in a dynamic environment, and a higher learning rate is selected to accelerate initial learning. The learning rate of PPO is based on empirical values and experimental tuning results, and a lower learning rate is selected to avoid oscillation or instability caused by too fast gradient updates in complex scenarios. This study also set the batch size of DQN and PPO to 64 and 128. The smaller batch size of DQN is intended to improve the real-time update capability of the model, while the larger batch size of PPO helps to enhance the adaptability to complex environments and training stability. The discount factor is used to weigh short-term rewards and long-term benefits. This study set it to 0.99, highlighting the overall performance of the drone swarm in long-term patrol missions and retaining the effectiveness of short-term path optimization. The target network of the DQN algorithm is updated every 100 iterations. Based on the consideration of model training stability, it delays the update of the target network to avoid training oscillations caused by frequent updates. The PPO algorithm uses gradient clipping technology and sets the gradient clipping threshold to 0.2 to ensure that the gradient update does not exceed a reasonable range, thereby improving the stability of the model.

In order to verify the effectiveness and correctness of the algorithm, this study also built a multi- drone motion simulation model and uses the four-degree-of-freedom 8-state drone dynamics model27. It contains 8 state variables \(\left[ {{\text{s}}1,\;{\text{s}}2,\;{\text{s}}3,\;{\text{s}}4,\;{\text{s}}5,\;{\text{s}}6,\;{\text{s}}7,\;{\text{s}}8} \right]\) and four input variables: the aileron deflection command \({\emptyset _a}\), elevator and rudder deflection command \({\emptyset _b}\), rudder angle command \({\emptyset _c}\) and throttle command \({\emptyset _d}\). In order to control the drone swarms, this study uses three cluster control quantities: speed \({{\text{V}}_i}\), yaw angle \({{\text{R}}_i}\) and altitude \({{\text{H}}_i}\). The dynamic model of the drone with attitude control capability is controlled through three cluster control quantities. The output command of the cluster algorithm is converted into the input of the attitude control loop in the drone dynamic model. The PI control loop needs to be built as the transition. The relationship between the cluster algorithm, PI control loop, attitude control loop and state dynamics model is shown in Fig. 6. In the process of model implementation, it first constructs a four-degree-of-freedom drone dynamics model. By defining the drone’s 8 state variables and 4 input variables, the drone dynamics equation is established to describe the relationship between the state variables and the input variables. Then it designs the PI control loop by setting the proportional gain (\({{\text{K}}_p}\)) and integral gain parameters (\({{\text{K}}_i}\)) of the PI controller, which inputs the output command of the cluster algorithm, calculates the error (e) and the integral of the error (\(\smallint {\text{edt}}\)), and calculates the control output through the formula \({\text{u}}\left( {\text{t}} \right)={{\text{K}}_p}{\text{e}}\left( {\text{t}} \right)+{\text{~}}{{\text{K}}_i}\smallint {\text{e}}\left( {\text{t}} \right){\text{dt}}\). Finally, the attitude control loop is implemented. It converts the output command of the PI controller into the attitude control command of the drone to control the pitch angle, yaw angle and throttle, which achieves the precise control of the drone movement.

Fig. 6
figure 6

The relationship between swarm algorithm, PI control loop, attitude control loop and UAV 8-state dynamics model.

Experimental result

In order to verify the performance of the EN-MASCA algorithm, it compares with the unimproved MASCA (multi-agent swarm control algorithm), NNCA (Nonlinear Neural Control Algorithm)28, and NSGAII (Non-dominated Sorting Genetic Algorithm II)29 algorithm. The results are as follows:

Flight traces

Figure 7 shows the obstacle avoidance flight routes controlled by four algorithms. The drone1 controlled by the MASCA algorithm deviated from the drone swarm 1.874 m at the mileage of 1500 m, which shows the significant deviation. The maximum height difference occurred at 2250 m, and the height difference between drone2 and drone5 reached 3.495 m, which indicates that its flight altitude fluctuated with a large amplitude. The drone1 controlled by the NNCA algorithm deviated from the drone swarm 1.690 m at the mileage of 750 m, but then gradually returned to the cluster. The maximum height difference between drone3 and drone5 reached 2.473 m at the mileage of 1500 m, which indicates that its height change amplitude is relatively large. The drone2 controlled by the NSGAI algorithm deviated from the cluster by 1.744 m at the mileage of 1500 m, showing a certain deviation; the height difference between drone3 and drone4 reached 2.431 m at the mileage of 2000 m, which shows that its height change had obvious periodicity and moderate fluctuation amplitude. The drone6 controlled by the EN-MASCA algorithm deviated from the drone swarm 0.781 m at the mileage of 500 m but quickly returned to the cluster. Its maximum height difference between drone3 and drone6 is 1.524 m at the mileage of 1500 m, indicating that its altitude was the most stable, with the smallest height difference and deviation, showing good cluster consistency.

Fig. 7
figure 7

The height variation of unmanned swarms controlled by four algorithms. (a) MASCA, (b) NNCA, (c) NSGAI, (d) EN-MASCA.

Flight stability

Figure 8 shows the changes in speed, yaw angle, height change rate and relative distance of the drone swarm controlled by four algorithms. The average speed of the EN-MASCA algorithm is 9.214–11.315 m/s, and the expected value is 10 m/s. Compared with another three algorithms, which is 7.624–12.990 m/s, its maximum and minimum values are reduced by 14.80% and 10.59%. The average yaw angle of the EN-MASCA algorithm is − 0.705 to 1.929 rad, and the expected value is − 0.845 to 2.092 rad. Compared with another three algorithms in the range of − 2.404 to 2.674 rad, its maximum and minimum values are reduced by 38.61% and 86.38%. The average relative distance of the EN-MASCA algorithm is 1.777–5.357 m, and the expected value is 4 m. Compared with another three algorithms in the 2–6 m range, its maximum and minimum values are reduced by 37.99% and 14.86%. The average height change rate of the EN-MASCA algorithm is 0.550–1.597 m/s, and the expected value is 1.0 m/s. Compared with another three algorithms in the range of 0.394–3.739 m/s, its maximum and minimum values are reduced by 134.13% and 4.06%. The above results show that the speed of drone swarms controlled by the EN-MASCA algorithm quickly converges to the expected flight speed after fluctuating within the allowable range, which has a good speed tracking effect. After fluctuating, the yaw angle quickly converges to the expected yaw angle; the height change rate shows it can track the leading. The navigator has less fluctuation in tracking the navigator and avoiding the tracking delay of the pitch angle. During the whole flight, the distance between the drones is the minimum relative distance, which is always greater than the minimum distance to avoid collision between drones.

Fig. 8
figure 8

Changes in the navigation speed, altitude change rate, yaw angle, and relative distance of the unmanned swarm controlled by the four algorithms. (a) Average navigation speed, (b) Average yaw angle, (c) Average relative distance, (d) Average altitude change rate.

Cluster stability

Figure 9 shows the changes in the distance between the cluster center and virtual navigator, the distance between the cluster center and obstacle, the cluster roll angle and the yaw angle controlled by four algorithms. The average distance between the cluster center and obstacle of the EN-MASCA algorithm is between 12.171 and 16.700 m, and the expected value is 15 m. Compared with another three algorithms in the range of 11.033 and 21.624 m, its maximum and minimum values are reduced by 29.48% and 5.12%. The average distance between the cluster center and virtual navigator of the EN-MASCA algorithm is between 8.106 and 11.915 m, and the expected value is 10 m. Compared with another three algorithms in the range of 6.014–15.847 m. Its maximum and minimum values are reduced by 33.02% and 19.19%. The average roll angle of the EN-MASCA algorithm is 8.95–14.87 rad, and the expected value is 12 rad. Compared with another three algorithms in the range of 6.12–17.78 rad, its maximum and minimum values are reduced by 22.85% and 31.97%. The average pitch angle of EN-MASCA is 16.952–22.959 rad, and the expected value is 20 rad. Compared with another three algorithms in the range of 12.041–29.754 rad, its maximum and minimum values are reduced by 29.60% and 5.49%. These results show that the EN-MASCA algorithm performs better in controlling the navigation of drone swarms, with a smaller fluctuation range and closer to the expected value. Because the virtual navigator is trained through the PPO algorithm, its network structure is complex and has strong learning ability. The navigator receives environmental information to adjusts its behavior by the reward function. The reward function has multiple aspects, such as the navigator’s approach to the target area, the distance between the cluster center and the navigator and the cluster center avoiding obstacles, which ensures the navigation effect of the navigator so that the entire cluster can complete the task more effectively. It also introduces the projection of the virtual navigator and obstacles, optimizes the control input of the cluster and enables the agent to better follow the navigator and avoid obstacles.

Fig. 9
figure 9

The control changes under four algorithms in the distance between the cluster center and obstacles, the distance between the cluster center and virtual navigator, and the cluster roll angle and pitch angle. (a) Average distance between the cluster center and obstacles, (b) Average distance between the cluster center and virtual navigator, (c) Average roll angle, (d) Average pitch angle.

Simulation effect

Figure 10a–c shows the GAZEBO simulation navigation performance of the drone swarm controlled by the EN-MASCA algorithm. The starting and ending points of the drone swarm are (1, 9, 0.15) and (15, 1, 0.1), which is marked by red dots and triangles. The positions of obstacles 1–6 are (1, 6, 0.5), (4, 5, 0.6), (4.2, 6.5, 0.6), (7, 7.5, 0.6) and (10, 3, 0.8). The drones start from the starting point. First, the drone swarm bypasses obstacle 1 and passes through it from its right side. Then the drone swarm sailed to obstacles 2 and 3. The two obstacles are closed in position, and the drone swarm bypassed the upper and lower paths to avoid the obstacle area. Then, the drone swarm sailed to obstacle 4. The obstacle is high, and the drone swarm adjusts the path in front and chose the safe bypass route. Before approaching the end, the drone swarm avoided obstacle 5 and passed from its right side. Finally, the drone swarm bypassed all obstacles and reached the end. During the entire navigation process, the drone swarm constantly adjusted the flight altitude and direction to adapt to terrain changes and avoid obstacles. Figure 10d shows the EN-MASCA algorithm iteration in the simulated navigation. After 200 iterations, the objective function value tends to be stable, which indicates that the navigation path of drone swarms converges to the optimal path during the optimization process, which avoids all obstacles and meets the requirements of the shortest path and minimum energy consumption. It shows the obstacle avoidance path, terrain adaptability and iterative optimization process of the EN-MASCA algorithm controlling drone swarms from multiple perspectives, which indicates that it can find the optimal path in changing terrains and its excellent performance in a complex environment.

Fig. 10
figure 10

GAZEBO simulation navigation effect and iteration of EN-MASCA algorithm controlling UAV swarm (a) 3D navigation trajectory map, (b) Two-dimensional navigation plan, (c) Contour navigation chart, (d) Performance grap.

The GAZEBO algorithm path is imported into the Rflysim3D software30 for demonstration. The Rflysim3D environment is planned as the simulation environment, and the drones’ flight altitude and speed are set to be consistent. Five drones leave at the initial position, and the departure position is consistent with the position in GAZEBO. As shown in Fig. 11, the drone group’s flight path is consistent with the path simulated by GAZEBO. Although drones 2 and 4 fly in another direction at the beginning, they pass through obstacles 1–3 after adjustment and return to the flight path of drone 1 (navigator). Finally all the drone groups fly to the end. It shows that the drone group can fly according to the planned path and complete the cruising process.

Fig. 11
figure 11

3D simulation effect of EN-MASCA algorithm controlling drone swarm (a) Obstacle distribution map, (b) Initial navigation path diagram, (c) Obstacle avoidance navigation path map, (d) Optimal navigation route map.

Discussion

This study is consistent with Zhou et al.17 and Yin et al.21 in terms of path planning and obstacle avoidance, but by introducing a virtual navigator and reinforcement learning algorithm, it achieves higher accuracy and greater adaptability. improvements. Compared with the studies of Meng et al.15 and Chen et al.18, this study optimized the cluster collaboration mechanism, enabling drones to maintain consistency and stability in complex agricultural scenarios. Combining the smart agriculture results of Lu et al.11 and Li et al.12, this study focuses on the specific application scenario of durian orchards and designs a more practical optimization strategy. Compared with the dynamic modeling studies of Zhou et al.13 and Chen et al.14, this study shows stronger robustness in dynamic environments. Based on Yin et al.21 and Ji et al.23, this study significantly improved the efficiency and real-time performance of the algorithm through distributed control strategy and experience replay mechanism. Through the above experimental results, the EN-MASCA algorithm performs better than other algorithms in multiple dimensions such as flight trajectory deviation, altitude change stability, and flight speed range, demonstrating good accuracy and stability. In terms of relative distance consistency and safe distance control between cluster center and obstacles, EN-MASCA significantly outperforms MASCA, NNCA and NSGAII, indicating that it is more suitable for clustering tasks in the complex environments. It achieves fast path adjustment and efficient navigation performance by introducing a virtual navigator model and reinforcement learning algorithms (DQN and PPO), especially in complex scenarios. Table 2 shows the comparison results between the EN-MASCA and another two algorithms.

Table 2 The comparison results between the EN-MASCA and another two algorithms.

Although this study simulated the basic terrain characteristics of a durian orchard, there are some limitations compared to the real orchard scene. The terrain undulations and obstacle distribution in the simulated environment are designed based on average values and typical samples, and fail to fully cover the extreme terrain changes that may occur in the orchard (such as steep slopes and deep gullies). There are irregularly distributed soft soils or waterlogged areas in the real environment. These factors put higher requirements on the flight control and path planning of the drone, which have not been fully reflected in the simulation. Although moving obstacles were added to the experiment to simulate a dynamic environment, the types and behavior patterns of dynamic obstacles are relatively limited. The moving obstacles in this experiment used a fixed speed and a simple linear motion model, while nonlinear and irregular motion (such as random movement of people, animals, and mechanical equipment) may occur in reality. In order to narrow the gap between the experimental setting and the real conditions, subsequent research will build a more complex and realistic simulation environment, including more sophisticated terrain and obstacle modeling. It introduces experimental designs with multiple scenarios and multiple meteorological conditions to comprehensively evaluate the robustness and adaptability of the algorithm. It also conducts long-term field tests to accumulate real data to optimize the algorithm performance.

Conclusions

This study proposes an enhanced multi-agent swarm control algorithm (EN-MASCA) for the coordinated patrol operation of drone swarms in durian orchards. It introduces DQN and PPO algorithms to optimize drones’ navigation and obstacle avoidance strategies. It guides the drone swarm through the virtual navigator model to improve its adaptability and stability. It also constructs a six-degree-of-freedom drone motion simulation model and uses the PI control loop to achieve the attitude control of drone swarms. Compared with MASCA, NNCA and NSGAII algorithms, the results show that the EN-MASCA algorithm is superior to the other three algorithms regarding flight trajectory, flight stability, cluster stability and simulation effect. The drone swarm controlled by the EN-MASCA algorithm can effectively avoid obstacles, maintain a tight formation and complete patrol tasks. Its speed, altitude and yaw angle change rate are closer to the expected value and have less fluctuation. The distance between the cluster center and the virtual navigator or obstacles remains stable, which ensures the safety and stability of patrol operations. It enables drone swarms to learn and optimize flight paths, avoid collisions and misjudgments and complete large-scale patrol tasks quickly, which reduces the labor intensity and costs of manual patrols and improves patrol efficiency and safety. Moreover, it enables the drone to respond quickly, detect abnormal conditions and send alerts to managers promptly, which provides the accurate location information to help managers respond quickly, ultimately improving the economic benefits of the durian orchard.