Introduction

Rice serves as the primary food source for more than half of the global population, making the prevention of rice diseases and pests essential for maintaining food security1. However, the emergence of rice diseases and pests poses a significant threat to crop yields and quality, leading to incalculable annual economic losses due to their impact on agriculture2. Common rice diseases can cause leaves to wither, affecting photosynthesis and damaging the stems, nodes, panicles, young branches, and grains3. In severe cases, these diseases can lead to total crop failure. Rice pests can damage the plant’s vascular tissues, and large numbers of pests can cause the rice to fall over, leading to growth deformities and significantly impacting yield and quality4.

To combat these threats more effectively, the development of accurate and efficient disease and pest detection methods has become a central focus of current research efforts5. Traditional monitoring methods for rice diseases and pests are largely based on eye observations by plant protection personnel, which has the following shortcomings: First, it is highly subjective. The identification results are significantly affected by individual factors such as the observer’s experience and fatigue level, which can easily lead to misjudgment. Second, in large-scale planting scenarios, a large amount of manpower is required for field-by-field inspection, resulting in high monitoring costs. Third, it has poor timeliness. The outbreak of diseases and pests is often sudden, and manual inspection has a long cycle and slow response, which can easily miss the optimal window for prevention and control6. Traditional computer vision-assisted methods attempt to improve efficiency through image processing technologies (such as threshold segmentation and morphological feature extraction), they still have obvious limitations: highly dependent on manually designed features, and their robustness to complex field environments is poor. In the realm of target detection technology, deep learning algorithms offer innovative solutions for identifying rice diseases and pests, leveraging their prowess in feature extraction and pattern recognition7. First, the early identification and intervention of rice diseases and pests are crucial to safeguard rice growth and yield, mitigate yield losses due to diseases and pests, and enhance agricultural production efficiency8. Furthermore, the use of deep learning technology for the identification of diseases and pests can reduce the dependence on chemical pesticides, which is beneficial for environmental protection and the promotion of sustainable agriculture9. Finally, the application of deep learning technology strengthens the research basis of smart agricultural technology, promotes the automation and intelligence of agricultural production, and is expected to improve the level of food security on a global scale, and has important social value to promote social and economic development and improve the quality of people’s life10. In recent years, rapid progress in computer hardware, particularly in processing speeds, along with advances in software technologies, has propelled the application of deep learning and image processing to the forefront of research in the domain of diseases and pests prevention and control11.

Zhang et al.12 have toke advantage of both CNN and transformer for accurate detection of leaf blast and brown spot diseases. The corresponding precision, recall, and F1-score were all over 0.96 with an AUC of up to 0.9987, and the corresponding loss rate was 0.0042. Uddin et al.13 proposed a novel end-to-end training of convolutional neural network (CNN) and attention (E2ETCA) ensemble framework that fuses the features of two CNN-based state-of-the-art (SOTA) models along with those of an attention-based vision transformer model. Rajasekhar et al.14 developed a Spider Monkey-based Random Forest (SMbRF) model for the precise detection and classification of rice leaf diseases, achieving a remarkable accuracy of 99.29%, sensitivity of 99.52%, precision of 98.76%, and specificity of 99%, demonstrating the model’s scalability. Sangaiah et al.15 have enriched the T-YOLOv4 network by integrating SPP (Spatial Pyramid Pooling)16, CBAM (Convolutional Block Attention Module), SCFEM (Sand Clock Feature Extraction Module), Ghost modules, and additional convolutional layers, thereby enhancing the network’s precision in identifying the three types of rice leaf diseases, the testing mean average precision (mAP) as 86%. Gao et al. have further enhanced the YOLOv5 by adding the CBAM (Convolutional Block Attention Module), optimizing the main branch gradient flow with the BottleNeck Block module of CSP17 (Cross Stage Partial) in the Neck, and substituting the SPPF (Spatial Pyramid Pooling-Fast)18 module with the S-SPPF module to detect the three kinds of rice leaf diseases, the detection result increased by 9.90% in terms of mAP. Song et al.19 have introduced a novel YOLOv8-SCS architecture that incorporates SPD-Conv (Space-to-Depth Convolution)20, the CG block (Context Guided block)21, and Slide Loss, enabling the detection of 10 distinct rice pests, achieving an mAP of 87.9%, which is a 5.7% improvement over the original YOLOv8. Trinh et al.22 have innovatively altered the loss function of the YOLOv8n model by incorporating Alpha-IoU and have implemented this enhanced model in IoT devices to detect the three most prevalent rice leaf diseases in Vietnam, the precision of the proposed model in this research has reached 89. 9% in the data set of 3175 image23. Table 1 shows the summary of AI models for agricultural diseases and pests detection in recent years. However, these target detection methods depend on the characteristic parameters of accurate extraction, feature extraction difficulty, weak anti-interference ability and poor universality. There are fewer types of diseases and pests detection, the detection accuracy and the applicability of these models still need to be improved.

Table 1 Summary of AI models for agricultural diseases and pests detection.

This article first introduces materials and methods, the second step is to conduct experiments, give results and comparisons and the last step is to conclude. This model is on the basis of YOLOv8n, using Triplet Attention mechanism to model channel attention and spatial attention, using GLSA (Global to Local Spatial Aggregation) module to effectively fuse feature information at different levels, C2f-BottleNeck for improvement using WTConv (Wavelet Transform Convolution), replace the loss function with EIoU (Enhanced Intersection over Union) and propose a new model called YOLO-DP. This model can detect 15 different types of diseases and pests, a visualization system has also been created for the model to detect images and videos of diseases and pests, on the basis of the original YOLOv8n model, the accuracy rate increased by 2.9% to 77.8%, which can effectively detect more types of rice diseases and pests.

Materials and methods

Experimental environment configuration

During the experiment, in order to ensure the efficiency of model training, the cloud server AutodL was selected as the experimental platform, equipped with RTX 4090 (24 G) graphics card, the processor is 18 vCPU AMD EPYC 9754 128-Core Processor, the platform runs on the Ubuntu 20.04 LTS operating system, the deep learning framework is PyTorch 1.11.0, the development language is Python 3.8, and the computing platform is Cuda 11.3. During the training of the network model, the initial learning rate was set at 0.01, the number of iterations to 300 and the number of training batches to 64. The environment of the developed rice diseases and pests detection system is PyQt5 5.15.2, the development language is Python 3.9.20, framework is PyTorch 1.9.0.

Data set construction

The construction of high-quality data sets is the premise of model training. Existing public data sets do not collect rice diseases and rice pests at the same time, so this study constructed a new data set by collecting online and field photography, gathering data of fifteen common rice diseases and pests. The field photography in data set comes from Dongping Town, Chongming District, Shanghai, China, the camera used for image capture was the Apple iPhone 15, which collected nine types of rice diseases including: Rice blast, Bacterial blight, Brown spot, Rice dead heart, Rice downy mildew, Rice false smut, Rice sheath blight, Rice bacterial leaf streak and Rice tungro, as well as six types of rice pests including Brown-planthopper, Green-leafhopper, Leaf-folder, Rice-bug, Stem-borer and Whorl-maggot, these images are shown in Fig. 1. These data, combined with the public data set, cover the characteristics of pests and diseases in different regions and environments, fully ensure the diversity and practicability of the data set and can provide strong support for subsequent research and analysis.

Fig. 1
figure 1

Fifteen types of rice diseases and pests. This figure covering nine types of rice diseases including: Rice blast, Bacterial blight, Brown spot, Rice dead heart, Rice downy mildew, Rice false smut, Rice sheath blight, Rice bacterial leaf streak and Rice tungro, as well as six types of rice pests including Brown-planthopper, Green-leafhopper, Leaf-folder, Rice-bug, Stem-borer and Whorl-maggot.

The data set was divided into training, validation, and test groups in a 7:2:1 ratio. At the same time, use Baidu’s intelligent data service Easy Data to annotate the missing labels of rice diseases and pests images, the annotation is the specific lesion area. To enhance the model’s generalization and robustness, the original images were processed with transformations flipping and rotation, using four methods: brightening, dark, salt noise and Gaussian noise adjustments were made to the H (Hue), S (Saturation) and V (Value) components in the HSV color space to increase the diversity in color, at the same time to emulate different scenes of rice fields such as in sunny, rainy, morning, dusk and night. These methods and transformations are shown in Fig. 2, through these methods the data set was expanded to 20,129 images. Finally, the composition of the data set is presented in Table 2. Although the enhancement strategies have enriched data diversity to a certain extent, there are still potential limitations: the current environmental simulation mainly focuses on basic changes in light intensity and weather conditions, and has not yet fully covered the complex actual scenarios in the field.

Fig. 2
figure 2

Six types of data augmentation methods. This figure shows six types of data augmentation methods to emulate different scenes of rice fields such as in sunny, rainy, morning, dusk and night.

Table 2 Composition of the data set.

YOLOv8n network structure improvement

YOLOv8n (You Only Look Once v8 Nano) is a real-time target detection model launched by Ultralytics, which provides frontier performance in terms of accuracy and speed. As a new iteration of the YOLO series, YOLOv8n composed of modules such as Input, Backbone, Neck, Head, Loss and Output, improve the performance of feature extraction and object detection24.

The model YOLO-DP in this article will build on the YOLOv8n: Introducing Triplet Attention into the Backbone of the network captures cross-dimensional interactions between channels and spatial dimensions through three branching structures, thus improving the model’s ability to understand and process features, in the mean time it has a negligible computational overhead. Improve YOLOv8n using GLSA modules to effectively fuse feature information at different levels, improve the context perception of the model, enhance the effectiveness of the feature representation. C2f-BottleNeck for improvement using WTConv by cascadading decomposition of wavelet transform can significantly expand the receptive field of the network without significantly increasing the number of parameters. Finally, the original loss function CIoU was replaced with EIoU by directly minimizing the width and height differences of the prediction box from the real box, considering the matching of position and size comprehensively. Reduce the position offset and shape mismatch of the prediction box, making the model converge faster. Based on these, we proposed a new model YOLO-DP, Fig. 3 is the structure diagram of YOLO-DP.

Fig. 3
figure 3

Structure diagram of YOLO-DP. This figure illustrates the structure of YOLO-DP, including the Backbone, the Neck, the Head, the Loss, the Input and the Output. The backbone consists of Conv: A standard convolutional layer for extracting features, WTConv: Improved convolutional layers, GLSA: the Global to Local Spatial Aggregation) module, Concat: Feature stitching operation to fuse features at different levels, C2f: Feature Processing Module, Detect: The detection head is used to generate target detection results, and EloU:Loss function. The input and output image size is 640 × 640 × 3.

The main steps are collecting images of rice diseases and pests, adjusting them to a resolution of 640 × 640, augmenting the data set through image algorithms to enhance the model’s generalization and robustness. Then training the preprocessed images and corresponding labels in the detection network to obtain the model weight files; using specific loss functions and optimizers during the training process to ensure optimal performance of the model for specific tasks. Finally, validating and testing the trained model weight files on test images to assess the model’s performance and accuracy. Training metrics such as average accuracy, average recall rate, and mean precision are recorded to comprehensively evaluate the model’s performance25.

Triplet attention

In the complex rice field environment, the YOLOv8n model faces challenges with rice leaves, subtle diversity of rice ear disease characteristics, low contrast between infected and healthy leaves, and similar pest color to rice wheat ears, and similar leaf color. These factors make it difficult for the original model to accurately detect and locate the characteristics of rice pests and pests. To address these challenges, this study was used Triplet Attention, an innovative approach to attention mechanisms, leveraging a three-pronged architecture to discern cross-dimensional interplay within input data and calculate attention weights. This technique facilitates the creation of interdependencies among input channels or spatial positions, while maintaining a relatively low computational burden. Current research incorporates Triplet Attention into the YOLOv8n Backbone, enriching the model’s information exchange with auxiliary keys and bolstering its ability to discern intricate structures26.

Triplet Attention’s primary concept is Cross-dimensional Interaction: Triplet Attention mechanism adeptly captures the interplay between the channel (C) and spatial dimensions (height H and width W) via three distinct branches. Next, we have the following Rotation Operation: This operation adeptly captures dependencies across various dimensions without the need for dimensionality reduction, thereby reducing the parameter count while preserving the integrity of the information. Lastly, the Residual Connection plays a pivotal role: The output of each branch is integrated with the input through a Residual Connection, which bolsters the model’s learning capacity. Figure 4 is an illustration of the core mechanism of Triplet Attention, Cross-dimensional Interaction.

Fig. 4
figure 4

Diagram of cross-dimensional interaction in triplelet attention. In this figure Triplet Attention consists of three parallel branches. Two of these branches are responsible for capturing cross-dimensional interactions between the channel (C) and spatial dimensions (H or W). The last branch is used to construct Spatial Attention. The outputs of the three branches are aggregated using averaging.

Let the input feature map be \(\chi\), with dimensions C × H × W. The formulas applied range from formula 1 to 10. First Branch (Channel Attention): Compress channel information into two dimensions through \(ZPool\) which includes max pooling and average pooling:

$$\begin{array}{*{20}c} {\chi_{1}^{*} = ZPool\left( \chi \right)} \\ \end{array}$$
(1)

Generate attention weights through convolution and activation functions, where \(\sigma\) is the Sigmoid activation function, and \({\psi }_{1}\) is the convolution operation: Output, where \(\odot\) denotes element-wise multiplication:

$$\begin{array}{c}{\omega }_{1}=\sigma \left({\psi }_{1}\left({\chi }_{1}^{*}\right)\right)\end{array}$$
(2)
$$\begin{array}{c}\overline{{\chi }_{1}}=\chi \odot {\omega }_{1}\end{array}$$
(3)

Second Branch (Channel and Height Interaction): Permute the feature map along the height dimension, where \({\chi }_{perm2}\) is the permutation of \(\chi\) along the height dimension:

$$\begin{array}{c}{\chi }_{2}={\chi }_{perm2}\end{array}$$
(4)

Generate attention weights through \(ZPool\) and convolution operations:

$$\begin{array}{c}{\omega }_{2}=\sigma \left({\psi }_{2}\left({\chi }_{2}^{*}\right)\right)\end{array}$$
(5)

Output the average of \({\chi }_{2}\):

$$\begin{array}{c}\overline{{\chi }_{2}}=\chi \odot {\omega }_{2}\end{array}$$
(6)

Third Branch (Channel and Width Interaction): Permute the feature map along the width dimension, where \({\chi }_{perm3}\) is the permutation of \(\chi\) along the width dimension: Generate attention weights through \(ZPool\) and convolution operations:

$$\begin{array}{c}{\chi }_{3}={\chi }_{perm3}\end{array}$$
(7)
$$\begin{array}{c}{\omega }_{3}=\sigma \left({\psi }_{3}\left({\chi }_{3}^{*}\right)\right)\end{array}$$
(8)

Output the average of \({\chi }_{3}\):

$$\begin{array}{c}\overline{{\chi }_{3}}=\chi \odot {\omega }_{3}\end{array}$$
(9)

Finally is the output part:Average the outputs of the three branches:

$$\begin{array}{c}y=\frac{1}{3}\left(\overline{{\chi }_{1}}+\overline{{\chi }_{2}}+\overline{{\chi }_{3}}\right)\end{array}$$
(10)

Global to local spatial aggregation (GLSA)

The efficacy of YOLOv8n is significantly influenced by its Neck architecture, which is tasked with merging feature maps across various scales to distill pertinent spatial details. Conventional Neck structures might fall short in effectively capturing both global and local features, particularly when encountering objects with substantial scale disparities. To address this challenge, we propose incorporating the Global to Local Spatial Aggregation (GLSA) module within the Neck of YOLOv8n, thus bolstering the model’s capacity to discern global and local features, Fig. 5 is the overview of GLSA27.

Fig. 5
figure 5

Overview of the global-to-local spatial aggregation module (GLSA). This figure shows the module consists of two parallel layers: a GSA (Global Self-Attention) layer, which focuses on the content of the pixels based solely on their content and a GLA (Global Location Attention) layer, which focuses on the spatial location of the pixels. The output of the module is the sum of the outputs from these two layers.

The GLSA module refines feature representation by isolating and combining global and local spatial features. This module is composed of two primary components: Global Spatial Attention (GSA) and Local Spatial Attention (LSA). The GSA component utilizes a self-attention mechanism to capture global dependencies, whereas the LSA component concentrates on the extraction of local features, which is crucial for the identification of small objects. The formulas applied range from formula 11 to 18.

Firstly, the Backbone of YOLOv8n was used to extract the multiscale feature map \({{F}_{1}}^{\prime}\), \({{F}_{2}}^{\prime}\), \({{F}_{3}}^{\prime}\). Apply the GLSA module to each feature map to extract global and local features, where \({F}_{gi}\) and \({F}_{li}\) represent the global and local features of the ith feature map, respectively:

$$\begin{array}{c}{F}_{gi}=GSA\left({F}_{i}\right)\end{array}$$
(11)
$$\begin{array}{c}{F}_{li}=LSA\left({F}_{i}\right)\end{array}$$
(12)

Fusion of global and local features for richer representation of features:

$$\begin{array}{c}{{F}_{i}}^{\prime}={F}_{gi}+{F}_{li}\end{array}$$
(13)

The multi-head attention mechanism is applied to further fuse the features:

$$\begin{array}{c}{F}_{multi}=MultiHead\left({{F}_{1}}^{\prime},{{F}_{2}}^{\prime},{{F}_{3}}^{\prime}\right)\end{array}$$
(14)

Key formulas in the GLSA module is Global Spatial Attention (GSA):

$$\begin{array}{c}AttG\left({F}_{i}\right)=Softmax\left(Transpose\left({C}_{1\times 1}\left({F}_{i}\right)\right)\right)\end{array}$$
(15)
$$\begin{array}{c}GSA\left({F}_{i}\right)=MLP\left(AttG\left({F}_{i}\right)\otimes {F}_{i}\right)+{F}_{i}\end{array}$$
(16)

Local Spatial Attention (LSA):

$$\begin{array}{c}AttL\left({F}_{i}\right)=\sigma \left({C}_{1\times 1}\left({F}_{c}\left({F}_{i}\right)\right)+{F}_{i}\right)\end{array}$$
(17)
$$\begin{array}{c}LSA\left({F}_{i}\right)=AttL\left({F}_{i}\right)\odot {F}_{i}+{F}_{i}\end{array}$$
(18)

Among them, \({C}_{1\times 1}\) represents \(1\times 1\) convolution, \(\otimes\) represents matrix multiplication, \(\odot\) represents element-by-element multiplication, \(\sigma\) represents the sigmoid activation function, and \(MLP\) represents multilayer perception.

Wavelet transform convolution (WTConv)

The C2f-BottleNeck module in YOLOv8n requires feature fusion operation, which increases the computational complexity of the model, leads to increase of model training and inference time, and requires more storage space and computational resources. So we introduced Wavelet Transform convolution (WTConv) which is essentially a method of replacing traditional convolution kernel with wavelet basis function for signal or image processing. It replaces the fixed convolution kernel of traditional convolutional neural network with wavelet basis which has clear mathematical significance, so as to extract features in multi-scale and multi-direction, and combine the advantages of convolution’s local receptive field and wavelet’s multi-resolution analysis. Figure 6 is the WTConv operation example.

Fig. 6
figure 6

WTConv operation example. This figure shows the WTConv operation separates the convolution across frequency components and also allows smaller convolution kernels to operate over a larger region of the original input, thereby increasing the receptive field relative to the input. WT (Wavelet Transform) aims to expand the receptive field of convolution through wavelet transformation, and IWT (Inverse Wavelet Transform) is the linear inverse of WT, which can reconstruct the convolution results back to the original space without loss.

In this research, the WTConv module toke the place of the C2f-BottleNeck architecture of YOLOv8n, substituting the existing deep convolutional layer. This enhancement enables the current model to not only encompass a broader spectrum of contextual information but also to react more adeptly to low-frequency features, a crucial aspect for object detection tasks. Specifically, the WTConv module augments the model’s sensitivity to shape via wavelet transformation, thereby enhancing the detection capabilities for small objects and those situated within intricate backgrounds. The introduction of WTConv modules can enhance network performance, especially has a low computing cost in terms of floating point operations28.

The formula for WTConv can be represented by the following key steps:

Wavelet Transform (WT), where \({X}_{LL}^{0}=X\) is the input to this layer,\({X}_{H}^{i}\) rep-resents all three high-frequency plots of level i:

$$\begin{array}{c}{X}_{LL}^{i},{X}_{H}^{i}=WT\left({X}_{LL}^{i-1}\right)\end{array}$$
(19)

In the wavelet domain convolution, \(X\) is the input tensor, and \(W\) is a weighted tensor of a \(k\times k\) deep convolution kernel with four times the number of input channels. This operation not only separates the convolution between the frequency components, but also allows the smaller convolution kernel to operate within a larger area of the original input, increasing the receptive field relative to the input:

$$\begin{array}{c}Y=IWT\left(Conv\left(W,WT\left(X\right)\right)\right)\end{array}$$
(20)

Cascading wavelet decomposition, where \({X}_{LL}^{i}\) and \({X}_{H}^{i}\) are wavelet transform results of the second order:

$$\begin{array}{c}{Y}_{LL}^{i},{Y}_{H}^{i}=Conv\left({W}^{i},\left({X}_{LL}^{i},{X}_{H}^{i}\right)\right)\end{array}$$
(21)

Combination of different frequency outputs, this results in the summation of different levels of convolution, where \({Z}^{i}\) is the aggregate output from level i and beyond:

$$\begin{array}{c}{Z}^{i}=IWT\left({Y}_{LL}^{i}+{Z}^{i+1},{Y}_{H}^{i}\right)\end{array}$$
(22)

Enhanced intersection over union (EIoU)

The YOLOv8n model adopts the CIoU (complete intersection over union) Loss function, which is composed of the original IoU (intersection over union), the aspect ratio calculation, and the distance from the center point, which can evaluate the similarity between the prediction box and the actual box for bounding box regression, but when the regression bounding box is performed, if the height-width ratio of the predicted box and the actual box shows a linear correlation, The penalty term of the CIoU loss function may be-come invalid and become zero29.

The EIoU loss function in this paper is further improved on the basis of the CIoU loss function, which not only considers the distance between the center point of the prediction box and the real box, but also considers the difference between the aspect ratio of the prediction box and the aspect ratio of the real box. This improvement allows EIoU to decouple geometric factors and optimize more accurately and evaluate the accuracy of the prediction frame more comprehensively, especially when the prediction frame has less overlap with the real box, EIoU can provide more efficient gradient information and accelerate the convergence of the model. Figure 7 is EIoU loss function schematic diagram.

Fig. 7
figure 7

EIoU loss function schematic diagram. This figure shows EIoU Loss focuses on higher-quality anchor boxes by decomposing the aspect ratio influence factor of the prediction box and the ground-truth box, and separately calculating the differences in width and height.

The EIoU loss function is calculated as follow:

$$\begin{array}{c}EIoU=1-\left(\frac{intersection}{union}\right)+\frac{{\left({w}^{pred}-{w}^{gt}\right)}^{2}}{{C}_{w}^{2}}+\frac{{\left({h}^{pred}-{h}^{gt}\right)}^{2}}{{C}_{h}^{2}}+\frac{{\rho }^{2}\left({b}^{pred},{b}^{gt}\right)}{{c}^{2}}\end{array}$$
(23)

In this formula, \(intersection\) is the area where the prediction box intersects with the real box, \(union\) is the union area of the predicted box and the true box. \(\rho \left({b}^{pred},{b}^{gt}\right)\) is euclidean distance between box centers. \(c\) is the diagonal length of the smallest enclosing box. \({w}^{pred}-{w}^{gt}\) is the square of the difference between the width of the predicted box and the width of the true box. \({h}^{pred}-{h}^{gt}\) is the square of the difference between the height of the predicted box and the height of the true box. \({c}_{w}\) is the width of the minimum additive box that contains the predicted box and the real box. \({c}_{h}\) is the height of the smallest encore box that contains the prediction box and the real box.

Model and and result analysis

Evaluation indicators

In order to test the performance of the improved network model, frames per second at the time of verification, the size of the model, Precision (P), Recall (R) , the average precision of all detection categories when at 50% overlap between the prediction box and the real box (mAP50) and the average precision (mAP95) of all detection categories with a stride length of 0.05 when the overlap between predicted and real boxes is between 50 and 95% are used as the final evaluation indexes of the model performance. The calculation formula can be expressed as:

$$\begin{array}{c}P=\frac{TP}{TP+FP}\end{array}$$
(24)
$$\begin{array}{c}R=\frac{TP}{TP+FN}\end{array}$$
(25)
$$\begin{array}{c}AP={\int }_{0}^{1}P\left(R\right)dR\end{array}$$
(26)
$$\begin{array}{c}mAP50=\frac{\sum_{j=1}^{k}A{P}_{i}}{k\left(classes\right)}\left(IoU=0.5\right)\end{array}$$
(27)
$$\begin{array}{c}mAP95=\frac{\sum_{\text{j}=1}^{\text{k}}\sum_{t=0}^{9}A{P}_{i}}{10k\left(classes\right)}\left(IoU=0.5+0.05t\right)\end{array}$$
(28)

Results analysis

Comparative experiment results and analysis

To verify the effect of the improved model, selected TOOD(Task-aligned One-stage Object Detection)30, typical representative of the two-stage object detection model Faster R-CNN31, RT-DETR (Real-Time Detection Transformer)32, YOLOv8n, YOLOv10n33 and YOLO-DP for comparative test, the comparison results in Table 3. It can be seen that YOLO-DP is better than other popular models in the detection of rice diseases and pests, improves by 2.9% based on the original YOLOv8n mAP50 and 4% of mAP95.

Table 3 Experimental results of the comparative test.

Figure 8 illustrates the results of detecting rice pests and diseases across six different models in various scenarios and times, including multiple detection targets, close-range, long-distance, direct sunlight, and nighttime. In conditions of long-distance or direct sunlight, RT-DETR and YOLOv10n may fail to detect rice pests and diseases. TOOD and Faster R-CNN have relatively lower accuracy, whereas YOLO-DP, based on YOLOv8n, offers higher precision in identifying rice pests and diseases, making it more effective for detection.

Fig. 8
figure 8

Comparison of six models for rice diseases and pests detection. This figure is the detection comparison chart of the six models shows the average precision of all detection categories when at least 50% overlap between the prediction box and the real box of rice diseases and pests detection under various conditions. It can be seen that YOLO-DP performs better.

Figure 9 is a curve plot of the training epochs of six different models compared to the values of mAP50. As can be seen from the curve chart, the mAP50 value of YOLO-DP is higher than that of other models, showing the superiority of the model. Figure 10 shows the comparison of Precision-Recall Curve between YOLOv8n and YOLO-DP, which shows that the model has a good target performance and improve for fifteen types of rice diseases and pests.

Fig. 9
figure 9

Curve plot of the training epochs against the values of mAP50. This figure is a curve plot of the training epochs of six different models compared to the values of mAP50. It can be seen that YOLO-DP performs better.

Fig. 10
figure 10

Precision-recall curve comparison between YOLOv8n and YOLO-DP. This figure shows the values of mAP50 for 15 kinds of rice diseases and insect pests before and after the model improvement, which proved that the model improved the detection of 15 kinds of rice diseases and pests.

Figure 11 is a visual system which loaded YOLO-DP model to capture images of rice diseases and pests. At the same time, the system supports video detection and camera detection, and displays the location, quantity, reliability, and the consumed time of detecting targets.

Fig. 11
figure 11

Visual system of rice diseases and pests detection. This figure is a visual system which loaded YOLO-DP model to capture images of rice diseases and pests. At the same time, the system supports video detection and camera detection, and displays the location, quantity, reliability, and the consumed time of detecting targets. It can be seen that YOLO-DP performs well.

Ablation experiment

In order to verify the effectiveness of the improved algorithm, four networks were designed for ablation experiments, all using the same rice diseases and pests data set, batches, and training period. The experimental results are shown in Table 4, except the Average Accuracy of YOLOv8n + EIoU, value decreased by 0.66%, each other improvement point was improved.

Table 4 Experimental results of the ablation experiment.

Comparison of different attention mechanisms

To comprehensively evaluate the performance of Triplet Attention in YOLOv8n model, this study conducted comparative experiments including three types of attention mechanism: YOLOv8n + AgentAttention34, YOLOv8n + LocalWindowAttention35, and YOLOv8n + Triplet Attention. The three attention mechanisms were added separately to the tail of the network Backbone, the experimental results are shown in Table 5.

Table 5 Experimental results of different attention mechanisms.

According to the experiment, Triplet Attention is superior to other attention mechanisms when used to detect diseases and pests in rice.

Comparison of different loss functions

To comprehensively evaluate the performance of EIoU in enhancing the YOLOv8n model, this study conducted comparative experiments involving four loss functions: YOLOv8n + EMASlideLoss36, YOLOv8n + SlideLoss37, YOLOv8n + DIoU38, and YOLOv8n + EIoU. The objective was to analyze the performance differences among these various loss functions in the context of rice diseases and pests detection tasks, the experimental results are shown in Table 6.

Table 6 Experimental results of different loss functions.

Because the indicators of YOLOv8n + EIoU are higher than other loss functions, according to the experimental results, YOLOv8n + EIoU is more suitable to detect rice diseases and pests than other loss functions.

Heat map analysis

Using Grad-CAM (Gradient-weighted Class Activation Mapping)39 visual heat maps to visualize the features of different rice diseases and pests output by the YOLO-DP model, the results are shown in Fig. 12. This figure can visually display the level of regional attention, with more obvious features corresponding to higher heat at the respective positions, the confidence and prediction boxes for rice diseases and pests detection are also shown.

Fig. 12
figure 12

Heat map analysis between YOLOv8n and YOLO-DP. This figure shows the Grad-CAM technique was used to analyze and compare the performance of the original YOLOv8n model and YOLO-DP in object detection tasks. Heatmap areas from red to yellow represent the primary focus of the model, while blue areas indicate lower attention. YOLO-DP’s attention is more focused on the center of the diseases and pests, demonstrating more precise feature extraction.

As can be seen in Fig. 12, compared to YOLOv8n, YOLO-DP can locate the characteristics of diseases and pests and perform accurate identification more accurately, the confidence level of detection is also higher and also reduces the impact of complex backgrounds on detection.

Analysis of other data set

To verify the generalizability of the method proposed in this study, comparative experiments were conducted on other datasets using YOLOv8n and YOLO-DP. The experiment utilized the 2020 Rice Leaf Disease Image Samples40. The data set contains 5932 number images includes four kinds of Rice leaf diseases i.e. Bacterial blight, Blast, Brown Spot and Tungro. The experiment selected 3849 images and divided them into training, validation, and test sets in a 7:2:1 ratio, with experimental environment and parameter settings consistent with those described in “Results analysis” section. The results are shown in Table 7. Compared to YOLOv8n, YOLO-DP improved the average accuracy and recall rates by 1.3% and 0.7%. There was a significant 1.4% improvement in the mean Average Precision over the original YOLOv8n model, indicating higher detection accuracy of YOLO-DP.

Table 7 Experimental results of 2020 rice leaf disease image samples.

Conclusions

Given the current lack of effective methods for simultaneously detecting rice diseases and pests, this study innovatively constructed a dataset that meticulously includes images of the fifteen most common rice diseases and pests. The YOLO-DP model was then introduced, which significantly improved the accuracy of detection under various lighting conditions, distances, and target quantities. By comparing with multiple mainstream object detection models, attention mechanisms, and loss functions, and through ablation studies, the effectiveness of the rice disease and pest detection model was validated. Finally, a rice disease and pest detection system was developed, supporting image, video, and camera detection.

Statistical analysis of the experimental results shows that the YOLO-DP model exhibits stable and excellent performance in rice disease and pest detection tasks. In comparative experiments with the original YOLOv8n model, it achieves a 1.9% increase in precision, a 3.7% increase in recall, a 2.9% increase in mAP50, and a 4.0% increase in mAP95, reaching 90.5, 88.1, 92.5, and 72.1 respectively. This indicates that the improvement of the YOLO-DP model is not due to accidental fluctuations but a stable performance enhancement achieved through structural optimization. In addition, the model’s decision-making process is analyzed using the Grad-CAM visualization tool, which further verifies the accuracy and consistency of the model in capturing disease and pest features, providing quantitative support for the interpretability of the results.

In conclusion, YOLO-DP model shows excellent performance in rice disease and pest detection tasks, providing strong technical support for intelligent agriculture and real-time disease monitoring, and providing the possibility to apply these advanced model technologies to other crops and diseases41.

Future research could explore extending the YOLO-DP model to detect diseases in other crops, such as wheat and corn, to validate and optimize its applicability and effectiveness in a broader range of crop diseases, more specialized models for small-scale diseases and pests detection will be introduced into the experimental comparison. For extreme scenarios that are not covered in the field, such as blurred images and dew reflection on foggy days, the dataset is planned to be expanded, and adversarial training strategies are introduced to improve the model’s resistance to unknown interference42. Additionally, incorporating multi-modal image detection technology, which utilizes information from various imaging modes (such as visible light, infrared, and hyperspectral images), can enrich the model’s input dimensions, enhance the distinguishability of pest and disease features, and provide forward-looking ideas for developing more precise and versatile intelligent monitoring systems for crop pests and diseases43. Finally, the model is connected to agricultural Internet of Things (IoT) devices, through model quantification and knowledge distillation, it is adapted to low-computing power devices and the lightweight YOLO-DP model is deployed on drones and field edge computing terminals, enabling full-process automation of "real-time image collection—local rapid detection"44.