Introduction

Maize is not only a food crop but also plays a significant role in the production of feed, fuel, and bioproducts. Its cultivation and production are directly linked to global food security and the stability of the agricultural economy1. Given that leaf diseases pose major threats to maize cultivation, the timely and accurate detection of affected areas is crucial for controlling the spread of these diseases2,3.

The frequency of maize leaf disease has increased due to environmental factors and inappropriate farming practices, posing challenges to healthy crop growth4. Traditional manual detection methods are time-consuming, labor-intensive and prone to misjudgment, especially in large-scale cultivation areas. Therefore, the use of intelligent technology for maize leaf disease detection can not only improve the precision and speed of detection, but also detect the leaf disease at an early stage to prevent its spread and reduce economic losses5.

Smart agriculture achieves intelligent management of agricultural activities by integrating advanced technologies such as artificial intelligence6,7, Internet of Things8,9 and big data10. Maize leaf disease area detection is an important application of smart agriculture, which can quickly and accurately detect leaf disease areas through the use of deep learning and image recognition technology11,12,13, helping farmers to understand the health of the crop in real time and make corresponding preventive and control measures. Smart agriculture not only improves the efficiency of agricultural production, but also reduces the use of pesticides and promotes the sustainable development of agriculture. In recent years, many scholars at home and abroad have been using deep learning and target detection techniques to identify leaf disease infested areas in crops and for leaf disease management14.

Literature15 proposed a multilayer deep information feature fusion network (DFN-PSAN), combined with YOLOv5 and pixel-level attention mechanism, to achieve efficient classification of plant diseases in natural field environments, with an average accuracy and F1 score of more than 95.27%, and to improve the model interpretability through SHAP and t-SNE methods. Dai’s research team16 focused on multimodal data fusion for agricultural applications and proposed the ITF-WPI model to improve the accuracy of wolfberry pest identification through cross-modal feature fusion to achieve 97.98% accuracy while significantly reducing the model complexity. Dai’s research team16 focused on the agricultural application of multimodal data fusion and proposed the ITF-WPI model, which improved the accuracy of wolfberry pest identification through cross-modal feature fusion, reaching 97.98% accuracy and significantly reducing the model complexity at the same time. Literature17 constructed a PRF-SVM model for support vector machines to address the environmental complexity of corn leaf disease detection, achieving an average accuracy of 96.67% on the PlantVillage dataset and demonstrating its robustness in complex field environments.

Similarly, Yang et al.18 recognized leaf disease by an enhanced YOLOv8 algorithm, which introduces the Attention module and Slim-neck module to replace the original Ordinary Convolution module. The algorithm was effective in identifying maize leaf disease in complex field environments. However, the mAP of the algorithm was only 71.62%, and there is much room for improvement. Existing deep learning-based leaf disease detection algorithms usually require large amounts of computational resources and memory. Jia et al.19 constructed a lightweight YOLO MSM model using into GhostNetV2. Compared with other models, it significantly reduces model complexity and can achieve 97.1% recognition precision and 45.3ms slower inference time, which shows that balancing precision and achieving lightweight is extremely difficult. In addition, the dataset al.so affects the progress of leaf disease detection. To address the problems of difficult dataset acquisition, insufficient samples, and low recognition precision for apple leaf disease image recognition, Li et al.20 used a data augmentation method combining Improved Cyclic Consistent Generative Adversarial Networks with affine transformation. The data enhancement method can play a good role in expanding the dataset, and it is verified on commonly used networks that the enhanced dataset can effectively improve the precision of apple leaf disease image recognition, which provides a new idea for the subsequent crop disease recognition.

Tian21 proposed a new method called multi-scale dense YOLO for detecting three typical small-target Lepidoptera leaf disease on sticky insect panels. Although it can effectively represent local features, it cannot capture global correlation information between pixels at long distances. In the literature22, a deep learning-based artificial intelligence algorithm provides a new approach for maize detection. The authors used an unmanned aerial vehicle to construct a diverse dataset for maize detection and applied YOLOv5 for the detection, but because their dataset was a small target, the detection precision was low reaching only 59%.

Nowadays, many studies have demonstrated the feasibility and efficiency of detecting crop diseases through target detection techniques. However, maize leaf disease detection still faces significant challenges. Environmental and weather influences make it difficult to collect a large-scale, high-quality dataset for maize leaf disease. Existing detection algorithms struggle to balance high precision and fast detection speed, which are crucial for real-time agricultural applications. To address these challenges, this work proposes the YOLO MSM algorithm, which introduces the following key innovations:

(1) Innovative MKConv Module: A novel MKConv module is proposed to enhance feature extraction performance without being constrained by a specific kernel size. This approach reduces network complexity, achieves algorithm lightweighting, and improves the adaptability of the model to complex agricultural scenes.

(2) SK Attention Mechanism in the C2f Residual Module: The incorporation of the SK attention mechanism into the C2f residual module eliminates redundant maize leaf disease feature information, enhancing the feature extraction process. This significantly boosts the model’s detection precision and speed.

(3) MPDIoU Loss Function: The model utilizes the MPDIoU loss function, which simplifies the similarity comparison between bounding boxes. This improvement accelerates convergence, enhances regression precision, and effectively improves the localization performance for maize leaf disease detection.

Data acquisition and preparation

Study area for data collection

In order to deepen the detection and control techniques of maize leaf disease, the field data of this study were collected from the maize cultivation site in Ruhu Town, Huicheng District, Huizhou City, Guangdong Province, China (22° 30’ ~ 23° 10’ N, 114° 10’ ~ 114° 40’ E). Through field visits and data collection in the maize planting sites, the extent of the leaf disease can be assessed more effectively and more advanced detection and control strategies can be explored.

Data preparation

To ensure the representativeness of the collected data, as shown in Fig. 1, the images were captured using a high-definition device, Canon EOS 6D, with an image resolution of 6960 pixels ×4640 pixels. The characteristics of maize leaf disease are influenced by different weather conditions, lighting, and background complexity. For instance, under warm sunlight or in a complex background, shading and lighting conditions resembling maize leaf disease features can increase the difficulty of identification. Conversely, under low light with a simple background, the disease features are more distinct and clearly shaped, making feature areas easier to detect.

Fig. 1
figure 1

Preparation of datasets.

The data collection was conducted over a period of 13 days, during which researchers captured images under varying conditions, including rainy, sunny, and cloudy weather, as well as at different times of the day. This approach ensured the dataset’s diversity and representativeness. Through screening and organization, a total of 14,700 high-quality images of maize leaf disease were selected as the dataset to guarantee the completeness and variability of the data. The dataset was divided into three parts: 80% for training, 10% for validation, and 10% for testing. This division strategy is commonly used in the field of object detection to balance training needs and evaluation reliability. The 80% training set provides sufficient data for model training, the 10% validation set monitors performance during training to prevent overfitting and optimize hyperparameters, and the 10% test set evaluates the model’s final performance using independent data.

The annotation process was carried out using LabelImg software, chosen for its ease of use, support for multiple annotation formats (such as Pascal VOC, YOLO, and Create ML), and facilitation of group collaboration and batch processing. Trained researchers manually annotated the dataset by drawing bounding boxes around the target areas in each image, strictly following the edge range of the targets to minimize background interference. The annotation mode was set to YOLO format, ensuring compatibility with the model’s input requirements. Each annotation file was saved in txt format, corresponding one-to-one with the image file. These files contained the category number and normalized coordinates of the target, including the x and y coordinates of the center point and the width and height of the bounding box. In total, 14,700 annotation files were generated, providing comprehensive coverage of the target areas in all samples. The annotation format is as follows: < object-class-id > < x> < y > < width > < height>. This meticulous annotation process ensured high-quality input data, laying a solid foundation for subsequent iterative training and performance optimization of the algorithm.

Intelligent detection methods

The YOLOv8 algorithm represents the most cost-effective iteration within the YOLO series, providing substantial enhancements in detection precision and speed relative to its predecessors. While the YOLOv10 algorithm20 is a more advanced version, YOLOv8 maintains a more pronounced overall advantage due to its effective balance of precision, speed, and computational resource consumption, thereby remaining highly competitive across a range of application scenarios.

In the design of the backbone network, YOLOv8n adopts the CSPDarknet architecture, which combines the gradient shunting technique in YOLOv724 and the design of cross-layer connections. A new C2f residual module is also introduced, which not only effectively reduces the number of parameters in the model, but also greatly enhances the efficiency of feature extraction. YOLOv8 chooses the PANet structure based on feature pyramid (FPN) and path aggregation network (PAN), which significantly improves the transfer of information from the bottom to the top layer by incorporating a bottom-up path mechanism, and can achieve multi-scale contextual feature Fusion. Finally, YOLOv8 innovatively introduces the Anchor-free matching mechanism, and uses the Binary Cross Entropy Loss (BCE) in the classification branch, and the Distributed Focus Loss (DFL) and CloU loss function in the detection branch, which improve the precision of the model in classification and localization. The YOLOv8 series consists of the n, m, l, s, and x versions. versions, among which, YOLOv8n is the lightest version, which is suitable for application scenarios with high real-time requirements. For the task of detecting maize leaf disease in the area, this study adopts YOLOv8n as the baseline model, aiming to ensure the detection speed while achieving a leap in the detection effect.

MKConv structure

The traditional convolution uses a fixed sampling grid, which is limited in its local information extraction. Although the receptive field can be expanded indirectly by stacking multiple layers, it tends to be less efficient when dealing with large scales or complex spatial relationships in images. In addition, as the size of the convolutional kernel increases, the number of parameters grows in a ratio of squares, which leads to a significant increase in the demand for computational resources.

In order to overcome the limitations associated with traditional convolutional operations, this study proposes an innovative convolutional method called MKConv (Multi-scale Vari-able Kernel Convolution), which takes full advantage of the self-attention mechanism. The core idea of MKConv is to allow the convolutional kernel to operate on multiple scales, thus enabling better adaptation to features of different sizes and shapes. This approach not only provides more diverse parameter configuration options, but also the ability to flexibly adapt the sampling shape to the characteristics of specific data.

Irregularly sampled coordinates may not be suitable for standard convolution operations of a particular size, e.g., 5 × 5, 7 × 7, etc. It is because irregular grids may lead to uneven distribution of sampling points, which affects the efficiency and effectiveness of the convolution operation.

Improvements for conventional convolutional operations for convolutional neural networks, especially to accommodate irregularly shaped convolutional kernels. In the MKConv, a regular 3 × 3 sampling grid (G) is first generated. As shown in Eq. 1, this grid is the same as the one used in traditional convolutional operations, and is usually centered at the point (0,0) and laid out in the traditional way ((-1,-1) in the upper left corner and (1,1) in the lower right corner).

$$G=[(1,1),(0,1), \cdot \cdot \cdot ,( - 1,0),( - 1, - 1)]$$
(1)

Next, the convolution constructs an irregular grid for the remaining sampling points. These points may not follow a fixed grid layout, but are dynamically adjusted according to the needs of the in-put features. MKConv can adjust the positions of the sampling points based on the spatial distribution of the features to better capture complex features such as edges or corner points. Stitch together regular and irregular sampling grids to form a complete sampling grid.

In the irregular convolution operation, each sample point Sn is associated with a convolution parameter v. In addition, define Sn as the initial sampling point. These parameters are dynamically adjusted according to the location and contribution of the sampling points to maximize the effect of feature extraction. The corresponding convolution operation can be defined as follows:

$$Conv({S_0})=\sum {v({S_0}+{S_n})}$$
(2)

As shown in Fig. 2, MKConv provides a flexible approach to enhance the adaptability and performance of convolutional neural networks, enabling them to handle a wide range of complex and variable input data more efficiently.

In the MKConv, the offsets are first added to the original coordinates, and then the features at the corresponding locations are extracted by interpolation and resampling techniques. It is worth noting that this meth-od is designed to target square sampling shapes, so features can be stacked along rows or columns, thus extracting features required for irregular sampling shapes by row convolution or column convolution. To extract the features, a convolution kernel of suitable size and step size can be selected. In addition, features can be converted to a four-dimensional format (C× N× H× W) and then extracted using a convolution with step size and convolution size set to (N× 1 × 1).

Fig. 2
figure 2

Schematic diagram of MKConv structure. In the MKConv, the offset of the convolution kernel is obtained by a specific convolution operation with dimensions (C× 2 N× H× W), where C, N,H, W stand for the number of convolution channels, the size of the convolution kernel, the height of the convolution, and the width of the convolution, respectively. The operation is unfolded when N = 7 in Fig.

There is also the option of combining the features over the channel dimensions as (CN× H× W) and then using a 1 × 1 convolution to reduce the dimensions to (C× H× W). the resampled Conv are stacked along the column direction and subsequently convolved with rows of size (N× 1) and step size (N× 1).

As a result, the position of the convolution kernel in MKConv can be dynamically adjusted by the learned offsets. During the training process, MKConv is able to learn the most effective offsets with respect to a particular image and target, and thus automatically adjusts the position of the convolution kernel at the time of sampling. This adaptive tuning allows MKConv to more accurately align and extract key features in an im-age, especially if there are variations in the shape and size of the target from image to image. By supporting linear tuning of the convolutional parameters, MKConv provides an efficient way to reduce the model’s parameters and computational overhead with-out sacrificing detection performance.

C2f-SK attention mechanisms

In order to maintain high detection precision for leaf disease areas. The SK attention mechanism25 is introduced into the C2f residual module to generate the C2f-SK novel module. the C2f-SK is able to adaptively adjust the size of the receptive field according to the inputs to effectively capture target objects at different scales, and thus enhance the detection effect.

The SK attention mechanism mimics the property of cortical neurons to dynamically regulate the receptive field for different stimuli by designing the Selective Kernel. As shown in Fig. 3, the mechanism generates multiple paths with different kernel sizes through three steps (Split, Fuse, Select) and calculates the weights of each channel to effectively fuse the outputs of each convolution kernel.

First, multiple grouped convolutions with different sensory fields are applied to the input feature map F. In particular, two sets of feature maps V1 and V2 are obtained when the convolution kernels are set to 3 × 3 and 5 × 5. In the separation operation, two operations of convolution kernel size 3 and 5 are performed for any input feature mapping \(F \in {M^{{\text{H}} \times {\text{W}} \times {\text{C}}}}\) as shown in Eqs. (34):

$${F_1}:\beta (C{\text{onv }}3 \times 3(F)) \to {V_1} \in {M^{{\text{H}} \times {\text{W}} \times {\text{C}}}}$$
(3)
$${F_2}:\beta (C{\text{onv }}5 \times 5(F)) \to {V_2} \in {M^{{\text{H}} \times {\text{W}} \times {\text{C}}}}$$
(4)

whereβ is the Relu activation function.

Subsequently, in the Fuse step, V1 and V2 are summed up elementwise to form the feature map V. The result of fusion using element-by-element summation is shown in Eq. (5). Which is globally pooled with a global average(Avg) in order to embed the global information, thus obtaining the channel features S(1 × 1×c), the result is shown in Eq. (6).

$$V={V_1}+{V_2}$$
(5)
$$S=Avg(V)=\frac{1}{{{\text{W}} \times {\text{H}}}}\sum\limits_{{j=1}}^{{\text{W}}} {\sum\limits_{{i=1}}^{{\text{H}}} {V(j,i)} }$$
(6)
$$Z=fc(S)=\beta (BN({W_s}))$$
(7)

Next, a fully-connected(fc) layer is used to compresses the channel features and obtains the dimensionality reduced feature Z to improve the efficiency. This process is shown in Eq. (7), where BN denotes batch normalization. In the Select step, the mechanism adaptively selects information of different spatial scales using soft attention among channels and multiplies the channel attention with the corresponding feature maps, and ultimately adds the two feature maps U1 and U2 combining the channel attention to form the feature map U. Branches with different kernel sizes and their corresponding channel information are fused by softmax attention, so that the individual branches receive different levels of attention, resulting in differences in the effective receptive field size of neurons in the fusion layer.

Therefore, the SK attention mechanism effectively utilizes the limited computational resources to focus on the leaf disease region and suppresses irrelevant background information, so that this improvement can help to improve the recognition ability of the model in complex field environments without significantly increasing the computational burden, ensuring the precision and efficiency of leaf disease detection.

MPDIoU loss function optimization

In the YOLOv8 algorithm, CIoU is used as the bounding box regression loss function. Although CIoU, mainly focuses on the bounding box regression metrics such as the distance between the predicted box and the real box, the overlap area, and the aspect ratio.

Fig. 3
figure 3

Structure of the SK attention mechanism.

However, these loss functions do not take into account the directionality of the mismatch between the real and predicted frames. This shortcoming leads to slower convergence and lower efficiency of the model, as the predictor frames may wander in different locations during the training process, which ultimately leads to degradation of the model performance.

MPDIoU26 is a novel loss function designed based on the principle of minimum point distance. The method simplifies the comparison of the similarity of two bounding boxes by minimizing the distance between the upper-left and lower-right points between the predicted bounding box and the actual marked bounding box, as shown in Fig. 4. Such a simplification process not only significantly improves the convergence speed of the model, but also improves the precision of the regression. MPDIoU is optimized for target detection tasks and aims to enhance the alignment precision of the bounding boxes. It improves the model’s localization ability by considering the minimum vertical distance between the predicted frame and the real frame, and performs particularly well in cases where the bounding boxes are highly overlapping or partially overlapping. Its core equations are shown in (8–10) .

$${\text{d}}_{1}^{2}={({x_1} - x_{1}^{{gt}})^2}+{({y_1} - y_{1}^{{gt}})^2}$$
(8)
$${\text{d}}_{2}^{2}={({x_2} - x_{2}^{{gt}})^2}+{({y_2} - y_{2}^{{gt}})^2}$$
(9)
$${\text{MPDIoU=IoU-}}\frac{{{\text{d}}_{1}^{2}+{\text{d}}_{2}^{2}}}{{{{\text{h}}^2}+{{\text{w}}^2}}}=\frac{{{\text{A}} \cap {\text{B}}}}{{{\text{AUB}}}} - \frac{{{\text{d}}_{1}^{2}}}{{{{\text{h}}^2}+{{\text{w}}^2}}} - \frac{{{\text{d}}_{2}^{2}}}{{{{\text{h}}^2}+{{\text{w}}^2}}}$$
(10)
$${L_{{\text{MPDIoU}}}}=1 - {\text{MPDIoU}}$$
(11)

Where d1 and d2 represent the Euclidean distance between the diagonal of the predicted bounding box and the real bounding box, respectively. (x1,y1),(x2,y2) are the coordinates of the upper left and lower right maizeers of the predicted bounding box bounding box, respectively. Correspondingly, gt is denoted as the truth box that corresponds to it. h and w are the height and width of the bounding box, respectively, and IoU denotes the intersection and concurrency ratio between the predicted bounding box and the real bounding box. a is the real box; b is the predicted box.

Fig. 4
figure 4

Diagram of CIoU and MPDIoU loss function.

Unlike the traditional IoU loss function, which relies heavily on the overlap area between the predicted and real bounding boxes to optimize the model, in some cases this approach may not adequately reflect the geometrical alignment state of the bounding boxes.

The MPDIoU, on the other hand, introduces an additional distance metric that focuses on the minimum vertical distance of the vertices between the predicted box and the real box, which allows the loss function to be which allows the loss function to focus more on the exact alignment of the bounding box during the optimization process. Through this innovative design, MPDIoU not only takes into account the degree of overlap of the bounding boxes, but also their relative positions and shapes, thus further strengthening the focus on the precision of the bounding box alignment while retaining the existing advantages of the IoU loss function. This enables MPDIoU to evaluate and optimize the prediction results of bounding boxes more accurately compared to the traditional IoU loss function in tasks such as target detection and instance segmentation.

YOLO MSM: a real-time and high-precision detection algorithm for maize leaf disease

Despite its excellent performance in terms of detection precision and speed, the YOLOv8n algorithm still presents certain challenges when confronted with maize leaf disease targets in complex scenes. These challenges mainly stem from the diversity of leaf disease in terms of morphology and scale, coupled with the fact that they are easily occluded, which makes it possible for the model to ignore small occluded targets during the feature extraction process, thus leading to frequent omissions and false detections. For this reason, this work proposes the YOLO MSM algorithm to address these issues.

The YOLO MSM algorithm consists of four main components: the input of the image, the output prediction layer and the feature extraction and fusion network. The structure of this network is shown in Fig. 5. Firstly, a 640 pixel × 640 pixel image of maize leaf disease data is normalized to one input, which is then passed to the backbone feature extraction network for feature extraction. The extracted leaf disease feature maps are then fed into the feature fusion network to achieve the up-and-down fusion of shallow and deep features, and finally, the feature information enters the output layer to generate prediction frames and give the corresponding categories.

Firstly, in order to cope with the problem of low resolution of deep feature maps due to the insufficiency of some pixel points, MKConv is innovatively proposed. MKConv is a revolutionary convolutional kernel design that will enable the convolutional kernel to be fully adapted to the specific features of the input data. Secondly, considering the irregular arrangement and morphological change characteristics of maize leaf disease, the feature extraction is further enhanced by integrating the SK attention mechanism in the C2f module.

Fig. 5
figure 5

YOLO MSM model structure diagram.

Among them, the C2f module is the core component of the YOLOv8 network, which has a large number of convolutional operations stacked inside it, which leads to features that are too similar between neighboring channels, thus generating redundant features. The loss function is also optimized to cope with low-quality samples such as uneven lighting and green leaf background present in the dataset. This loss function balances the training process between different sample qualities, improves the convergence speed of the model, and enhances the generalization ability of the algorithm. The YOLO MSM’s backbone feature extraction network consists of CBS, MKConv, C2f-SK, and SPPF modules. After processing through the backbone network, the maize leaf disease image produces three effective feature layers, which are integrated using top-down and bottom-up fusion methods. This integration involves up-sampling, concatenation, regular C2f, and MKConv techniques. Similarly, the Institute proposes a leaf disease detection framework based on the improved YOLOv8 algorithm, referred to as YOLO MSM.

The framework initially establishes key parameters for the training phase through data labeling and pre-processing, including 500 training cycles and a batch size of 16 images. Subsequently, the network parameters are initialized, encompassing the learning rate and weight decay coefficient. The image data were obtained from a standardized YOLO dataset, and various network components, such as CBS, MKConv, C2f-SK, SPPF, and other modules, were employed for feature extraction and fusion during the training process. The loss function was computed using MDPloU. During the model validation phase, the learning rate and training strategy were dynamically adjusted to ensure model optimization. Ultimately, the model successfully accomplishes the tasks of leaf disease identification, localization, and classification, demonstrating convergence through the result curve. The best model weights and the final model are saved, providing an effective solution for leaf disease detection.

Experiments and analysis of results

The experimental environment and parameters of this study are shown in Table 1. In order to comprehensively evaluate the performance of the YOLO Mural algorithm for damage detection in grotto murals, special attention is paid to the lightweight, precision and real-time detection performance of the algorithm model.

Among them, the lightweight performance is evaluated by the number of parameters, floating-point operations (Flops), network layers, model storage size (Volume) and training times of the algorithm. In terms of precision, Precision (P), Recall (R), mean average precision (mAP), and reconciled mean (F1) are used as key evaluation criteria. In the real-time detection performance of the algorithm, it is measured by the number of image or video frames per second (fps) that can be processed to ensure that the response speed of the detection system meets the requirements of practical applications.

Table 1 Experimental environment and parameters.

In addition, mAP@0.5 stands for denotes that at the IoU threshold of 0.5, the average precision of the two categories is calculated, and then the average of the average accuracies is taken. mAP@0.5:0.95 is that the IoU thresholds are divided into 10 equal parts in the interval from 0.5 to 0.95, and the average precision mean value is calculated once for each threshold. The average precision mean takes into account the performance of the model under different IoU thresholds, which enables a more detailed assessment of the model’s ability to recognize prediction frames with different levels of overlap, thus providing richer information for model optimization. On the other hand, to understand the detection performance of different sized targets more accurately, APS, APM, and APL are used to evaluate the average precision of small, medium, and large targets, respectively.

Comparative experiment

In order to verify the rationality of choosing YOLOv8n as the baseline model, comparison experiments are conducted on the same test platform, parameters and dataset. Comparison experiments are conducted with the excellent models of YOLO series (YOLOv3n, YOLOv5n, YOLOv6n, YOLOv7n, YOLOv8n, YOLOv9c, YOLOv10n and YOLOv11n) as well as RT-DETR-18 model. In addition, to further validate the performance of the algorithms, the YOLO-SDW27, CA-YOLO28 and GAM YOLO29 algorithms were reproduced, and comparative experiments were conducted.

As shown in Table 2, compared with other algorithms, the number of parameters and Flops of YOLOv8n model are 1.65 million and 8.2G, respectively, and the YOLOv8n model is more lightweight than YOLOv3n model. The volume of YOLOv8n model is only 12.2% of that of YOLOv9c model. The YOLOv11n algorithm was detected at only 87.42fps and took the longest time of 5.37 h. At the same time, the YOLOv8n model has the shortest training time, spending 0.93 h to complete the iterative training, and the real-time detection speed exceeds 200fps, far exceeding the threshold of 30fps required for real-time detection. It can be seen that, in terms of lightweight, the YOLOv8n has a significant advantage, which is suitable for real-time deployment in smart agriculture, and lays a foundation for the future deployment of hardware devices with limited computing resources.

In terms of detection precision, the F1 value of YOLOv8n model breaks through 85%, which can better balance precision and recall. Compared with the current state-of-the-art YOLOv10n model, YOLOv8n model mAP@0.5 and mAP@0.5:0.95 improves by 0.17% and 0.46%, respectively. The precision of YOLOv8n model is 89.52%, which is 2.43% higher than RT-DETR-18 model. Compared to the recall of YOLOv9c, YOLOv8n is 0.82% higher. Overall, YOLOv8n has the best overall performance and as a baseline model will open a new window for intelligent regional detection.

Table 2 Comparative test results of different models.
Table 3 Results of ablation experiments.

Maize leaf diseases exhibit a wide range of variations in type and severity, ranging from tiny spots to large patches, which poses significant challenges to the generalization capabilities of detection algorithms. Precision improvement usually faces the challenge of breaking through 90%. In this study, the proposed YOLO MSM algorithm achieved a detection precision of 90.11%, showing its effectiveness in maize leaf disease detection. The YOLO MSM algorithm improves the precision by 3.39%, 5.10% and 3.65 compared to YOLO-SDW, CA-YOLO and GAM YOLO algorithms, respectively. The recall of YOLO MSM algorithm is 82.64%, which is 0.39% higher than that of YOLOv11n algorithm (82.32%), and the recall of YOLO MSM algorithm is 7.85% and 6.39% higher than that of YOLO-SDW and GAM YOLO, respectively. improved by 7.85% and 6.39% compared to YOLO-SDW and GAM YOLO algorithm, respectively, and by 12.91% compared to CA-YOLO algorithm, indicating its higher stability in target area recognition.

Ablation experiments

In order to verify the optimization of each module in terms of damage detection performance of cave wall paintings, this study designs a series of ablation experiments based on the YOLOv8n model. The study defines three key modules: Module A employs MKConv to dynamically tune the convolution operation to enhance the capture of features of different scales and shapes. Module B integrates C2f-SK to optimise the feature extraction process and reduce maize leaf disease redundancy. Module C introduces the MPDIoU mechanism to improve detection precision.

As shown in Table 3, compared with the baseline algorithm model, Model A effectively improves the precision of maize leaf disease area detection by adding the MKConv module, which reduces the number of parameters and Flops by 1.82% and 10.84%, respectively. Model B, which adopts the C2f-SK module fused with the SK attention mechanism for feature extraction, has a real-time detection speed of 245.20 fps and an precision of 90.13%, which proves that the attention mechanism is able to better realize the information extraction of high-level semantic features and shallow delicate features of the maize leaf disease region. The detection precision of Model C with updated loss function replacement reaches 90.27%, which is a significant improvement in precision at the expense of real-time detection speed. Models AB, AC and BC are fused by two modules. The proposed YOLO MSM algorithm has the highest combined detection precision and real-time detection performance compared to Model AB, Model AC and Model BC.

MKConv comparison experiment

During the study, the plug-and-play MKConv module was used to replace all 3 × 3 regular convolutional operations in the models of YOLOv5 and YOLOv8. In the YOLOv7 model, the replacement is performed for the first 3 × 3 convolutional operation in the ELAN structure inside the backbone network.

The experimental results on different models are shown in Table 4. Based on the YOLOv5n algorithm, the introduction of the MKConv module has obvious advantages. It has the highest precision for small and larger targets, and the APS and APL of YOLOv5n-MKConv reach 14.0% and 37.9%, respectively. It can be seen that the proposed MKConv can help the network to better recognize the key regions of the target compared to other attention convolution methods.

Table 4 Various average precision results.

Similarly, in the current more stable and cutting edge YOLOv8n algorithm model, the models introducing CBAM, CA, and MK modules have all effectively improved precision performance. But the YOLOv8n-AKConv model is significantly less effective. There is an precision degradation in the detection process for both small and medium targets. In contrast, the APS, APM, and APL of the YOLOv8-MKConv model were improved by 0.63%, 2.98%, and 9.18%, respectively, compared to the baseline model.

Attention mechanisms comparison experiment

In order to assess the role of fusing SK attentional mechanisms within the C2f residual module, comparative experiments were performed by fusing different attentional mechanisms (ECA30, CBAM31, CA32, SE33, AIFI34, and SimAM35) at the same bits. These experiments are conducted simultaneously under the same conditions of introducing MKConv and optimizing the loss function.

According to the results shown in Table 5, although the real-time detection speed of the model incorporating the AIFI mechanism is significantly higher with the same F1 value of 85%, its Flops increase by 0.2G compared to the YOLO MSM model. Upon comparison, the YOLO MSM model has a faster real-time detection speed in the maize leaf disease region, reaching 289.1 frames per second, and also has the lowest Flops. In terms of accurate detection of maize leaf disease, the proposed YOLO MSM algorithm improves 0.41%, 0.79%, 0.69%, 2.04%, 0.12%, and 0.52% in mAP@0.5, respectively, when comparing the models incorporating ECA and CBAM, CA, SE, AIFI, and SimAM attention mechanisms. It can be seen that after the fusion of adding SK mechanism, the key features in the maize leaf disease area are focused on, thus improving the detection accuracy and processing speed.

Table 5 Comparison of attention mechanisms.

Analysis of results

The loss function can calculate the model error between the predicted value and the actual value, feedback the decision-making effect of the model, and help the neural network adaptively adjust the model weights to make the prediction results more accurate. The loss function of the bounding box in the maize leaf disease detection process is shown in Fig. 6.

As shown in Fig. 6, the convergence trend of the loss curves for the training and validation sets is roughly the same, with the loss values decreasing rapidly in the first 100 epochs at the beginning of the iterative phase, and then slowly becoming smaller and converging. When thawed at the 50th epoch, the entire network structure is involved in the training, the loss has a small increase, and then continues to gradually decrease, and converge at about 180 epochs, the network training effect reaches the optimum, save the training weights corresponding to the epoch with the smallest loss in the validation set during the training process as the input weights for the subsequent test experiments. The proposed YOLO MSM algorithm has faster convergence speed and smaller loss function value after optimizing the loss function to MPDIoU. Compared to YOLOv8n algorithm, YOLO MSM algorithm reduces the final convergence value from 1.7391 to 1.5225 during the training process. and the improvement of the loss function does not impose any computational load on the network structure, it is a completely lossless improvement.

Fig. 6
figure 6

Boundary box loss curves.

In this study, the mAP@0.5 curves of YOLOv10n, YOLOv9c, YOLOv8n, and YOLO MSM models are compared during the first 100 training sessions selected. As shown in Fig. 7, it can be seen that the proposed YOLO MSM algorithm converges faster, improves precision faster, and has a smaller fluctuation range than other YOLO algorithms during the training process of the maize leaf disease area detection model.

Fig. 7
figure 7

mAP@0.5 Curve Diagram.

The R-P curve can visualise the performance of the algorithm under different threshold settings. By calculating the area under the R-P curve, the Average Precision (AP) of the model can be derived. Figure 8 shows the R-P curves of YOLOv8n and YOLO MSM. From the Fig. 8, it can be observed that the area under the curve of the YOLO MSM model is significantly larger than that of the YOLOv8n model, which indicates that the Average Precision of YOLO MSM is higher.

Fig. 8
figure 8

Comparison of R-P curves.

In addition, in the R-P curve, connecting the point (0,0) and the point (1,1) plots the balance line of the R-P curve, which produces 2 intersections with the R-P curves of the YOLOv8n and the YOLO MSM model: (0.8485,0.8485) and (0.8691,0.8691), respectively. It can be seen that the YOLO MSM model is able to balance the precision rate and recall rate in the detection of maize leaf disease areas, and its balanced performance is more ideal.

Fig. 9
figure 9

Performance curves for real-time detection.

During the maize leaf disease area detection process, 36 images in the test set were selected for the detection speed test, and the real-time detection time of the algorithms is shown in Fig. 9. The real-time detection speeds of the baseline YOLOv8n algorithm and YOLOv10n are higher, reaching 201.65 fps and 223.53 fps, respectively, while the proposed YOLO MSM algorithm is higher, reaching 279.56 fps, with the detection time of each image ranging from 0.003s to 0.005s, with the detection time of 17 images stabilising at 0.003s, while the detection time of YOLOv9c is stable at 0.003s. The detection time of YOLOv9c algorithm ranges from 0.007s to 0.009s, and the detection time of YOLOv7n algorithm is stable at 0.008s for 19 images, which shows that the improved algorithm has an excellent real-time detection performance in detecting maize leaf disease areas, with lower detection time and more stable real-time detection speed.

Visualization results

The results of monitoring maize leaf disease areas using various algorithms are presented in Fig. 10. The enhanced YOLO MSM algorithm demonstrates significant improvements in monitoring performance compared to the original YOLOv8n algorithm. It demonstrated higher precision and confidence in detection, reducing the rate of missed and false detections.

As shown in Fig. 10(a), the YOLO MSM directly provides a confidence level of 0.84 for the same maize leaf disease region. In Fig. 10(b), both YOLOv5n and YOLOv8n models have the issue of missed detections. In contrast, YOLO MSM demonstrates higher monitoring precision, enabling it to effectively track and monitor maize insect leaf disease. under dynamic field conditions, while the bounding box fits the leaf disease area more accurately. In Fig. 10(c), the leaf disease in the field may exhibit overlapping and high-density clusters. While YOLOv5n and YOLOv8n models suffer from the problem of misdetection, the YOLO MSM algorithm incorporates an attention mechanism that enables it to identify and locate major leaf disease more precisely.

The YOLO MSM detection method can be applied to real production environments for real-time detection in agricultural fields. The model is encapsulated using PyQt5 to enable the detection of maize leaf disease and spot counting. The system interface, shown in Fig. 11, contains the configuration of detection types and parameters, the function of dynamically modifying detection thresholds, and the instant display of detection results.

Fig. 10
figure 10

Comparison of test results.

In short, the enhanced YOLO MSM algorithm combines the advantages of a lightweight design with reduced extraction of unnecessary feature data. By integrating the attention mechanism and cross-channel information fusion, it enhances the feature selection ability in complex environments, thus reducing the occurrence of false edge detection.

Fig. 11
figure 11

System interface and detection.

Conclusion

This study constructs a set of large-scale data sets on maize leaf disease. By sampling and photographing maize leaf disease in the field at different growth stages and under a variety of environmental conditions, covering different angles and light conditions. The dataset aims to provide a rich resource for agricultural disease control, academic research and technology development. An algorithm for detecting maize leaf disease areas is also proposed. The algorithm is based on YOLOv8, which combines the attention mechanism to construct a new lightweight residual module, and introduces a new convolution module to improve the feature extraction ability, so as to improve the precision and detection speed of the model. By optimising the loss function, the regression performance of leaf disease area detection is effectively improved.

The YOLO MSM algorithm shows excellent performance on the corn leaf disease dataset as verified by ablation and comparison experiments. In terms of model lightweighting, the number of parameters of YOLO MSM model is greatly reduced, the model volume is only 5.4 MB, and the training time is 0.92 h, which greatly improves the training efficiency. In terms of detection accuracy, compared with other mainstream target detection algorithms such as YOLOv5n, YOLOv6n, and RT-DETR-18, the YOLO MSM model improves the detection precision by 1.60, 9.58, and 3.12% points, respectively. The post-YOLO MSM model proposed in this study has a small number of parameters and fast detection speed, which can meet the practical needs of high-precision and real-time detection of maize leaf diseases, and provides efficient technical support for the precise monitoring and control of agricultural diseases. In addition, the real-time detection system is designed to provide reliable technical support for leaf disease monitoring in real production.

In future research, we will work on the integration of deep learning and large models to further improve the precision and real-time detection performance of the neural network model, in order to promote the efficient development of intelligent agricultural disease control.