Introduction

The peach tree, native to China, belongs to the Rosaceae family of stone fruits, and is one of the most economically important fruit trees worldwide. Efficient production in the peach tree industry is important for ensuring sustainable development, increasing farmers’ income, and comprehensively promoting rural revitalization and agricultural and rural upgrading. In addition, peach trees play an important role in their ecological functions and ornamental value during greening. Planting peach trees in urban, rural, and public areas can create beautiful landscapes, improve the quality of the urban environment, and have a positive impact on creating a livable environment, thereby protecting the ecosystem and promoting economic development. However, pests and diseases hinder the vigorous development of peach trees, and substantially affect their quality and output, resulting in significant financial losses. These diseases and infections primarily affect tree crowns, leaves, fruits, and trunks. Implementing accurate and effective identification and control measures for these diseases and pests is essential to ensure sustainable development of the peach planting industry.

Traditional techniques for crop disease diagnosis predominantly depend on the subjective professional expertise of fruit growers and specialists, resulting in a deficiency of scientific rigor and objectivity in visual inspections. With the rapid evolution of computer science and information technology, the adoption of image recognition technologies has become widespread in contemporary agriculture. In peach tree disease detection, deep learning (DL) has proven to be efficient and accurate for identifying diseases in leaves and fruits. Yadav et al.1 used convolutional neural networks (CNNs) to detect bacterial diseases in peach leaf images, achieving an accuracy of 98.75%. Alosaimi et al.2 developed a new CNN model to detect diseases in peach plants and fruits and to localize diseased areas. However, despite the progress made in disease detection in leaves and fruits, relatively few studies have been conducted on disease detection in peach trunks.Peach tree Botryosphaeria gummosis,commonly known as peach gummosis, is a common disease affecting peach trees. It is a global disease with a wide range of occurrence. The disease is characterized by the production of a large amount of gum under the bark, which escapes through the pores and causes typical gum exudation symptoms. As gum exudation consumes a significant amount of the tree’s carbohydrate nutrition, it weakens the tree’s vitality, potentially leading to branch drying and tree death, which seriously restricts the healthy development of the peach industry in China3.

Recently, CNNs, which are key aspects of DL, have been extensively used in smart agriculture for tasks such as monitoring and detection. This significantly expanded its reach in the agricultural sector. Compared with traditional machine vision technology, they offer faster detection and higher accuracy4. Target detection models can be divided into two-and one-stage algorithms. The first category is represented by two-stage detection algorithms such as R-CNN5 and Faster R-CNN6. For instance, Wu et al.7 improved Res2net as a backbone network based on the Mask R-CNN network, and introduced a boundary loss function to construct an Asian soybean rust detection model. The experiments revealed that the improved Mask R-CNN achieved an average detection accuracy of 98.14%. Le et al.8 used a Faster CNN model with the Inception-ResNetV2 feature extraction module to construct a weed detection model in complex environments, and the improved Faster R-CNN achieved an average accuracy of 55.5%. Both methods use two-stage algorithms, similar to R-CNN. In the first stage, the primary task is to generate a set of candidate regions or proposal boxes that potentially contain targets. In the second stage, these candidate regions undergo more refined processing to determine the precise locations of the targets. This algorithm typically has a high detection accuracy, but a relatively slow processing speed. The other category is the single-stage algorithm, represented by the You Only Look Once (YOLO) algorithm9, which is a mainstream algorithm in the current field of target detection. Unlike two-stage detection algorithms, YOLO treats the target detection task as a whole, directly from the initial feature extraction to the final target localization and classification, all of which are completed in a single step. It does not require generating candidate regions first but directly predicts the bounding boxes and categories of targets on the feature map, resulting in a faster processing speed. In some cases, the detection accuracy may be sacrificed. Given its exceptional precision, rapidity, brief training period, and minimal computational demands, it is particularly well suited for agricultural tasks, such as detecting maturity and identifying pests and diseases. Yang et al.10 developed a method for automated tomato detection using a refined YOLOv8s model. The improved YOLOv8s achieved a mean average precision (mAP) value of 93.4%, significantly reducing the model size from 22 to 16 MB while maintaining a detection speed of 138.8 frames per second (FPS). The challenge of reliably locating and recognizing apple leaf illnesses in complicated natural scene backdrops was tackled by Li et al.11 They proposed a method based on an upgraded YOLOv5s model to detect apple leaf diseases of various sizes and shapes. The proposed model exhibits an improvement in mAP of 12.74%, 48.84%, 24.44%, and 4.2%, outperforming traditional detection methods such as SSD, Faster R-CNN, YOLOv4-tiny, and YOLOx, respectively.

Despite the progress made in pest and disease identification using YOLO technology, research on diseases associated with fruit tree stems remains limited. Taking peach gummosis as an example, given its economic importance and the challenges of early detection, along with the complex growing environment of peach orchards and the variety of lesion shapes, this study proposes a lightweight peach trunk gummosis detection model based on an improved YOLOv8 that targets trees in different locations in Henan Province. The goal was to decrease false detections resulting from multiscale lesions, dense lesions, and ambiguous features in tasks involving the detection of gummosis on peach trunks, thereby enhancing the precision and effectiveness of the model. The network structure was modified to further reduce the number of model parameters and increase the target detection speed to meet the requirements of deployment on edge devices with limited computing power, such as pest and disease control robots. Consequently, this study offers technical assistance for the treatment and management of peach gummosis at an early stage, laying the foundation for the realization of unmanned intelligent prevention and control of peach gummosis in peach trees.

Consequently, we propose a lightweight model, designated as YOLO-Gum, for the detection of peach gummosis in tree branches, which was based on the enhancement of YOLOv8. The primary contributions of this study are as follows:

  1. 1.

    A peach gummosis disease image dataset was constructed to provide data support for the training and validation of the peach gummosis disease detection model.

  2. 2.

    The SENetV2 module is introduced into the backbone network of YOLOv8, replacing some of the convolutional layers, aiming to improve the fineness of feature representation and the integration of global information through its multibranch architecture. In addition, the CCFM is incorporated into the neck structure of YOLOv8, which effectively integrates fine-grained details and context information, reduces the parameter count, and enhances accuracy while maintaining a lightweight model, making it better suited for deployment on resource-limited devices.

  3. 3.

    We compare the performance of YOLO-Gum with other models and methods through comparison and ablation experiments (19 validation experiments), and YOLO-Gum achieves an accuracy of 92.5% on PeachGumDisease, with 2.79 M parameters, 5.57 MB model size, and 7.6 FLOPs, which proves the effectiveness of the model.

YOLO-Gum, as a model for detecting peach gum disease, has a smaller size and higher accuracy, and successfully achieves an effective balance between model accuracy and lightweight. This study provides technical assistance for the development of robotic vision systems for managing peach tree growth and preventing and controlling diseases.

Methods

Data acquisition and annotation

The data collection was primarily conducted in various peach orchards in Zhong mu County, Zhengzhou City, Henan Province, using an Apple iPhone. The collection dates were December 15, 2023, and April 10, 2024. To ensure the integrity of the collected dataset, images were captured under different weather conditions, such as sunny, cloudy, rainy, and snowy. As the gum exuded from the trees gradually turns brown when exposed to air, it becomes a translucent and soft yellow gummy substance, further drying into brown and black hard gum blocks. The varying degrees of oxidation in different parts and the possible presence of multiple lesions on the same tree trunk increase the difficulty of identifying peach gummosis disease. As shown in Fig. 1, images of healthy peach tree branches and peach gummosis with varying degrees of oxidation are displayed. Since the collected images were randomly taken in complex orchard environments, photos with insufficient lighting, backlight, severe obstruction of the disease, and blurriness were deleted. Additionally, to ensure the applicability of the algorithm in real-world scenarios, for some sample photos containing non-disease objects, only the excess parts were cropped out, and then the remaining space was filled with a black background to ensure consistency in pixel size for all samples.

Fig. 1
figure 1

Partial dataset display. (a) Healthy; (b) Yellow colloid; (c) Brown colloid; (d) Black colloid; (e) Color-mixed colloid 1; (f) Color-mixed colloid 2; (g) Shaded by branches and leave; (h) high branch and trunk lesions.

Pictures with variable light intensities across time, those taken from various shooting angles, those with diverse disease intensities, and those showing color changes at various disease stages were selected from the gathered dataset to guarantee the depth and diversity of the dataset. Ultimately, 3420 images of peach gummosis disease were selected. These images were annotated using labeling software, with 6354 diseased instances marked in total. Table  1 lists the annotation details. The training, validation, and testing sets were arbitrarily partitioned into labeled datasets in the ratio 8:1:1.

Table 1 Distribution of images and labeled instances for peach gummosis detection.

Data enhancement

In the training task of the detection model, if the image data used for training are insufficient, overfitting will occur, and it is difficult for the original images in the training set to cover all factors such as light intensity, weather, noise, and clarity in the natural environment. Therefore, it is necessary to augment the image data of the branch and stem peach gummosis disease to obtain sufficient sample data and improve the detection performance and generalization ability of the target detection model12. Using Python, we applied various random augmentation methods to the original images of peach gummosis disease on tree branches. These methods include flipping, rotating, translating, and adjusting the brightness, hue, and sharpness, among other algorithms, to process the original image data to cover the maximum number of images of peach tree branches with gummosis disease under various complex backgrounds in natural orchards. Additionally, we added Gaussian noise to the images to mimic the blurring that may occur in practical applications because of device or tree branch movements when capturing images. Examples are shown in Fig.  2. The enhanced dataset for peach gummosis disease has been increased from 3420 labeled images to 27,360 images. The number of images in the enhanced training, validation, and test sets were 21,888, 2736, and 2736, respectively, and the models were trained and tested using this dataset, known as the ‘PeachGumDisease’ dataset.

Fig. 2
figure 2

Original and enhanced image. (a) Original image. (b) Flip it up and down. (c) Rotate 30 degrees. (d) Translation. (e) Brightness enhancement. (f) Chroma enhancement. (g) Sharpness enhancement. (h) Gaussian blur.

Comparison of the YOLOv8 model and performance

YOLO, a critical one-stage object detection algorithm, was first introduced by Redmon et al. in 201613. YOLOv8, an open-source iteration of YOLO developed by Ultralytics Company (Washington, DC, USA) and released on January 10, 2023, exemplifies a cutting-edge paradigm14. In the architecture of the backbone network, the C2f module serves as the key tool for extracting input features, inspired by the design philosophy of ELAN, as shown in YOLOv715. The SPPF module markedly decreases the computing complexity by integrating the convolution and pooling processes while concurrently augmenting the receptive field capabilities of the backbone network. In the neck network section, the FPN + PAN structure16 was employed to bolster the feature fusion process of the model. The detection header component17 utilizes YOLOX’s novel decoupled detection header18, which separates the regression branch from the prediction branch, thereby expediting the model convergence process.

YOLOv8, an advanced object detection model, is available in five distinct versions: YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x. These variations were fundamentally distinguished by the intricate depth and width configurations of the underlying network structures. The present investigation aimed to evaluate the performance of these five versions by subjecting them to both training and testing phases utilizing the custom-built “PeachGumDisease” dataset. The results are summarized in Table  2.

Table 2 YOLOv8 model performance comparison.

Table  2 summarizes a comparative analysis of the five YOLOv8 versions in terms of the number of parameters and model size. YOLOv8n excels as the most lightweight architecture, with the fewest parameters and smallest model size among the five YOLOv8 versions with a mAP50 only 2.5% lower than that of YOLOv8s, 5.6% lower than that of YOLOv8l, and 4.4% lower than that of YOLOv8x. However, compared with YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x, YOLOv8n has 71.31%, 87.65%, 92.69%, and 95.32% fewer parameters, and the model sizes were 61.78%, 83.46%, 89.95%, and 93.71% smaller, respectively. Fig.  3 illustrates that YOLOv8n possesses the fewest parameters and smallest model size relative to the other four versions. Considering all the assessment criteria and balancing detection accuracy with model size, along with the lightweight requirement for installing peach gummosis detection on mobile devices19, YOLOv8n was selected for future improvements in this study.

Fig. 3
figure 3

YOLOv8 model performance comparison.

Improved YOLOv8 model

This study proposes the use of an enhanced YOLO-Gum algorithm to detect peach gummosis, building on the original advantages of the YOLOv8 model. The procedure not only maintained a swift pace but also enhanced the precision of detecting peach gummosis within a multifaceted environment. The enhancements to the proposed algorithm primarily focused on two components: the SENetV2 attention mechanism and CCFM. First, in the backbone layer of YOLOv8, we replaced the original C2f with C2f_SENetV2, which not only addressed the shortcomings of the lightweight network in feature extraction but also enhanced feature representation by strengthening inter-channel interactions, thereby improving object detection accuracy. In addition, the CCFM structure is introduced into the neck structure to integrate the features of different scales by a fusion operation, which improves the overall performance of the model, reduces the number of parameters, and increases the detection speed. Notably, the SPPF module is integrated into the YOLOv8 architecture as an enhanced feature extraction component, which aims to address the limitations of traditional fixed-size pooling while maintaining computational efficiency. Located at the junction of the backbone and neck subnetworks, SPPF employs a cascaded multiscale pooling strategy to progressively aggregate context information across the spatial hierarchy.The use scenario and operation of the disease control robot are simulated in the laboratory environment. The improved structure is shown in Fig. 4.

Fig. 4
figure 4

The improved YOLOv8 model structure.

Squeeze-and-excitation networks (SENet)

SENet is an architecture devised to bolster the performance of CNNs by adaptively refining the significance of channel weights within feature maps, thereby enhancing the relationships between channels20. This study integrated the SENet attention mechanism to strengthen the interplay between the salient features identified by the model and their ability to convey meaningful information. As shown in Fig. 5, sourced from21, prior to entering the SENet attention module, all channels in the feature map from the backbone network are equally important. However, in postprocessing through SENet, the significance of individual feature channels is differentiated with varying weights represented by distinct colors, guiding the network to prioritize channels with higher weights.

SENetV2 is an improved SENet architecture that enhances the network representation by introducing a new module known as squeeze aggregated excitation22. This module integrates the features of ResNeXt and SENet by employing a multibranch fully connected (FC) layer for the squeezing and excitation processes, culminating in feature scaling. Experimental results on the benchmark dataset demonstrated a significant improvement in the classification accuracy of the SENetV2 model compared with existing models.Introducing the SENetV2 module into the backbone network of YOLOv8 and replacing some of the original convolutional layers aims to further improve the fineness of feature expression and the integration of global information through the multibranch structure, and Fig. 6 shows its internal working schematic.

During the squeeze process, a diminished FC layer processes the outcomes that stem from the application of global average pooling23. This transformation helps retain key features, enabling them to traverse the module, which in turn strengthened the representational power of the network. Building on this concept, we propose increasing the cardinality of the FC layer. Accordingly, we propose increasing the cardinality of the FC layer during this operation. Through the incorporation of a novel layer architecture, namely multibranch dense layers coupllped with reduction mechanisms, the model was empowered to capture a broader range of global representations within the network. The layers amassed during the squeezing process undergo a concatenation procedure and are thereafter passed on to the FC layer, as depicted in Fig. 7. Subsequently, the output generated by the FC layer is multiplied with the input layer of the given module, ultimately leading to the restoration of dimensionality. The final output was subjected to a scaling technique similar to that of SENet.This series of processes in the residual module is shown in Eq. (1):

$$\begin{aligned} \text {SEnetV2} = x + F\left( x \cdot \text {Ex}\left( \sum \text {Sq}(x)\right) \right) \end{aligned}$$
(1)

where x represents the input; F( ) represents the operations that are responsible for changing the input, including batch normalization and dropout, symbolized; and Sq( ) denotes the squeeze operation encompassing an FC layer.

Fig. 5
figure 5

Schematic diagram of the SENet module.

Fig. 6
figure 6

SENetV2 internal working schematic (The squcezed output is subsequently directed into multiple branch FC layers, after which the excitation process takes place. The spli input is transmitted to the end for restoring its original form).

Fig. 7
figure 7

Aggregated fully connected layers within SEnetV2 (This merges with concatenation at the end, The compressed conv layers post-squeeze are input into this module, The layers have a reduction size of 32 and a cardinality of 4).

Cross-scale convolutional feature fusion module (CCFM)

A CCFM is introduced into the neck structure of YOLOv8. The primary principle of CCFM is to amalgamate the characteristics of varying scales via fusion operations, thereby augmenting the model’s resilience to scale variations and its capacity to identify small-scale objects24. CCFM can efficiently incorporate intricate characteristics and contextual data, thereby enhancing the model’s overall performance and reducing the parameter count. Training convergence is accelerated, and performance is improved with the incorporation of multiscale features25. Nonetheless, despite deformable attention mitigating computational expenses, a significantly extended sequence length continues to render the encoder a computational bottleneck. Because of the computational redundancy of the multiscale transformer encoder, the high-level features are distilled from the low-level features; therefore, performing feature interactions on the connected multiscale features is redundant. A series of variants was designed to demonstrate that the simultaneous interaction of intra- and cross-scale features can be computed to validate this idea24. An efficient hybrid encoder comprising an attention-based intra-scale feature interaction and a CNN-based CCFM is proposed, as shown in Fig. 8.

Fig. 8
figure 8

The encoder structure for each variant (SSE represents the single-scale Transformer encoder, and CSF represents cross-scale fusion. AIFI and CCFM are the two modules designed for our hybrid encoder).

CCFM is enhanced with the cross-scale fusion module, which incorporates multiple fusion blocks of convolutional layers into the fusion pathway. The fusion block’s function is to amalgamate two contiguous scale features into a novel feature, as depicted in Fig. 9. The fusion block comprises two \(1 \times 1\) convolutions to modify the channel count; The N RepBlocks, each consisting of RepConv26, facilitate feature fusion, and the outputs from the two paths are combined using element-wise addition.The CCFM module integrates lightweight designs in both its channel attention and spatial feature extraction paths, thereby significantly reducing the parameter count. Rather than applying a full \(C \times C\) mapping to mix C channels, the channel-attention path first uses a \(1 \times 1\) convolution to compress the channels from C to C/r, performs the attention computation, and thereafter restores them to C with another \(1 \times 1\) convolution, reducing the attention submodule’s parameters from \(C^2\) to \(2C^2/r\), that is, by a factor of r. In the spatial-feature-extraction path, the conventional \(C^2k^2\) full convolution is replaced by two \(1 \times 1\) convolutions for channel projection with a reparameterizable RepBlock inserted between them, resulting in only \(O(Ck^2)\) parameters. Combined, the total parameter count of the CCFM decreases from \(O(Ck^2)\) to approximately \(2C^2/r + C \cdot k^2\), which is \(O(C^2/r + r)\), achieving a multifold reduction in the model size.

Fig. 9
figure 9

CCFM structure.

Experimental equipment and selection of hyperparameters

Training configuration

The model was trained on a Windows-based computer and assessed using the PyTorch 1.12.1 DL framework.The following specifications were utilized for the training and evaluation of the model: Intel Core i7-13700 CPU, 32 GB of random-access memory, and NVIDIA GTX 4090Ti graphics card with 24 GB of video memory. The software was operated on CUDA 12.1 and Python 3.9, with the model training environment summarized in Table 3.

Table 3 Experimental environment configuration.

Selection of hyperparameters

All the models were trained under standardized experimental settings using a learning rate scheduling strategy. The hyperparameters were configured as follows: initial learning rate of 0.01, final learning rate factor of 0.01, momentum of 0.937, weight decay coefficient of 0.0005, batch size of 32, and 150 training epochs.

Model evaluation metrics

Performance evaluation and complexity assessment are the two main categories of evaluation metrics. F1 score, mAP, recall, and precision were the performance indicators used to evaluate the model. The model size, parameters, and FPS are metrics used to evaluate the computational efficiency and image processing velocity of the model. These metrics are used to evaluate the complexity of the model. Precision is defined as the ratio of the number of samples that were correctly predicted to be positive to the total number of samples that were anticipated to be positive and serves as an indicator of a model’s classification capability, whereas recall is the ratio of accurately predicted positive samples to the total number of actual positive samples. Average precision (AP) is essentially the integral of precision and recall, Meanwhile, mAP simply represents the average of these AP values, serving as an indicator of the model’s comprehensive performance in accurately identifying and classifying targets. The F1 score represents the harmonic mean of the precision and recall, utilizing both metrics to assess the model performance27,28. Equations (2)–(9) present the computational formulas.

$$\begin{aligned} P&= \frac{{TP}}{{TP + FP}} \times 100\% \end{aligned}$$
(2)
$$\begin{aligned} R&= \frac{{TP}}{{TP + FN}} \times 100\% \end{aligned}$$
(3)

where TP denotes the number of positive samples that have been accurately detected, representing the count of positive samples that were detected incorrectly is denoted by FP, and the number of negative samples that were detected incorrectly is denoted by FN.

$$\begin{aligned} AP&= \int _0^1 {P(R)} dR \end{aligned}$$
(4)
$$\begin{aligned} mAP&= \frac{{\sum \nolimits _{i = 1}^n {A{P_i}} }}{n} \end{aligned}$$
(5)

where n is the number of disease species.

$$\begin{aligned} F1&= \frac{{2 \times Precision \times Recall}}{{Precision + Recall}} \end{aligned}$$
(6)

The complexity and detection speed of the model are also assessed. The computational complexity of the model was assessed by analyzing metrics, including the parameter count, number of layers, and memory footprint. FLOPs serves as a metric for quantifying the complexity of a given model by assessing the aggregate count of addition and multiplication operations executed within it. A diminished FLOP number indicates reduced computational requirements for model inference, resulting in faster calculation, which is essential for evaluating the processing speed and is critical for diagnosing diseases in real time.

$$\begin{aligned} Parameters&= \sum {(K \times K \times {C_{in}} \times {C_{out}})} \end{aligned}$$
(7)
$$\begin{aligned} FLOPs(Conv)&= (2 \times {C_{\mathrm{{in}}}} \times {K^2} - 1) \times {W_{out}} \times {H_{out}} \times {C_{^{_{out}}}} \end{aligned}$$
(8)
$$\begin{aligned} FLOPs(Liner)&= (2 \times {C_{in}} - 1) \times {C_{out}} \end{aligned}$$
(9)

where \(C_{\text {in}}\), stands for the input channel,\(C_{\text {out}}\), stands for the output channel,K, stands for the size of the convolution kernel, and \(W_{\text {out}}\) and \(H_{\text {out}}\), stand for the width and height dimensions, respectively, of the resulting output feature map.

Results

Results of ablation experiments

Feature fusion is a pivotal technology for enhancing the performance of machine learning models by integrating multisource feature information, thereby boosting their generalization ability, robustness, and accuracy. As summarized in Table 2, YOLOv8n has been proved to have the smallest number of parameters and model sizes. Considering the need for lightweight devices, this study adopted the original YOLOv8n as the baseline for the experiment and conducted ablation experiments on the YOLOv8n benchmark, maintaining the default parameters and evaluating various improvements. We tested four mainstream lightweight networks: the BiFPN29, ASFYOLO30, SDI31, and CCFM feature fusion modules. Four improved models were constructed in YOLOv8n by sequential addition, and the results were compared on the same dataset. Table summarizes the results of the experiments. Table 4 and Fig. 10 provide a detailed comprehensive evaluation of the four mainstream lightweight DL architectures regarding performance in terms of a number of parameters, model size, FLOPs, and accuracy. Experiments demonstrate that the number of parameters after the BiFPN, ASFYOLO, SDI, and CCFM feature fusion modules are respectively reduced by 12.86%, 4.39%, − 54.86%, and 38.57% compared with YOLOv8n. Compared with the baseline model, the size of the model was reduced by 32.79%, 25.86%, − 17.01%, and 53.98%, respectively. Further analysis of the accuracy of each model revealed that the addition of the CCFM resulted in the best performance. The compact model size is only 3.98 MB and the number of parameters is only 1.96 M, which effectively emphasizes the success of lightweight design and proves that cross-scale feature fusion can enhance the detection ability of the model. Although size exhibited the smallest resource demand, the accuracy decreased slightly. Considering the lightweight and accuracy of the model, it can be used as the basic model for the following ablation experiment to further explore its performance boundary and optimization space.

Table 4 Performance comparison of YOLOv8n with the introduction of different feature fusion modules.
Fig. 10
figure 10

Performance comparison of YOLOv8n with the introduction of different feature fusion modules (The lightweight index on the Y-axis is the Parameters (M), Model Size (MB) and FLOPS (G) three indicators; The Accuracy index on the Y-axis is the Presinon (%), mAP50 (%) and F1 (%) three indicators).

The incorporation of an attention mechanism can significantly enhance the feature learning capacity of the detection model, allowing the network to concentrate on the target region containing critical information while diminishing the influence of extraneous data. After introducing the CCFM into YOLOv8n, we integrated six attention mechanisms (CA32, ECA33, CBAM34, GAM35, SENetv123, and SENetv222) into the backbone network for the experiments, which enabled the model to focus more attention on prominent features and improve accuracy.

Table  5 summarizes the experimental results. Five of the six attention mechanisms introduced (CA, ECA, GAM, SENetv1, and SENetv2) improved, among which the combination with SENetv2 achieved the highest accuracy and mAP50 value. From the perspective of lightweight analysis, number of parameters, model size, and FLOPs, the combination of SENetv2 was the best scheme. This shows that SENetV2 performs extrusion and excitation operations through a multibranch FC layer, and finally, feature scaling, significantly improving the feature extraction and representation capabilities of the network, making it the first choice for our lightweight detection model. Therefore, the fusion of the CCFM and SENetV2 modules not only maintains the lightweight characteristics of the model but also improves the detection accuracy by optimizing feature extraction. As shown in Figs. 11 and 12, under the joint action of the two modules, the YOLO-Gum model achieves the best performance.

Table 5 Performance comparison of fusion modules with different attention mechanisms.

To analyze and present the structural basis of the parameter reduction of the proposed model in more detail, the detailed parameter analysis of the backbone and neck, components of the improved model and the baseline YOLOv8 are analyzed and quantified in Table  6. As shown in Table  6, the proposed model introduces a 1.06% increase in parameters in the backbone network through the synergistic optimization of the SENetV2 attention mechanism and the CCFM feature fusion module, while maintaining the head structure parameters at a constant value of 751,507. Meanwhile, the neck module exhibits a substantial 23.66% reduction in parameters, resulting in an overall parameter reduction of 7.31% at the network level. This heterogeneous parameter allocation strategy reflects a structured lightweight design paradigm that preserves feature extraction capabilities while improving computational efficiency through optimized feature fusion pathways.

Table 6 Parameter counts comparison between YOLOv8n and our model.
Fig. 11
figure 11

Precision comparison of fusion modules with different attention mechanisms.

Fig. 12
figure 12

Lightweight comparison of fusion modules with different attention mechanisms.

We randomly selected several photos from the test subset for comparison to demonstrate the detection results of the proposed model. The results are shown in Fig. 13, where the highlighted areas represent the detection results of the network. The peach gummosis recognized by the model is indicated by the text located above the detection result box, and the number is the level of confidence associated with the detection. The experimental results demonstrate that the standard YOLOv8n model fails to detect all peach gummosis cases, and this omission occurs because of partial occlusion by overlapping leaves. The introduction of SENetV2 effectively improves the ability of the model to exploit the relationship between feature channels, and expanding the receptive field improves the sensitivity and adaptability of the network to small object detection. The introduction of the CCFM reduces the number of model parameters and improves computational efficiency. Compared with YOLOv8, the enhanced YOLOv8 exhibits superior detection proficiency and confidence.

Fig. 13
figure 13

Comparison of detection before and after improvement.

To demonstrate the benefits of the improved model more intuitively, we drew heat maps of the two models under two common scenarios, as shown in Fig.  14, to provide visual insight into the focus of the model prediction. The warm-colored areas in the heat map represent areas where the model is confident that peach gum disease is present; conversely, the regions with cooler colors represent areas where the model is not confident in its predictions. Gradient-weighted class-activation mapping(Grad-CAM)36 technology is applied to guide arbitrary target gradients into the final convolutional layer to generate a rough location map that highlights the key regions used by the model for prediction, visualizes the network prediction process as a thermal map, demonstrates the feature extraction process, and highlights fine details in the image features. Notably, the lower regions of the heat map do not necessarily indicate that these regions do not contain peach gum disease. Moreover, the peach gum disease can still appear in these lower-confidence regions, depending on how well the features of these regions match the characteristics of the peach gum disease learned by the model. According to the comparison in Fig. 14, the hotspots of YOLOv8n are more concentrated on clear lesions, which may be slightly inadequate for processing lesions with blurred boundaries and unclear features. By contrast, YOLO-Gum may have a higher ability to capture image details and provide more comprehensive and detailed detection results.

Fig. 14
figure 14

Visualization of heatmaps original and final improvements.

Comparison of different models

To assess the efficacy of the proposed model, identical datasets and validation sets were used for the training and evaluation. The ultimate network weights acquired from the model training were used to evaluate the outcomes of the identical validation set. For comparative studies, the two-stage detection model Faster R-CNN and the one-stage detection models YOLOv3, YOLOv5, YOLOv6, YOLOv8, and YOLOv10 were used. The experimental results are listed in Table 7. As summarized in Table 7, Faster R-CNN achieved an mAP50 of 93.1%, but its model size was approximately 20 times larger and its parameter count was 14.9 times higher compared with those of the improved YOLO-Gum model. Therefore, it is not ideal for the detection of peach gummosis in real-time scenarios. Among the single-stage detection models in the YOLO series, the improved YOLOv8 model proposed in this study demonstrated a significant performance. Under the same experimental settings, a precision of 92.5% and an mAP50 of 76.5% were achieved. Specifically, compared with YOLOv3, YOLOv5, YOLOv6, and YOLOv8, the improved versions of YOLOv8 achieved precision improvements of 12.37%, 9.86%, 17.24%, and 5.73%, and mAP50 improvements of 12.32%, 0.92%,3.11%, and 33.77%, respectively. The precision–recall curves show the relationship between the precision and recall rate, providing a more comprehensive performance comparison. A comparison of the precision–recall curves of the different models is shown in Fig. 15. In contrast, as summarized in Table 6, the improved model size is only 5.57 MB, and the number of parameters is also only 2.79 M. Compared with YOLOv3, YOLOv5, YOLOv6, and YOLOv8, the model size has been reduced by 97.17%, 7.63%, 32.81%, and 35.46%, and the number of parameters was reduced by 73.19%, 20.26%, 34.03%, and 12.54% respectively. Compared with the latest YOLOv10 model, the improved model achieved a notable enhancement in the crucial metric of mAP50, with a 2.7% increase. The size of the improved model was reduced by 64.65% compared with that of YOLOv10. Simultaneously, the number of parameters is reduced by 65.30%. Although the precision of the optimized model decreased slightly by 2.5% compared with that of YOLOv10, considering its significant advantages in terms of model size, parameter count, and mAP50, this model better aligns with the requirements of the peach gummosis detection field for lightweight, high-precision detection tools.

Table 7 Performance comparison of mainstream detection models.
Fig. 15
figure 15

Comparative analysis of precision-recall curves for different models.

Discussion

DL algorithms offer notable advantages in terms of accuracy and speed in recognizing crop pests and diseases. However, few studies have investigated fruit tree branch and trunk diseases in natural environments. Peach gummosis is a prevalent affliction of peach tree branches and trunks, which is pervasive and detrimental to the growth and productivity of peach trees and associated industries. Consequently, large-scale detection of peach gummosis represents a significant challenge. This study addresses this issue by employing an attention mechanism and feature fusion approach to enhance the YOLOv8n algorithm.

The attention mechanism assigns differential weights to the visual features extracted by the model, thereby emphasizing the salient aspects of the picture. This allows the network to concentrate on the target area, which holds significant information while diminishing the influence of extraneous data. The interference of irrelevant backgrounds on detection results can be reduced using this strategy. The use of an attention mechanism can significantly improve the ability of a detection model to learn new features. Numerous researchers have incorporated this into their models to improve their overall performance. For instance, Bin et al.37 incorporated the attention mechanism CBAM in YOLOv5 to reduce the impediments posed by intricate ambient details. Bao et al.38 proposed a parallel channel space attention module that enhances the representation of the output feature maps and eliminates the interference of the weight coefficients under the serial structure. In this study, the SENetV2 attention mechanism module was introduced in YOLOv8n to replace a part of the original convolutional layer. This aims to effectively enhance the ability of the model to utilize the relationship between feature channels through a multibranch structure, thereby further improving the fineness of the feature expression and the integration of global information. This, in turn, enhanced the representational ability of the model.

The cross-scale feature fusion process poses a significant challenge in the context of peach gummosis detection, largely because of the inherent variability in lesion size that is typically observed in this context. In this study, a lightweight CCFM was introduced into YOLOv8n to address the diverse morphology, intricate color features, and susceptibility of peach gummosis to occlusion in complex natural environments. The objective was to integrate features at different scales through fusion operations to enhance the adaptability of the model to changes in scale and its ability to detect small-scale objects, effectively integrate detailed features and contextual information, and cross-scale the fusion of feature maps at different resolutions. The integration of detailed features and contextual information, as well as the fusion of feature maps with varying resolutions across scales, is a crucial aspect of this process. Feature fusion has emerged as a prominent research topic in agricultural inspection. For instance, Zhang et al.39 effectively filtered out irrelevant features by adding a new multiscale feature adaptive fusion module, Res-Attn, to capture citrus feature information in a more comprehensive manner. Similarly, Sun et al.40proposed a cross-spatial feature fusion mechanism to better fuse the color and texture features of the maize leaves. However, current research still faces a challenge: integrating detailed features and contextual information is not feasible for extremely large or small-scale objects. This leads to a substantial decrease in detection accuracy for objects with extremely large or minuscule dimensions. Future research will focus on investigating the potential applications of the model to locations of varying scales.

Although the model demonstrated proficiency in accurately recognizing peach gummosis, a few concerns that require more focus and inquiry exist. First, future research should continue to expand the peach gummosis dataset and utilize more data feature information to ensure that the model maintains high recognition accuracy when presented with complex scenarios featuring irregular lesion shapes and inconspicuous boundaries. Second, this study achieved lightweight features that met the requirements of peach gummosis recognition in complex environments. However, scope for further research on lightweight features still remains. Finally, the lightweight YOLO-Gum model developed in this study was employed solely for model validation and testing on a computer. In future research, the model will be incorporated into edge devices for applications such as disease patrol robots and visual inspection systems.

Conclusions

Although numerous studies have focused on identifying pests and diseases, few have specifically focused on tree branches. This study considers the detection methods for peach gummosis in natural environments, aimed at addressing the gap in the detection of trunk diseases and addressing the challenges of absent and erroneous detection resulting from the varied morphology of sick spots, multiscale factors, and dense distribution of peach gummosis lesions. An optimized YOLOv8 peach gummosis detection model, YOLO-Gum, was successfully constructed. This model introduces two enhanced modules, the CCFM and SENetv2, which significantly improve detection performance. Specifically, key metrics, such as precision, mAP50, and recall, all achieved improvements over the baseline model. Notably, this improvement was achieved without increasing the number of parameters or size of the model, thereby achieving a lightweight model. The results demonstrate that the YOLO-Gum model achieves a precision of 92.5% and, mAP 50 of 76.5%, and F1 score of 74.3% on the test set. The model size on the CPU devices was 5.57 MB, with a parameter count of 2.79 M. It also maintains excellent performance and robustness under extreme conditions. With high detection accuracy and a compact model size, the enhanced model is appropriate for deployment on mobile devices for real-time surveillance and intelligent management of peach gummosis in trees. In smart agriculture, the improved YOLOv8 can be integrated with intelligent patrolling disease control robots using an artificial intelligence-based disease detection system to achieve efficient and high-quality disease prevention and control.