Introduction

Safflower is a valuable cash crop with medicinal and oilseed properties1. Its market demand is increasing due to the growth of research and development in safflower-related products. Safflower plants produce multiple clusters of flower bulbs, each with varying opening times, making selective harvesting in batches a necessity. Currently, safflower harvesting relies mainly on manual multi-batch picking, which is both labour-intensive and costly. For this reason, the development of robotic technology for safflower harvesting is crucial. Current intelligent automated harvesting equipment relies on GPS/GNSS during autonomous walking in rows. However, it cannot overcome the bending of safflower rows in the field caused by the operation, resulting in injury to the plants. Therefore, deep learning methods were introduced to research the detection of safflower clusters in complex farmland environments and achieve the accurate identification and positioning of safflower clusters within rows. By using safflower clusters as navigation feature points, this establishes a foundation for the visual navigation and positioning required for safflower harvesting robots.

Two primary approaches are used to detect and recognise flower classifications2. One relies on image processing techniques to differentiate flowers from the background based on variations in colour between flowers and their surroundings. The other is a deep learning approach. Jason et al. employed dual cameras, for near and far recognition, to locate and identify flowers. To decrease image processing time, they used a basic Bayesian classifier on reduced colour-segmented regions for distant flower recognition. For near recognition, they utilized RGB-D cameras to reconstruct the three-dimensional plant, identifying recognized flowers with 78.63% accuracy3. Oppenheim et al. (2017) proposed an algorithm for analysing images to detect and count yellow tomato flowers in greenhouses, with 74% accuracy, using adaptive global thresholding, segmentation in the HSV colour space, and morphological cues to identify the flowers4. Image processing-based target detection techniques rely on the analysis and extraction of the colour, texture, and shape features of the safflower target. Despite the low computational costs associated with these methods, their stability and accuracy in detecting safflower targets are limited when faced with sudden changes in lighting, small target sizes, and complex backgrounds.

In comparison to conventional image processing techniques, deep learning-based object detection methods (such as convolutional neural networks (CNNs)) can automatically learn the multi-level features of targets without the need for manually specifying the colour, shape, or other feature parameters. This enables deep learning-based object detection methods to demonstrate enhanced accuracy and resilience, even in challenging environments. The technology used for detecting objects based on their visual characteristics relies on analysing and extracting inherent features such as colour, texture, and shape. Despite the relatively low computational cost, the stability and accuracy of object detection are negatively impacted by sudden changes in illumination and complex backgrounds surrounding the target objects. The field of agriculture has seen a recent uptick in the application of deep learning5,6, whose target detection and recognition techniques include RCNN7, Fast-RCNN8, Mask-RCNN9, SSD10, and You Only Look Once (YOLO)11. Jia et al. (2020) proposed an optimised Mask R-CNN algorithm for the recognition of apples on fruit trees, achieving test accuracy of 97.3%. However, the computational speed was low12. Tian et al. (2019) employed an enhanced SSD algorithm to detect flowers in VOC2007 and VOC2012, with respective accuracies of 83.64% and 87.4%13. Zhang et al. (2022) developed a tomato detection algorithm using an improved RC-YOLOv4, with mean accuracy of 94.44% and a detection speed of 10.71 frames/s14. The YOLO algorithm better balances between detection accuracy and speed, making it more suitable for deployment on mobile devices. However, the problem of reduced detection performance still occurs when coping with the small size and high density of safflower clusters in safflower farmland.

This study aimed to address the aforementioned issues by making several contributions: (1) YOLOv5 was employed as the base model, with the lightweight convolution Ghost_conv enhancing the original backbone network; (2) the CBAM attention mechanism was introduced to suppress unimportant features and enhance the model’s adaptive feature fusion capability; (3) a combined loss function integrating CIOU and NWD was proposed to optimise the convergence speed of the model loss; and (4) the original COCO dataset anchors were updated using the K-means algorithm, focusing on safflower clusters in the field, to enhance the model’s multi-scale safflower detection capability. This study presents an enhanced lightweight safflower detection algorithm based on YOLOv5, SF-YOLO, that addresses the issues of inaccurate detection and poor robustness in complex safflower field environments, both of which impede precise automated harvesting and visual navigation. This offers a viable solution for the enhancement of safflower harvesting machinery.

Materials and methods

Data acquisition

The dataset of images employed in this research originated from Hongqi Farm in Jimusar County, Changji Prefecture, Xinjiang Uygur Autonomous Region, China, situated at 89°12ʹE and 44°24ʹN. The safflower strain used was Jihong 1. The Intel RealSense D455 depth camera and HNONR 20 were utilised for image acquisition, placed directly in front of a homemade safflower picking robot body, as shown in Fig. 1. The viewing angle ranged from 20 to 45° to capture images with resolutions of 1080 × 720 and 1920 × 1080. To enhance the robustness and generalisation of the model, we took into account factors such as time and weather conditions during data collection. The dataset was augmented through image enhancements, including rotation, noise perturbation, blurring, and colour transformation. Periods of sunny weather included early morning, noon, and afternoon, as depicted in Fig. 2a–c; overcast conditions are shown in Fig. 2d; Fig. 2e shows images captured at dusk; and Fig. 2f shows LED fill-in lighting at night.

Figure 1
figure 1

Safflower picking robot in field.

Figure 2
figure 2

Images of safflower captured under diverse weather and light conditions: (a) Sunny day-morning; (b) Sunny-noon; (c) Sunny-afternoon; (d) Overcast; (e) Nightfall sky; (f) Nighttime fill light.

Dataset production

Complex datasets were created by combining images captured from various angles, lighting, and weather conditions, enabling automated harvesting equipment to adapt to a variety of working conditions for both identification and harvesting purposes. We aimed to produce versatile datasets that can be used in different settings. Safflower threads typically display shades of reddish-orange and yellowish-orange (in their immature state) in the field, owing to differing degrees of maturity. Given the current market demand for safflower filaments, the two colours are not differentiated, and are graded similarly. Given the inconsistent growth of safflower plants in the field, and the impact of weather leading to fallen plants, as well as instances of plants shading each other and causing significant leaf damage, data marking requires us to exclude safflower plants with shading exceeding 75%. In the process of marking, it is important to avoid flower buds, stalks, and leaves. Figure 3 depicts the Ji Hong 1 safflower variety. The data were annotated using the open-source application LabelImg, and the data were in PASCAL VOC dataset format. The label name for safflower was “safflower.” An example of the annotation is shown in Fig. 4.

Figure 3
figure 3

Schematic diagram of Jihong 1.

Figure 4
figure 4

Example of annotated safflower data.

Safflower cluster detection

YOLOv5 network architecture

The YOLOv5 network comprises an input layer, backbone network, neck network, and prediction output layer. The image input to the backbone network undergoes convolution to obtain the feature map, which is forwarded to the neck network for multi-scale feature fusion, incorporating upsampling and downsampling. The combined features output from layers 17, 20, and 23 are forwarded to the output layer. The confidence score, prediction category, and target frame coordinates can then be obtained after non-maximum suppression and other processing15. The framework diagram for the YOLOv5 network is presented in Fig. 5.

Figure 5
figure 5

YOLOv5 structure.

YOLOv5 model version determination

YOLOv5 has five versions, with different network depths and numbers of residuals. We evaluated the performance of YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x at detecting safflower. We trained each model on a self-constructed safflower dataset, with the aim of identifying the most appropriate version of YOLOv5. Our findings are displayed in Table 1. A model’s efficacy was assessed by its precision (accuracy), F1-score, mean average precision (mAP), GFlops (model calculations), and number of model parameters (Params), where

$${\text{Precision}} = \frac{TP}{{\left( {TP + FP} \right)}}$$
(1)
$$F1 = 2*\frac{Precision*Recall}{{Precision + Recall}}$$
(2)
$$mAP = \mathop \sum \limits_{i = 1}^{N} AP_{i} /N$$
(3)
$$GFlops = L\left( {\mathop \sum \limits_{i = 1}^{N} K_{i}^{2} *C_{i - 1}^{2} + \mathop \sum \limits_{i = 1}^{i} M^{2} *C_{i} } \right)$$
(4)
$$Params = L\left( {\mathop \sum \limits_{i = 1}^{N} M_{i}^{2} *K_{i}^{2} *C_{i - 1} *C_{i} } \right).$$
(5)
Table 1 Training results for each version of YOLOv5.

True positives (TPs) refer to cases where the model correctly identifies an object, while false positives (FPs) refer to cases where the model predicts an object where there is none. The average precision (AP) is the area under the precision-recall curve. Other factors that can affect model performance include the kernel size (K), number of channels (C), size of input image (M), and number of iterations (i).

YOLOv5x, YOLOv5l, and YOLOv5m have greater detection accuracy than YOLOv5n, but their more intensive computation makes them unsuitable for mobile use. YOLOv5n is more lightweight, but has poor accuracy. Balancing computation and accuracy, we selected YOLOv5s, and enhanced it to reduce computation (params, GFlops) while maintaining detection accuracy.

Improvement strategies

We present a safflower detection model for SF-YOLO. We seek a reduced computational burden, given the complex background and environmental variables that characterize safflower farmland. To this end, we replace the standard convolutional block in the backbone network with the lighter Ghost_conv block. We embed the attention module after the SPPF module in the backbone network, which enables the model to focus on relevant information and enhances its self-adaptive fusion ability, ultimately improving the recognition rate. The initial model’s LGIOU loss function is replaced by the fusion L(CIOU+NWD) loss function. We refine the YOLOv5 model’s initial anchor frame using K-means clustering to better suit small- and medium-sized safflower target detection. Figure 6 depicts the structure of the improved SF-YOLO network model.

Figure 6
figure 6

SF-YOLO structure.

Model lightweighting

Due to computational limitations, the prediction model is too large for mobile deployment. The numbers of parameters and computations are reduced by replacing the CBS module in the backbone network with the Ghost_conv module from the Ghostnet network16, whose initial phase involves the application of conventional convolution to produce feature maps with fewer channels, thus lowering computational demands. Subsequently, inexpensive operations are performed on the feature maps to obtain new ones and further reduce computation. The two groups of feature maps are combined to generate the output feature maps, as shown in Fig. 7.

Figure 7
figure 7

Ghost_conv schematic diagram.

CBAM attention mechanism

Woo et al. (2018) proposed CBAM, a mechanism that focuses on important features in both channel and spatial dimensions and suppresses unimportant features to improve the accuracy of safflower detection17. The process is as follows. The channel dimension attention mechanism is processed, the input feature map F (H × W × C) is subjected to global max pooling and global average pooling, the two obtained feature maps are fed into a multilayer perceptron (MLP), the two features of the output are summed, and the channel feature map (channel attention feature), i.e., M_c, is generated by sigmoid activation, as shown in Eq. (6). In the second step, M_c is used as the input feature, which undergoes global max pooling and global average pooling. The resulting feature maps are then concatenated along the channel dimension and then passed through a 7 × 7 convolution followed by Sigmoid activation to generate the spatial attention feature map, M_s, as shown in Eq. (8). The channel dimension attention mechanism and the spatial dimension attention mechanism are illustrated in Fig. 8.

$$M\_c = Sigmoid\left( {MLP\left( {Avgpooling\left( F \right)} \right) + MLP\left( {Maxpooling\left( F \right)} \right)} \right)$$
(6)
$$F_{1} = M\_c \times F$$
(7)
$$M\_s = Sigmoid\left( {conv\left( {MLP\left( {Avgpooling\left( {F_{1} } \right)} \right) + MLP\left( {Maxpooling\left( {F_{1} } \right)} \right)} \right)} \right)$$
(8)
$$F_{2} = M\_s \times F_{1} ,$$
(9)

where F is the input feature; F1 is the channel attention mechanism feature layer; and F2 is the CBAM module output feature.

Figure 8
figure 8

Schematic diagram of CBAM attention mechanism.

Improvement of K-means-based anchor frame mechanism

The YOLOv5 initial anchor frame is obtained by using K-means clustering for the COCO dataset, and a genetic algorithm to adjust the anchor frame during dataset training. The size of the anchor frame influences the convergence speed and accuracy of the model. The safflower dataset produced in this study was labelled with small- and medium-sized targets, the 80 category targets in the COCO dataset were of different sizes and categories, and the initial anchor frame of YOLOv5 was not suitable for the constructed dataset. Therefore, we used K-means clustering to obtain new safflower anchor frames in the dataset. The clustering results are shown in Table 2.

Table 2 Safflower anchor frame update results.

Loss function improvement

The loss function is used to assess the difference between the predicted and true values of a model. We used LGIOU, LDIOU, and LCIOU to calculate the loss function in YOLOv5, where

$$L_{DIOU} = 1 - IOU + \frac{{D_{2}^{2} }}{{D_{C}^{2} }}$$
(10)
$$L_{CIOU} = 1 - IOU + \frac{{D_{2}^{2} }}{{D_{C}^{2} }} + \alpha v$$
(11)
$$\alpha = \frac{v}{{\left( {1 - IOU} \right) + v}}$$
(12)
$$v = \frac{4}{{\pi^{2} }}\left( {arctan\frac{{w^{gt} }}{{h^{gt} }} - arctan\frac{w}{h}} \right)^{2} ,$$
(13)

where IOU is the intersection and merger ratio; \(D_{2}\) is the Euclidean distance between the centroids of the predictive and real frames; \(D_{c}\) is the diagonal distance, i.e., the smallest closure region that can contain both the predictive and real frames; v measures the consistency of the aspect ratio; and wgt and hgt are the respective widths and heights of the real frames.

The safflower field growing environment is complex, and safflower targets are not only small- and medium-sized, but there are a large number of them. LCIOU provides more comprehensive loss information than LGIOU and LDIOU by considering the shape, size, centre position, and aspect ratio error of the bounding box. LCIOU takes into account the distance between the centres of bounding boxes, which are more accurately positioned. On the basis of LCIOU, the normalised Wasserstein distance (NWD) algorithm is integrated, and the balancing coefficient β is set to accelerate the convergence of the model loss. We set β to 0.8, and the NWD algorithm is calculated as

$$W_{2}^{2} \left( {N_{a} ,N_{b} } \right) = \left[ {x_{A} ,y_{A} ,\frac{{w_{A} }}{2},\frac{{h_{A} }}{2}} \right]^{T} ,\left[ {x_{B} ,y_{B} ,\frac{{w_{B} }}{2},\frac{{h_{B} }}{2}} \right]_{2}^{T2}$$
(14)
$$NWD\left( {N_{a} ,N_{b} } \right) = {\text{exp}}\left( { - \frac{{\sqrt {W_{2}^{2} \left( {N_{a} ,N_{b} } \right)} }}{C}} \right)$$
(15)
$$L_{NWD} = 1 - NWD\left( {N_{a} ,N_{b} } \right),$$
(16)

where \(W_{2}^{2} \left( {N_{a} ,N_{b} } \right)\) is the Gaussian distribution between bounding boxes A and B; \(x_{A}\), \(y_{A}\), \(x_{B}\), \(y_{B}\) are the respective centre coordinates x, y of bounding boxes A and B; \(w_{A}\), \(h_{A}\), \(w_{B}\), \(h_{B}\) are the respective lengths and widths of bounding boxes A and B; and \(NWD\left( {N_{a} ,N_{b} } \right)\) is a similarity measure of bounding boxes A and B.

In this study, the original LGIOU is replaced by the improved LCIOU as the loss function of SF-YOLO. LCIOU is based on LDIOU, introducing a term αv, where α is a balancing parameter not involved in the gradient calculation18,19. The total loss LCIOU+NWD after using the improved CIOU is

$$L_{CIOU + NWD} = \left( {1 - \beta } \right) \cdot L_{NWD} + \beta \cdot L_{CIOU} .$$
(17)

Figure 9 shows the loss decline curves before and after improvement. The improvement converges quickly in the first 100 rounds of training.

Figure 9
figure 9

Improved loss function plot.

Results and discussion

Table 3 shows the configuration environment for this experiment. The dataset consisted of 6554 safflower images, which were divided into training, test, and validation sets in a 7:2:1 ratio, resulting in 4970 images for training, 1420 images for validation, and 710 images for testing. The initial learning rate was set to 0.01, and the learning rate was adjusted to accelerate convergence. Considering the memory limitations of the CPU, the batch size was set to 18, the number of training rounds was set to 300, and 640 × 640 resolution was selected.

Table 3 Experimental environment configuration and hardware parameters.

Ablation experiments

Ablation experiments were conducted utilising a test set comprising 710 images captured at various times throughout the day. This study focused on the introduction of the Ghostmodel, updating anchor boxes via K-means clustering, and incorporating attention mechanisms, with the aim of assessing the performance of SF-YOLO in detecting safflower clusters in agricultural fields. The results of the experiment are presented in Table 4. The improvement of the model backbone network structure reduced the complexity of the model and the number of parameters, and it also enhanced the comprehensive performance and accuracy of the model. Compared with the initial YOLOv5s model structure, the introduction of Ghostmodel resulted in decreases in GFlops and Params by 15.2% and 24.4%, respectively, and 0.8% in mAP. The introduction of CBAM after the SPPF layer of the backbone network, replacement of the fusion loss function with LCIOU+NWD and updating of anchor frames resulted in a small change in GFlops and Params, but mAP increased by 2.1%. Following the implementation of improvements to the model, the Precision value exhibited an upward trajectory, increasing from 91.9% in the original model to 94.1%. The Recall value had a slight decline, decreasing from 92.6 to 92.3%. However, both the mAP0.5 and Precision values exhibited an upward trend, indicating that the model became more reliable in detecting actual positive samples while reducing false identifications. This renders it suitable for practical detection and automated harvesting tasks in safflower fields.

Table 4 Results of ablation experiments.

Experiments comparing performance of different models

To verify the comprehensive performance of the proposed SF-YOLO model for inter-row safflower detection, it was compared with the Faster R-CNN, SSD, YOLOv3-tiny, YOLOv4, YOLOv8, YOLOv5, and YOLOv9-c models. The results are shown in Table 5, which indicates that the SF-YOLO model had 31.4%, 17.9%, 7.1%, 10.65%, 1.5%, and 0.2% higher mAP values than Faster R-CNN, SSD, YOLOv3-tiny, YOLOv4, YOLOv5, and YOLOv9-c, respectively. After replacing the original convolutional blocks of the backbone network, the GFlops of SF-YOLO were reduced by 927.17 G, 259.17 G, − 1.1 G, 127.9 G, 14.4 G, 1.8 G, and 222.6, respectively, while Params were reduced by 22.49 M, 17.62 M, 2.68 M, 57.95 M, 105.27 M, 1.02 M, and − 8.9 M, respectively. YOLOv8s has the same mAP as SF-YOLO, but with higher GFlops and Params by 14.4 G and 105.27 M, respectively. Furthermore, although YOLOv9 has demonstrated considerable promise on a number of publicly available datasets, it has not yet fully adapted to specific conditions such as small safflower sizes, dense distributions, and other particular challenges. Its mAP0.5 and F1 scores were both slightly inferior to those of SF-YOLO. The F1 score of SF-YOLO is markedly superior to those of other models, suggesting that the model exhibits enhanced precision in recognising small safflower targets and maintains robust performance. Considering the accuracy and complexity of the model, SF-YOLO is more suitable for deploying and completing the safflower detection task on mobile devices.

Table 5 Comparative experimental results of different models.

SF-YOLO model detection effect

Using the improved SF-YOLO, the safflower detection results under different light and weather conditions are shown in Fig. 10. To ensure comprehensive and objective results, six scenarios were selected for testing: sunny day-early morning, sunny day-noon, sunny day-afternoon, evening, fill-in light, and cloudy day. Figure 11 shows the detection results of the original YOLOv5 model, where red boxes mark successfully detected safflowers, and blue boxes mark the target safflowers. It is seen that SF-YOLO maintains stable and accurate detection under different light and weather conditions, while during periods of strong lighting changes, the initial YOLOv5s model may fail to detect some safflower targets during the inference process, as shown in Fig. 11a–f. The experimental results verify the superiority of SF-YOLO in detecting safflower in different farmland scenarios.

Figure 10
figure 10

SF-YOLO detection effect: (a) Sunny day-morning; (b) Sunny-noon; (c) Sunny-afternoon; (d) Nightfall sky; (e) Nighttime fill light; (f) Overcast.

Figure 11
figure 11

YOLOv5s initial model detection effect: (a) Sunny day-morning; (b) Sunny-noon; (c) Sunny-afternoon; (d) Nightfall sky; (e) Nighttime fill light; (f) Overcast.

Model visualisation and analysis

We used heat maps to visualise, compare, and analyse detection performance of the YOLOv5 and SF-YOLO models by indicating the model’s attentional weights for different regions and features through the colour depth of the target centre. To further understand the model’s internal attention to regional features, the features output from the penultimate layer of the model were subjected to gradient-weighted activation mapping using Grad-CAM. Figure 12 shows the detection effect of SF-YOLO under different light and weather conditions, and Fig. 13 shows the detection effect of the original model. It can be seen that SF-YOLO has better robustness and adaptability in a complex farmland environment.

Figure 12
figure 12

SF-YOLO heat map: (a) Sunny day-morning; (b) Sunny-noon; (c) Sunny-afternoon; (d) Nightfall sky; (e) Nighttime fill light; (f) Overcast.

Figure 13
figure 13

YOLOv5s heat map: (a) Sunny day-morning; (b) Sunny-noon; (c) Sunny-afternoon; (d) Nightfall sky; (e) Nighttime fill light; (f) Overcast.

Conclusion

To achieve accurate and fast detection of safflower with limited computational capacity in a complex environment, we proposed an improved target detection model, SF-YOLO, which adopts Ghost_conv to replace the original convolutional block in the backbone network to improve computational efficiency. The CBAM attention mechanism was embedded in the backbone network, and the fusion LCIOU, LNWD loss function LCIOU+NWD was introduced to extract features more accurately, enhance the adaptive fusion ability of the model, and accelerate loss convergence. An anchor frame obtained by K-means clustering was used to update and replace the original anchor frame, for better adaption to multiscale reddish colours in farmland. Data enhancement techniques including Gaussian blurring, Gaussian noise, sharpening, and channel disruption were employed to further improve the generalisation ability of the model, making it robust to lighting, noise, and angle changes. SF-YOLO outperformed the original YOLOv5s model in experiments, where GFlops and Params decreased from 15.8 to 13.2 G and from 7.013 to 5.34 M, respectively, which are decreases of 16.6% and 23.9%, and the mAP0.5 of SF-YOLO improved by 1.3–95.3%. This indicates that SF-YOLO can effectively and accurately detect safflower in complex environments, using computationally underpowered devices.

In summary, SF-YOLO achieved the efficient and accurate detection of safflower in a complex environment with limited computational capacity, which is of great significance for the practical application and development of agricultural automation technology.

Although SF-YOLO can achieve the accurate detection of safflower among multi-agricultural fields, there are still shortcomings: (1) Safflower varieties are diverse; however, the dataset did not account for different safflower variety types, and this experiment solely focused on the ‘Jihong 1’ variety. (2) We did not consider the possibility of multiple safflowers overlapping, which the model will treat as only one safflower. In the future, features such as instance segmentation will be used to better overcome the problem of the overlapping of flowers leading to misrecognition.