Introduction

With improvements in socioeconomic levels, edible fungi have become widely recognized by consumers in the Chinese market because of their unique nutritional value and taste1. In recent years, Oudemansiella raphanipes has become an important pillar industry for increasing the income of farmers in many regions. As consumer demand for Oudemansiella raphanipes continues to increase, factory-based cultivation has gradually become mainstream. The growth cycle of Oudemansiella raphanipes is relatively short (three to four months)2, necessitating prompt harvesting upon reaching maturity. However, traditional methods for identifying the growth stage of Oudemansiella raphanipes rely mainly on manual monitoring, and this approach is inadequate in terms of accuracy and efficiency, thereby failing to meet the needs of modern large-scale agricultural production. Moreover, deep learning techniques have not yet been applied to identify the growth stages of Oudemansiella raphanipes. Therefore, the goal of this study was to develop an efficient model for identifying the growth stage of Oudemansiella raphanipes.

The identification of the growth stage of Oudemansiella raphanipes can be considered an object detection task. With the advancement of deep learning, anchor-based detection algorithms have been extensively applied to a variety of crops, including grains, fruits, and vegetables3,4,5,6. Zhang et al. (2022) utilized the improved Faster region-based convolutional neural network (R-CNN) to detect the developmental stages of rice spikes3. Yadav et al. (2022) deployed the YOLOv5 algorithm to detect volunteer cotton plants within cornfields across critical growth phases4. Zha et al. implemented a model based on the RetinaNet-based adaptive training sample selection (ATSS) model to recognize different growth stages of grass mushrooms5. Almalky and Ahmed (2023) employed deep learning architectures, such as RetinaNet with ResNet-101-FPN, to detect and classify the growth stages of Consolida regalis weed6. However, anchor-based algorithms, such as Faster R-CNN, YOLOv5 and RetinaNet, have notable limitations. These methods depend on predefined anchor boxes, complicating the model and poorly accommodating diverse object sizes and shapes. This approach also results in a disparity between positive and negative samples, with most anchors covering merely the background, skewing the training process7. Additionally, manually adjusting anchor parameters requires substantial expertise, prolonging the tuning process8.

In recent years, anchor-free object detection methods have been widely applied in the field of object detection. These methods, which do not require predefined anchor boxes, offer reduced model complexity and enhanced detection stability8. For example, Liu et al. (2022) developed TomatoDet for the robust detection of tomatoes at different growth stages in greenhouses by integrating an attention mechanism and a novel circle representation into the CenterNet backbone9. Xie et al. (2023) developed FCOS-FL, a model for detecting various categories of litchi leaf diseases and pests. This model is particularly effective with hard-to-detect pests such as Mayetiola sp. and litchi algal spots10. Among the anchor-free object detection methods, the fully convolutional one-stage object detector (FCOS) is an innovative algorithm that enables direct object classification and location regression on feature maps11. This approach enhances both efficiency and accuracy. In this model, ResNet-50 is deployed as the backbone, and three parallel subnetworks are incorporated as heads for classification, centerness prediction, and bounding box regression, streamlining the structure. In addition, a three-level FPN structure is implemented in FCOS. The FPN effectively integrates low-dimensional global information with high-dimensional local details and is responsible for distinguishing overlapping predicted bounding boxes11. The introduction of a centerness prediction subnetwork improves localization accuracy by focusing on the central areas of targets11. The loss function combines the classification loss, which uses the focal loss to handle class imbalance, the centerness loss, which implements the binary cross-entropy loss to improve the accuracy of the predicted centerness, and the regression loss, which utilizes the intersection over union (IoU) to increase the precision of the predicted bounding boxes11. This comprehensive approach optimizes performance and precision in predicting object locations. FCOS stands out among anchor-free models for its simplified design, reduced number of hyperparameters, and ease of tuning. Moreover, FCOS demonstrates superior flexibility and generalizability and effectively handles targets of various sizes and shapes, with notable performance in detecting small objects. Overall, FCOS shows outstanding performance and efficiency in object detection owing to its unique design and optimization strategies.

However, despite the effectiveness of FCOS and other deep learning-based object detection methods, these models often require significant computational resources, making them challenging to deploy on devices with limited processing power. This limitation is particularly relevant in agricultural applications, as onsite real-time analysis is crucial. To address this issue, model compression techniques such as pruning have become indispensable tools in optimizing deep learning models for practical deployment12,13. For example, Liu et al. (2022) explored discrimination-aware network pruning, focusing on retaining discriminative features during compression to ensure minimal performance loss with deep models14. Furthermore, Guo et al. (2021) developed a progressive channel pruning approach, enabling gradual model compression to achieve the desired compression ratios with minimal impact on model performance15.

Targeting existing challenges, it is necessary to accurately identify the growth stages of Oudemansiella raphanipes under constrained conditions. This study developed the ORaph8K image dataset with comprehensive annotations. We also proposed the OR-FCOS model, a lightweight framework for precise growth stage identification. Our OR-FCOS algorithm is specifically optimized for deployment on edge devices such as Jetson Nano. The main contributions of this article are as follows: Firstly, the OR-FCOS model incorporates the MobileNetV3-Large backbone with the EMA module, effectively replacing the more computationally demanding ResNet-50 in the FCOS framework. Secondly, The model substitutes the conventional FPN with NAS-FPN and introduces NAS-FCOS-Head, refining the processes of feature fusion and prediction. Thirdly, By integrating the CIoU loss function, the model takes into account the aspect ratio consistency and the distance between bounding box centers. Fourthly, A channel pruning technique is utilized to remove redundant parameters from the NAS-FPN and NAS-FCOS-Head components, reducing model size and computational requirements without sacrificing accuracy.

Materials and methods

Data collection

The image data were sourced from Oudemansiella raphanipes grown under real environmental conditions. Smartphones with cameras (iPhone 1216, OPPO A317, Honor 9i18 and Huawei Nova 5 Pro19) were used for sampling. The sampled images of Oudemansiella raphanipes were taken from random angles (images were manually captured from various positions around the plant to ensure random shooting angles), against diverse backgrounds, and under various lighting conditions (we took photos of the plants at different times of the day to create a diverse range of illumination scenarios) at different growth stages to increase the diversity of the collected data. Every image contains multiple Oudemansiella raphanipes samples at different stages of growth. After eliminating similar images, the dataset includes a total of 8,000 images. We then used a Python program to randomly split the images into training, validation, and test sets with an 8:1:1 ratio. We present some representative images of Oudemansiella raphanipes at different growth stages within the dataset (Fig. 1). The dataset was annotated via the ISAT_with_segment_anything tool available on GitHub20, and the images were classified into four stages of growth on the basis of their developmental progression: stage 1 (seedling stage), stage 2 (growing stage), stage 3 (maturity stage), and stage 4 (flowering stage). In the agricultural production process, Oudemansiella raphanipes is typically harvested during the third stage (maturity stage). Harvesting too early (e.g., in the first and second stages) can result in lower quality and reduced yield, while harvesting too late (e.g., in the fourth stage) may lead to degradation of the product. Understanding and accurately determining these growth stages are crucial for optimizing both quality and productivity. The resolution of the images in the collected dataset varied. Table 1 presents the properties of the collected dataset.

Fig. 1
figure 1

Representative images depicting the different growth stages of Oudemansiella raphanipes. (a) Seedling stage, (b) growing stage, (c) maturity stage, and (d) flowering stage.

Table 1 Collected dataset properties.

In real cultivation environments, Oudemansiella raphanipes is typically planted densely, which often results in the occlusion of the mushroom caps or stipes. To address this issue, we distinguished between the caps and stipes during the dataset annotation process. This differentiation enhances the model’s performance in scenarios where occlusion occurs, allowing for more accurate monitoring and analysis of the growth stages of Oudemansiella raphanipes.

Data augmentation

In the data augmentation phase of our preprocessing pipeline, we enhanced the robustness and generalizability of our model by employing three specific techniques tailored for Oudemansiella raphanipes: HSV random augmentation, random choice resizing, and random flipping. HSV random augmentation was applied with the parameters H_delta=5, S_delta=30, and V_delta=30, introducing subtle variations in colour that mimic natural differences. Random choice resize was utilized with scales set to [(1600, 960), (1333, 640), (1333, 800), (800, 800), (640, 640)], allowing for the simulation of various distances and perspectives. Finally, random flips were implemented with a probability of 0.5 to reflect the natural randomness of orientations in the wild. These methods, selected for their effectiveness in preliminary experiments, enabled the creation of a robust and versatile dataset conducive to developing a generalized model that is less prone to overfitting.

OR-FCOS model

The network architecture of the OR-FCOS model is illustrated (Fig. 2).

Fig. 2
figure 2

Network architecture of OR-FCOS model.

In complex agricultural environments, traditional object detection algorithms such as FCOS often encounter challenges including high computational demands, limited adaptability to varying object scales, and reduced detection accuracy. These limitations hinder the efficient detection of the growth stage of Oudemansiella raphanipes. To address these issues, we propose the OR-FCOS algorithm tailored for accurately identifying the growth stage of Oudemansiella raphanipes. The OR-FCOS model introduces several key innovations: (1) It integrates the MobileNetV3-Large backbone with EMA module, replacing the computationally intensive ResNet-50 within the FCOS framework, thereby enhancing feature extraction efficiency and maintaining high accuracy. (2) The traditional FPN is supplanted with NAS-FPN, and NAS-FCOS-Head is introduced, optimizing the feature fusion process and prediction heads. (3) The CIoU loss function is incorporated within NAS-FCOS-Head structure, addressing the shortcomings of standard IoU loss by considering aspect ratio consistency and the distance between bounding box centers. (4) A channel pruning strategy is employed to eliminate redundant parameters in the NAS-FPN and NAS-FCOS-Head components, reducing the model size and computational demands without compromising detection accuracy. These modifications enable the OR-FCOS model to achieve a balanced performance, ensuring efficient and accurate detection of the growth stage of Oudemansiella raphanipes in complex environments.

MobilenetV3 backbone with EMA module

In the practical production process for the efficient detection of the growth stage of Oudemansiella raphanipes, computational resources are often limited. Therefore, algorithms with fewer parameters and lower computational overhead are more suitable for such applications. To address the challenges associated with the high computational demand of the FCOS algorithm and its limitations in real-time execution on edge computing devices, we selected MobileNetV3-Large to replace ResNet-5022 within the FCOS framework. As a lightweight convolutional neural network architecture, MobileNetV3 is designed to achieve faster and more efficient feature extraction on mobile devices23. It offers a good trade-off between computational efficiency and accuracy, enhancing the performance of its predecessors, MobileNet24 and MobileNetV225. The adoption of MobileNetV3-Large in the FCOS framework improves the efficiency of the model, reducing the need for high computational resources.

However, simply replacing ResNet-50 with MobileNetV3 may lead to a decrease in accuracy. To maintain detection performance, we introduce the EMA module26 into the original MobileNetV3’s inverted residual block, replacing the existing squeeze-and-excitation (SE) attention module. The selection of the EMA module over the SE module is underpinned by several theoretical considerations. Firstly, the EMA module is specifically designed to capture both local and global feature interactions without dimensionality reduction. This enhance the model’s ability to extract comprehensive features from the data26. In contrast, the SE module primarily focuses on channel-wise attention, which may limit its capacity to model complex spatial dependencies essential for accurately identifying diverse growth stages of Oudemansiella raphanipes. Secondly, the multi-scale architecture of the EMA module allows for efficient processing of features at various spatial resolutions26, which is crucial for recognizing subtle morphological variations across different growth stages. This multi-scale processing capability ensures that the model remains robust and sensitive to nuanced patterns that may be overlooked by single-scale attention mechanisms like SE. Additionally, the EMA module achieves this enhanced feature extraction without a substantial increase in parameter count or computational overhead. This modification aligns with our objective of constructing a lightweight, efficient, and accurate model suitable for the identification of growth stage of Oudemansiella raphanipes.

NAS-FPN and NAS-FCOS-head

To accurately identify the growth stages of Oudemansiella raphanipes, it is crucial to precisely predict mushroom instances across different scales. We utilize the architecture of NAS-FPN and NAS-FCOS-Head to increase the detection accuracy and robustness of our model across different environmental conditions and mushroom sizes.

Traditional FPNs enhance detection performance by merging high-level, semantically rich features with low-level, detailed features. NAS-FPN improves object detection by optimizing the integration of multi-scale features27. It uses deformable convnets v2 (DCNv2)28, which adjusts convolutional kernels to better fit the geometric variations of objects, enhancing the network’s ability to capture complex shapes and sizes. Additionally, NAS-FPN employs skip connections to preserve information across feature levels, which is crucial for maintaining details. NAS-FPN is thus particularly effective in capturing subtle changes in mushroom growth.

In FCOS, the prediction head maps each feature in the pyramid to the output using four 3\(\times\)3 convolutions. The NAS-FCOS-Head offers structural improvements over the traditional FCOS-Head27. The inclusion of DCNv2 allows the NAS-FCOS-Head to dynamically adapt its convolutional filters to the shape and orientation of target objects. Additionally, the integration of 1\(\times\)1 convolution adjusts channel capacities and adds non-linearity, enriching feature representation without substantial computational overhead. This architecture can provide more accurate and robust performance.

Improved loss function

For the precise detection of the growth stage of Oudemansiella raphanipes, accurate prediction of the bounding box is essential. In FCOS, the IoU loss function is adopted in its head structure for bounding box regreesion. The IoU refers to the ratio of the intersection to the union between the predicted bounding box and the ground truth bounding box. The IoU value ranges from 0 to 1, where 1 indicates perfect overlap and 0 indicates no overlap. The IoU loss is 1 minus the IoU value, that is:

$$\begin{aligned} IoU \, Loss = 1 - IoU \end{aligned}$$
(1)

However, the IoU loss has several limitations. For example, when there is no overlap between the predicted and ground truth boxes, the IoU value is 0, leading to a vanishing gradient, which prevents the model from learning from such incorrect predictions. Additionally, the IoU loss is not sufficiently sensitive to changes in bounding box size and does not account for the distance between the centres of the bounding boxes, which is crucial for localization accuracy.

To address these issues, the CIoU loss was introduced in the NAS-FCOS-Head structure. In the CIoU loss function, two terms are added to the IoU loss function, the aspect ratio consistency and the distance between centre points, with the formula29:

$$\begin{aligned} CIoU \, Loss = 1 - IoU + \frac{\rho ^2(b, b^{gt})}{c^2} + \alpha v \end{aligned}$$
(2)

where \(\rho (b, b^{gt})\) represents the Euclidean distance between the centroids of the predicted bounding box \(b\) and the ground truth bounding box \(b^{gt}\), \(c\) denotes the diagonal length of the smallest box that encloses both bounding boxes, \(v\) represents the consistency of the aspect ratios between the predicted and actual bounding boxes, and \(\alpha\) serves as a weighting factor to equilibrate the consistency of the aspect ratios with the central point distance. Moreover, v and \(\alpha\) are calculated as follows:

$$\begin{aligned} v= & \frac{4}{\pi ^2}\left( \arctan \frac{w^{gt}}{h^{gt}}-\arctan \frac{w}{h}\right) ^2 \end{aligned}$$
(3)
$$\begin{aligned} \alpha= & \frac{v}{(1-IoU)+v} \end{aligned}$$
(4)

where w and h represent the width and height of the predicted bounding box, respectively, whereas \(w^{gt}\) and \(h^{gt}\) represent those of the ground truth bounding box. The CIoU loss function is more comprehensive in its design than the IoU loss function, focusing not only on the degree of overlap between the bounding boxes but also on ensuring that the centre of the predicted box is as close as possible to the centre of the ground truth box while maintaining a similar aspect ratio as the predicted box. In this way, the CIoU loss can still provide effective gradients when there is no overlap, helping the model correct the position and shape of the predicted bounding box, enabling more precise bounding box regression even in complex agricultural environments.

Channel pruning strategy

In object detection, optimizing the balance among model size, computational efficiency, and accuracy is crucial, especially for deployment on resource-constrained devices. In traditional model designs, either efficiency or accuracy is often compromised. MobileNet24 introduced a width multiplier parameter that uniformly scales the network’s width (i.e., feature channels) to obtain a trade-off between accuracy and computational cost. By carefully adjusting the width multiplier, MobileNet achieves efficient inference with markedly fewer parameters and lower computational requirements compared to standard architectures.

Inspired by the design of MobileNet, we propose a targeted compression methodology that compresses the model by specifically reducing the feature channel count in NAS-FPN and NAS-FCOS-Head. By decreasing the number of feature channels in these components, which are responsible for a large portion of the model’s parameters and computational complexity, we can effectively reduce the overall resource demands. This approach ensures that the model maintains its operational performance with minimal degradation in accuracy while achieving greater efficiency suitable for deployment on resource-constrained devices.

Evaluation metrics

To accurately evaluate the performance of the model, this study adopts commonly used evaluation metrics in object detection algorithms: \(\text {mAP}_{50}\) and \(\text {mAP}_{50:95}\). Their definitions are as follows:

$$\begin{aligned} \text {mAP}_{50}= & \frac{1}{C} \sum _{c=1}^{C} \text {AP}_{50}^{(c)} \end{aligned}$$
(5)
$$\begin{aligned} \text {mAP}_{50:95}= & \frac{1}{10} \sum _{i=0}^{9} \text {AP}_{50 + 5i}^{(c)} \end{aligned}$$
(6)

where C is the number of categories, \(\text {AP}_{50}^{(c)}\) represents the average precision for category c at an IoU threshold of 0.50, and \(\text {mAP}_{50:95}\) is the mean average precision averaged over IoU thresholds from 0.50 to 0.95 in increments of 0.05. A higher mAP value indicates better model performance, as it reflects the model’s ability to accurately detect and classify objects in various conditions.

Additionally, the complexity of the model is evaluated using the following metrics:

$$\begin{aligned} \text {FLOPs}= & 2 \sum _{l=1}^{L} \left( C_{\text {in}, l} \times C_{\text {out}, l} \times K_l^2 \times H_l \times W_l \right) \end{aligned}$$
(7)
$$\begin{aligned} \text {Number of Parameters}= & \sum _{l=1}^{L} \left( C_{\text {in}, l} \cdot C_{\text {out}, l} \cdot K_l^2 + C_{\text {out}, l} \right) \end{aligned}$$
(8)

where L is the number of network layers, \(K_l\) is the size of the convolution kernel, \(C_{\text {in}, l}\) and \(C_{\text {out}, l}\) represent the number of input and output channels of the l-th layer, respectively, and \(H_l\), \(W_l\) are the height and width of the l-th layer’s output feature map, respectively. The term \(C_{\text {out}, l}\) accounts for bias parameters. In terms of model complexity, fewer FLOPs and a smaller number of parameters generally indicate a more efficient model, which can lead to faster inference times and lower computational costs.

Furthermore, the model’s deployment efficiency is assessed by measuring the size of the trained model weights and the GPU memory consumed during the inference phase to ensure compatibility with hardware constraints. Detection speed is evaluated in FPS. A higher FPS means faster processing, which is beneficial for real-time applications.

Training details

In our experiment, we employed the AdamW30 optimizer with a learning rate of 0.0001 and a weight decay of 0.05, incorporating gradient clipping31 with a maximum norm of 0.01 for stability. Compared to the traditional Adam32 optimizer, AdamW more effectively handles weight decay, reduces overfitting, and enhances the model’s generalization capabilities. We utilized a cosine annealing strategy33 to schedule both the learning rate and momentum.

During the first 10 epochs, the learning rate increased from 0.0001 to 0.001, while momentum rose from 0 to a range between 0.85 and 0.95. This process enables the model to rapidly learn features during the initial training phase. Over the next 14 epochs, the learning rate was reduced to \(5 \times 10^{-7}\), and the momentum was adjusted to 1. This phase aims to fine-tune the model parameters, reduce fluctuations in the loss function and enhance the model’s stability. From epoch 24 onwards until epoch 100, we kept the learning rate constant. This decision is based on experimental observations of the model’s performance stability. It helps prevent potential training instability or overfitting.

The computing resources for this experiment are described in Table 2. MMDetection34 and MMPretrain35 were employed to train and test our model. The metrics for the experiments in this section are calculated on the test set.

Table 2 Computational resources.

Comparison with other methods

We selected several leading models to compare their performance with our improved model. These models have been chosen due to their proven effectiveness and widespread adoption. The selected models include Faster R-CNN7, RetinaNet36, YOLOv11n37, RetinaNet-based ATSS38, CenterNet39, FCOS-based AutoAssign40, FCOS and YOLOv8n41.

Faster R-CNN is a widely adopted two-stage detector known for its high accuracy in object detection tasks7. RetinaNet introduced focal loss to effectively address class imbalance in single-stage detectors36. YOLOv11n offers real-time detection capabilities with a lightweight architecture suitable for speed-critical applications37. ATSS builds on RetinaNet by incorporating adaptive training sample selection to enhance bounding box regression38. CenterNet utilizes keypoint estimation for precise object localization, improving detection accuracy39. AutoAssign enhances the FCOS framework with dynamic assignment mechanisms for better training adaptability40. FCOS represents anchor-free approaches, providing simplicity and computational efficiency in object detection. YOLOv8n advances the YOLO series with improved performance and flexibility for various detection scenarios41.

Our model is expected to outperform these models in the task of identifying the growth stages of Oudemansiella raphanipes for several reasons. Firstly, our model is meticulously designed, integrating NAS-FPN and NAS-FCOS-Head, and enhances detection accuracy by employing techniques such as EMA attention and CIoU Loss. Additionally, our model utilizes a more lightweight backbone and further reduces the number of parameters and overall size through channel pruning, making it more lightweight compared to other models. These advantages of our model will be validated in the subsequent Results section.

Results

Ablation test

To enhance the performance of identifying the growth stage of Oudemansiella raphanipes, three improvements were applied to the original FCOS algorithm (A: introduce MobileNetV3 with EMA module as the backbone; B: incorporate NAS-FPN and NAS-FCOS-Head; C: replace IoU loss function with CIoU loss function). However, the specific contributions of the improvement mechanisms to the detection performance of the OR-FCOS model remain unclear. To assess the distinct influence of each element on the model’s performance, a series of ablation tests was conducted, and the findings are detailed in Table 3. We present the model’s performance during training of different tests (Fig. 3).

Table 3 Ablation tests.
Fig. 3
figure 3

Training dynamics for ablation studies showing the \(\hbox {mAP}_{50:95}\) and loss across epochs. (a) \(\hbox {mAP}_{50:95}\) by epoch, (b) loss by epoch.

Without MobileNetV3 and using ResNet-50 instead, the \(\hbox {mAP}_{50}\) and \(\hbox {mAP}_{50:95}\) achieve 89.4% and 78.3%, respectively, but the parameter count increases to 32.4M and the FLOPs to 94.8G. The inclusion of the lightweight EMA module slightly improves precision, as seen in the full model’s \(\hbox {mAP}_{50}\) of 89.4% and \(\hbox {mAP}_{50:95}\) of 78.3% compared to 89.2% and 78.1% when this component is removed. The removal of NAS-FPN and NAS-FCOS-Head results in a noticeable decline in performance, with \(\hbox {mAP}_{50}\) dropping to 87.3% and \(\hbox {mAP}_{50:95}\) to 74.9%. The CIoU loss function slightly boosts performance without adding extra Params and FLOPs, with the full OR-FCOS model achieving a \(\hbox {mAP}_{50}\) of 89.4% and \(\hbox {mAP}_{50:95}\) of 78.3% compared to 88.8% and 77.5% when CIoU is excluded.

Comparison of channel pruning strategies

To verify the effectiveness of the pruning strategy, comparative experiments on decoders with different channel count were conducted, where the decoder specifically includes the neck network and head network. By implementing different pruning ratios in our improved FCOS network, we aimed to find an optimal pruning rate to achieve a better balance between reducing computational resource consumption and maintaining model performance. We present the comparative results of the distinct pruning strategies (Table 4). The model’s \(\hbox {mAP}_{50}\) and number of parameters for different feature channel count are shown (Fig. 4).

Table 4 Effect of decoder channel count on model precision and efficiency.
Fig. 4
figure 4

Impact of feature channel count on model \(\hbox {mAP}_{50}\) and number of parameters.

As the decoder feature channel count decreases, the model’s mAP value slightly decreases, whereas the number of parameters, model weight size, and number of FLOPs substantially decrease. The model achieves the highest mAP value without pruning but also has the highest resource consumption.

When the feature channel count reaches 192, the model’s mAP value remains close to that of the model with a 256-channel width, but the number of parameters, model weight size, and number of FLOPs decrease, indicating that with a 192-channel width, the model achieves a better balance between performance and resource consumption. However, when the feature channel count is less than 160, the model’s performance begins to decline dramatically. In particular, when the feature channel count reaches 64, the decline in model performance becomes very apparent, and the advantage of saving resources is not sufficient to compensate for the performance loss. In summary, under the conditions of this experiment, the results suggest that setting the decoder feature channel count to 192 can enable a better trade-off between substantially reducing the model’s computational resource consumption and maintaining high detection accuracy. Through this method, we successfully achieve effective model compression while minimizing the impact on performance.

Comparison with other methods

We compared the performance of various leading models with the performance of our improved model. The results are concisely presented in Table 5. The inference speed and the CUDA memory usage during inference of each model was tested on a single NVIDIA RTX A4000 GPU.

Table 5 Comparison of Oudemansiella raphanipes identification results of different models.

Our model achieved an \(\hbox {mAP}_{50}\) of 89.4% and an \(\hbox {mAP}_{50:95}\) of 78.3%, outperforming alternative models such as Faster R-CNN, RetinaNet, RetinaNet-based ATSS, CenterNet, FCOS-based AutoAssign, and FCOS in accuracy. In addition, our model has an efficient design, with only 9.9 M parameters and a size of 40.1 MB, demonstrating notable efficiency. In terms of processing speed, our model reached 24.0 FPS, with only YOLOv8n and YOLOv11n showing a higher speed. However, it is crucial to note that YOLOv8n and YOLOv11n, despite their higher speed of 102.0 FPS and 75.8 FPS, exhibited much lower accuracy than our model. Faster R-CNN and CenterNet also demonstrated good accuracy, with \(\hbox {mAP}_{50}\) scores of 85.0% and 84.5%, respectively, and \(\hbox {mAP}_{50:95}\) scores of 72.0% and 73.7%. However, they require more computational resources than our model, as highlighted by Faster R-CNN’s 41.8 million parameters and a 167.3 MB weight file. This comparison underscores our model’s efficiency, offering a superior trade-off of high accuracy and rapid processing with substantially lower resource consumption.

We illustrate the performance of various anchor-free detection models (Fig. 5). Regarding the results obtained from the CenterNet algorithm, incorrect identifications occur at the locations marked by the blue arrow, diverging from the expected empirical outcomes (Fig. 5a). Notably, for the FCOS-based AutoAssign algorithm, a blue arrow in the top left corner indicates a false detection result (Fig. 5b). In contrast, the FCOS algorithm and our proposed algorithm successfully detects all instances of Oudemansiella raphanipes in the image without any false detections (Fig. 5c and Fig. 5d), demonstrating the efficacy of our improved detection method. More visual comparison is illustrated in Fig. 6.

Fig. 5
figure 5

Visual comparison of the detection results of different anchor-free object detection models. (a) CenterNet, (b) FCOS-based AutoAssign, (c) FCOS, and (d) OR-FCOS (Ours).

Fig. 6
figure 6

Visual comparison of the detection results of different anchor-free object detection models. left: FCOS, right: OR-FCOS (Ours).

The OR-FCOS model exhibit lower confidence score than the FCOS model in Fig. 5. The main reason for the difference in confidence scores lies in the optimization of the model architecture. Our OR-FCOS builds upon the original FCOS architecture by incorporating optimized feature extraction modules and enhanced loss functions. These improvements increase the model’s localization precision and classification accuracy while making the distribution of confidence scores more conservative to reduce the likelihood of false positives. Through this architectural optimization, OR-FCOS is able to maintain high detection accuracy while effectively controlling the false positive rate, thereby outperforming the traditional FCOS detector in overall performance.

Discussion

In this study, the dataset exhibits a certain degree of class imbalance, such as some classes having fewer instances than other classes. Such imbalance can lead the model to be biased towards the majority class, diminishing its ability to accurately identify minority classes. Consequently, it is essential to compare the precision metrics across different categories.

The precision of each category at a 95% IoU threshold is shown in Table 6. The identification performance varies across different growth stages and categories. Specifically, within each growth stage, category a (caps) consistently exhibits higher precision than category b (stipes). For example, in the first growth stage, the precision for the cap (1a) reaches 0.851, whereas the stipe (1b) is only 0.667. This pattern remains consistent in subsequent growth stages, with the cap precision in the second stage at 0.905 and the stipe at 0.703; in the third stage, the figures are 0.884 versus 0.705; and in the fourth stage, 0.871 versus 0.674. The persistently low precision in stipe identification indicates inherent challenges in accurately identifying the growth stages of stems across all stages.

Table 6 Model precision for different categories.

The confusion matrix for our proposed OR-FCOS is shown in Fig. 7. From the matrix, we observe that the misclassification rate is high when classifying the stipes. For example, there is a 10% misclassification rate of class 1b being identified as 2b, and a 9% rate of class 3b being misclassified as 2b. This difficulty arises from the visual similarity of stems at different growth stages, which increases the complexity of the feature extraction process.

Fig. 7
figure 7

Confusion matrix of OR-FCOS.

Additionally, the imbalance in the number of stipe samples across various growth stages in the training dataset may lead to decreased model performance in distinguishing between different growth stages.

The practical impact of these misclassifications is notable. Erroneously identifying the stipe of a particular growth stage as belonging to another stage (for example, classifying 1b as 2b, 3b, or 4b) may result in incorrect assessments of the growth process, thereby affecting critical decisions such as determining the harvest time. Our error analysis highlights that the primary causes of misclassification are the visual similarities between stems at different growth stages and the imbalance in sample numbers across growth stages in the training data.

To mitigate these challenges, future work plans to adopt multiple strategies, including addressing class imbalance by increasing the number of stipe samples in each growth stage, employing advanced feature extraction techniques to better distinguish stems across different growth stages, and implementing class balancing methods such as oversampling or weighted loss functions.

Conclusion

Identifying the growth stage of Oudemansiella raphanipes in complex agricultural environments is crucial for optimizing cultivation practices and enhancing yield management. In this study, we constructed an Oudemansiella raphanipes image dataset named ORaph8K with annotations of different growth stages. We developed the OR-FCOS model to address issues such as high computational demands and accuracy reduction faced by traditional detection algorithms. By integrating MobileNetV3 with EMA module as the backbone, incorporating NAS-FPN and NAS-FCOS-Head, adding CIoU loss function, and utilize channel pruning strategy, the OR-FCOS model substantially improved the accuracy and efficiency of the identification of Oudemansiella raphanipes growth stages. The experimental results demonstrate that the OR-FCOS model not only achieves higher accuracy, with mAP scores of 89.4% for \(\hbox {mAP}_{50}\) and 78.3% for \(\hbox {mAP}_{50:95}\), but also enhances operational efficiency. This is evidenced by a reduction in the number of model parameters to 9.9 M, model size to 40.1 MB, and CUDA memory usage to 255 MB. Moreover, the number of FLOPs was decreased to 31.2 G, and the inference speed was improved to 24.0 FPS. These enhancements make the model highly suitable for resource-constrained environments, aligning with production practices.