Abstract
Crayfish sub-part processing requires high precision and efficiency in complex environments; however, current methods often fail to effectively handle small anatomical structures and are unsuitable for edge-device deployment. We propose YOLOv11nDHBC, a lightweight detection and segmentation framework that integrates three technical innovations. First, the proposed D-HGNetV2 backbone integrates DynamicConv modules into the HGNetV2 architecture, reducing parameters by 31% (from 2.9 M to 2.0 M) and computational cost by 9.8% (10.2 GFLOPs to 9.2 GFLOPs) through input-dependent kernel aggregation, which enhances multiscale feature extraction for occluded or overlapping parts. Second, a bidirectional feature pyramid network (BiFPN) with learnable weights adaptively fuses crossscale representations, strengthening finegrained detail capture in cluttered environments while controlling computational overhead. Third, the CARAFE contentaware upsampling module replaces nearestneighbor interpolation to preserve highresolution information, boosting smalltarget segmentation (e.g., claws and legs) without significant model size growth. On our selfconstructed dataset, YOLOv11nDHBC achieves 96.8% detection mAP@0.5 and 96.0% segmentation mAP@0.5—surpassing the YOLOv11nSeg baseline—while maintaining realtime inference at 65.8 FPS and a model size of 4.2 MB (26.3% smaller). This balanced design offers robust performance for automated segmentation of crayfish sub-parts, thereby facilitating deployment in aquatic processing systems.
Similar content being viewed by others
Introduction
Crayfish (Procambarus clarkii) has become popular for its tender meat, high protein content, low fat, and rich nutritional value, as well as its notable health, economic, and environmental benefits1. In 2023, China reported a crayfish farming area of 29.5 million mu and a production of 3.161 million tonnes. In industrial practice, deep processing of crayfish primarily involves part-by-part segmentation, requiring precise shelling and separation of anatomical regions such as the claws, tail, and head. Although computer vision (CV) technologies have advanced significantly and are widely used in aquatic product processing—particularly for automated fish sorting, quality grading, and production monitoring—dedicated research on fine-grained segmentation of multiple crayfish anatomical parts remains scarce2. Existing approaches mainly focus on whole-body identification or classification of individual specimens, failing to meet the precise segmentation requirements at the subpart level. More critically, in complex scenarios involving dense stacking, variable postures, and mutual occlusion of anatomical parts, generic segmentation models show fundamental limitations3. These include insufficient dynamic feature adaptability, which reduces robustness against deformation and occlusion-induced ambiguity; inefficient multi-scale feature fusion, which prevents optimal balancing of accuracy and computational cost; and poor reconstruction of fine-scale structures (such as walking legs), which limits recognition precision for critical subparts. In addition, models that prioritize segmentation accuracy typically incur prohibitive computational overhead due to excessive parameters and complexity. This severely limits their deployment potential for real-time, resource-constrained edge devices in industrial processing lines.
Most crayfish processing enterprises still rely on manual shelling methods4. While these methods maintain high product integrity, they are labor-intensive, time-consuming, costly, and pose potential risks of microbial contamination. Zhang et al.5 developed a multifunctional automated shelling system in which crayfish heads are fixed onto a rotating disc. The system integrates multiple processes, including guiding, compression, tail-breaking, cutting, back-opening, intestinal removal, and shell–meat separation. Although the system meets performance standards and supports various crayfish meat products, its processing throughput remains limited. Ma et al.6 designed a roller-shaft shelling machine that uses a grinding mechanism with tappet compression springs and a kneading–extrusion roller-shaft assembly. In this setup, de-headed crayfish are first crushed by flexible crushing feet, then kneaded by five rollers and extruded by three rollers for shell removal. While this design offers high capacity and reduced labor, it lacks precise positioning, resulting in suboptimal outcomes and limited automation. With the rapid development of deep learning in image recognition, machine vision has been widely applied in aquatic product processing, achieving notable results. Wang et al.7 optimized a network to reduce missed and false detections of sheep in complex farming environments. Their model achieved a segmentation mAP@0.5 of 92.08%, providing technical support for instance detection in complex backgrounds. Wang et al.8 proposed a YOLOv4-based method for crayfish quality detection. They used industrial cameras for image acquisition and trained the model on an industrial computer for real-time detection. Crayfish quality was assessed based on curvature features, and defective specimens were sorted by a picking device. This method, optimized in both network architecture and data preprocessing, achieved 97.8% detection accuracy with an average inference time of 37 ms. While it outperformed conventional models, it still faces challenges in complex backgrounds. Chen et al.9 developed an adaptive cropping algorithm for data preprocessing, which enables accurate identification and counting of crayfish to support scientific feeding and precise baiting. Chen et al. proposed an improved SSD model for detecting crayfish body parts by replacing the backbone with MobileNetV3 and incorporating Soft-NMS. The model achieved an mAP@0.5 of 95.50% and a 30 ms inference time per image, enabling fast and accurate part detection for automated processing10.
Given the challenges of inaccurate part identification and low processing efficiency in current crayfish deep processing, this study constructs a segmentation dataset covering the cephalothorax, abdomen, claws, and walking legs to address the practical needs of sub-part processing. Building on this, we propose an improved target detection and instance segmentation model, YOLOv11n-DHBC, based on YOLOv11n-Seg. The model achieves efficient multi-scale feature extraction and fusion through three key components: the self-developed D-HGNetV2 backbone, the integration of a bidirectional feature pyramid network (BiFPN), and the use of a CARAFE upsampling module. The improved architecture enhances feature representation while reducing model complexity, ensuring high detection accuracy and meeting the demands for efficient, precise sub-part detection in industrial deep processing workflows. This study bridges the gap between high-precision but computationally expensive instance segmentation models and lightweight real-time detection methods by proposing a solution that achieves both competitive accuracy and superior efficiency for practical industrial applications.
Dataset construction
Image data acquisition
The crayfish images were captured via an iPhone 13 Pro Max (resolution: 4032 × 3024 pixels) at Zhen Zhiyong’s wholesale crayfish department in Jiangsu Province, during August–October 2024 (late summer to autumn). To represent real-world conditions comprehensively, we captured images under varied lighting, including indoor artificial light and outdoor overcast or rainy settings. During image acquisition, challenging visual conditions were deliberately included, such as mutual occlusion between crayfish, overlapping individuals, and complex postures (e.g., curled abdomens and flexed claws). Images were taken from multiple angles to enhance the dataset’s generalizability. Blurry or highly similar images were removed, resulting in a final dataset of 585 high-quality crayfish images. Representative samples are shown in Fig. 1.
Image pre-processing and labelling
Using the annotation functionality of the open-source tool LabelMe, expert annotators meticulously delineated the anatomical boundaries of the crayfish cephalothorax, abdomen, claws, and walking legs. This annotation strategy ensured that labels were strictly aligned with the morphological boundaries of each sub-part. To address the challenge of limited initial data volume, which predisposes models to overfitting, a comprehensive data augmentation pipeline was implemented. Specifically, the pipeline included geometric transformations (horizontal and vertical flipping, random rotation to simulate viewpoint changes), photometric distortions (brightness adjustment to mimic lighting variations), and noise injection (Gaussian blurring for defocus, additive Gaussian noise, and salt-and-pepper noise to emulate sensor artifacts)11. These augmentation techniques artificially diversified the training data and expanded the final dataset to 3,510 unique images. This robust dataset, specifically tailored for crayfish part segmentation, was then partitioned into training, testing, and validation subsets following the established 7:2:1 ratio. This partitioning scheme ensures unbiased training and rigorous performance evaluation. The LabelMe annotation interface is shown in Fig. 2.
Methods of splitting different parts of crayfish
YOLOv11n-Seg base model
In this study, the YOLOv11 algorithm from the YOLO family was selected as the base model. YOLOv11 includes five variants with different parameter sizes: n, s, m, l, and x12. Among these, YOLOv11n has the fewest parameters and the fastest detection speed. Given the requirement for deployment on edge devices with limited computational power, YOLOv11n was selected as the base model.
YOLOv11n is an efficient model that integrates target detection and instance segmentation. Its overall structure comprises an input layer, a backbone network13, a neck network, and a segmentation layer, as shown in Fig. 3. The input layer receives image samples and provides the input data required for training and inference. The backbone network extracts multi-scale, high-quality features through the improved C3k2 module, C2PSA module, and retained SPPF layer14. The C3k2 module improves computational efficiency by using small convolution kernels, while the C2PSA module integrates a spatial attention mechanism to enhance the model’s focus on important regions15. The neck network adopts the design of a Feature Pyramid Network (FPN) and a Path Aggregation Network (PAN)16, while introducing a Dynamic Feature Aggregation (DFA) module to enable effective multi-scale feature fusion17, enhancing adaptability to small targets and complex scenes. The segmentation layer generates high-quality masks by classifying and predicting feature maps pixel by pixel using an optimized fully convolutional network (FCN) architecture combined with the improved C3k2 module. The model supports instance segmentation by first identifying each target’s bounding box and category label during detection, then accurately classifying each pixel in the segmentation stage to produce pixel-level results18. With its efficient design for feature extraction, fusion, and segmentation, YOLOv11n-Seg achieves seamless detection and segmentation in complex multi-target scenarios through its efficient feature extraction, fusion, and segmentation design, which balances real-time performance and accuracy.
D-HGNetV2 backbone network
This section provides a comprehensive overview of the design of the improved YOLOv11n-DHBC algorithm, as shown in the network diagram. Figure 4 illustrates the network architecture of YOLOv11n-DHBC. To enhance feature extraction and multi-scale information capture, the original YOLOv11n backbone was replaced with HGNetV219. The DynamicConv module was embedded into the HGBlock of HGNetV2 to achieve more efficient feature extraction by dynamically adjusting convolution kernel weights20. The BiFPN module was incorporated into the neck network to further enhance target feature representation through bidirectional multi-scale feature fusion21. To reduce detail loss caused by traditional upsampling, the CARAFE module was introduced to the neck network22, significantly improving the detection accuracy of small target parts (e.g., crayfish pincers and walking legs). These improvements enhance the overall performance of the model in crayfish detection and segmentation tasks. Experimental results show that YOLOv11n-DHBC improves mAP50 accuracy compared to the original YOLOv11n, while maintaining low computational overhead, making it more suitable for real-time detection in industrial scenarios.
The YOLOv11n backbone, which relies on standard convolutions and C3 modules for its lightweight design, exhibits limitations in feature extraction and dynamic feature adaptability. To address these limitations, this study proposes D-HGNetV2, a novel backbone that replaces the original architecture with HGNetV2 and integrates DynamicConv modules into its HGBlock units.
D-HGNetV2 comprises core modules including DWConv, HGStem, Dynamic_HGBlock, C2PSA, and SPPF, which are designed to extract multi-scale and multi-level image features, thereby enhancing model expressiveness and performance. Following HGNetV2’s lightweight, modular, and optimized design principles—including efficient initial feature extraction—D-HGNetV2 introduces dynamic convolution to further enhance adaptability. The optimized backbone reduces computational overhead while preserving its lightweight characteristics.
HGStem, the initial preprocessing layer, plays a crucial role in the model. Its primary function is to preprocess input data, extract key features, and provide suitable inputs for subsequent layers. During processing, HGStem first uses convolutional layers to extract features from the input data. By applying different convolutional kernels, the convolutional layers capture feature information from various local regions and generate rich feature representations. After feature extraction, HGStem applies max pooling for downsampling. This operation reduces data dimensionality while preserving key information. By applying max pooling at different scales, HGStem obtains multi-scale feature information from the input data. The structure of HGStem is shown in Fig. 5.
HGBlock is a hierarchical feature extraction module designed to efficiently capture multi-scale information through convolutional operations and gating mechanisms23. Its basic structure comprises multiple convolutional layers and channel attention modules to ensure efficient information transfer as network depth increases. However, HGBlock relies on fixed convolutional kernels for feature extraction. The kernel parameters remain static during training, and their weights do not adapt to the characteristics of input images once training is complete. This static nature limits feature representation when handling targets of varying scales or classes. In particular, fixed kernels struggle to adapt to complex backgrounds or small target detection tasks, leading to reduced detection accuracy. In addition, HGBlock employs a standard channel attention mechanism with a fixed inter-channel interaction pattern. It cannot dynamically adjust the importance of each channel based on input content, potentially causing information loss during multi-scale feature fusion and impairing final detection performance. To address the limitations of HGBlock in handling multi-scale and complex features, we redesign the HGBlock by integrating DynamicConv, forming the Dynamic_HGBlock (Fig. 5b). In this design, the static 3 × 3 convolution layers in the residual branch of the original HGBlock are replaced by DynamicConv layers, while the skip connections and downsampling paths retain their static convolutional form. This hybrid structure allows the block to combine static feature reuse with input-adaptive filtering, enhancing the model’s expressiveness without excessive complexity.Specifically, the DynamicConv layer dynamically aggregates multiple parallel convolution kernels through an attention-based mechanism that generates input-dependent kernel weights. During inference, the input feature map is first processed by global average pooling and fully connected layers to compute the attention weights, which are then applied to aggregate K parallel 3 × 3 kernels24. The resulting composite kernel is convolved with the input, enabling spatially adaptive feature extraction. Compared to the original HGNetV2, Dynamic_HGBlock introduces dynamic convolution in place of 50% of the fixed convolution layers in each block, reducing reliance on static filters and enhancing adaptability to input variation. The overall architecture preserves HGNetV2’s lightweight properties while improving multi-scale detail extraction, particularly for small parts and occluded regions. Figure 5b shows the modified block structure highlighting the interaction between static skip paths and dynamic convolution branches.
The core of dynamic convolution is to break through the traditional convolution kernel fixed mode Fig. 6, based on the characteristics of the input data25, with the help of the attention mechanism to dynamically aggregate multiple parallel convolution kernels, so as to achieve the adaptive adjustment of the convolution operation. The calculation process is shown in the figure below, assuming that the dimension of the input feature map is \(H \times W \times {C_{in}}\) (H is the height, W is the width, \({C_{in}}\) is the number of input channels), first of all, the global average pooling, the feature map will be compressed in the spatial dimension, to get the dimension of \(1 \times 1 \times {C_{in}}\)26, the calculation formula is:
Next, the output of the global average pooling is fed into the first fully connectegd layer \(\:{F}_{1}\). Let the first fully connected layer change the number of channels to \({c_{mid}}\),then we have \(t={F_1}\left( z \right)={W_1}z+{b_1}\left( {{W_1} \in {{\mathbb{R}}^{{C_{mid}} \times {C_{in}}}},{b_1} \in {{\mathbb{R}}^{{C_{mid}}}}} \right)\),\(\hat {t}=ReLU\left( t \right)\) is processed by the ReLU activation function to obtain t. After that, \(\hat {t}\) is fed into the second fully connected layer \({F_2}\), which converts the number of channels to \(\:K\) (\(\:K\) should be the number of convolutional kernels), i.e., \(s={F_2}\left( {\hat {t}} \right)={W_2}\hat {t}+{b_2}\left( {{W_2} \in {{\mathbb{R}}^{K \times {C_{{\text{mid}}}}}},{b_2} \in {{\mathbb{R}}^K}} \right)\), which is processed by the softmax function to generate the attentional weights, and the formula is:
Assuming that there are K convolution kernels(\(\:{C}_{out}\) out is the number of output channels, \(\:{D}_{k}\) is the size of the convolution kernel) and the corresponding bias terms \({\tilde {b}_{\text{k}}} \in {{\mathbb{R}}^K}\), they are aggregated by using the obtained attention weight \({\pi _k}\).The aggregated convolution kernel \(\tilde {W}=\mathop \sum \limits_{{k=1}}^{K} {\pi _k}{\tilde {W}_k}\) and the bias terms \(\mathop \sum \limits_{{k=1}}^{K} {\pi _k}\mathop {{b_{\text{k}}}}\limits^{\sim }\).The aggregated convolution kernel w is convolved with the input feature map x (the convolution operation is denoted as *) to obtain \(\tilde {W}*X\), which is \(\tilde {W}*X+\tilde {b}\) after adding the bias term \(\tilde {b}\). Finally, it is nonlinearly transformed by the activation function to get the final output \(y=g\left( {\tilde {W} * X+\tilde {b}} \right)\).
In the specific context of crayfish sub-part detection, the introduction of DynamicConv within HGBlock is particularly beneficial for addressing the anatomical variability and occlusion challenges inherent to this task. Crayfish claws and legs often exhibit large pose variations and are prone to mutual occlusion in densely packed scenes. The dynamic mechanism enables adaptive convolutional kernel aggregation based on input features, allowing the model to better distinguish these overlapping parts by adjusting focus according to the local context. This improves detection precision of fine-scale structures without introducing excessive computational burden, aligning with the requirements for both accuracy and efficiency in industrial processing lines.
Bidirectional feature pyramid network
In YOLOv11n, the traditional neck network employs a Feature Pyramid Network (FPN) or Path Aggregation Network (PANet) for multi-scale feature fusion, as shown in Fig. 7a,b. However, these approaches often lead to information loss and inefficiency when fusing features of different scales, particularly between low-level spatial details and high-level semantic features. To address this, the YOLOv11n neck is replaced with a Bidirectional Feature Pyramid Network (BiFPN), which incorporates learnable weights that enable the network to autonomously learn the importance of input features27. This enables BiFPN to effectively fuse multi-resolution features, improving detection performance for targets of diverse scales. By iteratively applying bidirectional feature paths, BiFPN facilitates cross-level feature interactions, further enhancing the model’s adaptability to complex scenes.
The specific flow of BiFPN is shown in Fig. 7c. The operation process of BiFPN mainly consists of three stages: feature input, feature fusion and output. In the feature input stage, BiFPN usually receives different levels of features from the backbone network, such as P3-P7. These features have different resolutions and semantic information, providing a rich data base for subsequent fusion28.
In the feature fusion stage, input features are first preprocessed to simplify the network structure by removing single-input nodes and adding extra edges to fuse additional features. Next, input features at different resolutions are fused using a fast normalized fusion method. For example, intermediate features along the top-down path are first computed
Where\(\:{P}_{6}^{td}\) is the intermediate feature obtained from the top-down path of the feature layer, \(\:Conv\) represents the convolution operation, which is used to further process the fused features and extract more effective feature representations, \(P_{6}^{{in}}\) represents the original feature input to the feature layer \({P_6}\) in BiFPN, and \(P_{7}^{{in}}\) represents the original feature input to the feature layer \({P_7}\) in BiFPN, and \(\:{w}_{1}\)、\(\:{w}_{2}\) are the learnt weight parameters, corresponding to the input features \(P_{6}^{{in}}\) and \(Resize\left( {P_{7}^{{in}}} \right)\), respectively, and the network learns these weights to determine the importance of different input features,\(Resize\) is the operation for adjusting the resolution of the features, here, the input feature \(P_{7}^{{in}}\) of feature layer \({P_7}\) is adjusted to match the resolution of \(P_{6}^{{in}}\) for subsequent fusion, and ε is a very small constant that prevents the denominator from being zero.
Here the operation is used to adjust the resolution of the feature to match other features, and the operation performs further processing on the fused features. Next, the output features of the bottom-up path are computed
Here \(P_{6}^{{out}}\) is denoting the final output feature obtained by feature layer \({P_6}\) in the bottom-up path computation. \(P_{5}^{{out}}\) represents the output features obtained by feature layer \({P_5}\) in the previous computation process. \(\:{w}_{1}^{{\prime\:}}\), \(\:{w}_{2}^{{\prime\:}}\), \(\:{w}_{3}^{{\prime\:}}\) are learnable weight parameters corresponding to the input features \(P_{6}^{{in}}\)、\(\:{P}_{6}^{td}\), and \(\:Resize\left({P}_{5}^{out}\right)\) respectively, which are used to measure the contribution of different input features in the fusion process.
The fusion process for the remaining feature layers is similar, with bidirectional fusion operations repeated multiple times to fully integrate feature information across different levels. After multiple iterations, BiFPN outputs fused multi-scale features. These features are then fed into the subsequent classification and regression networks for target classification and localization. The classification network determines the target class based on fused features, while the regression network predicts target location, together achieving accurate detection. The BiFPN structure is designed to enhance the network’s ability to capture multi-scale features while reducing computational cost.
CARAFE upsampling
In this study, in order to improve the feature recovery capability in the crayfish detection and segmentation task, we introduced the CARAFE operator into the Neck part of YOLOv11n, replacing the traditional nearest neighbour interpolation up-sampling module. Traditional up-sampling methods usually spatially extend the feature maps by simple interpolation algorithms, however, in complex detection tasks, especially for the detection of small targets such as crayfish, traditional methods often fail to effectively recover the detailed features, resulting in loss of information and degradation of the target detection accuracy. The CARAFE operator, through the content-aware feature restructuring approach, is capable of dynamically adjust the reconstruction process, which significantly improves the effectiveness of small target detection29. In crayfish detection, due to the small size of the target and its possible similarity to the background interferences, the traditional up-sampling method often fails to effectively distinguish the target from the background. With the introduction of CARAFE, we are able to maintain the integrity of fine-grained features, which enables the model to better identify small targets, such as crayfish pincers, crayfish walking legs and other parts of the crayfish, thus improving the accuracy of the detection.The structure of CARAFE is shown in Fig. 8.
The architecture of CARAFE consists of two main core components: content-aware convolution and feature reconstruction modules. These modules work in tandem to achieve fine-grained image up-sampling by performing specific convolution and reconstruction operations on the input feature maps. The core innovation of CARAFE is the content-aware convolution operation30. While traditional up-sampling methods usually use a fixed convolution kernel to expand the feature map, CARAFE dynamically generates a convolution kernel by analysing the content of the input feature map. Specifically, CARAFE first processes the input feature map through a convolutional layer to generate a set of adaptive convolutional kernels that determine. their features based on the local contextual information of the input. Unlike traditional methods, CARAFE does not rely on fixed weights, but adapts the convolution kernel to the specifics of the input features, allowing the convolution operation to be optimised for different feature regions of the target, leading to better recovery of details and small targets. After kernel-aware convolution, CARAFE uses these generated convolution kernels to perform feature organisation operations. This module propagates the information to the high-resolution space by performing adaptive convolution operations on low-resolution feature maps. The feature recombination module recovers high-resolution.
detail information by weighting and reconstructing each pixel in the feature map with content-aware convolution kernels. Compared to traditional interpolation methods, CARAFE is able to reconstruct details better, especially at the edges of the target and in small target regions.
CARAFE upsampling is mainly divided into kernel prediction module and feature reorganisation module. In the kernel prediction module, the number of channels of the input\(\:\:H\times\:W\times\:C\) feature map is firstly compressed to \({C_m}\)to obtain the \(\:{C}_{m}\times\:H\times\:W\) feature map, and this process reduces the subsequent computation. Then the content is encoded by a convolutional kernel pair of size \(\:{k}_{up}\times\:{k}_{up}\)to generate the recombination kernel to obtain a feature map of size \(\:{\sigma\:}^{2}\times\:{k}_{up}^{2}\) as shown in Eq. 5.
Sampling is mainly divided into kernel where \(\:{C}_{m}\) denotes the number of channels in the feature layer after dimensionality reduction, \(\:\sigma\:\) is the multiplier of upsampling (usually 2), and \(\:{k}_{up}\) is the size of the predicted upsampling kernel. Then, the channels are expanded in the spatial dimension and then rearranged and combined to obtain an upsampling kernel of size \(\:\sigma\:H\times\:\sigma\:W\times\:{k}_{up}^{2}\), which is then Softmax normalised so that the sum of its weights is 1. In the content-aware restructuring module, feature restructuring is carried out using the obtained upsampling kernel to extract target features. For each target location of the output feature map, a region of size \(\:{k}_{up}\times\:{k}_{up}\:\)is taken centred on the target, and a dot product is made with the up-sampling kernel obtained from the prediction of the location of the point, which is mapped back to the input feature map, obtaining a feature map of size \(\:\sigma\:H\times\:\sigma\:W\times\:C\). CARAFE, as a content-aware up-sampling method of feature maps, through the kernel prediction module and the feature recombination module, significantly improves the image feature recovery accuracy and detection of small targets. Compared with traditional up-sampling methods, CARAFE not only better preserves image details, but also has obvious advantages in computational efficiency.
Specifically for crayfish sub-part segmentation, CARAFE offers a decisive advantage in recovering the fine boundaries of small anatomical components, such as walking legs and claws, which often blend with background clutter or overlap with other parts. The content-aware kernel prediction ensures that the upsampling process is sensitive to subtle local feature variations, improving the clarity of part boundaries and segmentation masks. This directly addresses the need for precise segmentation of small, irregularly shaped parts under complex visual conditions.
Testing environment and parameters
Test environment
Training parameter configuration is a critical step, as its appropriateness directly affects final detection accuracy. To ensure scientific rigor and fair comparison, identical parameter settings were applied across all models in the comparative experiments.
The input image size is a key factor influencing feature extraction performance and computational complexity. After extensive evaluation and validation, an input resolution of 640 × 640 pixels was selected to balance computational cost and detection accuracy. This setting provides sufficient image information for precise feature extraction The batch size, set to 16, plays a pivotal role by ensuring diverse parameter updates and fully utilizing hardware parallelism compared to smaller batch sizes. The number of training epochs, set to 200, represents a crucial hyperparameter. This configuration allows 200 complete passes through the training dataset, enabling effective learning while conserving computational resources. Excessive epochs may lead to overfitting and degrade generalization ability. The learning rate (0.001) and momentum (0.937) regulate convergence speed and training stability. This combination ensures rapid convergence and training stability, helping avoid local optima during optimization. Details of the hardware and software environments are provided in Table 1.
Evaluation metrics
To comprehensively assess the lightweight performance, accuracy, and real-time capability of the proposed model for crayfish sub-part detection and segmentation, multiple key metrics were employed. For lightweight evaluation, metrics including trainable parameter count, floating-point operations (FLOPs), training time, model size, and GPU utilization are used to quantify computational overhead and hardware adaptability. For accuracy, precision (P), recall (R), F1 score, and mean average precision (mAP) serve as primary metrics. Specifically, mAP@0.5 denotes the mean average precision calculated at an IoU threshold of 0.5, while mAP across all categories is computed using a 0.05 IoU step size. For real-time performance, frames per second (FPS) during image or video processing is used to reflect inference efficiency. The specific calculation formulas are presented as follows:
Here, \(\:TP\) (true positives) denote the number of correctly predicted positive samples,\(\:FP\) (false positives) represent the number of negative samples incorrectly predicted as positive, and \(\:FN\) (false negatives) signify the number of positive samples incorrectly predicted as negative. Precision, defined as the ratio of true positives to the total number of detected samples, and recall, defined as the ratio of true positives to the total number of ground-truth positive samples, are expressed in Eqs. (6) and (7), respectively. In contrast to precision and recall, mean average precision (mAP) more comprehensively reflects the model’s overall detection performance. Average precision (AP) is defined as the area under the precision-recall (PR) curve, and mAP represents the mean value of AP across all categories. According to the COCO benchmark, AP is the average of IOUs from 0.5 to 0.95 in steps of 0.05. AP50 indicates an IoU threshold of 0.5.
Additionally, frames per second (FPS) measure the model’s inference speed. A model meets real-time detection requirements if its FPS exceeds 30. Giga floating-point operations per second (GFLOPs) and trainable parameter count quantify model complexity: lower GFLOPs and fewer parameters indicate reduced computational requirements.
Results and analysis
Comparison of trunk lightweighting results
As shown in Table 2, D-HGNetV2 demonstrates unique advantages in crayfish sub-part detection and segmentation tasks. Although its parameter count (2.6 M), FLOPs (9.4 GFLOPs), and model size (5.4 MB) are not optimal among lightweight indicators, its detection mAP@0.5 (96.6%) and segmentation mAP@0.5 (95.8%) outperform those of lightweight networks such as HGNetV2 (96.4% / 95.5%) and MobileNetV4 (96.7% / 95.6%). Its segmentation accuracy is only 0.2% lower than that of GhostNet. The core advantage of D-HGNetV2 lies in the integration of the DynamicConv module. Using an input-dependent attention mechanism, it adaptively aggregates multi-scale convolutional kernels, enhancing feature capture for overlapping regions and small targets (e.g., claws and legs). Specifically, DynamicConv generates attention weights via global average pooling and dynamically adjusts convolutional parameters, enabling precise focus on small target features. Meanwhile, its channel selection strategy reduces redundant computations, significantly lowering model complexity while maintaining detection accuracy comparable to YOLOv11n-Seg (only a 0.2% decrease in detection mAP@0.5), and offering greater potential for edge deployment than lightweight networks such as RepViT. The training process curves of mAP@0.5 for different backbone networks are shown in Fig. 9. Experimental data indicate that D-HGNetV2 offers superior feature extraction robustness in complex scenarios compared to similar lightweight solutions, making it an ideal backbone for real-time vision systems in automated crayfish processing.
Ablation experiment
To confirm the performance advantages of the YOLOv11n-DHBC proposed in this study and validate the effectiveness of the improved strategy, four groups of ablation experiments were designed and conducted, with the results presented in Table 3. Among these, the original YOLOv11n model was used as the baseline in the first group of experiments, serving as a reference for comparison. The ‘√’ in the table indicates the improvement points introduced to the baseline model, which were gradually added in the second, third, and fourth groups of experiments to assess the impact of each optimization on model performance and to analyze the overall effectiveness of the proposed method.
As shown in Table 3, after replacing the backbone network of yolov11n-seg with the D-HGNetV2 network, the recall of detection, mAP0.5, and F1 decrease by 0.1%, 0.2%, and 0.1%, respectively, and the accuracy of segmentation and mAP0.5 decrease by 0.1%, and 0.1%, respectively. This minor accuracy drop is an acceptable trade-off considering the significant reduction of 10.3% in parameters, 7.8% in computation, and 5.2% in model size, which directly aligns with the lightweight design objective and ensures the model can better meet the deployment requirements on edge devices. The D-HGNetV2 introduces a dynamic convolution module, which reduces the computational complexity and the number of parameters by reducing redundant computational operations and optimising the feature extraction process, while maintaining a high detection accuracy. With the introduction of the bidirectional feature pyramid network BiFPN, the weighted fusion of multi-scale features is achieved through learnable weights, which improves the feature utilisation efficiency and further reduces redundant information. Here, even with nearly unchanged detection and segmentation performance, the model achieves a substantial 26.9% parameter reduction, 4.3% computation reduction, and 25.9% model size reduction. This reflects a deliberate balance: while the detection mAP0.5 remains stable, resource savings enhance feasibility for real-time industrial applications. Finally, the CARAFE up-sampling module is used to replace the traditional up-sampling in the neck part, and the recall of detection, mAP0.5, and F1 increased by 0.3%, 0.2%, and 0.2%, respectively, and the recall of segmentation, mAP0.5, and F1 increased by 0.2%, 0.1%, and 0.1%, respectively, while at the same time, the number of parameters, the amount of computation, and the size of the model increased by 5.2%, 2.2%, and 5%, respectively. These additional resource costs are justified by CARAFE’s ability to recover finer details through content-aware upsampling, which significantly enhances small target detection (e.g., claws, walking legs) critical in crayfish sub-part segmentation. The trade-off reflects the study’s aim: marginally increased complexity for meaningful gains in precision crucial to industrial-grade segmentation tasks.
The comprehensive ablation experiments demonstrate that the improved YOLOv11n-DHBC model achieves a 31% reduction in the number of parameters, a 9.8% reduction in computation, and a 26.3% reduction in model size compared to the original YOLOv11n-Seg model. These reductions align with the goal of lightweight design, making the model better suited for deployment on edge or resource-constrained devices. At the same time, the improved model maintains high precision, with recall, mAP@0.5, and F1 all showing slight improvements over the original model. This reflects a well-balanced trade-off: the enhancements not only reduce resource requirements but also improve detection and segmentation performance where fine-grained accuracy is essential for practical industrial applications.
Comparison of results from different models
To further evaluate the effectiveness of the YOLOv11n-DHBC model, segmentation models including Mask R-CNN, YOLOv5s, YOLOv7, YOLOv7-tiny, YOLOv8n, YOLOv11n, and YOLOv12n were selected for comparison. To ensure fairness in the experimental results, all networks were trained on the same dataset without loading official default weights. After training, the network weights with the highest accuracy were selected, and the test environment was kept unchanged. All models used an image input size of 640 × 640 and were evaluated on the same test set comprising 702 images. The results of each detection model are presented in Table 4.
All models employed an image input size of 640 × 640 and were evaluated on a common test set containing 702 images. The results for each detection model are summarized in Table 4. Mask R-CNN, as a state-of-the-art two-stage instance segmentation framework, achieves respectable accuracy (90.3% Box_mAP@0.5, 89.5% Seg_mAP@0.5), but its massive parameter count (44 M), computation load (255 GFLOPs), and model size (83.9 MB) result in significantly reduced inference speed (11.5 FPS), making it impractical for real-time industrial deployment on edge devices. In contrast, the proposed YOLOv11n-DHBC model introduces dynamic convolution, a bidirectional feature pyramid network (BiFPN), and CARAFE, achieving notable advancements in lightweight design. With 2.0 million parameters—72.9% fewer than YOLOv5s’ 7.4 million—it reduces computational load to 9.2 GFLOPs (a 64.2% decrease from YOLOv5s’ 25.7 GFLOPs) and model size to 4.2 MB (a 71.0% reduction from YOLOv5s 14.5 MB). While its inference speed of 65.8 FPS is slightly lower than that of YOLOv8n and YOLOv12n, YOLOv11n-DHBC excels in detection and segmentation accuracy, achieving 96.8% Box_mAP@0.5 and 96.0% Seg_mAP@0.5—surpassing most lightweight counterparts and providing a significantly more efficient alternative to heavyweight models like Mask R-CNN without sacrificing performance in complex scenes. Its robustness is significantly enhanced in scenarios where crayfish are overlapping or occluded. This balanced design across lightweight architecture, accuracy, and adaptability to complex environments positions YOLOv11n-DHBC as an ideal choice for edge-device deployment, offering a high-precision, high-efficiency visual solution for automated crayfish processing.
To intuitively demonstrate the effectiveness of YOLOv11n-DHBC, detection and segmentation experiments were conducted using both the YOLOv11n base model and the improved YOLOv11n-DHBC model on the test set. The results are illustrated in Fig. 10. As shown in the figure, the original YOLOv11n model is prone to missed and false detections, particularly when numerous or complex targets are present. This results in poor detection and segmentation performance due to inadequate feature extraction. Although the improved YOLOv11n-DHBC model still shows occasional incomplete segmentation of crayfish walking legs when their number is large, it eliminates complete missed detections and large segmentation gaps seen in the base model, representing a substantial improvement. Therefore, the YOLOv11n-DHBC model meets the real-time and high-efficiency requirements for sub-part segmentation of river crayfish.
Visualisation and analysis
To illustrate differences in how the model attends to crayfish body parts before and after improvement, we visualized and compared the feature outputs of YOLOv11n and YOLOv11n-DHBC using CardCAM heatmaps. These heatmaps intuitively highlight pixel regions of interest and their attention levels (with redder colors indicating stronger attention). The heatmaps are shown in Fig. 11.
Analysis of the heatmaps shows that in single-target or low-density scenarios, both the original and improved models effectively capture key features of each anatomical part. However, as the number of crayfish increases or image complexity rises, the original model’s ability to capture features declines markedly, resulting in incomplete information acquisition across large areas. In particular, when multiple crayfish overlap, its ability to distinguish targets from the background deteriorates significantly. By contrast, YOLOv11n-DHBC reduces background noise interference by concentrating attention (red-highlighted regions) on critical features, ensuring precise focus on target parts. The heatmaps further confirm that YOLOv11n-DHBC accurately captures key features across different body regions, demonstrating superior detection and segmentation performance even in complex scenes.
Model generalization experiment
To validate the robustness of the YOLOv11n-DHBC model, a generalization performance comparison test was conducted using the SHOUCRAB dataset. Collected by the SUBAQUEOUS team at Shanghai Ocean University, this dataset is specifically designed for detection and segmentation of different parts of Chinese mitten crabs, encompassing three distinct body part categories. The dataset comprises 2,335 training images, 333 validation images, and 417 test images. The results are presented in Table 5.
The generalization results further confirm that the proposed lightweight architecture of YOLOv11n-DHBC not only preserves competitive accuracy across different datasets and target species but also ensures that its efficiency benefits, in terms of reduced computational load and model size, translate into robust performance in new scenarios. This demonstrates that YOLOv11n-DHBC achieves a superior balance of accuracy and lightweight properties compared with more complex architectures, facilitating deployment in real-world industrial applications that demand both precision and efficiency.
Conclusion
To address the challenges of low detection accuracy in complex scenes, high resource consumption, and substantial response latency in crayfish sub-part processing, this study proposes YOLOv11n-DHBC, a lightweight detection and segmentation model built on the improved YOLOv11n-Seg framework. Through dynamic feature extraction, optimized multi-scale fusion using BiFPN, and detail-preserving CARAFE upsampling, the model achieves concurrent improvements in detection accuracy and computational efficiency on a custom-built dataset. Experimental results show that YOLOv11n-DHBC achieves a detection mAP@0.5 of 96.8% and a segmentation mAP@0.5 of 96.0%, outperforming traditional segmentation methods such as Mask R-CNN (90.3% and 89.5% respectively) by a significant margin, while offering dramatically lower complexity (2.0 M parameters and 9.2 GFLOPs versus 44 M parameters and 255 GFLOPs). Moreover, the model achieves a real-time speed of 65.8 FPS, far exceeding Mask R-CNN’s 11.5 FPS, thus meeting edge device deployment requirements. Ablation studies validate each architectural innovation: D-HGNetV2 enhances feature extraction through dynamic convolution, BiFPN strengthens cross-scale feature fusion, and CARAFE improves small-target detail recovery. Compared to mainstream YOLO variants, YOLOv11n-DHBC excels in lightweight metrics, reducing parameters by 72.9% relative to YOLOv5s while maintaining detection accuracy comparable to YOLOv7-tiny. In challenging scenarios with overlapping or occluded crayfish, detection accuracy exceeds traditional methods by over 30%, offering a precise and efficient visual solution for automated processing. Future work will focus on enhancing model generalization under extreme lighting conditions and integrating the model into industrial real-time systems to further advance aquatic processing automation. These findings position YOLOv11n-DHBC as a promising alternative that balances accuracy and computational efficiency, offering a practical solution for tasks requiring both fine-grained segmentation and real-time processing on edge or embedded devices.
Data availability
The data provided in this study can be obtained from the corresponding author W.S.
References
Tian, J., Zhang, J., Liu, Y. & Zhou, D.,Shu, N. Nutritional value and food safety of crayfish. China Food Saf. 89–91 (2024).
Li, J. et al. Exploration on the quality changes and flavour characteristics of freshwater crayfish (Procambarus clarkia) during steaming and boiling. LWT 190, 115582 (2023).
Shen, J., Liu, N., Sun, H., Li, D. & Zhang, Y. An instrument indication acquisition algorithm based on lightweight deep convolutional neural network and hybrid attention Fine-Grained features. IEEE Trans. Instrum. Meas. 73, 1–16 (2024).
Su, Y., Wu, W. W. L., Yu,Wei.,Sun, X. P. Y. & Kang, J. W. Q. Progress of crayfish shelling technology and comprehensive utilisation of shrimp shells. Meat Res. 37, 41–45 (2023).
Zhang, L., Zhang, J., Peng, D., Tang, S. & Li, T. Research on key technologies and equipment for mechanised and efficient shrimp shelling. Guangdong Modern Agricultural Equipment Research Institute. https://kns.cnki.net/KCMS/detail/detail.aspx?dbcode=SNAD&dbname=SNAD&filename=SNAD000001962895. (2021).
Ma, D., Gong, L., Gan, L., Zhao, X. & Zhang, L. Force analysis of roller-type shrimp shelling device during shelling. Mod. Agricultural. Equipment41, 64–68 (2020).
Wang, F. & Wang, W. S. X. Y. W. A sheep instance segmentation method based on improved YOLO v8n-seg. Trans. Chin. Soc. Agricultural Mach. 55, 322–332 (2024).
Wang, S., Huang, J., Zhang, P. & Wang,j. A YOLOv4 neural network based quality detection method for crayfish. Food Mach. 37, 120–124 (2021).
Chen, Z. Li, Zhuo.,Yang, Zhi. Research on target detection method for factory-farmed shrimp based on YOLOv5 marine fishery. 44 610–620. (2022).
Chen, Y. et al. Study on positioning and detection of crayfish body parts based on machine vision. Food Measure. 18, 4375–4387 (2024).
Shen, J. et al. An algorithm based on lightweight semantic features for ancient mural element object detection. Npj Herit. Sci. 13, 1–13 (2025).
Jiang, P., Ergu, D., Liu, F., Cai, Y. & Ma, B. A review of Yolo algorithm developments. Procedia Comput. Sci. 199, 1066–1073 (2022).
He, Z. et al. Comprehensive Performance Evaluation of YOLOv11, YOLOv10, YOLOv9, YOLOv8 and YOLOv5 on Object Detection of Power Equipment. https://doi.org/10.48550/arXiv.2411.18871 (2024).
Terven, J., Córdova-Esparza, D. M. & Romero-González, J. A. A comprehensive review of YOLO architectures in computer vision: from YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 5, 1680–1716 (2023).
Gai, R., Chen, N. & Yuan, H. A detection algorithm for Cherry fruits based on the improved YOLO-v4 model. Neural Comput. Applic. 35, 13895–13906 (2023).
Feng, J., Jin, T. & CEH-YOLO: A composite enhanced YOLO-based model for underwater object detection. Ecol. Inf. 82, 102758 (2024).
Khanam, R. & Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. https://doi.org/10.48550/arXiv.2410.17725 (2024).
Rasheed, A. F. & Zarkoosh, M. YOLOv11 Optimization for Efficient Resource Utilization. https://doi.org/10.48550/arXiv.2412.14790 (2024).
Zhao, Y. et al. DETRs Beat YOLOs on Real-time Object Detection. 16965–16974 (2024).
Chen, Y. et al. Dynamic Convolution: Attention over Convolution Kernels. 11030–11039 (2020).
Tan, M., Pang, R. & Le, Q. V. EfficientDet: Scalable and Efficient Object Detection. 10781–10790 (2020).
Wang, J. et al. CARAFE: Content-Aware Reassembly of Features. 3007–3016 (2019).
Zheng, Q. et al. HGO-YOLO: Advancing Anomaly Behavior Detection with Hierarchical Features and Lightweight Optimized Detection. https://doi.org/10.48550/arXiv.2503.07371 (2025).
Wang, R., Liang, F., Wang, B. & Mou, X. ODCA-YOLO: an Omni-Dynamic Convolution coordinate Attention-Based YOLO for wood defect detection. Forests 14, 1885 (2023).
Chen, J. & Er, M. J. Dynamic YOLO for small underwater object detection. Artif. Intell. Rev. 57, 165 (2024).
Shen, J. et al. Finger vein recognition algorithm based on lightweight deep convolutional neural network. IEEE Trans. Instrum. Meas. 71, 1–13 (2022).
Wang, S. et al. Measurement of asphalt pavement crack length using YOLO V5-BiFPN. J. Infrastruct. Syst. 30, 04024005 (2024).
Doherty, J., Gardiner, B., Kerr, E. & Siddique, N. BiFPN-YOLO: One-stage object detection integrating Bi-Directional feature pyramid networks. Pattern Recogn. 160, 111209 (2025).
Lv, K. CCi-YOLOv8n: Enhanced Fire Detection with CARAFE and Context-Guided Modules. https://doi.org/10.48550/arXiv.2411.11011 (2024).
Xu, Y., Lu, J., Wang, C. & YOLO-SOD Improved YOLO small object detection. in PRICAI 2024: Trends in Artificial Intelligence (eds Hadfi, R., Anthony, P., Sharma, A., Ito, T. & Bai, Q.) 164–176 (Springer Nature, Singapore, doi:https://doi.org/10.1007/978-981-96-0125-7_14. (2025).
Author information
Authors and Affiliations
Contributions
All authors made significant contributions to the manuscript. W.S. and C.F.L. conceived and designed; J.Z. and F.Y.F. conceived and designed the experiments; J.Z., W.S., and D.W.C. presented tools and carried out the data analysis; W.S. and C.F.L. wrote the paper. J.P.Z, F.Y.F., and D.W.C. guided and revised the paper; All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Shi, W., Zhang, J., Fu, Y. et al. Lightweight detection and segmentation of crayfish parts using an improved YOLOv11n segmentation model. Sci Rep 15, 25634 (2025). https://doi.org/10.1038/s41598-025-11201-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-11201-9