Abstract
Timely and accurate detection of ear tag dropout is crucial for standardized precision breeding, health monitoring, and breeding evaluation. Reserve breeding pigs exhibit high activity levels and frequent interactions, leading to a higher prevalence of ear tag dropout. However, detection is challenging due to motion blur, small tag size, and significant target scale variations. To address this, we propose a motion blur-aware multi-scale framework, Adapt-Cascade. First, a Weight-Adaptive Attention Module (WAAM) enhances the extraction of motion blur features. Second, Density-Aware Dilated Convolution (DA-DC) dynamically adjusts the convolutional receptive field to improve small ear tag detection. Third, a Feature-Guided Multi-Scale Region Proposal strategy (FGMS-RP) strengthens multi-scale target detection. Integrated into the Cascade Mask R-CNN framework with Focal Loss, Adapt-Cascade achieves 93.46% accuracy at 19.2 frames per second in detecting ear tag dropout in reserve breeding pigs. This model provides a high-accuracy solution for intelligent pig farm management.
Similar content being viewed by others
Introduction
Ear tags remain widely used in breeding stock management due to their low cost and ready availability. However, incidents of ear tag loss are not uncommon, caused by factors such as biting, abrasion against facilities, and the aging of tag materials. In large group pens where replacement breeding pigs are reared with frequent interactions and high activity levels, ear tag loss is particularly prevalent. Moreover, if more than one breeding pig loses its ear tag, individual identities can become confused, leading to potentially catastrophic consequences for the breeding program. Currently, the detection of ear tag loss in breeding pigs relies primarily on manual observation. This approach is labor-intensive and prone to frequent missed detections, and it often fails to identify tag loss promptly. Such delays further exacerbate the risk of identity confusion. Therefore, promptly and accurately detecting pigs that have lost their tags and alerting farm personnel to reapply ear tags is of critical importance for safeguarding the accuracy of genetic breeding programs for pigs1.
Machine vision-based detection methods provide a promising alternative but face several critical challenges in practical environments: (1) rapid pig movements generate motion blur, significantly degrading image clarity and complicating the accurate detection of small-sized ear tags2; (2) ear tags inherently possess limited visual features, which can become further diminished during feature extraction and downsampling stages of detection models; (3) the substantial scale disparities between pigs and ear tags negatively impact detection performance, demanding specialized multi-scale detection strategies. Addressing these challenges necessitates innovations within object detection frameworks to enhance the robustness and precision of ear tag dropout detection under real-world conditions.
Traditional object detection methods have demonstrated effectiveness primarily in controlled single-class environments. For instance, Yu et al. introduced a multi-feature fusion approach combining geometric, texture, and morphological features, subsequently utilizing an improved Support Vector Machine (SVM) classifier to detect small targets in aerial imagery3. Similarly, Fang et al. leveraged Haar features coupled with the AdaBoost algorithm to improve object detection accuracy4. Nevertheless, the robustness and generalization of these traditional methods are inherently limited by environmental variations and fixed hyperparameter configurations.
Deep learning-based object detection algorithms, including one-stage frameworks such as YOLO series5,6,7,8 and RetinaNet9, and two-stage frameworks represented by Faster R-CNN10, Cascade R-CNN11, and Libra R-CNN12, have exhibited significant advantages in robustness and accuracy. Recent advancements have targeted specific issues like motion blur, small-scale objects, and multi-scale detection challenges. Xiao et al. introduced the EFC and MFR modules within a one-stage detection architecture to boost the detection performance of small targets13. Zhao et al. developed MS-YOLOv7, leveraging adaptive anchor adjustments and multi-scale feature fusion to enhance detection accuracy14. Furthermore, Aakanksha et al. proposed class-centered enhancement strategies to robustly handle motion blur in semantic segmentation tasks15. However, these one-stage models often compromise accuracy to achieve faster detection speeds, whereas two-stage models typically maintain superior accuracy. Zheng et al. employed a label distribution weighting method to enhance the training performance for small-scale objects16, and Cai et al. proposed a reinforcement learning-based region proposal network that dynamically adjusts anchor boxes for accurate localization across multiple scales17. Nonetheless, our previous work using an improved Cascade Mask R-CNN found limitations in reliably detecting ear tag dropouts, particularly under rapid pig movements18.
In response to these existing limitations, this paper proposes an adaptive feature-based detection algorithm specifically designed to address ear tag dropout challenges under motion blur conditions in reserve breeding pigs. Firstly, we introduce a Weight-Adaptive Attention Module (WAAM), based on the Convolutional Block Attention Module (CBAM), to enhance the adaptive extraction of critical motion-blurred features. Secondly, we propose the Density-Aware Dilated Convolution (DA-DC) technique, which adaptively adjusts the receptive fields of convolution kernels, thereby improving the detection capability for small-scale ear tags. Furthermore, we design a Feature-Guided Multi-Scale Region Proposal (FGMS-RP) strategy to optimize detection performance across various target scales. Finally, by integrating WAAM, DA-DC, FGMS-RP, and Focal Loss into the Cascade Mask R-CNN framework, we develop the Adapt-Cascade detection model, significantly advancing the precision and robustness of ear tag dropout detection under realistic farming scenarios.
Materials and methods
Data acquisition and data preprocessing
The experiment selected a reserve breeding pigsty from a large-scale breeding farm located around Hohhot, Inner Mongolia Autonomous Region, China, for data collection. The pigsty measures 5.3 × 2.7 m and is equipped with slotted flooring, housing 28 breeding pigs aged 2 to 3 months; 2 pigs had lost ear tags, while the remaining 26 had intact ear tags. A hemispherical camera (DS-2PT7D20IW-DE, Hikvision, Hangzhou, China) was installed 3.4 m above the feeding area of the pigsty to continuously monitor the pigs 24 h a day from a vertical top-down perspective (Fig. 1). Color images were collected under bright lighting conditions, while in low-light conditions, the camera automatically activates infrared illumination to capture grayscale images.
During data collection, motion disturbance factors were introduced to verify the model’s detection performance for moving pigs. An external speaker (ES-626, AILIFU, Shenzhen, China) was installed on the camera to emit a 120 dB alarm sound every 3 min, triggering a disturbance in the pig group and increasing the sampling proportion of motion data. The data was stored in the farm’s NVR and downloaded using the iVMS-4200 software (version 3.6.1.6; download at https://partners.hikvision.com/support/hikTools/detail? toolType=iVms4200&id=680371184935768064; the video format was MP4, encoded with H.264, with a resolution of 1920 × 1080 and a frame rate of 25 frames per second.
After data collection, a preprocessing step was implemented to ensure data quality and enhance the effectiveness of model training. This process involved video filtering, frame extraction, image cropping, and similarity-based selection to optimize the dataset’s representativeness and diversity. The specific steps are as follows:
(1) Remove video files where no pigs are visible, the lens is obstructed, or the frames are severely damaged, ensuring the images are clear and usable.
(2) FFMPEG software (version 7.1; download at https://ffmpeg.org/download.html#build-windows) is used for frame extraction, with one frame extracted per second. Since the camera video captures the feeding area and surrounding regions, the feeding area is cropped and retained as the region of interest to enhance model training efficiency further. The cropped image size is 1438 × 973.
(3) The Structural Similarity Index (SSIM) algorithm19 is used to filter the images. Through repeated experimental verification, the algorithm’s empirical parameter is set to 0.78. Specifically, when the similarity between two images is greater than 0.78, one of the images is discarded; when the similarity is less than or equal to 0.78, both images are retained. This ensures the diversity and representativeness of the images while avoiding excessive redundancy in the dataset.
Image annotation and dataset division
The self-developed annotation tool which is based on the SAM framework, is utilized for image labeling, resulting in a total of 6,544 images. Among these, 14,894 pigs and 8,348 ear tags have been annotated. The annotation results are exported in the COCO dataset format20, as shown in Fig. 2.
The dataset is randomly divided into training sets and validation sets in an 8:2 ratio. Based on the lighting conditions and the movement status of the pigs, each dataset is further classified into six groups: daytime stationary group, daytime moving group, daytime mixed group, nighttime stationary group, nighttime moving group, and nighttime mixed group. The data distribution is shown in Table 1. In the mixed groups, stationary and moving pigs are in the same image. It is noticeable that moving pigs exhibit clear motion blur, and both the pigs and ear tags are relatively indistinct, as shown in Fig. 3.
Adapt-Cascade architecture
Cascade Mask R-CNN optimizes the quality of candidate boxes through a cascading structure, making it particularly suitable for small object detection and instance segmentation tasks in complex scenes. The model consists of three components: the backbone network, the Region Proposal Network (RPN), and the cascading detection network21. The backbone network extracts semantic features from the input image, while the RPN generates bounding boxes for candidate target regions. Finally, the cascading detection network refines these bounding boxes through a series of detections and corrections, enhancing detection accuracy. However, in actual farming environments, factors such as motion blur, small ear tag size, and large differences in target scales lead to poor ear tag detection performance, and these challenges have not been fully addressed by Cascade Mask R-CNN. Consequently, this paper proposes a motion blur-aware multi-scale framework, Adapt-Cascade, for ear tag dropout detection in reserve breeding pigs, as illustrated in Fig. 4. The model employs ResNeXt-101 as the backbone network22. At the Conv2 to Conv5 stages of ResNeXt-101, the WAAM is integrated to capture key features of the target effectively. At the Conv4 and Conv5 stages, the DA-DC dynamically adjusts the dilated convolution rate to enhance feature extraction for small targets. Additionally, the FGMS-RP is constructed to weight and fuse candidate region features from different levels of the RPN output, enhancing multi-scale feature extraction. Finally, Focal Loss is employed to optimize the classification and regression losses of the cascading detection network, improving the classification accuracy for rare categories.
Weight-adaptive attention module
Images of breeding pigs contain multi-level semantic information. Existing research typically relies on techniques such as Sigmoid to generate attention weights for extracting low-level and high-level semantic features without adapting these weights to the dynamic changes in the input features. Consequently, strong feature information may not be effectively captured, while weak feature information is easily suppressed. This paper presents the WAAM, as illustrated in Fig. 5, which calculates a weight adaptation factor based on the response intensity distribution of the input feature map. This factor dynamically adjusts the attention weights, enabling the model to concentrate more on discriminative and key features, thereby enhancing feature extraction and detection under motion blur conditions.
Channel attention and spatial attention are computed on the input feature map to achieve feature integration along the channel and spatial dimensions, resulting in the output feature map \(\:F\). The mean \(\:{{\upmu\:}}_{ij}\) and standard deviation \(\:{{\upsigma\:}}_{ij}\) of \(\:F\) at position \(\:\left(i,j\right)\) are computed using Eq. (1) and Eq. (2), respectively.
Where \(\:C\) represents the number of channels.
The threshold \(\:{T}_{ij}\) at position \(\:\left(i,j\right)\) is calculated based on the mean and standard deviation of the feature distribution at that position, shown in Eq. (3).
Where \(\:{k}_{ij}\) is a learnable parameter calculated through the network optimization process.
The binary cross-entropy loss is used to calculate the classification loss \(\:L\) during training. the loss function is differentiated using the backpropagation algorithm to optimize the parameter \(\:{k}_{ij}\), shown in Eq. (4), Where \(\:{\upeta\:}\) represents the learning rate.
To prevent \(\:{k}_{ij}\) from becoming too large or too small, regularization is applied using Eq. (5), where \(\:{\uplambda\:}\) represents the regularization strength coefficient.
The adaptive weight adjustment factor \(\:{{\upalpha\:}}_{ij}\) is calculated according to the threshold and response value of \(\:F\) at position \(\:\left(i,j\right)\), as shown in Eq. (6). When \(\:{R}_{ij}\) is less than or equal to \(\:{T}_{ij}\), \(\:{{\upalpha\:}}_{ij}\) is used to adjust the attention weights. Conversely, when \(\:{R}_{ij}\) is greater than \(\:{T}_{ij}\), the attention weights are adjusted according to \(\:{R}_{ij}\).
The enhancement coefficient matrix \(\:A\in\:{R}^{H\times\:W}\) is calculated, the dimensions of \(\:A\) are then expanded to \(\:A\in\:{R}^{1\times\:H\times\:W}\), and element-wise multiplied with the feature map \(\:F\in\:{R}^{C\times\:H\times\:W}\), resulting in the enhanced feature map \(\:{F}^{{\prime\:}}\), as shown in Eq. (7). This process achieves the effect of enhancing the feature map based on the feature response values.
Eensity-aware dilated convolution
Small targets can be defined based on relative or absolute scale. The relative scale is determined according to the target size ratio to the image size. The Society of Photo-Optical Instrumentation Engineers defines small targets as those where the target pixel area is less than 0.12% of the total image pixels23. In this study, the ear tag occupies approximately 0.11% of the total image pixels, making it a typical small object detection problem. In addition, ResNeXt-101 performs convolution operations using fixed-size convolution kernels, which results in a fixed receptive field and local information loss. This is particularly detrimental to the feature extraction of small targets.
In this study, DA-DC is designed to adaptively adjust the receptive field of the convolution kernel based on the feature density in different regions of the feature map, as shown in Fig. 6. Specifically, DA-DC replaces the \(\:3\:\times\:3\) convolutions in the Conv4 and Conv5 stages of ResNeXt-101. The dilation rate is dynamically calculated based on the feature values of the previous layer’s feature map. Smaller dilation rates are applied in densely featured regions to preserve more local detail features. Conversely, larger dilation rates are used in sparsely featured regions to expand the receptive field of the convolution kernel, further extracting contextual features. This approach enhances the model’s ability to extract semantic features, improving the detection performance for small targets.
The feature map from the previous layer is denoted as \(\:F\in\:{R}^{C\times\:H\times\:W}\), where C represents the number of channels, and \(\:H\) and \(\:W\) denote the height and width of the feature map, respectively.
(1) Perform feature aggregation along the channel dimension of \(\:F\), resulting in the feature value \(\:{D}_{ij}\) for each spatial position \(\:(i,j)\).
In Eq. (8), \(\:{D}_{ij}\) represents the density value at position \(\:(i,j)\), and \(\:{F}^{(k,i,j)}\) is the feature value of the \(\:k\)-th channel at the same position. The density map \(\:D\in\:{R}^{H\times\:W}\) is then calculated, representing the feature density at each spatial position.
(2) To ensure the density values are within a reasonable range, the density map is normalized using the min-max normalization method. This scales \(\:{D}_{ij}\) to the range \(\:\left[\text{0,1}\right]\), with the normalized value denoted as \(\:{D}_{ij}{\:}^{{\prime\:}}\), as shown in Eq. (9).
(3) Based on \(\:{D}_{ij}^{{\prime\:}}\), the dilation rate of the dilated convolution is dynamically adjusted, as shown in Eq. (10), where \(\:{r}_{\text{m}\text{i}\text{n}}\) and \(\:{r}_{\text{m}\text{a}\text{x}}\) represent the minimum and maximum dilation rates, respectively.
(4) The calculated \(\:{r}_{ij}\) is used to set the dilation rate, and dilated convolution operations are performed position by position.
Feature-guided multi-scale region proposal strategy
Based on the basic principles of the GRoIE algorithm for the region of interest extraction24, it is recognized that all layers of the Feature Pyramid Network (FPN) retain useful information. However, applying uniform RoI Align operations to all layers can introduce irrelevant feature noise. Therefore, this study proposes the FGMS-RP) as shown in Fig. 7. This strategy dynamically adjusts the RoI weights based on the target feature size and the resolution of the feature layers, ensuring that RoI pooling at each layer can adaptively extract features according to the target size. Consequently, it provides a more appropriate feature fusion strategy for targets of different scales.
As shown in Eq. (11), the RPN selects the appropriate FPN feature layer based on the area of the RoI.
In Eq. (12), \(\:w\) and \(\:h\) represent the width and height of the RoI, respectively, while 224 is the standard reference size. \(\:{k}_{0}\) is the baseline feature layer, adjusted according to the FPN network, in this case, it is set to 4, representing the 4th layer of the FPN. \(\:k\) is the calculated level based on the RoI, indicating the feature layer to which the RoI should correspond.
The weight factor \(\:{w}_{k}\) for the RoI in other feature layers is calculated based on the area \(\:\left(A\right)\) of the RoI.
In Eq. (12), \(\:{w}_{k}\) represents the weight of the RoI for the \(\:k\)-th layer of the FPN, \(\:{A}_{k}\) is the standard RoI area suitable for the \(\:k\)-th feature map, and \(\:\alpha\:\) is the coefficient for adjusting the weights across different levels. For each RoI, the weights \(\:{w}_{2},{w}_{3},{w}_{4}\), and \(\:{w}_{5}\), are first calculated based on the area of the RoI for layers \(\:{\text{P}}_{2},{\text{P}}_{3},{\text{P}}_{4}\), and \(\:{\text{P}}_{5}\), respectively. Features are extracted from each corresponding feature layer as the RoI for that layer.
The generated candidate regions are input into RoI Align, where feature alignment is performed for each RoI across the \(\:{\text{P}}_{2},{\text{P}}_{3},{\text{P}}_{4}\), and \(\:{\text{P}}_{5}\) feature layers of the FPN. The candidate regions of varying sizes are mapped to a fixed-size \(\:7\times\:7\) feature map, followed by a convolution operation using a \(\:5\times\:5\) kernel with padding set to 2. This process further extracts and integrates features from the feature map, enhancing the representation of local details. After convolution processing, each FPN feature map is subjected to weighted summation, integrating information from different FPN layers into a unified multi-scale feature representation.
The loss of the model
The total loss comprises two parts: the RPN loss and the cascade detection loss, as defined in Eq. (13):
Here, \(\:{\lambda\:}_{i}\) for each cascade stage is set to 1, 0.5, and 0.25, respectively. \(\:{L}_{rpn\_cls}\) and \(\:{L}_{rpn\_reg}\) are computed using binary cross-entropy and Smooth L1 loss, while \(\:{L}_{cls}\), \(\:{L}_{reg}\), and \(\:{L}_{mask}\) denote the classification, regression, and mask losses of the cascade network (with \(\:{L}_{mask}\) computed via binary cross-entropy).
To address the challenge of imbalanced sample distribution, focal loss[25] is applied to the classification and regression losses of the cascade detection network, as shown in Eq. (14):
In this formulation, \(\:{p}_{t}\) is the predicted probability, \(\:{\alpha\:}_{t}\) is the balancing parameter, and \(\:\gamma\:\) modulates the loss contribution of easy versus hard samples. This modification effectively reduces the loss weight of easy samples and emphasizes hard ones, thereby mitigating the negative impact of sample imbalance on detection accuracy.
Experimental setup and parameters
The experimental hardware configuration includes two Intel(R) Xeon(R) Gold 6137 processors with 256 GB of memory and eight NVIDIA GeForce RTX 3090 GPUs. The software environment is built on the Ubuntu 20.04 operating system, and a deep learning framework was established using Miniconda3, python 3.8.5, CUDA 11.7, PyTorch 2.0.0, and MMDetection 2.28.2. During training, the stochastic gradient descent (SGD) method is used to optimize the training loss, with a momentum coefficient of 0.9, an initial learning rate of 0.02, and a weight decay coefficient of 0.0001. The model is trained for 60 epochs with a batch size of 160. For the three-stage Cascade R-CNN, the IoU thresholds are set to 0.5, 0.6 and 0.7, respectively.
Evaluation indicators
This study uses precision, recall, average precision, mean average precision for bounding box detection (bbox mAP), and mean average precision for instance segmentation (segm mAP) as performance metrics to evaluate the accuracy of the model. Detection speed (FPS) is used to measure the temporal efficiency of the model.
The experiment determines whether a pig has lost its ear tag by calculating the intersection area between the pig and ear tag masks. Individuals annotated with a distance of at least 1 px from the image edge are considered fully within the detection field. Suppose there are \(\:n\) pigs and \(\:m\) ear tags detected in the image, denoted by sets \(\:B\) and \(\:E\), respectively, i.e., \(\:B=\{{b}_{1},{b}_{2},\dots\:,{b}_{n}\}\), and \(\:E=\{{e}_{1},{e}_{2},\dots\:,{e}_{m}\}\). For any \(\:{b}_{i}\), the intersection area \(\:{A}_{ij}\) between \(\:{b}_{i}\) and \(\:{e}_{j}\) from set \(\:E\) is calculated as shown in Eq. (15).
If \(\:{A}_{ij}>0\), it means an intersection exists between the ear tag in set \(\:E\) and the pig \(\:{b}_{i}\), meaning that the pig \(\:{b}_{i}\) has not lost its ear tag. Conversely, if \(\:{A}_{ij}=0,\) the ear tag has been lost. To evaluate the model’s accuracy in detecting ear tag dropout in breeding pigs, the model’s accuracy is redefined as Accuracy, as shown in formula (16).
In Eq. (16), \(\:ET\_Drop\) and \(\:ET\_NotDrop\) represent the number of pigs with and without lost ear tags, respectively. \(\:T{P}_{ET\_Drop}\) refers to the number of correctly identified pigs with lost ear tags, and \(\:T{P}_{ET\_NotDrop}\) refers to the number of correctly identified pigs without lost ear tags.
Results
The results of training
After each training iteration, the model’s detection accuracy for reserve breeding pigs and ear tags was evaluated on the validation set. To assess the performance of Adapt-Cascade, five representative instance segmentation models-Cascade Mask R-CNN, YOLACT26, SOLOv227, DetectoRS28, and Mask Deformable DETR29 (all using ResNeXt101 as the backbone network) were selected for comparison. The changes in the model’s bbox mAP, segm mAP, and Loss during training are shown in Fig. 8.
Analysis of Fig. 8 indicates that Cascade Mask R-CNN, YOLACT, SOLOv2, and Mask Deformable DETR exhibit slow convergence and relatively lower detection accuracy. After training for 25 epochs, DetectoRS achieves stable bbox mAP and segm mAP at approximately 78% and 79%, respectively. In contrast, after 20 training epochs, Adapt-Cascade’s bbox mAP and segm mAP gradually stabilize at around 91.0% and 89.6%, respectively. The detection accuracy of Adapt-Cascade converges earlier and reaches the highest values, demonstrating a significant improvement in feature adaptation and extraction capabilities with the introduction of WAAM and DA-DC in the backbone network. Regarding model loss, Mask Deformable DETR and SOLOv2 show slower loss convergence and higher loss values. Adapt-Cascade, optimized using Focal Loss, experiences the largest decrease in loss during the early stages of training, converging more quickly and achieving the lowest loss value. These results fully demonstrate the superior performance of the proposed model in terms of feature perception and extraction capabilities.
The optimal results of Adapt-Cascade and the five comparison models are shown in Table 2. Analysis reveals that Adapt-Cascade excels across all evaluation metrics. In terms of precision, Adapt-Cascade achieves bbox mAP and segm mAP of 91.06% and 89.68%, respectively, significantly outperforming the other models. The precision advantage is likely attributed to incorporating the WAAM, DA-DC, and FGMS-RP modules, which enhance the model’s ability to capture features of small targets and targets at different scales under motion blur conditions. Regarding recall rate, Adapt-Cascade attains a recall rate of 96.32%, a 6.63% improvement over Cascade Mask R-CNN, indicating a significant enhancement in the model’s ability to detect targets and effectively reduce false negatives comprehensively. In terms of detection speed, the FPS of the proposed model is 19.20, which is 3.98 and 2.44% points lower than YOLACT and Cascade Mask R-CNN, respectively. This difference is primarily due to the model’s computational complexity and the multi-stage detection strategy. Despite the slower computational speed, the detection rate of 19.20 FPS still meets real-time detection requirements, and the improvements in accuracy and recall compensate for the reduced detection speed.
Detection performance under varied lighting and motion conditions
To validate the proposed model’s robustness, the five models’ bbox mAP and segm mAP were compared under different lighting conditions and motion states in the validation set, with the results shown in Table 3.
As shown in Table 3, Adapt-Cascade achieves bbox mAP and segm mAP of 91.06% and 89.68% on the validation set, respectively, which represent improvements of approximately 7.01% and 6.96% over the second-best model, DetectoRS. This demonstrates a stronger detection accuracy.
From the perspective of lighting factors, Adapt-Cascade achieves bbox mAP and segm mAP ranging from 92.45% to 90.87% and 91.02%−88.46%, respectively, in the daytime data group, performing the best among all models. This indicates that the model has a significant advantage in well-lit environments. Although the detection performance on the nighttime data group is slightly lower, it still ranges from 92.17% to 90.42% for bbox mAP and 90.82%−88.96% for segm mAP, demonstrating good robustness under complex lighting conditions.
From the perspective of motion state factors, Adapt-Cascade demonstrates the best detection performance on the stationary data group, indicating its excellent detection capabilities for static data. For the mixed and motion data groups, the model’s performance is significantly better than that of the other five models, showing strong adaptability to fast-moving targets. Compared to Cascade Mask R-CNN, the proposed model improves bbox mAP and segm mAP on the validation set by 7.71% and 8.50%, respectively. For the daytime and nighttime motion data groups, the bbox mAP and segm mAP are improved by 10.70%, 9.81% and 12.13%, 12.43%, respectively, achieving the largest improvement and most noticeable enhancement. This demonstrates that the proposed method effectively enhances the model’s detection performance on motion-blurred data.
Ablation experiment
This paper is based on Cascade Mask R-CNN, with the incorporation of WAAM and DA-DC in the backbone network, integration of FGMS-RP in the RPN, and optimization of the loss function to construct an optimal breeding pig ear tag dropout detection model. Ablation experiments were designed using the controlled variable method, with the results shown in Table 4.
As shown in Table 4, using Focal Loss to calculate the model’s loss improves its detection accuracy and recall rate for breeding pigs and ear tags without affecting detection speed. This demonstrates that Focal Loss offers advantages in classification accuracy for imbalanced tasks, helping to enhance the model’s detection ability. After incorporating the WAAM module, the model’s bbox mAP and segm mAP increased from 83.47% and 81.83–85.55% and 82.78%, respectively. This improvement is primarily due to the model’s increased focus on discriminative features, which enhances its adaptability to features and, in turn, boosts detection accuracy.
After adding DA-DC, the model performs dilated convolutions based on feature distribution, resulting in bbox mAP and segm mAP reaching 88.45% and 85.88%, respectively, with a notable 2.9% point increase in recall. This indicates that DA-DC is crucial in improving detection accuracy and the recall rate for low-probability targets.
Introducing the FGMS-RP module enables the model to dynamically adjust weights of RoIs, providing more suitable feature fusion strategies for targets of different scales. This results in a bbox mAP of 91.06%, segm mAP of 89.68%, and a recall rate of 96.32%, achieving the best performance.
Furthermore, the inference speed remains relatively stable across all experimental groups. Overall, the collaborative work of the improved modules significantly enhances the model’s feature extraction and target detection capabilities. The model substantially improves detection accuracy while maintaining a relatively high inference speed.
The analysis of the ear tag dropout detection performance
The detection performance of the proposed model and comparison model for breeding pigs and ear tags is shown in Table 5.
Analyzing Table 5, Adapt-Cascade demonstrates excellent performance in terms of bbox mAP, segm mAP, and recall for both breeding pigs and ear tags. Cascade Mask R-CNN and DetectoRS show relatively stable detection accuracy, but their recall rates are significantly lower than Adapt-Cascade’s. This suggests that the improvements in feature adaptation within Adapt-Cascade play a crucial role in enhancing detection accuracy and recall rate. The performance across all models for breeding pig detection is relatively accurate and stable. However, there are considerable differences in ear tag detection performance. Adapt-Cascade achieves bbox mAP and segm mAP for ear tag detection of 88.06% and 87.98%, respectively, demonstrating strong capabilities in small target detection. SOLOv2 and Mask Deformable DETR perform slightly worse than DetectoRS and Cascade Mask R-CNN, primarily because DetectoRS and Cascade Mask R-CNN have reinforced small target detection ability through feature resampling.YOLACT, being a single-stage detection algorithm, has a speed advantage but lags significantly behind in recall rate and detection accuracy. Overall, Adapt-Cascade surpasses other models in detection accuracy and recall rate, making it well-suited for complex tasks such as detecting ear tag dropouts in reserve breeding pigs.
Analyzing Table 6, the proposed model achieves an accuracy of 93.46% for breeding pigs’ ear tag dropout detection, which is 7.08% points higher than Cascade Mask R-CNN. This validates that the improvement strategies have enhanced the model’s ability to detect small ear tags and multi-scale targets in motion-blurred data, significantly improving the model’s accuracy for ear tag dropout detection in reserve breeding pigs under production environment conditions.
Comparison with the latest SOTA methods
Table 7 presents a direct comparison between Adapt-Cascade and three recent state-of-the-art (SOTA) instance segmentation methods under identical training and evaluation protocols on our dataset. EVA30 achieves bbox mAP of 89.72%, segm mAP of 87.45%, and recall of 94.10% at a speed of 14.50 frames per second (FPS). CBNetV231 records 90.15% bbox mAP, 88.20% segm mAP, and 96.43% recall at 16.80 FPS, achieving the highest recall among all methods. Co-DETR32 attains 89.98% bbox mAP, 87.98% segm mAP, and 94.30% recall at 22.15 FPS, demonstrating the fastest inference speed. In comparison, Adapt-Cascade delivers the best overall accuracy with bbox mAP of 91.06% and segm mAP of 89.68%, achieves a competitive recall of 96.32%, and maintains a speed of 19.20 FPS. These results indicate that while CBNetV2’s dual-backbone design yields the highest target coverage (recall), Adapt-Cascade strikes a superior balance between precision (mAP) and efficiency (FPS), making it particularly well-suited for high-accuracy applications.
Discussion
In this study, Adapt-Cascade outperformed several state-of-the-art models in terms of bbox mAP and segm mAP under various lighting and motion conditions. Notably, the model demonstrated significant improvements in scenarios with motion blur and background interference, indicating its strong robustness and adaptability in complex production environments.
Grad-CAM was used for heatmap analysis of the model33, and the results are shown in Fig. 9. The first row presents six representative sets of data images, the second row shows the heatmaps for the Cascade Mask R-CNN, and the third row displays the heatmap results for Adapt-Cascade.
Comparison of detection effect diagrams of different models. The figure compares the performance of various models under different challenging conditions. (a) Detecting smaller ear tags during the daytime; (b) distortion during the daytime; (c) the impact of motion blur during the daytime; (d) the effect of background interference during the nighttime; (e) gathering occlusion during the nighttime; (f) motion blur during the nighttime.
As observed, The original images include breeding pig images under different lighting conditions (including daytime and nighttime) and various motion states. In Cascade Mask R-CNN, the hotspot regions mainly concentrate on the pig’s body. Still, the hotspots do not perfectly align with the body boundaries, showing some degree of diffusion and blurring. The differentiation between different targets is insufficient, with some background areas being notably activated, indicating the algorithm’s limitation in suppressing background interference, which impacts detection performance. In contrast, the hotspots in Adapt-Cascade are highly concentrated on the target areas, aligning well with the contours of the pig’s body. The differentiation of hotspots between different pigs is more distinct, and the activation of background areas is significantly reduced. This suggests that Adapt-Cascade can activate features based on the distribution of target characteristics, which contributes to improved detection capabilities.
The heatmap results indicate that Adapt-Cascade achieves superior feature localization. This is primarily due to the incorporation of motion blur-aware modules and multi-scale feature fusion strategies, which enhance the model’s ability to discriminate target boundaries and suppress background interference. This behavior is consistent with the mechanism of the Weight-Adaptive Attention Module and the Density-Aware Dilated Convolution, which have been shown in similar previous studies to improve feature representation34,35. Moreover, the quantitative results in Table 4 further corroborate this enhancement: under focal loss, introducing WAAM alone raises bbox mAP from 83.47 to 85.55% and segm mAP from 81.83 to 82.78%, applying DA-DC or FGMS-RP individually also yields notable gains, when all three modules are combined, the model achieves 91.06% bbox mAP and 96.32% recall, perfectly aligning with the boundary-focused and background-suppression patterns observed in the heatmaps.
The comparison of the model’s detection performance is shown in Fig. 10. The first row represents representative original data images, the second row shows the detection results of Cascade Mask R-CNN, and the third row presents the detection results of Adapt-Cascade. The results reveal that, in the breeding pig images from the production environment, particularly for targets with motion blur due to the rapid movement of breeding pigs during the day and night (see (c) and (f)), Adapt-Cascade effectively detects the ear tags, improves recall rate, and reduces false negatives. Under conditions of background interference, image corruption, and large differences in target scales (see (b), (d), and (e)), the model can accurately detect both the breeding pigs and ear tags, improving detection accuracy for the body region of the breeding pigs and avoiding false detection of the ear tags. In particular, in Fig. 10 (a), where the ear tag is small due to the side position of the ear, Adapt-Cascade achieves a detection accuracy of 0.99 for the ear tag at this location, which is 0.02 higher than Cascade Mask R-CNN. This demonstrates that the proposed algorithm exhibits excellent detection performance and robustness.
Heat map of the model. This figure compares the heat maps from different models under various conditions: (a) Daytime Static, (b) Daytime Motion, (c) Daytime Mixed, (d) Nighttime Static, (e) Nighttime Motion, and (f) Nighttime Mixed. The heat maps illustrate the distribution of attention across different areas of the images.
In Table 3, Adapt-Cascade consistently demonstrates high performance across various illumination and motion conditions. In the daytime static group, the model achieves peak scores, with bbox mAP of 92.45% and segm mAP of 91.02%. However, when motion is introduced (daytime moving) or a combination of motion and static frames is employed (daytime mixed), there is a modest decline of approximately 1.5 to 1.6% points, reflecting the impact of motion blur on boundary localization. Under nighttime illumination, the static group still attains scores of 92.17% and 90.82%, indicating that spatial detail is largely preserved. Nevertheless, the combined challenges of reduced contrast, absence of chromatic cues, and motion blur result in the nighttime mixed and moving groups dropping to 90.76%/89.26% and 90.42%/88.96%, respectively. The most significant performance gap is observed in the nighttime moving group, where accuracy decreases by approximately 2 points compared to the daytime static group. These findings validate the robustness of Adapt-Cascade while also highlighting opportunities to further narrow residual gaps through targeted low-light enhancement and motion-deblurring strategies.
Despite the promising results, certain limitations persist. For example, the detection performance in extremely low-light or high-occlusion scenarios still shows potential for improvement. Figure 11, columns (a)–(c), illustrates failures caused by severe occlusion: when viewed from a top-down perspective, the erect ear orientation only exposes a single tag edge or a minimal fragment, which prevents the model from capturing a complete tag silhouette and results in missed detections. Columns (d)–(f) depict failure cases in low-light conditions: the circular arrangement of the infrared fill light produces a bright central region, with rapidly diminishing illumination toward the upper-right and lower-right corners. This extreme reduction in contrast obscures tag features and amplifies background noise, likewise leading to detection omissions. These examples underscore the necessity for enhanced occlusion-robust feature integration and dedicated low-light enhancement or denoising modules to further strengthen Adapt-Cascade’s performance under such adverse conditions.
Future work could integrate occlusion-aware augmentation and low-light enhancement, alongside advanced attention mechanisms, to mitigate the current failure modes observed under severe occlusion and extreme dimness. Furthermore, evaluating the model on larger and more diverse datasets will further validate its generalizability.
Conclusion
Based on the challenges posed by factors such as motion blur, small ear tags, and large differences in target scales in ear tag dropout detection in production environments, this paper presents Adapt-Cascade: a motion blur-aware multi-scale framework for ear tag dropout detection in reserve breeding pigs. The algorithm is based on Cascade Mask R-CNN, incorporating WAAM and DA-DC in the backbone network, using FGMS-RP to optimize the RoI selection process, and combining Focal Loss to optimize the model’s loss function. Ablation experiments validate the specific contributions of the improvement strategies in enhancing the model’s performance. Additionally, Grad-CAM was used to generate heatmaps for target detection, providing a more intuitive demonstration of the model’s working mechanism. On the validation set, Adapt-Cascade achieves a bbox mAP of 91.06% and a segm mAP of 89.68%, improving by 7.71 and 8.5% points, respectively, compared to the pre-improved version. The improvement in detection performance, particularly for motion state data and ear tags, is especially notable. This validates the proposed model’s advantages in detecting motion blur and small target tasks. However, detection performance still lags in extremely low-light or high-occlusion scenarios. Future research will continue to optimize the algorithm structure, incorporate advanced image preprocessing techniques to enhance the model’s robustness and make the model more suitable for complex farming environments.
Data availability
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.
References
Wang, R., Gao, R., Li, Q. & Dong, J. Pig face recognition based on metric learning by combining a residual network and attention mechanism. Agric 13, 144 (2023).
Barbedo, J. G. A., Koenigkan, L. V., Santos, T. T. & Santos, P. M. A study on the detection of cattle in UAV images using deep learning. Sensors 19, 5436 (2019).
Yu, H. et al. A classification method for marine surface floating small targets and ship targets. IEEE J. Miniaturization Air Space Syst. 5, 94–99(2024).
Fang, H., Liu, Y. & Zhang, Z. A hybrid method for small object detection in aerial images. J. Image Graph. 23, 221–229 (2018).
Bochkovskiy, A., Wang, C. Y. & Liao, H. Y. M. YOLOv4: optimal speed and accuracy of object detection. Preprint at arXiv:2004.10934 (2020).
Li, C., Li, L., Jiang, H., Zhao, Y. & Zhou, T. YOLOv6: a single-stage object detection framework for industrial applications. Preprint at arXiv:2209.02976 (2022).
Wang, C. Y., Bochkovskiy, A. & Liao, H. Y. M. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 7464–7475 (2023).
Varghese, R. & Sambath, M. YOLOv8: a novel object detection algorithm with enhanced performance and robustness. Proc. Int. Conf. Adv. Data Eng. Intell. Comput. Syst. (ADICS) 1–6 (2024).
Lin, T. Y., Goyal, P., Girshick, R., He, K. & Dollár, P. Focal loss for dense object detection. Proc. IEEE Int. Conf. Comput. Vis. 42, 2980–2988 (2017).
Ren, S., He, K., Girshick, R., Sun, J. & Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149 (2016).
Cai, Z. & Vasconcelos, N. Cascade r-cnn: delving into high quality object detection. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 45, 6154–6162 (2018).
Pang, J. et al. Libra r-cnn: towards balanced learning for object detection. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 47, 821–830 (2019).
Xiao, Y., Xu, T., Yu, X., Fang, Y. & Li, J. A lightweight fusion strategy with enhanced inter-layer feature correlation for small object detection. IEEE Trans. Geosci. Remote Sens. 62, Article 4708011 (2024).
Zhao, L. & Zhu, M. MS-YOLOv7: YOLOv7 based on multi-scale for object detection on UAV aerial photography. Drones 7, 188 (2023).
Rajagopalan, A. N. Improving robustness of semantic segmentation to motion-blur using class-centric augmentation. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (2023).
Zheng, Z., Wang, P., Liu, W., Zhang, L. & Li, X. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 50, 3517–3530 (2020).
Cai, Z., Vasconcelos, N. & Cascade R-CNN: high quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 43, 1483–1498 (2021).
Wang, F., Fu, X., Duan, W., Wang, B. & Li, H. Visual detection of lost ear tags in breeding pigs in a production environment using the enhanced cascade mask R-CNN. Agric 13, 2011 (2023).
Wang, Z., Bovik, A. C., Sheikh, H. R. & Simoncelli, E. P. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13, 600–612 (2004).
Lin, T. Y. et al. Microsoft COCO: common objects in context. Proc. Eur. Conf. Comput. Vis. 740–755 (2014).
Cai, Z., Vasconcelos, N. & Cascade R-CNN: delving into high quality object detection. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 6154–6162 (2018).
Xie, S., Girshick, R., Dollár, P. & He, K. Aggregated residual transformations for deep neural networks. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2017).
Tong, K. & Wu, Y. Deep learning-based detection from the perspective of small or tiny objects: a survey. Image Vis. Comput. 123, 104471 (2022).
Mancini, M., Bulo, S. R., Caputo, B. & Ricci, E. Region-based semantic segmentation with end-to-end training. Proc. IEEE Int. Conf. Image Process. (ICIP) 2454–2458 (2019).
Ross, T. Y. L. P. G. & Dollár, G. K. H. P. Focal loss for dense object detection. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2980-2988 (2017).
Bolya, D., Zhou, C., Xiao, F. & Yang, M. Yolact: real-time instance segmentation. Proc. IEEE/CVF Int. Conf. Comput. Vis. 9157–9166 (2019).
Wang, X., Li, Y., Chen, Z. & Li, X. Solov2: dynamic and fast instance segmentation. Adv. Neural Inf. Process. Syst. 33, 17721–17732 (2020).
Qiao, S., Chen, L. C. & Yuille, A. Detectors: detecting objects with recursive feature pyramid and switchable atrous convolution. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (2021).
Dai, X. et al. Dynamic DETR: end-to-end object detection with dynamic attention. Proc. IEEE/CVF Int. Conf. Comput. Vis. 2988–2997 (2021).
Fang, Y. et al. EVA: exploring the limits of masked visual representation learning at scale. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 19358–19369 (2023).
Liang, T. et al. CBNet: a composite backbone network architecture for object detection. IEEE Trans. Image Process. 31, 6893–6906 (2022).
Zong, Z. et al. DETRs with collaborative hybrid assignments training. In Proc. IEEE/CVF Int. Conf. Comput. Vis. 6748–6758 (2023).
Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. Proc. IEEE Int. Conf. Comput. Vis. 618–626 (2017).
Liu, H. et al. A skin disease classification model based on multi scale combined efficient channel attention module. Sci. Rep. 15, 6116 (2025).
Zhao, X. et al. A quality grade classification method for fresh tea leaves based on an improved YOLOv8x-SPPCSPC-CBAM model. Sci. Rep. 14, 4166 (2024).
Funding
This research was funded by the Major Science and Technology Project of Inner Mongolia Autonomous Region (2021ZD0005), Science and Technology Innovation Team Construction Special Project for Universities of Inner Mongolia Autonomous Region (BR231302), Natural Science Foundation of Inner Mongolia Autonomous Region (2024MS06002), Innovation Research Team Program for Higher Education Institutions of Inner Mongolia Autonomous Region (NMGIRT2313), the Natural Science Foundation of Inner Mongolia Autonomous Region (2025ZD012) and the Fundamental Research Funds for Universities Directly Under Inner Mongolia Autonomous Region (BR22-14-05).
Author information
Authors and Affiliations
Contributions
Weijun Duan and Fang Wang conceived the experiments; Weijun Duan, Fang Wang and Buyu Wang conducted the experiments; Weijun Duan and Buyu Wang analysed the results; Xueliang Fu and Honghui Li provided resources and supervision; Weijun Duan and Fang Wang prepared the manuscript. All authors reviewed the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Duan, W., Wang, F., Fu, X. et al. Motion blur aware multiscale adaptive cascade framework for ear tag dropout detection in reserve breeding pigs. Sci Rep 15, 24188 (2025). https://doi.org/10.1038/s41598-025-09679-4
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-09679-4













