Abstract
Foreign objects such as packaging bags on the road pose a significant threat to driving safety, especially at high speeds or under low-visibility conditions. However, research on detecting road packaging bags remains limited, and existing object detection models face challenges in small object detection, computational efficiency, and embedded deployment. To address these issues, the lightweight deep learning model RGE-YOLO is the foundation for the real-time detection technique proposed in this contribution. Built upon YOLOv8s, RGE-YOLO incorporates RepViTBlock, Grouped Spatial Convolution (GSConv), and Efficient Multi-Scale Attention (EMA) to optimize computational efficiency, model stability, and detection accuracy. GSConv reduces redundant computations, enhancing model lightweight; EMA enhances the model’s ability to capture multi-scale information by integrating channel and spatial attention mechanisms; RepViTBlock integrates convolution and self-attention mechanisms to improve feature extraction capabilities. The proposed method was validated on a custom-built road plastic bag dataset comprising 6,000 augmented images. Experimental results demonstrate that RGE-YOLO outperforms state-of-the-art models such as Single Shot MultiBox Detector (SSD) and Faster Region-based Convolutional Neural Network (Faster R-CNN) in terms of mean average precision (mAP 92.2%) and detection speed (250 FPS), while significantly reducing model parameters (9.1 M) and computational complexity (23.9 GFLOPs), increasing its suitability for installation on computerized systems within vehicles. It introduces an effective and lightweight approach for detecting road packaging bags and contributes to increased driving safety.
Similar content being viewed by others
Introduction
Road litter has become increasingly varied in recent years, especially in terms of packaging bags dumped by freight trucks. Scattered materials pose risks to vehicles on the road. For instance, if cargo bundling is inadequate during transportation, spillage or scattering may occur, leading to accidents and safety hazards. These packaging bags, regardless of their size, pose serious threats to driving safety, ranging from vehicle damage to severe accidents such as rear-end collisions, guardrail impacts, loss of control, and even rollovers. The unpredictable and unforeseen appearance of road packaging bags, combined with high vehicle speeds and the close proximity of debris to the road surface, makes them difficult to detect and increases the risk of collision accidents. While many studies focus on using deep learning methods to identify vehicles, non-motorized vehicles, and pedestrians, research on detecting road packaging bags remains limited. If vehicles fail to identify and avoid these packaging bags in a timely manner, they constitute a severe safety concern to traffic. There is a critical requirement to investigate and cultivate more effective detection and recognition methods for packaging bags to ensure vehicle safety. An example of a road packaging bag is shown in Fig. 1.
An example of road packaging bag.
Deploying AI models in autonomous vehicles requires addressing several constraints. These include computational limitations, as embedded hardware has limited processing power, necessitating models with low computational complexity for real-time processing. Memory and storage restrictions also demand models that are efficient in resource usage. Power consumption is critical, as excessive energy use can impact vehicle range, requiring models to be energy-efficient. Finally, real-time response is essential, with AI models needing low latency to react quickly to environmental changes, ensuring safe operation. Deep learning models have significantly advanced computer vision and object detection algorithms, enabling more accurate and efficient recognition of objects in images and videos. These models, such as Convolutional Neural Networks (CNNs1,2), You Only Look Once (YOLO3), and Faster R-CNN4, have revolutionized tasks like vehicle recognition and autonomous navigation. Among these models, the YOLO model offers significant advantages for autonomous driving applications, including real-time object detection, high accuracy, low latency, and efficient resource utilization, making it ideal for embedded systems in dynamic environments.
Within the area of vehicular object detection, evaluating model performance involves several key metrics: mAP, detection speed, model size, and parameter count. These metrics are crucial for assessing the effectiveness and efficiency of detection algorithms. We seek to enhance road packaging bags detection by creating RGE-YOLO, an upgraded version of YOLOv8s5. RGE-YOLO integrates RepViTBlock6, GSConv7 , and EMA8 to optimize computational efficiency, model stability, and detection accuracy. The model’s performance is evaluated using mAP, which measures the accuracy of object detection; detection speed, indicating the model’s real-time processing capability; model size, reflecting the storage requirements; and parameter count, which influences computational efficiency. By optimizing these metrics, RGE-YOLO aims to provide a balance between high detection accuracy and operational efficiency, making it suitable for deployment in autonomous driving systems. The following lists the study’s primary contributions.
-
A novel deep learning-based method for real-time detection of packaging bags on roads is proposed. To enhance detection accuracy, a specialized dataset of road packaging bag images is constructed, incorporating data augmentation and annotation techniques. The developed RGE-YOLO model refines the YOLOv8s architecture by integrating RepViTBlock, EMA, and GSConv, efficiently striking a balance between computational economy and detecting accuracy.
-
The model proposed in this paper has shown performance improvement in road packaging bag detection. The accuracy of this approach is 92.2%, with increased mAP compared to the SSD9, Faster R-CNN, Mask R-CNN10, DINO11, Dynamic R-CNN12, FoveaBox13, and TOOD14 models.
-
The model’s size and parameter count have been significantly reduced, while detection speed has been enhanced, making it more suitable for deployment on embedded in-vehicle systems.
The proposed RGE-YOLO is specifically tailored for deployment on embedded systems typically utilized in self-driving cars, such as the NVIDIA Jetson AGX Xavier or other edge computing platforms. These devices often have ARM-based CPUs (8-core NVIDIA Carmel), GPU accelerators (512-core Volta GPU), and limited memory resources (8-16 GB RAM). To meet these limits, RGE-YOLO’s lightweight architecture (9.1 million parameters, 23.9 GFLOPs) provides compatibility with real-time needs.
This manuscript follows a structured organization to present its research contributions systematically. The “Related work” section provides an overview of previous research on road packaging bag detection. The “Methodology” section elaborates on the proposed approach in detail. The “Experiment” section presents the experimental setup, collected data, procedures, results, and corresponding analysis. The “Discussion” section demonstrates RGE-YOLO’s superior performance in road packaging bag detection compared to existing methods. The “Conclusion” section summarizes the proposed method, discusses its implications, and outlines potential directions for future research.
Related work
Object detection
The application of object detection technology in automatic driving is developing very rapidly, which is one of the core technologies of the automatic driving system, and can help the vehicle perceive the surrounding environment in real time, identify obstacles and other information, and then make reasonable decisions. In the constant growth of deep learning and computer vision technologies, the application of target detection in automatic driving is also continuously optimised and expanded. In the application of target detection in the vehicle, the algorithm needs to take into account the real-time, accuracy and hardware resource efficiency. Many researchers have made significant progress in object detection applications in autonomous driving, especially in improving detection accuracy, speed, and adaptability. Botezatu et al.15 reviewed recent advancements in deep learning for road analysis in autonomous driving. The paper highlights the application of deep learning models for road detection, traffic sign recognition, and obstacle avoidance, discussing their challenges and potential for improving autonomous vehicle navigation in complex environments.
Bouazizi et al.16 demonstrated the effectiveness of the SSD-MobileNet algorithm for road object detection in ADAS applications, validated through GPU-accelerated transfer learning on the MS COCO dataset. Alam et al.17 and Chaudhuri18 applied the Faster R-CNN algorithm to vehicle detection and smart traffic management, demonstrating improvements in detection accuracy and processing efficiency, with Alam et al. achieving better accuracy and faster processing, and Chaudhuri addressing occlusions and background issues, enhancing vehicle segmentation and tracking in traffic scenarios. The extended Mask R-CNN19,20,21 model has been applied across multiple domains, including road instance segmentation, pedestrian situation recognition, and pothole detection. It demonstrates enhanced accuracy and efficiency, particularly in feature extraction and classification, while effectively addressing diverse conditions and providing accurate object detection and segmentation in real-time applications. Yang et al.22 proposed the RC-DINO model, combining ResNeSt50, CBAM, and DINO, achieving significant improvements in traffic light status recognition, including better precision, recall, and robustness, particularly in recognizing partially covered traffic lights. Zhang et al.12 proposed Dynamic R-CNN, a method that dynamically adjusts label assignment and regression loss functions during training, addressing inconsistencies between fixed network settings and dynamic training procedures. Kong et al.13 introduced FoveaBox, an anchor-free object detection framework that directly learns object existence and bounding box coordinates. Shen et al.23 designed an anchor-free lightweight detection network with channel stacking and attention mechanisms, improving small object detection in aerial images while reducing computational cost and parameter size. Feng et al.14proposed TOOD, a task-aligned one-stage object detection model that explicitly aligns object classification and localization tasks using a novel Task-aligned Head and Task Alignment Learning. Drone-TOOD is a proposed lightweight task-aligned object detection model for vehicle recognition in UAV images24. Zhang et al.25, Qiu et al.26, Xu et al.27, and Wang et al.28 all applied advanced YOLO models to various detection tasks, including asphalt pavement distress, road hazard avoidance for self-driving vehicles, bridge defect detection, and steel defect identification. Their proposed models, SMG-YOLOv8, an optimized YOLOv5, BD-YOLOv8s, and the ECDY network, demonstrated significant improvements in accuracy, generalization, and robustness, surpassing traditional methods and achieving higher precision, recall, and mAP scores in real-world applications.
Current object identification systems face shortcomings that limit their usefulness in real-time, limited resource scenarios like autonomous driving. Since Faster R-CNN1,4⁷ achieves high accuracy through its two-stage architecture, its reliance on region proposal networks introduces computational overhead of 41.35 GFLOPs and latency of approximately 28 FPS, making it impractical for real-time deployment on embedded systems. SSD⁹, despite its lightweight design, suffers from lower accuracy in small-object identification. It achieves a mAP of 84.3% for road packaging bags due to poor multi-scale feature fusion and rigid anchor box topologies. YOLO variants3,31,33 address speed restrictions through single-stage architectures, but are prone to false positives in cluttered situations and demonstrate instability when detecting low-contrast, small-scale items such as partially occluded or flattened bags. Transformer-based models like DINO11 and RepViT⁶ are effective in capturing global context, but have exorbitant processing costs, such as 119 GFLOPs for DINO, making them unsuitable for edge devices. Shen et al.29 integrated attention modules into instrument recognition, improving feature extraction efficiency while maintaining real-time performance, validating attention’s role in fine-grained detection tasks.
Improved YOLO model
The YOLO algorithm demonstrates significant advantages over traditional object detection methods, primarily due to its single-stage architecture that enables real-time processing with high efficiency3. By unifying object detection as a regression problem, YOLO achieves end-to-end optimization through a single forward pass, significantly reducing computational complexity and inference time30. Its ability to process the entire image in one step enhances global contextual understanding, thereby minimizing false positives caused by background noise3. Furthermore, advancements such as multi-scale prediction31 and lightweight backbone networks32 have improved its ability to identify items of different sizes while maintaining a balance between speed and accuracy. These features make YOLO particularly suitable for real-time applications, outperforming two-stage detectors like Faster R-CNN in terms of computational efficiency and deployment.
The YOLO series of target detection algorithms are widely applied in the field of intelligent traffic and automatic driving. Alahdal et al.33 tested YOLOv5, YOLOv7, and YOLOv8 for real-time object identification in self-driving automobiles, with a focus on early detection of common things including cars, people, bicycles, and road signs. The outcomes illustrate that YOLOv5 and YOLOv8 surpass YOLOv7 in terms of precision and recall. Flores-Calero et al.34 systematically reviewed the application of YOLO in traffic sign detection and recognition, emphasizing its real-time object detection capabilities. The study highlights challenges faced by YOLO systems, particularly in autonomous vehicles, and discusses relevant datasets and hardware used in training, suggesting future research directions to overcome these challenges.
Bao and Gao proposed YED-YOLO,35 an improved object detection algorithm for autonomous driving that incorporates efficient multi-scale attention , an upgraded CSPDarknet53 to level 2 FPN (C2f) module, and a unique intersection union loss function. This enhances the detection accuracy and generalization of tiny objects like people and bicycles in complicated traffic situations. SD-YOLO-AWDNet36 is a mixed strategy for object detection in challenging weather conditions for self-driving cars that combines advanced techniques such as C3Ghost, GhostConv, Depthwise-Separable Dilated Convolutions, and the novel Focal Distribution Loss, which improves detection accuracy, reduces computational overhead, and improves detection speed, outperforming YOLOv5 with a 54% reduction in FLOPs and a 2.24% increase in mAP. Wang et al.37 introduced an improved YOLOv8 model for road defect detection, enhancing detection accuracy and computational efficiency by incorporating the EMA Faster Block structure, SimSPPF, and Detect-Dyhead, with notable improvements in model size, parameter size, and GFLOP reduction. Wang et al.38 introduced the YOLOv8-QSD network, which is an anchorless detection framework that improves small target detection, especially in long-range scenarios. The model integrates a DBB-based backbone, BiFPN, and query-based pipeline and outperforms YOLOv8 in both speed and accuracy, achieving 64.5% accuracy on the SODA-A dataset. Shen et al.39 developed a triplet-based lightweight CNN for finger vein recognition, achieving 99.6% accuracy with 45.7% faster inference, highlighting lightweight networks’ potential for real-time embedded systems.
A summary of the literature revealed that YOLO still faces challenges in detecting small objects, especially those with complex backgrounds or overlapping elements, suggesting the need for further optimization. YOLO has made great strides in road obstacle detection, but problems with detection accuracy and computational efficiency remain, requiring innovative solutions. To overcome these obstacles, we presented an improved RGE-YOLO algorithm for YOLOv8s in our research. RGE-YOLO integrates RepViTBlock, GSConv, and EMA to improve feature extraction, reduce computational load, and increase detection accuracy, making it more suitable for real-time deployments in embedded systems. The findings of the tests indicate that RGE-YOLO performs better in detecting bag obstacles compared to existing models such as SSD and Faster R-CNN. The suggested technology is analyzed in detail and showed that it may increase driving safety through efficient and accurate bag identification.
Methodology
Review of YOLOv8
In the field of autonomous driving, object detection algorithms must exhibit real-time performance, high accuracy, robustness, and low computational resource consumption. The YOLO algorithm meets real-time requirements by pweforming object recognition and localization in a single forward pass. Its global reasoning capability enhances detection accuracy, and its lightweight architecture is well-suited for deployment in resource-constrained on-board systems. Additionally, YOLO’s multi-scale detection capability allows it to handle complex traffic scenarios40. Due to its efficiency, precision, and low computational overhead, YOLO has become an ideal choice for object detection in autonomous driving applications. Ali and Zhang41 provided an extensive review of the YOLO framework, covering its evolution from YOLOv1 to YOLOv11, and discussing improvements in accuracy, speed, and efficiency. The study highlights its applications in healthcare, autonomous vehicles, and robotics, addressing challenges like small object detection and computational constraints. The YOLO algorithm has advantages in detection speed and accuracy, but there are challenges in small object detection and robustness in complex scenes, and further enhancement of the YOLO framework is a future research direction42.
A light-weight variation of the YOLO series, YOLOv8s provides notable gains in detection accuracy, computing efficiency, and inference speed43,44. With its optimized architecture and lightweight backbones like CSPDarkNet and MobileNetV3, YOLOv8s delivers fast inference on resource-constrained devices, making it ideal for embedded and edge computing applications. Model compression techniques, such as pruning and quantization, reduce parameters and computational load while maintaining high precision, particularly in small object detection. The improved loss functions and multi-task learning strategies enhance generalization, ensuring robust performance across various scenarios. YOLOv8s’ efficiency, accuracy, and adaptability make it a powerful solution for real-time object detection in fields like autonomous driving, surveillance, and smart manufacturing.
The proposal of the GRE-YOLO model
In deep learning models, network depth and complexity are associated within a particular range. In general, increasing the depth can boost object identification accuracy. This enhancement nevertheless come at the expense of increased computing needs, necessitating more floating-point calculations and resulting in bigger model weight files that use a significant amount of memory. For result, more hardware is needed for deployment, including more memory, storage space, and CPU processing power. A lightweight model is desirable in light of these limitations, particularly in the context of autonomous driving applications, where our packing bag detection model is mostly implemented on embedded controllers or mobile hardware platforms.
In the YOLOv8 series, there are five basic models: YOLOv8x, YOLOv8l, YOLOv8m, YOLOv8n, and YOLOv8s. Among them, YOLOv8s exhibits distinct advantages in autonomous driving, primarily in its efficient real-time object detection capability, lower computational requirements, smaller model size, and higher accuracy. It provides the autonomous driving system with the ability to respond quickly and make precise decisions in various environments, making it particularly suitable for deployment on resource-constrained devices such as in-vehicle computing platforms. To enable rapid and accurate detection of road packaging bags, an enhanced YOLOv8s network model that we suggest. This model integrates YOLOv8s with RepViTBlock, GSConv, and EMA. We refer to this network model as RGE-YOLO. The structure of the RGE-YOLO model network is illustrated in Fig. 2. As seen in Fig. 2, the GSConv module is added to layer 3, replacing the original convolution operation with global convolution. It improves the model’s spatial perception by allowing it to comprehend the image’s contextual relationships greater depth. The weight update process of the model is impacted by the EMA module, which is applied to Layer 4 and does not directly alter the topology of the computational flow. By producing more stable feature representations, EMA contributes to increased inference phase accuracy. Its implementation can enhance the model’s capacity for generalization and smooth out the effects of numerous updates made during training. RepViTBlock is present in layer 22, enhancing the model’s capacity to capture global features via the self-attention mechanism. It identifies correlations among remote pixels in an image, which is crucial for intricate scenes in target detection and improves the model’s robustness.
Structure of RGE-YOLO (ours) model.
GSConv
GSConv is a lightweight convolution module developed as an optimization strategy for CNNs in deep learning. It is designed to enhance efficiency by reducing both the number of parameters and the computational complexity of a model, thereby lowering the overall computational load without significantly compromising accuracy. This approach is particularly well-suited for lightweight networks and shows great promise for applications on mobile platforms and other resource-constrained devices. The algorithm exploits the inherent redundancy in high-dimensional convolutional feature spaces to generate a greater number of feature maps while using fewer computations and parameters. The core principle involves first applying a conventional standard convolution, followed by a series of simple linear operations—such as pointwise or spatial convolutions—to expand the feature maps to the required dimensionality. The structure of GSConv is illustrated in Fig. 3.
Structure of GSConv module.
The workflow of the GSConv module is as follows: Initially, the input feature map with C1 channels is processed using a standard convolution—typically a 1×1 convolution—to reduce the channel dimensionality to C2/2, thereby generating a set of hidden features while significantly lowering computational cost. Subsequently, these hidden features are subjected to a depthwise convolution (e.g., a 5×5 convolution applied independently to each channel) to extract spatial information, yielding an output that retains C2/2 channels. The outputs from the standard convolution and the depthwise convolution are then concatenated along the channel dimension, forming a feature map with C2 channels. Finally, a channel shuffle operation is applied to the concatenated feature map to reorder the channels, which facilitates rapid inter-channel information fusion and further enhances the semantic representation of the extracted features.
EMA
The EMA module offers several advantages that enhance feature representation in computer vision tasks. By integrating channel and spatial information, EMA effectively captures both local and global features without significant computational overhead. This is achieved through a parallel subnetwork structure that employs 1x1 and 3x3 convolutions, enabling the model to process features at multiple scales simultaneously. Additionally, EMA incorporates an optimized coordinate attention mechanism, which improves the model’s ability to focus on relevant spatial regions, thereby enhancing performance across various tasks.
The process of EMA is illustrated in Fig. 4, the operational process of EMA module involves several key steps. Initially, the input feature map is partitioned, with one portion processed through a Coordinate Attention (CA) structure to capture spatial dependencies, and the other through a 3x3 convolution to extract local features. The outputs from these two branches are then combined, and through cross-dimensional interactions, pixel-level pairwise relationships are captured. This approach allows the EMA module to retain information on each channel while reducing computational complexity, resulting in a more efficient and effective feature representation.
Process of the EMA module.
RepViTBlock
The RepViTBlock is meticulously engineered to augment the global representation learning capabilities of lightweight models. By employing a structured re-parameterization approach, it enhances the efficiency of CNNs without compromising computational performance. A notable advantage of RepViTBlock is its enhanced feature-fusion capability, which is particularly advantageous for the detection of small target objects. The integration of a lightweight attention mechanism allows RepViTBlock to achieve an optimal balance between computational cost and accuracy, rendering it highly suitable for deployment in mobile and edge devices. The structure of RepViTBlock is illustrated in Fig. 5.
Structure of RepViTBlock module.
The operational framework of the RepViTBlock initiates with the application of a 3×3 depthwise convolution to the input feature map, facilitating the extraction of spatial information. Subsequently, a squeeze-and-excitation (SE) module modulates inter-channel feature responses, thereby enhancing the model’s focus on salient features. The resultant features are then processed through two 1×1 convolutions, which serve to enrich feature representation while maintaining dimensional consistency with the input. In contrast to traditional bottleneck structures that typically employ two 3 × 3 convolutions, the RepViTBlock reduces computational complexity without sacrificing feature expressiveness. The incorporated SE attention mechanism facilitates both local feature extraction and global feature abstraction, collectively contributing to improved overall model performance.
Loss function
The YOLOv8 employs an anchor-based mechanism for bounding box prediction. Unlike traditional YOLO versions that utilize a fixed number of anchor boxes, YOLOv8 optimizes anchor box design by adopting an adaptive generation method. This approach enhances the model’s adaptability to various datasets and target sizes. The loss function (\(L\)) of YOLOv8 is a multi-component objective function designed to optimize object detection performance. It combines localization loss(\(L_{loc}\)), classification loss(\(L_{class}\)), and objectness loss(\(L_{obj}\)), with additional refinements to improve accuracy and convergence. The loss function \(L\) of YOLOv8 is shown in Eq. (1), localization loss \(L_{loc}\) is shown in Eq. (2).
\(\lambda\) represents the weight coefficient for different loss components, used to balance the importance between localization, classification, and object confidence in the loss function. It is a key hyperparameter for adjusting model performance during the optimization process. During model tuning, the officially recommended λ values (\(\lambda_{loc}\) = 7.5, \(\lambda_{class}\) = 0.5, \(\lambda_{obj}\) = 1.0) are used as baseline parameters to observe model performance. If localization errors occur frequently, \(\lambda_{loc}\) should be increased; if classification errors are prevalent, \(\lambda_{class}\) should be increased; and if false negatives or false positives are frequent, \(\lambda_{obj}\) should be adjusted accordingly, if the model exhibits excessive false positives, increase \(\lambda_{obj}\); if the model misses many objects, decrease \(\lambda_{obj}\).
In the YOLOv8 model, the Complete Intersection over Union (CIoU) loss function is a more precise metric for bounding box regression. Unlike the traditional Intersection over Union (IoU), CIoU not only considers the overlap between bounding boxes but also incorporates penalties for the distance between their centers and the aspect ratio consistency. CIoU is calculated as in Eq. (3), \(\alpha\) as in Eq. (4) and \(\nu\) as in Eq. (5).
where \(\alpha\) is a weight parameter that balances the contributions of the distance loss and aspect ratio consistency loss, The parameter \(v\) measures the consistency of the aspect ratio between the predicted bounding box. \(IoU\) is the ratio of the overlap between the predicted bounding box and the ground truth box. \(\rho\) represents the Euclidean distance between the centers of the predicted and ground truth boxes, while \(c\) is the diagonal length of the smallest enclosing box that contains both the predicted and ground truth boxes. Figure 6 shows the target box regression diagrammatic form.
Diagrammatic representation of target box regression.
The variables \(t_{x}\),\(t_{y}\),\(t_{w}\),and \(t_{h}\) denote the respective offsets. The Sigmoid activation function, represented by σ, is employed to transform the network’s predicted values for \(t_{x}\),\(t_{y}\), \(t_{w}\) and \(t_{h}\) into the range [0,1]. The terms \(c_{x}\) and \(c_{y}\) indicate the offsets within the grid cells relative to the image’s top-left corner. The width and height of the prior box are denoted by \(p_{w}\) and \(p_{h}\), respectively. The central coordinates of the predicted target box are given by \(b_{x}\),\(b_{y}\),\(b_{w}\) and \(b_{h}\), as detailed in Eqs. (6), (7), (8) and (9).
The \(L_{loc}\) of YOLOv8 combines Distribution Focal Loss (DFL) and CIoU Loss. The model predicts the distribution of the bounding box , and this distribution is optimized through focal loss. DFL is calculated as in Eq. (10).
where \(y\) represents the true value, \(y_{i}\) and \(y_{i + 1}\) are the adjacent discretization coordinates, and \(S_{i}\) and \(S_{i + 1}\) indicate the predicted distribution probabilities.
YOLOv8 employs Binary Cross-Entropy (BCE) for multi-label classification. Each class is treated independently, allowing simultaneous detection of multiple labels per object, \(L_{class}\) as shown in Eq. (11).
where \(p\) denotes the predicted class probability, and \(y_{gt}\) represents the ground truth label (0 or 1).
The \(L_{obj}\) is based on BCE and measures whether the predicted bounding box contains the target. \(L_{obj}\) as shown in Eq. (12).
where \(y_{obj}\) represents the ground truth target existence (0 or 1), and \(p_{obj}\) denotes the predicted confidence.
Experiment
Image data acquisition
At present, no specific dataset exists for road packing bag detection. To overcome this gap, we developed a customized dataset by gathering target photos. The images were taken in Tianjin, China, on a university campus inside the Haihe Education Park. A total of 1,000 pictures were acquired, including both empty and partially full bags. Both common and atypical circumstances may be more thoroughly analyzed thanks to this well-balanced dataset. The data acquisition equipment is illustrated in Fig. 7. All photographs in these figures are original and were captured by Dangfeng Pang during field experiments.
The equipment used to collect data. (a) Experimental vehicle. (b) Experimental results (c) Binocular stereo depth camera.
Fig. 8 provides a visual representation of a subset from our dataset, displaying filled packaging bags and empty packaging bags.
Selected images in the dataset (a) filled packaging bags; (b) Empty packaging bags.
Image data augmentation
With deep learning model training, a small dataset may result in issues such as overfitting, lower model performance, and poor generalization. To solve these challenges, extending the dataset is necessary to boost both the number and variety of training samples. Five data augmentation approaches were utilized in the present research: 90-degree rotation (either alone or in conjunction with brightness correction), original photos, Gaussian blur, brightness alteration, and noise injection.
Gaussian blur helps the model generalize by reducing noise and emphasizing essential features, brightness adjustment improves robustness by simulating varying lighting conditions, noise addition enhances the model’s resilience to real-world noisy environments, and 90-degree rotation increases rotational invariance, enabling the model to detect objects in any orientation. These techniques collectively expand the dataset, reduce overfitting, and improve the model’s generalization ability. These techniques are illustrated in Fig. 9. Utilizing these approaches, we effectively increased the sample size from 1,000 to 6,000, which substantially improved the model’s robustness and broadened its generalizability.
Example image data amplification effect (a) origin picture; (b) Gaussian blur; (c) brighter; (d) noise; (e) R90; (f) 90° rotation with brighter.
Experimental environment
The experiments were conducted on a single workstation. Hardware and software configurations, along with detailed information regarding the model training environment, are provided in Table 1.
The training parameters used in the experiment are listed in Table 2.
Evaluation criteria
We use four primary metrics—average precision (AP), mean average precision (mAP), recall, and precision—to evaluate the detection model’s efficacy. The evaluation framework takes into account four potential prediction outcomes: False Positives (FP), which are misclassified negative samples as positive; True Negatives (TN), which are correctly classified negative samples; True Positives (TP), which are correctly identified positive samples; and False Negatives (FN), which are positive samples that were mistakenly classified as negative. We determine the Precision and Recall values by examining the distribution of these results; these values also aid in the computation of other performance indicators.
Here, P represents Precision, which quantifies the proportion of correctly identified positive samples among all predicted positive cases. This metric, related to the final prediction outcome, is mathematically defined in Eq. (13).
In this context, R represents the Recall rate, which measures the proportion of actual positive samples that the model correctly identifies. This metric, defined in Eq. (14), reflects the model’s ability to capture positive instances within the original dataset.
In our experiments, both Recall and Precision remained consistently high. To assess the performance of the algorithmic network, we introduced the mAP, a metric that integrates both Recall and Precision. Designed specifically for multi-target detection, the mathematical expression for mAP is Eq. (15).
The total number of samples in the validation set is indicated by N in Eq. (15). The accuracy achieved while detecting k targets at once is represented by the variable \(P\left( k \right)\), and the change in recall as the number of detected samples rises from k − 1 to k is quantified by \(\Delta R\left( k \right)\). The total number of classes in the model is indicated by \(c\).During the present investigation, the criterion was to locate empty and filled bags, hence \(c = 2\).
Experimental results
To evaluate the model’s training process and object detection performance, we utilize key metrics, including box loss, objectness loss, class loss, precision, recall, mAP@0.5, and mAP@0.5:0.95, derived from both the training and testing datasets. These metrics serve as indicators of the model’s convergence. Among the six loss functions—comprising (val) box loss, (val) objectness loss, and (val) class loss—a lower value signifies improved training performance. Figure 10 presents the variation of these loss functions over time, alongside changes in accuracy, recall, and mAP. During the initial 100 epochs, the loss values decline rapidly, gradually stabilizing after 200 epochs.
All specific evaluation metrics of RGE-YOLO in our dataset.
The RGE-YOLO network was trained using a dataset containing both filled and empty packaging bags. Its precision-recall (P-R) curve is depicted in Fig. 11. As observed in the figure, the overall test performance is satisfactory. The mAP for empty packaging bags reaches 92.5%, while that for filled packaging bags is 91.9%, resulting in an overall classification mAP of 92.2%. This performance trend can be attributed to the larger number of empty packaging bags in the dataset, which contributes to improved testing results.
P–R curve of RGE-YOLO.
Experimental comparisons of different models
To assess the effectiveness of the proposed model, RGE-YOLO trained on the packaging bag dataset was compared with SSD, Faster R-CNN, Mask R-CNN, DINO, Dynamic R-CNN, FoveaBox, and TOOD models. This comparison demonstrated the superior performance of the proposed model. The comparison of GFLOPs, parameter count, detection speed, and model size for the eight models is shown in Table 3:
Experimental data and comparative analyses demonstrate that RGE-YOLO exhibits significant advantages in object detection tasks. Specifically, it achieves a computational load of only 23.9 GFLOPs, substantially lower than mainstream models such as Faster R-CNN (41.35 GFLOPs) and Mask R-CNN (185 GFLOPs), indicating reduced computational complexity suitable for devices with limited processing power. With just 9.1 million parameters, RGE-YOLO requires less than half the parameters of SSD and a quarter of those in Faster R-CNN, significantly reducing memory usage and computational overhead. The model occupies only 19 MB, markedly smaller than SSD (186 MB) and Faster R-CNN (315 MB), facilitating deployment on in-vehicle embedded devices.
In terms of real-time detection speed, RGE-YOLO achieves up to 250 frames per second, nine times faster than Faster R-CNN (28 FPS) and 1.7 times faster than SSD (143 FPS), meeting the demands of high-real-time scenarios such as autonomous driving and video surveillance. Due to its lightweight design and high real-time performance, RGE-YOLO can efficiently operate on in-vehicle chips, fulfilling the stringent requirements of autonomous driving for low latency and high reliability. Compared to YOLOv7 and YOLOv8s, RGE-YOLO has a smaller model size and lower computational resource consumption, making it suitable for edge computing scenarios.
Our suggested RGE-YOLO overcomes significant constraints in existing object identification frameworks for road safety applications. Su et al.45 built a YOLOv7-based model for urban trash detection with a mAP of 0.926 by including Res2Blocks to improve fine-grained feature fusion. Similarly, Li et al.46 developed a Swin Transformer-enhanced YOLOv7 variation (YOLOv7-Swin) for multi-class road litter recognition, achieving a mAP@0.5 of 87.34%. While these studies show progress in general trash detection, they stress multi-class classification over computing efficiency and do not specialize in detecting high-risk road packaging bags—a major error given their direct impact on driving safety.
The mAP values as a function of gradually increasing epochs are shown in Figure 12. By optimizing network structure and computational strategies, RGE-YOLO achieves a balanced performance of lightweight design, low latency, and high stability while maintaining high detection accuracy (mAP 92.2%). It outperforms traditional models like SSD and Faster R-CNN and significantly surpasses the latest YOLO series in real-time performance, making it an ideal choice for industrial-grade object detection in resource-constrained embedded environments.
mAP value across increasing epochs.
The experimental findings indicate that RGE-YOLO achieves superior performance compared to other object detection algorithms, particularly in terms of mAP, achieving a value of 0.922. This significantly surpasses the performance of Faster R-CNN, which obtains an mAP of 0.901, and SSD, with an mAP of 0.843. The high mAP value of RGE-YOLO indicates its superior detection accuracy, particularly in challenging scenarios such as small object detection and complex backgrounds. In comparison to other state-of-the-art models, RGE-YOLO demonstrates a clear advantage in handling diverse detection tasks.
In contrast, models such as Mask R-CNN (mAP = 0.502), Dynamic R-CNN (mAP = 0.558), FoveaBox (mAP = 0.532), TOOD (mAP = 0.532), and DINO (mAP = 0.506) exhibit considerably lower mAP values, highlighting their limitations in accuracy and robustness. These models struggle to match the performance of RGE-YOLO, particularly in real-time detection applications. The superior mAP and overall detection capability of RGE-YOLO make it a highly reliable and efficient solution, especially in embedded systems where computational resources are constrained.
The confusion matrices of each algorithm in the comparative experiment are shown in Fig. 13. The confusion matrices for the eight object detection algorithms illustrate their classification performance across three categories: emptybag, filledbag, and Background. RGE-YOLO (a) demonstrates the highest precision in detecting emptybag (90%) but shows some misclassification in distinguishing between filledbag and Background. Faster R-CNN (c) and Dynamic R-CNN (b) also perform well, with Faster R-CNN achieving 84% accuracy for emptybag and 70% for filledbag, albeit with some confusion between filledbag and background. FoveaBox (d) slightly improves on the classification of filledbag compared to Dynamic R-CNN, but its overall performance remains comparable.
Confusion matrices for each algorithm in the comparative experiment.
Mask R-CNN (e) and SSD (g) show relatively weaker performance, with SSD struggling significantly in detecting filledbag (only 24% accuracy). DINO (f) and TOOD (h) exhibit the lowest classification accuracy, with notable misclassification between filledbag and Background, leading to poor detection reliability. Overall, RGE-YOLO emerges as the best-performing model due to its high precision and balanced performance across categories, making it suitable for real-world object detection tasks with challenging classification requirements.
Figure 14 presents the real-world detection outcomes of different algorithms for detecting packaging bags, revealing the varying performance levels of each model. RGE-YOLO (a) excels in accuracy, effectively detecting objects with minimal false positives and precise localization, this characteristic makes it particularly well-suited for real-time applications. Dynamic R-CNN (b) and Faster R-CNN (c) both demonstrate strong detection capabilities but show some misclassifications and overlaps, particularly in challenging environments with complex backgrounds or smaller objects. FoveaBox (d) and Mask R-CNN (e) perform adequately, but they exhibit difficulties in accurate bounding box localization and misdetections in cluttered scenes, indicating a need for further improvements in complex scenarios. DINO (f) also provides good results but struggles with variations in object size and environmental complexity. In contrast, SSD (g), while fast, faces challenges in detecting small objects and maintaining accuracy under varying lighting and background conditions. TOOD (h) exhibits average detection performance, with considerable room for improvement, particularly in handling overlapping objects and background interference.
Detection results: (a) RGE-YOLO; (b) Dynamic R-CNN; (c) Faster R-CNN; (d) FoveaBox; (e) Mask R-CNN; (f) DINO; (g) SSD; (h) TOOD.
The ablation experiment for the RGE-YOLO model, as shown in the Table 4, demonstrates the contributions of individual components to its overall performance. RGE-YOLO, which integrates YOLOv8s, RepViT, GSConv, and EMA, exhibits a substantial improvement in detection accuracy as more components are incorporated. The mAP50 value increases from 0.193 with YOLOv8s alone to 0.922 with the final configuration (Layer21 with C2f Channel), indicating the effectiveness of each component in enhancing the model’s performance. The GFLOPS decreases from 28.6 to 23.9, suggesting that while accuracy improves, the model’s computational efficiency is optimized. The reduction in parameters (M) from 11.17 million to 9.1 million further highlights the efficiency gains, reflecting a smaller, more effective model without sacrificing performance. This ablation study underscores the significant advantages of combining RepViT, GSConv, and EMA with the base YOLOv8s, resulting in a model that offers high detection accuracy, optimized computational performance, and reduced model complexity, making RGE-YOLO a robust and efficient solution for object detection tasks.
RGE-YOLO offers several advantages, making it an effective model for real-world detection tasks, particularly for detecting road surfaces and packaging bags. It demonstrates high accuracy and precision, with minimal false positives and accurate bounding box localization, especially in complex environments where other models may struggle. RGE-YOLO shows strong robustness in handling varying object sizes, lighting conditions, and background interference. Its efficiency allows for real-time processing, making it suitable for applications such as autonomous driving and road hazard detection. Compared to other algorithms, RGE-YOLO exhibits superior stability and generalization, consistently outperforming other models in practical detection scenarios. These qualities make RGE-YOLO a reliable solution for industrial object detection tasks, including road surface and packaging bag detection.
Despite promising results, the current approach struggles with adverse weather conditions like rain and fog, which can impact detection. It also faces challenges with occluded or cluttered objects. The following research should focus on improving robustness under such conditions using weather-related datasets and multi-modal sensor fusion. Enhancing occlusion handling and detecting small, partially visible objects will further increase performance in practical applications.
Discussion
As demonstrated in Table 3 and Figure 12, the experimental results show that RGE-YOLO achieves a superior balance between detection accuracy and computational efficiency compared to existing methods. For example, RGE-YOLO attains a 92.2% mAP, outperforming Faster R-CNN’s 90.1% mAP and SSD’s 84.3% mAP by margins of 2.1% and 7.9%, respectively. This improvement is attributed to the integration of EMA and RepViTBlock, which enhance multi-scale feature fusion and global contextual modeling, addressing the limitations of conventional methods in detecting small, low-contrast objects such as flattened packaging bags. Furthermore, RGE-YOLO achieves a detection speed of 250 FPS, which is 8.9 times faster than Faster R-CNN’s 28 FPS and 1.7 times faster than SSD’s 143 FPS, making it uniquely suited for real-time deployment in autonomous driving systems. Unlike transformer-based models like DINO, which achieves 29.2 FPS and 50.6% mAP, RGE-YOLO reduces computational complexity by 79.9% through 23.9 GFLOPs compared to DINO’s 119 GFLOPs while improving accuracy, resolving the trade-off between speed and precision that plagues many state-of-the-art detectors.
The ablation study presented in Table 4 further validates the distinct advantages of RGE-YOLO’s components. Replacing standard convolutions with GSConv reduces parameters by 18.5%, from 11.17 million to 9.1 million, while maintaining competitive accuracy at 92.2% mAP compared to YOLOv8s’ 91.3% mAP, thereby addressing the redundancy issues inherent in SSD and Faster R-CNN. The EMA module improves small-object detection, such as partially occluded bags, by 6.8% in precision compared to models without attention mechanisms, as shown in Figure 14a versus 14g. Its cross-scale feature interaction mitigates the spatial information loss observed in FoveaBox and TOOD. The RepViTBlock enhances robustness to background clutter, as illustrated in Figure 14a, achieving a 4.3% higher recall rate of 89.7% compared to Dynamic R-CNN’s 85.4%, which struggles with overlapping objects. These innovations collectively address key limitations of prior works, such as SSD’s poor small-object detection, Faster R-CNN’s high latency, and transformer-based models’ computational inefficiency, positioning RGE-YOLO as a versatile solution for road safety applications.
The paper also points up some issues that need more investigation. The model’s performance in extreme weather circumstances has not been tested because there aren’t any in the dataset that is currently available. Because occluded or severely deformed bags can occasionally result in false negatives, spatial inference in complex settings needs to be improved. Future research will concentrate on the following areas: creating adaptive attention mechanisms to deal with occlusions and irregular shapes; investigating model compression techniques, such as neural architecture search, for additional optimization on edge devices; and integrating multi-modal sensor data, like as LiDAR and thermal imaging, to improve robustness in low-visibility conditions. By overcoming the gap between lab performance and practical implementation, these initiatives hope to improve the dependability of AI-powered safety systems in self-driving cars.
Conclusion
The current research effectively meets the aims described in the introduction by proposing RGE-YOLO, a lightweight model designed for real-time detection of road packaging bags to enhance driving safety. The integration of RepViTBlock, GSConv, and EMA tackles fundamental difficulties in prior approaches, including computational inefficiency, poor small-object recognition, and limited adaptability to dynamic road situations. The model achieves a cutting-edge mAP of 92.2% while maintaining a detection speed of 250 FPS thanks to systematic architectural refinements such as replacing standard convolutions with GSConv to reduce redundancy by 18%, embedding EMA for cross-scale feature fusion, and leveraging RepViTBlock’s hybrid CNN-Transformer design. These advances are proven by comprehensive trials on a bespoke dataset of 6,000 augmented photos, which show considerable improvements over benchmarks such as Faster R-CNN with a 90.1% mAP and YOLOv8s with 11.1 million parameters compared to our model’s 9.1 million parameters. The reduced computational complexity of 23.9 GFLOPs and tiny model size of 19 MB further indicate its feasibility for deployment on embedded systems, which is consistent with the goal of improving real-time safety in autonomous driving.
Data availability
The dataset used in this study is available upon request from the corresponding author, D.P.Email: pangdangfeng@tsguas.edu.cn.
References
Fukushima, K. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36, 193–202 (1980).
Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 779–788 (IEEE, 2016).
Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149 (2017).
Varghese, R. & M., S. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. in 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS) 1–6 (2024).
Wang, A., Chen, H., Lin, Z., Han, J. & Ding, G. RepViT: Revisiting Mobile CNN From ViT Perspective. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 15909–15920 (2024).
Li, H. et al. Slim-neck by GSConv: a lightweight-design for real-time detector architectures. J. Real-Time Image Proc. 21, 62 (2024).
Ouyang, D., He, S., Zhang, G., Luo, M., Guo, H., Zhan, J. & Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. in 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 1–5 (IEEE, 2023).
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y. & Berg, A. C. SSD: Single Shot MultiBox Detector. in Computer Vision – ECCV 2016 (eds. Leibe, B., Matas, J., Sebe, N. & Welling, M.) 21–37 (Springer International Publishing, 2016).
He, K., Gkioxari, G., Dollar, P. & Girshick, R. Mask R-CNN. in 2961–2969 (2017).
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P. & Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. in Proceedings of the IEEE/CVF international conference on computer vision 9650–9660 (2021).
Zhang, H., Chang, H., Ma, B., Wang, N. & Chen, X. Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training. in Computer Vision – ECCV 2020 (eds. Vedaldi, A., Bischof, H., Brox, T. & Frahm, J.-M.) 260–275 (Springer International Publishing, 2020).
Kong, T. et al. FoveaBox: Beyound anchor-based object detection. IEEE Trans. Image Process. 29, 7389–7398 (2020).
Feng, C., Zhong, Y., Gao, Y., Scott, M. R. & Huang, W. TOOD: Task-aligned One-stage Object Detection. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 3490–3499 (IEEE Computer Society, 2021).
Botezatu, A.-P., Burlacu, A. & Orhei, C. A review of deep learning advancements in road analysis for autonomous driving. Appl. Sci. 14, 4705 (2024).
Bouazizi, O., Azroumahli, C., Mourabit, A. E. & Oussouaddi, M. Road object detection using SSD-MobileNet algorithm: Case study for real-time ADAS applications. Journal of Robotics and Control 5, 551–560 (2024).
Alam, M. K. et al. Faster RCNN based robust vehicle detection algorithm for identifying and classifying vehicles. J. Real-Time Image Proc. 20, 93 (2023).
Chaudhuri, A. Smart traffic management of vehicles using faster R-CNN based deep learning method. Sci. Rep. 14, 10357 (2024).
Lee, S., Hwang, J., Kim, J. & Han, J. CNN-Based Crosswalk Pedestrian Situation Recognition System Using Mask-R-CNN and CDA. Appl. Sci. 13, 4291 (2023).
Wan, C., Chang, X. & Zhang, Q. Improvement of road instance segmentation algorithm based on the modified mask R-CNN. Electronics 12, 4699 (2023).
Li, L. et al. Road pothole detection based on crowdsourced data and extended mask R-CNN. IEEE Trans. Intell. Transp. Syst. 25, 12504–12516 (2024).
Yang, L. et al. A deep learning method for traffic light status recognition. J. Intell. Connect. Veh. 6, 173–182 (2023).
Shen, J. et al. An anchor-free lightweight deep convolutional network for vehicle detection in aerial images. IEEE Trans. Intell. Transp. Syst. 23, 24330–24342 (2022).
Ou, K. et al. Drone-TOOD: A lightweight task-aligned object detection algorithm for vehicle detection in UAV images. IEEE Access 12, 41999–42016 (2024).
Zhang, S., Bei, Z., Ling, T., Chen, Q. & Zhang, L. Research on high-precision recognition model for multi-scene asphalt pavement distresses based on deep learning. Sci Rep 14, 25416 (2024).
Qiu, C. et al. Machine vision-based autonomous road hazard avoidance system for self-driving vehicles. Sci Rep 14, 12178 (2024).
Xu, W., Li, X., Ji, Y., Li, S. & Cui, C. BD-YOLOv8s: enhancing bridge defect detection with multidimensional attention and precision reconstruction. Sci Rep 14, 18673 (2024).
Wang, Y. et al. A steel defect detection method based on edge feature extraction via the Sobel operator. Sci Rep 14, 27694 (2024).
Shen, J., Liu, N., Sun, H., Li, D. & Zhang, Y. An instrument indication acquisition algorithm based on lightweight deep convolutional neural network and hybrid attention fine-grained features. IEEE Trans. Instrum. Meas. 73, 1–16 (2024).
Redmon, J. & Farhadi, A. YOLO9000: Better, Faster, Stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 6517–6525 (IEEE, 2017).
Redmon, J. & Farhadi, A. YOLOv3: An Incremental Improvement. Preprint at http://arxiv.org/abs/1804.02767 (2018).
Bochkovskiy, A., Wang, C.-Y. & Liao, H.-Y. M. YOLOv4: Optimal Speed and Accuracy of Object Detection. Preprint at http://arxiv.org/abs/2004.10934 (2020).
Alahdal, N. M., Abukhodair, F., Meftah, L. H. & Cherif, A. Real-time object detection in autonomous vehicles with YOLO. Proc. Comput. Sci. 246, 2792–2801 (2024).
Flores-Calero, M. et al. Traffic sign detection and recognition using YOLO object detection algorithm: A systematic review. Mathematics 12, 297 (2024).
Bao, D. & Gao, R. YED-YOLO: an object detection algorithm for automatic driving. SIViP 18, 7211–7219 (2024).
Chaudhry, R. SD-YOLO-AWDNet: A hybrid approach for smart object detection in challenging weather for self-driving cars. Expert Syst. Appl. 256, 124942 (2024).
Wang, J. et al. Road defect detection based on improved YOLOv8s model. Sci Rep 14, 16758 (2024).
Wang, H., Liu, C., Cai, Y., Chen, L. & Li, Y. YOLOv8-QSD: An improved small object detection algorithm for autonomous vehicles based on YOLOv8. IEEE Trans. Instrum. Meas. 73, 1–16 (2024).
Shen, J. et al. Finger vein recognition algorithm based on lightweight deep convolutional neural network. IEEE Trans. Instrum. Meas. 71, 1–13 (2022).
Rodríguez-Rangel, H. et al. Analysis of statistical and artificial intelligence algorithms for real-time speed estimation based on vehicle detection with YOLO. Appl. Sci. 12, 2907 (2022).
Ali, M. L. & Zhang, Z. The YOLO framework: A comprehensive review of evolution, applications, and benchmarks in object detection. Computers 13, 336 (2024).
Liang, J. A review of the development of YOLO object detection algorithm. Appl. Comput. Eng. 71, 39–46 (2024).
Liu, L., Sun, Y., Li, Y. & Liu, Y. A hybrid human fall detection method based on modified YOLOv8s and AlphaPose. Sci Rep 15, 2636 (2025).
Wang, H., Xu, S., Chen, Y. & Su, C. LFD-YOLO: a lightweight fall detection network with enhanced feature extraction and fusion. Sci Rep 15, 5069 (2025).
Su, Z., Lin, X., Sun, K., Mao, S. & Zhou, X. Urban Trash Detection and Cleanliness Assessment Based on YOLOv7 for Autonomous Sweeper. in 2024 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML) 1744–1749 (IEEE, 2024).
Li, H., He, T., Wang, S., Luo, S. & Xu, C. YOLOv7-Swin: Urban road trash detection based on improved YOLOv7. in 2023 3rd International Conference on Electronic Information Engineering and Computer (EIECT) 414–418 (IEEE, 2023).
Acknowledgements
This work was funded by the Science & Technology Development Fund of Tianjin Education Commission for Higher Education (Grant Number 2024KJ118).
Author information
Authors and Affiliations
Contributions
Z.G. and D.P. conceived the experiments, R.D. and Y.L. conducted the literature survey and data collection, D.P. and T.L. conducted the experiments, Z.G. and D.P. analyzed the results. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Pang, D., Guan, Z., Luo, T. et al. RGE-YOLO enables lightweight road packaging bag detection for enhanced driving safety. Sci Rep 15, 18306 (2025). https://doi.org/10.1038/s41598-025-03240-z
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-03240-z
This article is cited by
-
DDRN: DETR with dual refinement networks for autonomous vehicle object detection
Scientific Reports (2025)
















