Introduction

Object detection represents a pivotal challenge in the field of computer vision, with advancements in this area driving progress across multiple domains. The Transformer model, currently leading the way in detection algorithms, has demonstrated remarkable proficiency in identifying objects in a variety of contexts1. Notably, the DETR model has surpassed traditional CNN approaches, particularly in the rapid and accurate detection of river floating objects, establishing itself as a promising solution for future environmental monitoring tasks. Rivers, in their diverse forms, serve as the lifeblood of ecosystems, meandering through both forests and urban landscapes2. They are not only critical components of natural circulation systems but also essential for sustaining human life. However, the increase in irresponsible waste disposal behaviors highlights the urgent need for monitoring and documenting river borne objects3. Effective surveillance can prevent river pollution and blockages, and for significant waterways, it can play a crucial role in restoring ecological balance. The development of accurate detection algorithms for river waste is therefore of growing importance. However, research on traditional river floating object detection algorithms often encounters the following challenges:

  • The impact of the diversity of image acquisition devices on the detection of floating objects in rivers: The absence of high-definition cameras specifically designed for this purpose results in variable image quality, which can obscure the clarity of the visual data4. Variations in color configurations among image acquisition devices utilized by diverse river monitoring channels can significantly impact the efficiency of detecting floating objects within rivers. To enhance detection consistency, it is essential to standardize images collected from various devices, thereby minimizing the effects of color configuration discrepancies on the detection algorithm.

  • Difficulties in identifying image features of floating objects in rivers: Images of floating objects in rivers frequently lack distinct and discernible features. This challenge is compounded by the complexity of the network model5, making effective identification of these floating objects difficult. To enhance recognition accuracy, researchers must develop more sophisticated algorithms that are capable of better extracting the characteristics of floating objects from intricate backgrounds.

  • The impact of the size of river floating objects and weather conditions on detection: The sizes of river floating objects captured by different equipment at varying magnifications can be inconsistent, and adverse weather conditions such as rain, snow6, fog, and darkness further complicate the detection of floating object types. Consequently, it is imperative to develop detection algorithms that can adapt to diverse environmental conditions to enhance detection performance under varying weather scenarios.

  • Precision challenge in locating river floating objects: Certain algorithms face challenges in precisely pinpointing the locations of floating objects within rivers. Incorrect identification and localization directly impact the classifier’s accuracy, leading to performance degradation. To address this issue, it is crucial to further optimize the algorithm to improve its positioning accuracy and robustness in complex environments.

The existing algorithms for detecting floating objects in rivers often discard low-resolution images directly and resort to enlarging the original images for small targets, which significantly affects the efficiency of real-time detection. These algorithms also fall short in comprehensively extracting features of floating objects and accurately localizing targets, lacking a backbone network specifically designed for feature extraction of river floating objects as well as a feature fusion module. To address the key issues in image target detection of river floating objects, especially the challenges of real-time detection under complex backgrounds and varying scales, the LR-DETR model is proposed. Its main contributions are as follows:

  • Development of an Innovative Feature Fusion Network: We designed and implemented a novel network architecture called the High-level Screening-feature Path Aggregation Network (HS-PAN). This network significantly enhances feature utilization efficiency by optimizing the multi-scale feature fusion pyramid module7 in MFDS-DETR. The introduction of HS-PAN provides a more effective approach to feature extraction and fusion, thereby establishing a solid foundation for subsequent garbage detection tasks.

  • Optimization of the image preprocessing process: We conducted a series of preprocessing steps on each image in the test set. This process includes the automatic reorientation of pixel data and the uniform resizing of images to 640 × 640 pixels. This standardization not only balances the size of floating objects within the images but also improves detection performance during training from multiple angles.

  • Improved convolution structure enhances model performance: We employed partial convolution8 technology to enhance the BasicBlock in ResNet-189. By applying standard convolution to a portion of the input channels to extract spatial features while leaving other channels unchanged, we constructed a T-shaped convolution structure that further enhances the overall performance of the model.

  • Innovative improvements of the feature fusion module in CNN: We modified RepBlock in the CNN-based Cross-scale Feature Fusion (CCFF) and introduced the Conv3XCBlock. Through the application of a parameter-free attention mechanism and progressive convolution, this overall structure achieves optimal feature integration and enhances feature utilization efficiency. Detection results on public datasets indicate that our method demonstrates significant effectiveness and accuracy in specific scenarios.

Related work

Detection of floating objects in rivers

Convolutional Neural Networks (CNNs) have established themselves as the cornerstone of object detection technology, celebrated for their exceptional efficiency and accuracy. These complex and sophisticated networks possess the distinctive ability to detect and classify objects within images or videos in a single stage, highlighting their extensive potential across various applications10. In the critical domain of river management, CNNs stand out for their ability to accurately identify and monitor floating objects, thereby equipping authorities with improved tools for the planning and stewardship of river ecosystems. The pursuit of enhanced automation in monitoring floating objects in rivers has become a central focus for researchers, who are committed to refining detection algorithms to address their limitations, particularly in recognizing small targets. Chen et al.11 have made significant advancements with an enhanced YOLOv5 model that optimizes feature fusion by seamlessly integrating shallow and deep features. This innovation addresses the shortcomings of existing algorithms and lays the groundwork for more effective detection. In response to environmental factors that can hinder detection performance, Li et al.12 introduced the YOLO-Float model, which has demonstrated impressive results on the FloW-img dataset, showcasing its robustness against the challenges posed by real-world conditions. Building upon this, Zhang et al.13 developed the YOLOv5-FF detector, which incorporates an adaptive feature extraction module that further refines the detection process. Nakayama et al.14 adopted a different approach by advancing the Large Eddy Simulation (LES) method to explore the transport and sedimentation dynamics of floating objects in actual rivers, providing valuable insights into the physical aspects of river floating objects management. Complementing these efforts, Li et al.15 proposed an enhanced version of the YOLOv5s model, specifically designed for the detection of river pollutants, demonstrating the ongoing commitment to innovation in this field.

One-stage target detection

The pursuit of object detection models with reduced bias has spurred innovative developments in the field. Li et al.16 developed the RS-UNet, a model that incorporates a Reflection Suppression Block (RSB) utilizing Laplace convolution alongside a Lightweight Encoder Decoder (LED). This novel approach has achieved an impressive average accuracy of 89% at an Intersection over Union (IoU) threshold of 0.5, demonstrating its effectiveness in mitigating detection bias. Further advancements by Li et al.17 introduced TC-YOLOv5, a model aimed at enhancing detection accuracy through the integration of a convolutional block attention module and a vision converter. This integration effectively addresses the critical challenge of balancing precision and speed in object detection tasks. Zhang et al.18 made strides towards precise object localization in video data with the introduction of EYOLOv3. Their experiments revealed that the incorporation of residual modules into the model’s backbone significantly enhances recognition accuracy, particularly for floating objects in dynamic video streams. Qiao et al.19 expanded the capabilities of the YOLOv5 model by improving its multi-scale feature fusion. They replaced the conventional feature pyramid network with a more efficient bi-directional alternative and integrated a coordinate attention module into the backbone network, collectively elevating detection accuracy. Chen et al.20enhanced the SSD model for detecting and tracking floating objects on water surfaces by incorporating adaptive filtering and improving its detection capabilities. Zhang et al.21 enhanced the RefineDet model by refining its anchor refinement module, transfer connection block, and object detection module. Li et al.22 enhances the YOLOv7 model for river floating garbage detection by refining its small target detection capabilities through improved multi-scale feature fusion, incorporating a more efficient bi-directional feature pyramid network alternative and integrating a coordinate attention module into the backbone. He et al.23 enhances the YOLOX framework for detecting floating objects in ground images of complex water environments by refining its feature integration approach. Zhang et al.24 enhances YOLOv7 for small-target water-floating garbage detection by introducing a multi-scale feature adaptive weighted fusion mechanism. This advancement is crucial for improving the accuracy and efficiency of river floating objects detection systems, thereby facilitating more effective river management and ecological protection efforts.

Detection transformer

The Transformer architecture, initially celebrated for its prowess in natural language processing25, owes much of its success to its sophisticated attention mechanism26. This success has inspired scholars to extend its application to the realm of image detection, where it offers distinct advantages over traditional methods. Unlike Recurrent Neural Networks (RNNs)27, the Transformer’s attention model excels in global computation and possesses robust storage capabilities. DETR exemplifies this paradigm shift by utilizing attention mechanism and Transformer architectures to directly detect objects and ascertain their positions within input images28. This innovative approach bypasses the need for conventional anchor boxes or Non-Maximum Suppression (NMS) techniques, thereby streamlining the detection process. Building upon the DETR foundation, Deformable DETR integrates the multi-scale deformable attention theory to address challenges associated with self-attention and cross-attention mechanism29. This enhancement significantly improves the model’s performance in detecting small and occluded objects, a task that often eludes traditional detection models. Conditional DETR further refines the original DETR model by incorporating a conditional embedding mechanism30. This mechanism assimilates conditional information, such as category and target location, into the model’s representation. Consequently, the model can predict the target bounding box and detection category, facilitating adaptive object detection that responds effectively to varying conditions and constraints. Anchor DETR introduces an innovative anchor mechanism that aligns anchors with the DETR framework31. This integration facilitates the automatic generation of anchor sizes and ratios that are tailored to the characteristics of the dataset. To enhance detection performance, a multi-stage training approach is utilized, which improves the model’s adaptability by establishing distinct training objectives and strategies at various stages of the training process. DAB-DETR introduces a dynamic anchor box generation module based on the DETR framework32. This module generates a set of dynamic anchor boxes that are informed by the feature map of the input image. The size and shape of these anchor boxes are determined dynamically according to the content of the image, which facilitates improved capture of target diversity. Dynamic Anchor Boxes Training is employed during the training phase to optimize detection performance by fine-tuning the parameters of the anchor boxes. DN-DETR accelerates the training of DETR by implementing a Query DeNoising mechanism, which introduces random noise to the queries during training to enhance detection capabilities33. DINO builds upon the denoising training method of DN-DETR and innovatively designs the initialization of the decoder’s object query34. By statically embedding the decoder and enhancing the top-k position query, it preserves the learnability of the content query, thereby increasing the accuracy of the algorithm’s detections and improving its ability to manage issues related to object deformation and occlusion. The latest RT-DETR advances beyond traditional YOLO-based approaches in real-time target detection tasks, with the goal of developing a real-time, end-to-end target detector35.

Methods

Overall architecture

The LR-DETR architecture, illustrated in Fig. 1, is a complex system comprising three primary components: the backbone network, the encoder, and the decoder. The backbone network is enhanced with partial convolution (PConv) to refine the BasicBlock of ResNet-18, resulting in the creation of a novel network, RPCN, which is crucial for improving the feature fusion process in subsequent stages. The core of the encoder is the Attention-based Intra-scale Feature Interaction (AIFI) and the HS-PAN. The AIFI is composed of a multi-head self-attention mechanism and an FFN, which can better accept image features and positional coding information, and the HS-PAN is composed of a feature selection and a feature fusion module, in which we use a CA module to introduce a feature selection module, which intelligently filters low-level features according to the high-level semantic information. For the feature fusion module, we introduce Conv3XCBlock to fuse with Selective Feature Fusion (SFF), which cleverly combines the neighboring scale features into a unified new representation, and finally the HS-PAN is formed by stitching the two modules together, which significantly improves the model’s ability to capture and elucidate multifaceted features. On the decoder’s side, a meticulous bipartite plot match is conducted between the model’s output and the Ground Truth (GT) values. This process is enhanced by self-attention mechanism and cross-deformable strategies that utilize the comprehensive features extracted by the encoder, thereby facilitating the precise identification of the target object’s location and category. The LR-DETR architecture, through these sophisticated components and processes, exemplifies a clear and logical approach to the complex task of detecting river floating objects.

Fig. 1
figure 1

The overall structure of our model.

Backbone based on improved FasterNet

In the context of network architecture, the backbone is essential for extracting data characteristics. Empirical evidence suggests that models with a higher parameter count tend to demonstrate superior performance, albeit at the expense of a rapidly increasing overall architecture size. To achieve a balance between efficiency and accuracy, it is crucial to reduce the size of the network architecture without compromising accuracy. Therefore, the design of a compact and efficient backbone for lightweight detectors is of utmost importance. In this study, we investigate the recently released FasterNet and its corresponding enhancements, designing the RPCN and employing it as the backbone of the detector.

FasterNet introduces a novel convolution operator known as partial convolution, which avoids shortcut-like operations. Consequently, it eliminates additional processes such as splicing or element-wise summation that could hinder the model’s inference speed. Partial convolution selectively processes the features of specific channels through conventional convolution while leaving the features of the remaining channels unchanged. This approach reduces computational complexity, facilitating rapid and efficient neural network processing. ResNet, a widely utilized deep convolutional neural network architecture in computer vision, addresses the common challenges of gradient vanishing or explosion that arise during the training of deep neural networks by incorporating residual blocks. This design enables the construction of deeper networks. The core concept of the ResNet architecture is the integration of residual connections, which allows the network to learn directly from the residuals between inputs and outputs, rather than solely from the original inputs. This residual learning mechanism enhances the network’s ability to capture long-term dependencies and improves the model’s generalization capability. Various variants of ResNet employ different basic modules, with ResNet-18, the simplest structure, being based on basic blocks. To enhance the expressiveness of FasterNet, improve its computational efficiency, and reduce the overall load on the network architecture, we have developed the backbone structure of the Residual Partial Convolutional Network (RPCN). The enhanced network begins with three sets of 3 × 3 convolutions. By adjusting the stride, the output shape aligns with the original 7 × 7 convolution of ResNet-18, thereby facilitating improved feature extraction. Additionally, in the pooling module of the ResNet block, the 1 × 1 convolution with a stride of 2 is replaced by global pooling with a stride of 2. This adjustment is made to mitigate information loss associated with traditional downsampling methods, allowing for greater information retention. Subsequently, the basic block is substituted with the PConv convolution module, resulting in final outputs designated as S3, S4, and S5, with respective shape values of 128 × 80 × 80, 256 × 40 × 40, and 512 × 20 × 20. This modification effectively reduces the overall parameter count while ensuring enhanced feature richness. Further structural details of the improved RPCN can be found in Fig. 2.

Fig. 2
figure 2

The specific structure of our RPCN Block.

For the sake of consecutive memory accesses, we only use the first or last consecutive \(\:{c}_{P}\) channels as a representative to compute the entire feature mapping, which makes the FLOPs of a PConv are only of the size shown in Eq. (1).

$$\:\text{h}\times\:\text{w}\times\:{\text{k}}^{2}\times\:{\text{c}}_{\text{p}}^{2}$$
(1)

When the commonly used values are used, the FLOPs of a PConv are 1/16 of those of a regular Conv, in addition to the fact that PConv have fewer memory accesses, as shown in Eq. (2).

$$\:\text{h}\times\:\text{w}\times\:2{\text{c}}_{\text{p}}+{\text{k}}^{2}\times\:{\text{c}}_{\text{p}}^{2}\approx\:\text{h}\times\:\text{w}\times\:2{\text{c}}_{\text{p}}$$
(2)

The improved model, compared with the original FasterNet, maintains the key PConv component and incorporates the residual mesh concept. This results in a smaller parameter size, increased computational efficiency, and enhanced data features. The output of each layer effectively captures diverse background and edge features, making the improved model a more suitable backbone for feature extraction.

Improved encoder

The AIFI module employs a multi-head self-attention and Feed-forward Neural Network (FNN) mechanism that specializes in receiving and processing image features originating from S5 and their positional encoding and other information. The high-level features of S5 are chosen because they are more conducive for the self-attention mechanism to capture the association between semantic and entity concepts. Meanwhile, the lower-level features S3 and S4 are directly inputted into the HS-PAN system for further processing along with the already processed S5 features without special processing.

The CCFF module in RT-DETR is similar to the feature pyramid structure, but the multi-scale nature of river floats in the river float dataset makes it more difficult to accurately identify float types. The complexity of this problem comes from the differences in the diameter of different types of floats, even if they are photographed with the same aerial camera equipment, the images of the same type of floats will show different sizes due to the differences in the environment and shooting angles. To address these multi-scale challenges, we propose the HS-PAN as an alternative to the CCFF module. HS-PAN integrates a module for selecting features and a module for fusing features. At the beginning of the structure, feature maps at different scales are selected in order to filter out relevant data. Following this, the mechanism known as SFF combines high-level and low-level information from these feature maps to generate features that contain enriched semantic content. This process helps in identifying nuanced features in images of river floats and strengthens the model’s detection abilities. In contrast, HS-FPN adopts a feature fusion method that begins from the top and works downwards, utilizing a fused feature map with enriched semantic details for making predictions. However, we have observed that this top-down FPN network is constrained by unidirectional information flow, resulting in limited improvement in detection accuracy. While the top layers of the feature map possess stronger semantic information ideal for object classification, the lower layers contain more precise location information crucial for object localization. Despite the FPN structure enhancing semantic information in the prediction feature map, there is a theoretical loss of location information. To address this, a new bottom-up path is introduced to transfer location information to the prediction feature map, ensuring that the prediction feature map combines both high semantic and location information for improved detection of floating objects in rivers. Refer to Fig. 3 for a detailed illustration of this structure.

Fig. 3
figure 3

The framework of high-level screening-feature path aggregation network comprises two parts: the feature selection module and the feature fusion module.

The HS-PAN is composed of two primary elements: the feature fusion module and the feature selection module. Comprising the feature selection module are the module CA and the module for matching dimensions. The CA module handles the input feature map \(\:{\text{f}}_{\text{i}\text{n}}\in\:{\text{R}}^{\text{C}\times\:\text{H}\times\:\text{W}}\), where C denotes the number of channels, H signifies the height, and W represents the width of the feature map. This feature map undergoes processing through two pooling layers and the results are combined. The weight value of each channel is determined using the Sigmoid activation function, resulting in the weight of each channel\(\:\:{\text{f}}_{\text{C}\text{A}}\in\:{\text{R}}^{\text{C}\times\:1\times\:1}\). Finally, the dynamic dimension matching module utilizes a 1 × 1 convolution to reduce the number of channels in the scale feature map to 256. On the other hand, the feature fusion module consists of the SSF module and Conv3XCBlock. The SSF module filters essential semantic information from low-scale features using high-level features as weights, as shown in Fig. 4.

Fig. 4
figure 4

The framework of SFF module.

The high-level features \(\:{\text{f}}_{\text{h}\text{i}\text{g}\text{h}}\in\:{\text{R}}^{\text{C}\times\:\text{H}\times\:\text{W}}\) are expanded using a transposed convolution (T-Conv) of size 2 and a kernel size of 3 × 3 to adjust the feature size \(\:{\text{f}}_{\widehat{\text{h}\text{i}\text{g}\text{h}}}\in\:{\text{R}}^{\text{C}\times\:2\text{H}\times\:2\text{W}}\). Bilinear interpolation is then applied to up-sample or down-sample the high-level features to match the dimensions of the low-scale features \(\:{\text{f}}_{\text{l}\text{o}\text{w}}\in\:{\text{R}}^{\text{C}\times\:\text{H}\times\:{\text{W}}_{1}}\). Subsequently, the high-level features are transformed into attention weights using the CA module, which are then used to filter the low-scale features to achieve features of the same dimension and obtain features \(\:{\text{f}}_{\text{a}\text{t}\text{t}}\in\:{\text{R}}^{\text{C}\times\:{\text{H}}_{1}\times\:{\text{W}}_{1}}\). Finally, the filtered low-scale features are integrated with the high-level features to enhance the feature representation of the model and obtain the final features \(\:{\text{f}}_{\text{o}\text{u}\text{t}}\in\:{\text{R}}^{\text{C}\times\:{\text{H}}_{1}\times\:{\text{W}}_{1}}\). Equations 34 illustrate the process of feature selection fusion.

$$\:{f}_{att}=BL\left(T-Conv\left({f}_{high}\right)\right)$$
(3)
$$\:\:{f}_{\text{o}\text{u}\text{t}}={f}_{low}*CA\left({f}_{att}\right)+{f}_{att}$$
(4)

HS-PAN has been enhanced through the integration of a cross-scale fusion module, where multiple convolutional layers are incorporated into the fusion path. The fusion block plays a crucial role in merging features from two adjacent scales to generate a new feature. While the original CCFF utilized N×RepBlocks consisting of RepConv for feature fusion, it was found to be unsuitable for HS-PAN. To address this, Conv3XCBlock was employed to optimize the entire fusion path. The newly introduced Fast Parameter-Free Self-Attention Network (SPAN) focuses on constructing the network using a self-attention mechanism and a SPAB module36. This approach not only resolves the heavy computational burden resulting from the model’s overall complexity and redundancy, but also effectively enhances super-resolution tasks through attention mechanism. SPAN consists of six consecutive SPAB modules, each of which extracts more advanced features via Conv3XCBlock.

Fig. 5
figure 5

Components of the Conv3XCBlock.

As shown in Fig. 5, Conv3XCBlock first extracts higher level features by three convolutional layers, and then sums the extracted features with the residual links of HS-PAN inputs, where the first two convolutional layers are connected to two activation functions each to ensure the efficiency of the whole network, and the features extracted by the convolutional layers are passed through the activation function \(\:{\sigma\:}_{a}\)which is symmetric about the origin to obtain the attention map \(\:{\text{V}}_{\text{i}}\). The feature map and the attention map are multiplied element by element to get the final output. Equations 5–7 show the fusion process.

$$\:{O}_{i}={F}_{{W}_{i}}^{\left(i\right)}\left({O}_{i-1}\right)={U}_{i}\odot\:{V}_{i}$$
(5)
$$\:{U}_{i}={O}_{i-1}\oplus\:{H}_{i},{V}_{i}={\sigma\:}_{a}\left({H}_{i}\right)$$
(6)
$$\:{H}_{i}={F}_{c,{W}_{i}}^{\left(i\right)}\left({O}_{i-1}\right)={W}_{i}^{\left(3\right)}\otimes\:\sigma\:({W}_{i}^{\left(2\right)}\otimes\:\sigma\:({W}_{i}^{\left(1\right)}\otimes\:{O}_{i-1}\left)\right)$$
(7)

where \(\:{\text{W}}_{\text{i}}^{\left(\text{j}\right)}\in\:{\text{R}}^{{\text{C}}^{{\prime\:}}\times\:{\text{H}}^{{\prime\:}}\times\:{\text{W}}^{{\prime\:}}}\) represents the previous convolutional kernel layer, σ denotes the activation function after the convolutional layer, and the ReLU37 function is generally used. \(\:\oplus\:\) and represent the sum of elements and convolution operations between the extracted features and the remaining connections, respectively. \(\:{\text{F}}_{{\text{W}}_{\text{i}}}^{\left(\text{i}\right)}\) and \(\:{\text{F}}_{\text{c},{\text{W}}_{\text{i}}}^{\left(\text{i}\right)}\) represent the functions of the ith Conv3XCBlock and the corresponding convolutional layer, respectively, by using the improvement of the Conv3XCBlock module to RepBlock, the features brought by HS-PAN and RPCN are better integrated, and the subsequent experiments showcased the accuracy in detecting river floats and the stability of the overall network structure.

Query selection and decoder

RT-DETR is an IOU-aware query selection algorithm that improves the user’s initialization of a query for a target. It uses an uncertainty minimum query selection scheme to explicitly construct and optimize the encoder function of joint latent variables for cognitive uncertainty modeling by selecting a certain number of image features as initial targets, thus providing the decoder with high-quality queries. Utilizing this foundation for target retrieval enables the model to concentrate on the entities in the scene that are closely associated with the target, thereby enhancing the precision of target identification. RT-DETR stands out as a highly effective method for object detection that integrates the strengths of the Transformer framework, particularly the Decoder component, offering a novel approach to object recognition and achieving end-to-end solutions. The crucial component in RT-DETR, the Decoder module, plays a pivotal role in the entire algorithm by employing self-attention and cross-attention mechanism to process the output characteristics of the encoder. This enables the transformation of the encoder’s output for image classification and the optimization of the target query using information from the hybrid encoder. This results in a prediction frame and an IOU score for the final detection of river floats.

Results

Dataset and implementation details

The river float dataset from Roboflow encompasses various categories, including balls, bottles, branches, weeds, leaves, milk cartons, plastic bags, and other types of plastic waste. This dataset38 features images of river floats captured under diverse weather conditions and at different times, providing a comprehensive array of examples. It consists of 1,591 training images, 455 validation images, and 228 test images, adhering to a ratio of 7:2:1. The images have been resized to 640 × 640 pixels to optimize computing resources while effectively capturing details of smaller river floats, thereby satisfying the training requirements.

Experimental environment and parameter configuration

During the training phase, the dimensions of the input images were fixed at 640 × 640 pixels, with a batch size of 4 and 2 workers assigned. Additionally, the number of epochs was established at 150. Our detector underwent training utilizing the AdamW optimizer, configured with a base learning rate of 0.0001, a weight decay of 0.0001, a global gradient clip norm of 0.1, a linear warmup period of 2000 steps, and a minimum learning rate set at 0.00001. Table 1 provides an overview of the experimental setup.

Table 1 Experimental system configuration environment.

Comparison of other methods

To evaluate the efficiency of the river float detection model, a series of comparative experiments were conducted. The evaluation process began with an analysis of the parameters and FLOP of various models to assess their effectiveness. Subsequently, the model’s performance was measured using multiple metrics, including precision, recall, mAP@0.5, and mAP@0.5:0.95. Recall evaluates the accuracy of identified objects in relation to all labeled objects, while precision measures the ratio of correct positives among identified positive instances. The mAP@0.5 represents the mean precision for each category at an IOU threshold of 0.5. Furthermore, mAP@0.5:0.95 entails calculating the average AP across different IOU thresholds, ranging from 0.5 to 0.95 in 0.05 increments. Our model’s performance was compared against established models such as RT-DETR, DETR, Deformable DETR, Faster R-CNN, and SSD, which are recognized as benchmarks in the field.

Table 2 Comparative experiments with prior works on our dataset.

The study compared our algorithm, which utilizes the RPCN network as its backbone, with existing models that employ ResNet-50 as their backbone. The weight file demonstrating the best training performance was preserved within the same experimental environment. Specific experimental results are shown in Table 2. Our algorithm outperformed SSD, achieving a 42.9% reduction in parameters, an 87.4% reduction in model complexity, a 5.3% increase in mAP@0.5, and a 6.7% increase in mAP@0.5:0.95. In comparison to Faster R-CNN, our model exhibited a 65.8% reduction in parameters, an 80.3% improvement in computational complexity, a 14.8% increase in precision, a 5.4% increase in mAP@0.5, and a 6.3% increase in mAP@0.5:0.95. In the comparison of the YOLO series, we selected several models with similar computational complexity to the RT-DETR model, namely YOLOv5m, YOLOv6s, YOLOv8m, and YOLOv10m. Our model outperforms the YOLO series in terms of recall rate, precision, mAP@0.5, and mAP@0.5:0.95, with improvements of 4.2%, 4.9%, 3.9%, and 5.9% in the mAP@0.5 scores, respectively. When assessed against DETR, our model demonstrated a 65.9% reduction in parameters, a 54.7% decrease in computational complexity, a 21.6% increase in precision, a 7.3% increase in mAP@0.5, and an 8.7% increase in mAP@0.5:0.95. Compared to Deformable DETR, our model achieved a 64.7% reduction in parameter size, a 78.1% reduction in computational complexity, a 2.1% increase in precision, a 3.0% increase in recall rate, a 13.8% increase in mAP@0.5, and a 14.4% increase in mAP@0.5:0.95. Finally, when compared to the RT-DETR model, our model exhibited a 25.8% reduction in parameter size, a 22.8% improvement in computational complexity, a 5.7% increase in recall rate, a 5.0% increase in mAP@0.5, and a 0.6% increase in mAP@0.5:0.95. Overall, when considering the complexity and performance metrics, our proposed method appears to be superior in detecting floating objects in rivers.

Detection results

Table 3 displays the detection performance of the LR-DETR model across various object classes in our dataset. Overall, the model exhibits high precision and recall, reaching 71.9% and 73.7%, respectively. mAP@0.5 and mAP@0.5:0.95 metrics also reach impressive values of 76.0% and 47.7%, confirming their effectiveness in high-precision detection of floating objects in the river. In particular, for the “bottle”, “milk carton”, and “ball” categories, the model’s mAP@0.5 scores show 94.7%, 86.5%, and 92.3% respectively. In addition, despite the dense distribution and mutual occlusion of the “branch” and “leaf” targets in the dataset, the model still achieves 66.7% and 74.1% mAP@0.5, which demonstrates its strength in handling dense targets. In addition, the model’s performance in detecting the “plastic-bag” and “plastic-trash” categories, with mAP@0.5 scores of 85.4% and 72.3%, respectively, also highlights its potential for detecting plastic products, even though these targets are less distinctive and have different shapes. In addition, the LR-DETR model scored 36.3% for mAP@0.5 in detecting objects with high similarity to the background such as “grass”, which also shows its good performance in detecting objects that are difficult to recognize. These results indicate that the model not only performs well in the detection of various river floating objects, but also shows robust performance in multi-scale target types, which makes it very promising for practical applications.

Table 3 Detection results of LR-DETR model on dataset for different categories.

In order to more intuitively show the comparison between the detection effect of LR-DETR and the original model, we visualize the mAP@0.5 score metrics of the training process, as shown in Fig. 6, at the 25th epoch of the training process it is clearly seen that LR-DETR has already outperformed RT-DETR, and as the number of training rounds increases until convergence, our model is always in the lead, which is more evidence that LR-DETR shows good learning ability during training and accuracy on detection results.

Fig. 6
figure 6

Visualization of the training process.

Comparison chart of detection effect

We conducted multiple sets of comparative experiments for various scenarios, as depicted in Fig. 7. The original individual models were compared with the images detected by our models. Through a series of comprehensive tests evaluating the performance of different object detection algorithms in diverse environmental conditions, our algorithms consistently outperformed established methods like SSD, DETR, and their variants. These tests aimed to replicate real world scenarios that present significant challenges to object detection systems, including variations in lighting, occlusion, and background complexity. The results of these tests offer valuable insights. In Test 1, which focused on detecting floating objects in a river channel under occlusion and in a plant-shaded environment, our algorithm demonstrated a notable difference in confidence levels compared to other algorithms like SSD and DETR. Our algorithm exhibited high confidence and accuracy in detecting occluded objects, showcasing its effectiveness in challenging occlusion scenarios. This suggests that our model incorporates advanced features and mechanisms, maintaining high detection accuracy even in occluded conditions.

In Test 2, focusing on solar reflection, we observed varying levels of missed detection in each model, such a critical flaw does not occur in our algorithm. The accuracy of river floats detected by each model was notably lower compared to our method, indicating that our algorithm not only ensures comprehensive detection but also maintains a higher standard of accuracy. This makes it particularly reliable in applications where accuracy is critical. Test 3 combines scenarios from both Test 1 and Test 2, involving scenarios where the water surface is reflected and the floating object is occluded a common challenge in real world applications. Unlike other algorithms, our method maintains a high detection accuracy and outperforms all comparison models. Our model addresses the challenges of these hybrid scenarios by introducing multi-scale feature fusion HS-PAN and feature network RPCN. These results underscore the effectiveness of our algorithm in detecting common river floats and further validate the feasibility of our approach.

Test 4 involves the presence of three or more floating objects in the same image at the same time, some of which are occluded. Our algorithm demonstrates its reliability by accurately detecting all objects, correctly predicting their categories, and displaying a higher average confidence level compared to other algorithms. Our model effectively identifies objects of various sizes and can also detect different types simultaneously, enabling comprehensive detection of all floating objects in a given scene. Test 5 evaluates the algorithm’s ability to detect floating objects in a river that closely resemble the background or bubbles in the water. Our detection algorithm consistently outperforms others in such scenarios, delivering superior detection accuracy. Despite the challenges posed by blurry and low-resolution photos of river floaters, as seen in Test 6, our algorithm excels in detecting these objects. While other algorithms struggle with accuracy or missed detections in such conditions, our algorithm effectively overcomes these issues, ensuring excellent detection performance and the identification of all objects.

The collective results of these tests demonstrate the excellent performance of our algorithms in various challenging conditions. Its ability to maintain high confidence and accuracy in conditions with sunlight reflection, along with its robustness to occlusion and surface anomalies, distinguishes it from existing algorithms. Moreover, its consistent accuracy across different environmental conditions highlights its broad potential applications in areas such as river pollution control and water quality monitoring. The superiority of our algorithm can be attributed to its intricate architecture and optimization strategies, which may encompass advanced feature extraction capabilities and efficient handling of environmental noise. Building upon this foundation, this study will delve into further research on the algorithm to enhance its adaptive capabilities and computing efficiency in a wider range of applications. The research findings aim to achieve optimal detection of floating objects on water surfaces with low complexity, reducing missed and false detections, enhancing detection accuracy, ensuring more reliable and precise results, and contributing to the advancement of river drifting object detection technology.

Fig. 7
figure 7

Comparison of detection results between different algorithms on river float dataset.

Heatmap comparison

In the realm of deep learning, a heatmap typically refers to a visual representation of a model’s output that illustrates the significance or intensity of activation across various sections of the input data. This visualization facilitates a better understanding of the model’s decision-making process for a specific task and highlights areas of interest to the model. In image processing tasks, a heatmap can identify the model’s points of interest within an image, such as object locations in object detection or key features in image classification. Generally, heatmaps are generated by passing input data through the model and subsequently visualizing the output or activation values of intermediate layers. These heatmaps serve to clarify the model’s decision-making process and can be crucial for debugging and improving model performance. Our study conducted a comprehensive comparison of the heatmaps produced by different models, including the SSD model, the DETR series models, and our proposed model. This comparison is illustrated in Fig. 8, which provides a visual representation of the performance of each model.

Fig. 8
figure 8

Comparison results of heatmap visualization.

The above table shows that LR-DETR is still able to accurately focus on the detection target and make the target area show more distinctive darker color markings when faced with a variety of complex environmental conditions such as occlusion, a wide range of object types, strong reflections, and complex background interference. In contrast, we observe that most other models have difficulty in effectively focusing on the detection target due to the interference of background noise and the lack of feature extraction capability, resulting in a lighter color display or incomplete coverage of the target region. By comparing the heatmaps, we can more intuitively see that the improvement of the LR-DETR model has indeed achieved significant results.

Ablation experiment

We conducted 5 sets of ablation experiments under consistent environmental and parameter settings to precisely evaluate the impact of each augmentation component on river float detection. Using the RT-DETR model as a baseline, we compared the experimental outcomes in Table 4. When RPCN was introduced independently, the recall rate, mAP@0.5, and mAP@0.5:0.95 improved by 5.9%, 2.1%, and 0.9% respectively compared to the original RT-DETR. RPCN, serving as the backbone network, enables the model to perform convolution operations on select channels of the input feature map during the feature extraction phase while keeping other channels unchanged. This enhancement underscores the significance of a robust backbone network in optimizing parameter count while still capturing detailed features essential for precise detection of floating objects in rivers. Furthermore, integrating the HS-PAN module led to a 6.4% increase in recall, a 2.7% increase in mAP@0.5, and a 0.1% increase in mAP@0.5:0.95. This not only signifies network parameter scale refinement but also demonstrates enhanced feature fusion, facilitating more accurate localization and classification of river floats by selectively merging high-level semantic information with low-level features.

The introduction of Conv3XCBlock alone resulted in a 5.7% increase in recall rate, a 2.6% increase in mAP@0.5, and a 0.5% increase in mAP@0.5:0.95. Conv3XCBlock optimizes the model’s focus, enhances feature fusion, and boosts detection accuracy. This highlights the significant impact of Conv3XCBlock on enhancing the model’s ability to identify and associate features. Additionally, it provides crucial features for detecting floating objects in rivers. When RPCN and HS-PAN were introduced simultaneously, recall increased by 7%, mAP@0.5 by 3.6%, and mAP@0.5:0.95 by 0.3%. The modules complement each other, with one focusing on extracting critical information and the other on deep feature integration. This dual approach enables the network to capture more accurate data patterns, offering a promising solution for handling complex data sets.

Table 4 Results of ablation experiments.
Fig. 9
figure 9

The visual results of the ablation study.

We observe that our model performs slightly lower in only accuracy, this is because the original model pays more attention to the effect of precision on the results during the training process, which leads to lower precision than the original model, but the recall of our model is higher, this is because our model makes a balance between precision and recall, which is more reasonable, it can be more adaptable to the changes of complex scenes and at the same time the other metrics are also greatly improved compared to the original model. Specifically, it boasts a 25.8% reduction in the parameter count and a 22.8% decrease in computational complexity, as measured by GFLOPs. Moreover, the frames per second (FPS) metric has doubled, and there is a notable rise in the recall rate by 5.7%, along with a concomitant increase in mAP@0.5 by 5.0% and mAP@0.5:0.95 by 0.6%, outperforming the original model in all these aspects. The integrated approach highlighted the synergies between modules, showcasing improvements in training speed, inference accuracy, and overall performance. The method employed in this study, integrating feature enhancement into the modified RT-DETR model, marked a notable advancement in river float detection. Visual comparisons of individual components were conducted to better understand their capabilities, as depicted in Fig. 9.

This study comprehensively improves the effectiveness of real-time DETR for efficient detection of floating objects in rivers. The implemented corrections and their optimization effects on the model are well documented, clearly showing a significant improvement in the accuracy and confidence of our model in identifying floating objects in rivers.

The original RT-DETR in Fig. 9a demonstrates significant missed detections and low average confidence levels for detected objects, indicating room for improvement, particularly in detecting complex river floats. In Fig. 9b, the introduction of the RPCN feature extraction module marks the initial step towards enhancement, significantly increasing simultaneous object detections, while playing a crucial role in minimizing feature loss and boosting overall model confidence. However, there are some defects, and the RPCN’s ability to capture richer gradient information contributes to misidentifying non-floating objects. Figure 9c shows further enhancements by replacing the HS-PAN module, resulting in no missed or false detections, showcasing HS-PAN’s efficiency in global feature learning and optimizing network parameter size and resource consumption. In Fig. 9d, incorporating Conv3XCBlock significantly enhances model performance, increasing average confidence levels and improving the model’s attention to relevant features for better identification and classification of various types of floating objects in rivers. The continuous addition of these improved functions can not only gradually increase the detection accuracy but also ensure the high accuracy of detecting floating objects in the river. The algorithm proposed in this study strikes a good balance between high-precision detection and efficiency. Constructed in this study, the algorithm not only maintains high accuracy but also demonstrates characteristics of lightness and efficiency. Following a well-designed strategic architecture adjustment, our models have achieved a new level of precision. They not only capture unique details specific to floating objects in rivers but also exhibit excellent performance in detecting this particular area. Through innovative model optimization, we have set a new benchmark for river floating object detection technology, significantly enhancing its accuracy and efficiency. This advancement not only signifies a technological breakthrough but also has a revolutionary impact on the field, serving as a crucial foundation for future research.

Comparison on different datasets

In order to fully validate the generalization ability of the improved model, we added an additional discussion of the model’s failure scenarios for edge scenarios, conducted extensive tests on the WaterPollution dataset39.

Table 5 Comparison of different methods on waterpollution dataset.
Fig. 10
figure 10

Water pollution detection images.

The goal of the WaterPollution dataset is to detect polluted floating objects on highly complex river traffic, and 3,000 aerial images of river water surface at different times and in different regions were collected. It mainly contains annotations of five categories: Water pollution, Floating debris, Derelict vessels, Fish farming and Waste, and we divided them into training set, testing set and validation set with the ratio of 7:2:1. The specific experimental results are shown in Table 5, and the LR-DETR algorithm has the highest average detection accuracy, with 30.5% and 11.3% for mAP@0.5 and mAP0.5:0.95, respectively, which is superior to the other testing algorithms.

In addition we added three tests to validate our effect, as shown in Fig. 10. Test 1 represents small river floating objects in the WaterPollution dataset, Test 2 represents the detection effect in a complex river traffic network, in addition we also simulate extreme weather such as snowstorms, heavy rainfalls or sandstorms by adding noise, and the effect is visualized by Test 3, we see that LR-DETR outperforms the original model in all three tests, and does not produce omissions in the detection results, which further proves the effectiveness of our model.

Discussion and conclusion

The morphology of floating objects in rivers is both diverse and complex, with shapes, sizes, and attachment positions varying significantly based on distance and viewing angle. These variations not only complicate the detection task but also pose significant challenges to existing algorithms. River floating object detection algorithms must be adaptable to changing light conditions and weather, while maintaining robust target detection capabilities. Additionally, occlusion between river drift and other targets further complicates detection efforts. Factors such as equipment model, manufacturer, and usage can affect imaging outcomes when capturing river images. However, challenges remain in obtaining a representative set of image samples of river floating objects due to issues related to image collection and labeling. This paper presents a real-time detection model for river floating object images, referred to as LR-DETR, which has been validated using a public dataset. The model employs the RPCN network structure, achieving a balance between efficiency and accuracy through local convolution and residual connections. Moreover, the HS-PAN module integrates a multi-scale contextual information acquisition approach to enhance adaptability to variations in the morphology and appearance of river floating objects by fusing multiple feature maps of different scales. The Conv3XCBlock module effectively directs the model’s focus toward the target objects in the river, minimizing background interference. Experimental results demonstrate the efficacy of this method in identifying floating objects in rivers, particularly in scenarios characterized by complex backgrounds and occlusions. The findings of this study hold significant theoretical and practical value for the intelligent identification of river floating objects, especially in applications such as water quality monitoring and reservoir management. Despite notable advancements in river floating object detection technology, challenges persist in practical implementation, including constraints related to computing resources and the requirements for real-time detection. As environmental awareness increases and water quality treatment technologies become more prevalent, the pace of development in river floating object detection technology is accelerating. These technologies not only assist in effectively identifying and managing floating objects in rivers but also provide crucial information for the protection of the river’s ecological environment. As technology advances and its application scope expands, river floating object detection technology is expected to present greater application prospects and development opportunities.