Abstract
Object detection, as a crucial component of remote sensing image processing, has become one of the primary methods with the maturation of deep learning technologies. Nonetheless, detecting small objects in remote sensing images remains a significant challenge. Addressing this issue, this study proposes an enhanced network model based on You Only Look Once YOLOv5, aimed at improving the detection capabilities for small objects in remote sensing images. The model employs a novel backbone network, ODCSP-Darknet53, to enhance feature extraction efficiency, and incorporates the small object enhancement bi-directional feature pyramid network (STEBIFPN) structure in the neck region of the network for optimized scaling of small object information. Additionally, we have designed two distinct weighted fusion strategies to further boost the model’s performance in detecting small objects. In the detection head portion of the model, a four-head detection network specialized for small objects is constructed, and adaptively spatial feature fusion (ASFF) technology is introduced to optimize the recognition capabilities for small objects. Experiments conducted on the DOTA and DIOR datasets demonstrate that our model achieves an average precision mean of 75.9% and 80.5%, respectively, with the model’s parameters and computational requirements amounting to 13.4M and 30.2 GFLOPs, respectively. Compared to the original YOLOv5s model, our model exhibits significant performance improvements in detecting typical small objects such as Bridge and Ship. Thus, this research provides an effective solution for object detection in the field of remote sensing image processing.
Similar content being viewed by others
Remote sensing technology plays a significant role in both military and civilian domains, including traffic monitoring1, maritime rescue2, and aviation control3, making the analysis and processing of remote sensing images critically important. Compared to traditional images, remote sensing images are captured from high altitudes, which due to factors like angle and distance, results in more complex backgrounds and many small objects. The objects in remote sensing images often exhibit arbitrary orientations, large scale variations, highly uneven distributions, and large aspect ratios, all of which undeniably add to the challenges of object detection in these images4,5,6. Object detection, as a key aspect of remote sensing image processing, is extensively used in fields such as oceanography, urban planning, and agricultural cultivation. Current object detection methods include traditional machine learning algorithms, which are relatively cumbersome as they require manual selection and design of feature extraction algorithms to derive useful features from images for detection. Moreover, convolutional neural network (CNN) based methods have rapidly evolved and can be categorized into two-stage and one-stage detection methods. Two-stage detection methods first generate a set of candidate boxes, then classify and refine the positioning of these boxes. Common two-stage models include R-CNN7, Mask R-CNN8, and Faster R-CNN9. In contrast, one-stage detection methods skip the generation and filtering of candidate boxes, directly predicting the class and location of objects within the image. This approach, by simplifying the detection process, significantly speeds up processing and is suitable for real-time applications. Popular one-stage algorithms include the YOLO series10,11,12,13,14,15,16,17 and SSD18.
YOLOv5s detection results on DOTA dataset, Figure (a) shows the YOLO test results and figure (b) shows the real annotation box.
Among them, the YOLO series has been particularly effective in balancing accuracy with detection speed. However, there is still substantial room for improvement in the detection of complex remote sensing images. For instance, as shown in Figure. 1(a), the objects to be detected are small and densely packed, while Figure. 1(b) displays the detection results of YOLOv5, clearly indicating missed detections, reflecting the insufficiencies in extracting information from small objects.
In recent years, improvements to the YOLO series networks have achieved favourable results. Wu et al.19 eliminated unnecessary residual modules and introduced a refined residual coordinate attention module, replacing the average pooling operation, which enhanced the feature representation of densely packed small objects. They also used a differential evolution algorithm to generate anchor boxes of various scales, adapting to the diverse object sizes present in HRRSI. Xie et al.20 proposed a lightweight detection model named CSPPartial-YOLO. This model incorporates a partial hybrid dilated convolution (PHDC) module, combining hybrid dilated convolutions with partial convolutions to increase the receptive field at a lower computational cost. They also constructed the CSPPartialStage to enhance the detection capabilities for small, complexly distributed objects in remote sensing images. Yang et al.21 used the IMNet as the backbone feature extraction network, significantly enhancing feature extraction capabilities while reducing the parameter count. They employed the Slim-BiFPN for adaptive fusion of multi-scale features with fewer parameters. Liu et al.22 drew inspiration from residual networks to create YOLO-Extract, integrating Coordinate Attention into the network. They also combined hybrid dilated convolutions with a redesigned residual structure, enhancing the shallow layer feature and positional information extraction capabilities, and optimizing the feature extraction for objects of varying scales. Jiang et al.23 aimed to im-prove the detection accuracy for small objects by sutilizing the C3D module to fuse deep and shallow semantic information, optimizing the multi-scale issues of remote sensing objects, and introduced a feature extraction method based on the regin attention (RA) mechanism combined with the Swin Transformer backbone. Li et al.24 introduced the 3D attention mechanism SimAM, which adaptively weights each channel and three-dimensional spatial features, reducing interference from irrelevant information in complex scenarios and enhancing the detection of small objects. Cheng et al.25 proposed using 1D convolution in the efficient channel attention bottleneck module, which enhances the backbone’s feature extraction capability for small and elongated defects. Zhang et al.26 employed a combination of the Normalized Weighted Distance Loss small objects detection algorithm and the Wise Intersection Over Union loss function to replace the original loss function, thereby improving the small objects detection performance. Sharba, A et al.27 introduced an attention mechanism into YOLOv5, placed after the last layer of the backbone, and slightly increased the number of parameters to enhance the feature extraction capability of the backbone network. Guo et al.28 integrated SKAttention into the Backbone layer of the network to address the issue of overlapping and redundant information within the model. Yu et al.29 enhanced the model’s ability to fit targets by replacing the sigmoid linear unit (SiLU) activation function with the exponential linear unit (ELU). In addition, S.O. Slim et al.30 improved the detection accuracy of the YOLOv5 model by employing data augmentation and transfer learning techniques, Ma et al.31 proposed a Scale Decoupling Module (SDM) to enhance small object features by suppressing large object features in the shallow layers.
In the aforementioned methods, the network has been optimized for small object detection. However, due to the unique characteristics of remote sensing images, merely incorporating attention mechanisms and modifying the loss function to enhance model nonlinearity is insufficient to achieve satisfactory performance in remote sensing object detection. Therefore, the model proposed in this paper is specifically optimized for small remote sensing objects, with the following key improvements:
-
In the backbone of the network, omni-dimensional dynamic convolution (ODConv)32 has been introduced at the shallow layers of the existing backbone network, forming the new ODCSP-Darknet53 backbone network. This improvement enhances the model’s capability to extract information from complex backgrounds and small objects in remote sensing images, facilitating deeper learning of small object features.
-
In the neck region of the network, an efficient small object enhancement bi-directional feature pyramid network (STEBIFPN) structure is designed to optimize the fusion paths for small object information. This improved feature pyramid network structure facilitates more effective extraction and transmission of small object features, thereby enhancing the precision and efficiency of small object detection. Moreover, this structure employs two different weighted fusion methods, further optimizing the scaling of small object information while effectively extracting these details.
-
To enhance the detection capabilities for extremely small objects, this study employs a four-head detection structure and specifically adds a detection head dedicated to extremely small objects. Simultaneously, the adaptively spatial feature fusion (ASFF)33 technology is introduced, which effectively integrates multi-scale information through detection heads of various sizes. This improves the model’s generalization ability and detection performance when handling objects of different sizes.
With the above improvements, the model proposed in this paper demonstrates significant performance improvement in small object detection in remote sensing images.
Methodology
Figure. 2 illustrates the improved network model. In our enhanced feature convergence network (EFCNet), the backbone employs the ODCSP-Darknet53 architecture.
This structure optimizes the CSP-Darknet53 by replacing the standard convolution with ODConv in the shallow layers, thereby enhancing the model’s ability to extract small target features while also reducing the parameter count and computational cost to some extent. In the Neck, the addition of a dedicated detection head for small targets offers more possibilities for fusion paths. Moreover, the fusion strategy was improved by introducing a novel adaptive weighted fusion method. For the small target paths in the fusion process, a CBH structure was specifically designed, and the convolutional block attention module (CBAM)34 was incorporated to mitigate background interference. In the detection head, the ASFF technique was employed to further strengthen information fusion, leading to improved detection performance.
Network structure.
Backbone network improvements
In the detection head section of the network, a four-head structure is adopted, and ASFF is used to replace the original YOLO detection head. The application of ASFF further promotes the fusion of multi-scale information, significantly enhancing detection outcomes. These improvements allow EFCNet to exhibit exceptional performance in processing complex images, especially in the detection of small objects.
Omni-dimensional dynamic convolution structure.
The ODconv structure is shown in Figure.3, the input feature map is initially processed through global average pooling (GAP) to transform it into a 1 × 1×Cin dimension, thereby reducing dimensionality and computational complexity, in theprevious text, " Cin " refers to the output in terms of the channel dimension. After being compressed through a fully connected (FC) layer, the features are subjected to a non-linear transformation via the ReLU activation function to eliminate negative values. This structure is further divided into four branches: the first branch predicts the spatial position weights of each convolutional kernel, aiding in object localization; the second branch predicts the weights of the input channels, analysing the structural features of objects within these channels; the third branch measures the weights of the output channels to ensure the integrity of feature information during the convolution process; the fourth branch predicts the weights of the convolutional kernels, refining the capture of local features specific to different object categories. These mechanisms optimize the network’s ability to extract detailed information and enhance its detection capabilities for small objects. The output formula for ODConv is as follows:
In the formula, x represents the input features, and y represents the output features. The terms \(\:{a}_{wi}\), \(\:{a}_{si}\), \(\:{a}_{ci}\), and \(\:{a}_{fi}\) represent the attention scalars for the convolutional kernel \(\:{W}_{i}\) along the spatial dimension, input channel dimension, and output channel dimension, respectively. The symbol ⊙ signifies multiplication across these various dimensions.
Neck improvements
Given the limited integration and extraction of detailed and semantic information by the original three-layer PAN-FPN in YOLOv5, adding an additional minuscule object detection layer to extract more feature information, along with the introduction of a bidirectional feature pyramid network (BiFPN), can enhance the fusion of features at various scales. Although using BiFPN fusion nodes to integrate shallow and deep features results in feature maps rich in both detail and semantic information, the integration of deep features with shallow ones can inevitably introduce background details that may interfere with object detection. This problem is particularly pronounced in the processing of remote sensing images with significant detection interference, especially where the object blends with similarly textured backgrounds or where the background blurs the object, potentially diminishing the advantages of BiFPN and four-layer detection and consequently reducing the model’s detection performance. To address this, the paper redesigns the feature fusion nodes, proposing an enhanced small object detection oriented bidirectional feature pyramid.
Figure (a) shows the structure of PAN-FPN, and figure (b) shows the structure of STEBI-FPN.
Efficient multi-scale attention module
Figure. 4(a) and Figure. 4(b) illustrate the frameworks of PAN-FPN and STEBIFPN, respectively. In the diagrams, C represents the use of the original concat structure, while F represents the use of a completely new fusion approach, red lines represent 2x downsample fusion paths, blue lines represent 2x upsample fusion paths, and black lines indicate conventional fusion paths. The more downsampling a feature map undergoes, the richer the semantic information it contains but the sparser the detail information it retains. Therefore, after multiple downsampling processes, small objects carrying less spatial detail information may lose features representative of themselves.
Compared to PAN-FPN, STEBIFPN introduces an ultra-small object detection layer and spans two new feature fusion paths between the neck output end and the backbone output end. The ultra-small object detection layer not only incorporates the output feature map from the most information-rich and shallowest C3 module of the backbone network but also extends the feature extraction path of the neck network, allowing the last three detection layers to gather more semantic information. The two newly added fusion paths enable the neck network to acquire more detailed information.
Moreover, all fusion nodes in PAN-FPN employ a method of direct concatenation of feature map channels for fusion, while the fusion nodes in STEBIFPN are divided into four different structures, as shown in Figure. 5.
Diagram of the new fusion method structure.
Figure. 5 displays fusion nodes with two and three input ports, where X1 to X3, serving as input feature maps to the fusion nodes, represent the output feature maps from shallow to deep layers. X1, containing the richest detail information, is processed through the CBH and CBAM attention modules to enhance object features and reduce background interference. Subsequently, X1 and other input feature maps are multiplied by the weighted coefficients \(\:{W}_{i}\) to adjust the importance of the feature maps before being concatenated for fusion. The initial value of \(\:{W}_{i}\) is set to 1, and it can be adaptively adjusted through training to accommodate learning.
Figure (a) shows the structure of CBH and figure (b) shows the structure of CBAM.
Figure. 6(a) illustrates the CBH module, which is composed of a two-dimensional convolution layer, a batch normalization layer, and a Hardswish activation function. The CBH module significantly enhances the quality and expressive capability of the feature map by effectively extracting spatial features through convolution operations. The batch normalization layer standardizes the feature distribution, enhancing the model’s generalization ability. The Hardswish activation function introduces non-linearity, further boosting the network’s processing capabilities, enabling the overall network to more effectively learn and represent complex patterns in the data.
Figure. 6(b) displays the structure of the CBAM attention mechanism, which consists of a channel attention mechanism (CAM) and a spatial attention mechanism (SAM). Initially, the input feature map is processed through CAM to enhance the representation of objects at the channel level. Then, SAM enhances the spatial representation capability to reduce background interference. Specifically, CAM performs global max pooling and average pooling on the input feature map, followed by an MLP to compute two sets of weights. These weights are then added together and processed through a sigmoid activation function to obtain a channel-wise weight vector for the input feature map. This weight vector is multiplied by the input feature map channel-wise to produce a new set of feature maps. SAM then performs further spatial dimension analysis on the feature maps processed by CAM. Specifically, SAM uses a channel-wise max pooling and average pooling strategy on the feature maps output by CAM, generating two independent spatial attention maps. These spatial attention maps are stacked channel-wise and input into a small convolution layer, which contains only one convolution kernel to integrate spatial information from different pooling operations. The feature maps processed by this convolution layer generate the final spatial attention map through a sigmoid activation function. This map explicitly indicates which spatial areas in the feature map are key, thereby aiding the network in focusing on processing information from these areas.
Detection head improvements
Due to the complexity of remote sensing images and the presence of information across various scales, a four-headed ASFF structure has been integrated into the model to enhance its generalization ability through multi-scale adaptive fusion. The structure is illustrated in the Figure. 7.
ASFF detection head structure diagram.
In the Figure. 7, Level1, Level2, Level3, and Level4 correspond to the feature maps output from the network’s Neck. The output of ASFF1 is obtained by adding the features from Level1 to Level4, each weighted appropriately. Similarly, the inputs and outputs for ASFF2, ASFF3, and ASFF4 are processed in the same manner as ASFF1. The computational formula is as follows:
In the formula, \(\:{y}_{ij}^{l}\) represents the output of the ASFF network, while \(\:{x}_{\text{i}j}^{1\to\:l}, {x}_{\text{i}j}^{2\to\:l}, {x}_{\text{i}j}^{3\to\:l}\),\(\:{x}_{\text{i}j}^{4\to\:l}\)denote the feature vectors input at the corresponding positions. The terms \(\:{\alpha\:}_{\text{i}j}^{l}, {\beta\:}_{\text{i}j}^{l}, {\gamma\:}_{\text{i}j}^{l}, {\kappa\:}_{\text{i}j}^{l}\) are the learnable weights from Level1 to Level4 to layerl.
Since the output of the ASFF is in a summative form, the dimensions and the number of channels of the output features must be consistent. Therefore, the feature maps from Level 1 to Level 4 are processed through a 1 × 1 convolution to align the number of channels consistently and derive the weights α, β, γ, and κ. Finally, these weight parameters are normalized through a softmax layer to ensure that the sum of the four weight parameters equals 1. The formula for a_\(\:{a}_{\text{i}j}^{1}\) is as follows:
Experiment results and analysis
Dataset
To assess the effectiveness of the improved model proposed in this paper, a series of experimental analyses were conducted using the challenging large-scale remote sensing object detection dataset DOTA V1.0. The DOTA V1.0 dataset comprises 2806 remote sensing images with resolutions ranging from 800 × 800 to 4000 × 4000, containing 188,282 object instances across 15 categories: small vehicles, large vehicles, airplanes, storage tanks, ships, harbors, ground runways, soccer fields, tennis courts, swimming pools, baseball fields, roundabouts, basketball courts, bridges, and helicopters. The total number of objects per category is illustrated in Figure. 8. To enhance the training effectiveness of the model, the original images were pre-processed through image segmentation and padding, expanding the dataset from 2806 images of varying resolutions to 21,046 images, with 15,749 used for training and 5,297 for testing.
In addition, to further validate the effectiveness of the proposed improvements, the DIOR dataset was used for evaluation. The DIOR dataset is characterized by its large-scale number of images and instances, as well as the diversity of object categories it covers. These categories include airplane, airport, baseball field, basketball court, bridge, chimney, dam, expressway service area, expressway toll station, harbor, golf course, ground track field, overpass, ship, stadium, storage tank, tennis court, train station, vehicle, and windmill.
DOTA dataset target category statistics chart.
Experimental environment and training setup
The primary experimental setup for this study consists of an Ubuntu 20.04 operating system, PyTorch 2.0.0, Python 3.8, CUDA 11.8, and a 24GB RTX 4090 GPU. During the model training process, the resolution of all images was standardized to 640 × 640. The training utilized the stochastic gradient descent (SGD) optimizer along with a cosine annealing algorithm for weight adjustment. The Mosaic data augmentation method was used to enrich the dataset. The model’s learning rate was set at 0.01, the batch size for training images was 32, the number of training epochs was set at 300, and the complete intersection over union (CIOU) was selected as the loss function.
Experimental evaluation indexes
In object detection, three types of metrics are commonly employed to evaluate model performance: Precision (P), Recall (R), and Mean Average Precision (mAP) serve as indicators of the model’s detection capabilities; Inference Time (IT) is used to assess the model’s detection speed; and GFLOPs and Parameters (Params) are used to evaluate the model’s deployment capabilities, where lower GFLOPs and Params suggest less demand on the hardware platform for deployment. Precision, Recall, and mAP can be calculated using the following formulas:
Results and analysis of ablation experiment
To explore the optimal performance of the model on the DOTA dataset, we conducted an ablation study focusing on the IOU-thres and Conf-thres. The experimental results are shown in Table 1. After training YOLOv5 through all epochs, the model automatically performs validation, with the default IOU-thres and Conf-thres set to 0.6 and 0.001, respectively. Under these settings, the model achieves a mAP of 72.1%. Upon further tuning of the IOU-thres and Conf-thres, we found that when the IOU-thres was reduced to 0.4 while maintaining the Conf-thres at 0.001, both precision and recall reached their highest values, and the mAP also achieved its peak performance. This result suggests that the proper adjustment of IOU-thres and Conf-thres helps strike an effective balance between precision and recall, thereby improving the overall detection performance of the model.
To verify the effectiveness of the improved model, ablation experiments were conducted based on the original YOLOv5s framework. The results of these experiments are shown in Table 2.
In Experiment A, the original YOLOv5s model was utilized for detection, with parameter size at 7.1 M, computational load at 15.9 GFLOPs, and inference time at 2.1ms. Compared to other models, it featured lower computational costs, but it also recorded the lowest mAP at 72.7%.
Experiment B implemented the ODCSP-Darknet53 as the backbone network, showing a slight decrease in computational load to 14.0 GFLOPs and an mAP increase of 0.8% over the original YOLOv5s, with an inference time of 2.8ms. This indicates that ODConv is more efficient at extracting shallow level information than standard convolution.
In Experiment C, the detection head was modified to use ASFF, resulting in an mAP increase to 74.2%, a significant improvement over YOLOv5s. However, the parameter size and computational load also increased to 12.5 M and 24.3 GFLOPs, respectively, with the inference time rising to 2.6ms. The results demonstrate a significant enhancement in object detection capabilities after modifying the detection head.
Experiment D introduced the STEBIFPN structure, achieving an mAP of 73.4%, a 0.7% increase over YOLOv5s, with parameters at 7.3 M and computational load at 19.6 GFLOPs. The inference time was 3.7ms. The results indicate that optimizing the fusion path and adopting an improved fusion method enhanced the model’s generalization and object detection capabilities.
Experiment E involved improvements to both the backbone network and the detection head, increasing the mAP to 75.0%, a 2.3% improvement over YOLOv5s. The parameter and computational loads were 12.5 M and 22.4 GFLOPs, respectively, with an inference time of 2.6ms. Compared to Experiment C, further enhancements in both modules resulted in improved information extraction capabilities and superior detection performance.
In Experiment F, ASFF was used as the detection head and the STEBIFPN module was added. The model achieved an accuracy of 75.4%, with a parameter count of 13.4 M. It had the highest computational cost among all models, reaching 32.1 GFLOPs, and an inference time of 3.5 ms. Compared to Experiments C and D, it showed a significant improvement in accuracy.
Experiment G showcased the model proposed in this paper, with parameters at 13.4 M, computational load at 30.2 GFLOPs, and an mAP of 75.9%, with an inference time of 3.8ms. Following optimizations to the backbone network, Neck fusion modules, and detection head, the model demonstrated substantial improvements over YOLOv5s. The interaction between various modules also enhanced the model’s capability for information extraction and feature map representation.
Table 3 Displays the detection results of various models on the DOTA dataset, with the best results in each category highlighted in bold. Analysis of the data reveals that the model improved in this study achieved the highest mAP value among all the refined models, reaching 75.9%. Particularly in the Bridge category, which is known for its detection difficulty, it significantly improved to 55.6% compared to other models. Additionally, this model achieved the highest mAP values in the detection of plane, bridge, Large-vehicle, ship, Tennis-court, harbor, and helicopter categories. Although the detection accuracy of this model did not reach the optimum in some categories, the differences in mAP values were not significant. Furthermore, the frames per second (FPS) of this model was also relatively high at 39, ranking among the higher levels compared to other models. Overall, the model proposed in this paper has im-proved both detection accuracy and speed, proving its applicability and effectiveness in detecting various small-sized object categories.
Comparative tests on different datasets
In addition, to demonstrate the effectiveness of the improved model, we conducted comparative experiments on the DIOR39 dataset, as shown in Table 4. The comparison includes models from the YOLO series, such as YOLOv7 and YOLOv8, as well as other recent improved models. The results in the table show that our model achieved the highest overall accuracy of 80.5%, which is 0.3% higher than YOLOv7. At the same time, our model has significantly fewer parameters and lower computational complexity than YOLOv7. Although our model has slightly more parameters and computational complexity compared to other models, its overall mAP performance is superior.
Visualisation of test results
To validate the improvements proposed in this study, we selected four distinct scenarios to demonstrate the detection performance of our model, as shown in Fig. 9. Figures 9(a), 9(c), 9(e), and 9(g) display the detection results produced by our model, while Figs. 9(b), 9(d), 9(f), and 9(h) present the actual annotation boxes. Figure 9(a) illustrates the model’s detection results for densely arranged small objects, where the model did not miss or misdetect any objects compared to the actual annotations. Notably, the model successfully identified an unannotated ship located in the upper right corner of the image. Figure 9(c) describes a scene from the DOTA dataset after object segmentation, where some object information was lost. Despite the airplane being truncated in the lower left corner, the model was still able to detect it accurately. Figure 9(e) demonstrates the detection of a bridge under low-light conditions, a particularly challenging task in the DOTA dataset, especially under adverse weather conditions. Nevertheless, the model accurately recognized the bridge within the image, showcasing the effectiveness of the improvements. Figure 9(g) illustrates the detection of various types of bridges against a more complex background, where the model did not miss any detections and was capable of identifying vehicles without external annotations, further validating its robustness.
Visual detection graph of DOTA dataset, the left (a)-(g) are the visual detection results of EFCNet, and the right (b)-(h) are the actual annotation boxes.
Comparison of heat maps between EFCNet and YOLOv5 in the DOTA dataset, The left figure (a) is the visual thermodynamic diagram of EFCNet, and the right figure (b) is the yolov5 visual thermodynamic diagram.
To further demonstrate the differences between the improved model developed in this study and the original YOLOv5s model, we utilized heatmaps to visualize the differences in model attention between the two. As shown in Figure. 10(a), the heatmap of the improved model from this study is displayed, while Figure. 10(b) shows the heatmap results of the original YOLOv5s model. Comparing these two images, it is evident that the attention of the model proposed in this study is more focused on the detection objects. In contrast, the attention of the original YOLOv5s model is more dispersed, particularly within the red-boxed area in Figure. 10(b), where the original model failed to effectively focus on the object.
These results indicate that the modifications introduced in the backbone network and the CBAM attention mechanism in the neck of the improved model effectively reduce the impact of background noise on the model and enhance its focus on object in-formation, which helps improve detection performance. The implementation of these improvement strategies not only optimizes the model’s focus but also enhances its ability to recognize objects in complex scenarios.
Comparison of DOTA dataset in different scenarios.
Furthermore, to more intuitively demonstrate the differences between EFCNet and YOLOv5, two sets of comparison images are presented in Fig. 11. The left side shows the detection results of the proposed model, while the right side displays those of YOLOv5. In the comparisons between subfigures (a) and (b), YOLOv5 exhibits missed detections, as indicated by the orange bounding box in subfigure (b). Similarly, in subfigures (c) and (d), YOLOv5 also fails to detect several vehicles. In contrast, EFCNet demonstrates superior detection capabilities, highlighting the effectiveness of the proposed improvements.
Conclusions
This paper builds upon the YOLOv5s framework to enhance network capabilities for the detection of small objects in remote sensing imagery. The modified network structure uses ODCSP-Darknet53 as the backbone, which strengthens the model’s ability to extract detailed information from small objects. Additionally, to optimize the fusion of multi-scale information, a new fusion pathway and method have been adopted, and a detection head specifically designed for small objects has been added. These modifications achieve good detection results without significantly increasing the model’s parameter and computational demands. Moreover, the original detection head was replaced with an ASFF detection head to further enhance the model’s detection capabilities through multi-scale information fusion. Ablation studies on the DOTA dataset have demonstrated the effectiveness of each module, with the EFCNet outperforming other improved models, achieving an mAP of 75.9% with parameters and computational demands of 13.4 M and 30.2 GFLOPs, respectively, and a relatively high FPS. Thus, this provides an effective method for the detection of small objects in remote sensing imagery. However, our model still has limitations. Although the accuracy has been improved, the introduction of additional parallel branches may negatively impact detection efficiency when deployed on edge computing devices. In the future, we will focus on further optimizing the model to enhance the detection accuracy of small targets and reduce the false detection rate, with the goal of achieving better detection performance.
Data availability
The data generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
References
Patil, P. Applications of deep learning in traffic management: A review. Int. J. Bus. Intell. Big Data Analytics. 5 (1), 16–23 (2022).
Wang, S. et al. A deep-learning-based sea search and rescue algorithm by UAV remote sensing, Proc. IEEE CSAA Guid. Navigation Control Conf., pp. 1–5, (2018).
Xu, Y. et al. Rapid airplane detection in remote sensing images based on multilayer feature fusion in fully convolutional neural networks. Sensors, 18,(7) 2335.https://doi.org/10.3390/s18072335, (2018).
Guo, D. et al. A remote sensing target detection model based on lightweight feature enhancement and feature refinement extraction. IEEE J. Sel. Top. Appl. Earth Observations Remote Sens. 17, 9569–9581. https://doi.org/10.1109/JSTARS.2024.3394887 (2024).
Cheng, G. & Han, J. A survey on object detection in optical remote sensing images, ISPRS J. Photogramm. Remote Sens., vol. 117, pp. 11–28, Jul. (2016).
Xu, Y. et al. Apr., Gliding vertex on the horizontal bounding box for multi-oriented object detection, IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 4, pp. 1452–1459, (2020).
Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation, Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 580–587, (2014).
He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN, Proc. IEEE Int. Conf. Comput. Vis., pp. 2961–2969, (2017).
Ren, S., He, K., Girshick, R., Sun, J. & Faster, R-C-N-N. Towards real-time object detection with region proposal networks, Proc. Adv. Neural Inf. Process. Syst., vol. 28, (2015).
Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified real-time object detection, Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 779–788, (2016).
Redmon, J., Farhadi, A. & YOLO9000. : Better faster stronger, Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 7263–7271, (2017).
Redmon, J. & Farhadi, A. YOLOv3: An incremental improvement, (2018).
Bochkovskiy, A., Wang, C. Y. & Mark Liao, H. Y. YOLOv4: Optimal speed and accuracy of object detection, Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 198–215, (2020).
Aboah, A., Wang, B., Bagci, U. & Adu-Gyamfi, Y. Real-time multi-class helmet violation detection using few-shot data sampling technique and YOLOv8, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), pp. 5349–5357, Jun. (2023).
Wang, C. Y., Bochkovskiy, A. & Liao, H. M. Scaled-YOLOv4: Scaling cross stage partial network, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 13024–13033, Jun. (2021).
Jocher, G. et al. Jan., Ultralytics/yolov5: V4.0Nn.SiLU() activations weights & biases logging PyTorch hub integration, (2021).
Li, C. et al. YOLOv6 v3.0: A full-scale reloading, arXiv:2301.05586, (2023).
Liu, W. et al. SSD: Single shot MultiBox detector, Proc. Eur. Conf. Comput. Vis., pp. 21–37, (2016).
Wu, Q., Wu, Y., Li, Y. & Huang, W. Improved YOLOv5s with coordinate attention for small and dense object detection from optical remote sensing images. IEEE J. Sel. Top. Appl. Earth Observations Remote Sens. 17, 2543–2556. https://doi.org/10.1109/JSTARS.2023.3341628 (2024).
Xie, S., Zhou, M., Wang, C. & Huang, S. CSPPartial-YOLO: A lightweight YOLO-Based method for typical objects detection in remote sensing images. IEEE J. Sel. Top. Appl. Earth Observations Remote Sens. 17, 388–399. https://doi.org/10.1109/JSTARS.2023.3329235 (2024).
Yang, Y., Ju, Y. & Zhou, Z. A Super Lightweight and Efficient SAR Image Ship Detector, in IEEE Geoscience and Remote Sensing Letters, vol. 20, pp. 1–5, Art no. 4006805, (2023). https://doi.org/10.1109/LGRS.2023.3284093
Liu, Z., Gao, Y., Du, Q., Chen, M. & Lv, W. YOLO-Extract: improved YOLOv5 for aircraft object detection in remote sensing images, in IEEE access, 11, pp. 1742–1751, (2023). https://doi.org/10.1109/ACCESS.2023.3233964
Jiang, X. & Wu, Y. Remote sensing object detection based on Convolution and Swin transformer. IEEE Access. 11, 38643–38656. https://doi.org/10.1109/ACCESS.2023.3267435 (2023).
Li, J. et al. BP-YOLO: A Real-Time product detection and shopping behaviors recognition model for intelligent unmanned vending machine, in IEEE access, 12, pp. 21038–21051, (2024). https://doi.org/10.1109/ACCESS.2024.3361675
Cheng, Z., Gao, L., Wang, Y., Deng, Z. & Tao, Y. EC-YOLO: Effectual detection model for steel strip surface defects based on YOLO-V5, in IEEE access, 12, pp. 62765–62778, (2024). https://doi.org/10.1109/ACCESS.2024.3391353
Zhang, Q., Liu, L., Yang, Z., Yin, J. & Jing, Z. WLSD-YOLO: A Model for Detecting Surface Defects in Wood Lumber, in IEEE Access, vol. 12, pp. 65088–65098, (2024). https://doi.org/10.1109/ACCESS.2024.3395623
Sharba, A. & Kanaan, H. Improving tiny object detection in aerial images with Yolov5. J. Eng. Sustainable Dev. 29 (1), 57–67. https://doi.org/10.31272/jeasd.2682 (2025).
Guo, J. et al. Research on night-time vehicle target detection based on improved KSC-YOLO V5. SIViP 19, 69. https://doi.org/10.1007/s11760-024-03576-5 (2025).
Yu, D. et al. Improved YOLOv5: efficient object detection for fire images. Fire 8, 38. https://doi.org/10.3390/fire8020038 (2025).
Slim, S. O. et al. Smart insect monitoring based on YOLOV5 case study: mediterranean fruit fly Ceratitis capitata and Peach fruit fly Bactrocera zonata. Egypt. J. Remote Sens. Space Sci. Volume 26, Issue 4. https://doi.org/10.1016/j.ejrs.2023.10.001
Ma, Y., Chai, L. & Jin, L. Scale Decoupled Pyramid for Object Detection in Aerial Images, in IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–14, Art no. 4704314, (2023). https://doi.org/10.1109/TGRS.2023.3298852
Li, C., Zhou, A. & Yao, A. Omni-dimensional dynamic convolution, arXiv:2209.07947, (2022).
Liu, S., Huang, D. & Wang, Y. Learning spatial fusion for single-shot object detection, arXiv:1911.09516, Nov. (2019).
Woo, S., Park, J., Lee, J. & Kweon, I. Cbam: Convolutional block attention module, Proc. Eur. Conf. Comput. Vis. (ECCV), pp. 3–19, (2018).
Time Object Detection with Region Proposal Networks,. in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 1 June 2017.
C.Cortes, N. D., Lawrence, D. D., LeeM, Sugiyama & Garnett, R. (eds) Red Hook, NY, USA: Curran, pp.91–99. (2015).
Yang, Y. et al. May., Adaptive knowledge distillation for lightweight remote sensing object detectors optimizing. IEEE Trans. Geosci. Remote Sens., 60, (2022). Art. 5623715.
Liu, F. et al. R2YOLOX: A Lightweight Refined Anchor-Free Rotated Detector for Object Detection in Aerial Images, in IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, Art no. 5632715, (2022). https://doi.org/10.1109/TGRS.2022.3215472
Ming, Q., Zhou, Z., Miao, L., Yang, X. & Dong, Y. Optimization for oriented object detection via representation invariance loss, IEEE Geosci. Remote Sens. Lett., vol. 19, Art no. 8021505, (2022). https://doi.org/10.1109/LGRS.2021.3115110
Xu, Y. et al. A High-Order feature association network for dense object detection in remote sensing. IEEE J. Sel. Top. Appl. Earth Observations Remote Sens. 17, 1513–1522. https://doi.org/10.1109/JSTARS.2023.3335288 (2024).
Ming, Q., Miao, L., Zhou, Z. & Dong, Y. CFC-Net: A critical feature capturing network for arbitrary-oriented object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 61, 1–14 (2021).
Chen, Y., Jiang, W., Wang, Y. & FAMHE-Net Multi-Scale feature augmentation and mixture of heterogeneous experts for oriented object detection. Remote Sens. 17, 205. https://doi.org/10.3390/rs17020205 (2025).
Cheng, K. et al. Tiny Object Detection via Regional Cross Self-Attention Network, in IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 10, pp. 8984–8996, Oct. (2024). 10.1109
Yang, Y. et al. Statistical sample selection and multivariate knowledge mining for lightweight detectors in remote sensing imagery. IEEE Trans. Geosci. Remote Sens., 60, (2022). Art. 3192013.
Zhang, J. et al. SuperYOLO: super resolution assisted object detection in multimodal remote Sens Ing imagery. IEEE Trans. Geosci. Remote Sens., 61, (2023). Art. 5605415.
Acknowledgements
This work was supported in part by the Natural Science Foundation of China under Grant U2341223 and Beijing Municipal Natural Science Foundation No.4232067.
Author information
Authors and Affiliations
Contributions
W.Y. was responsible for the overall content structure, experimental design and analysis of the paper. W.X. was responsible for the overall layout of the paper, referencing, and embellishment. L.Z. and Z.S was responsible for the overall direction and supervision of the paper. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, Y., Li, Z., Zhu, S. et al. EFCNet for small object detection in remote sensing images. Sci Rep 15, 20393 (2025). https://doi.org/10.1038/s41598-025-09066-z
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-09066-z
This article is cited by
-
Enhancing object detection in remote sensing images with improved YOLOv8 model
Scientific Reports (2025)
-
Human-in-the-loop target location detection method based on the region of interest in multi-object scenes
Signal, Image and Video Processing (2025)













