Abstract
With advancements in drone control technology, low-altitude remote sensing image processing holds significant potential for intelligent, real-time urban management. However, achieving high accuracy with deep learning algorithms remains challenging due to the stringent requirements for low computational cost, minimal parameters, and real-time performance. This study introduces the Global Object-Oriented Dynamic Network (GOOD-Net) algorithm, comprising three fundamental components: an object-oriented, dynamically adaptive backbone network; a neck network designed to optimize the utilization of global information; and a task-specific processing head augmented for detailed feature refinement. Novel module components, such as the ReSSD Block, GPSA, and DECBS, are integrated to enable fine-grained feature extraction while maintaining computational and parameter efficiency. The efficacy of individual components in the GOOD-Net algorithm, as well as their synergistic interaction, is assessed through ablation experiments. Evaluation conducted on the VisDrone dataset demonstrates substantial enhancements. Furthermore, experiments assessing robustness and deployment on edge devices validate the algorithm’s scalability and practical applicability. Visualization methods further highlight the algorithm’s performance advantages. This research presents a scalable object detection framework adaptable to various application scenarios and contributes a novel design paradigm for efficient deep learning-based object detection.
Similar content being viewed by others
Introduction
Remote sensing and object detection technologies have been instrumental in driving advancements across diverse domains, including urban traffic management, intelligent monitoring systems, vehicle tracking, and medical diagnostics1,2,3,4,5,6,7,8,9. By enabling real-time data acquisition and analysis, it has contributed to improved system efficiency, accuracy, and innovation in these fields. Although deep learning-based detection models10,11,12,13 have achieved notable success in detecting medium and large objects with high accuracy, the detection of small objects in images persists as a considerable challenge due to intrinsic factors such as low resolution, occlusion, and constrained feature representation. In addition, the images processed by target detection algorithms in low-altitude remote sensing equipment differ significantly from those in traditional datasets due to complex background scenes, diverse perspective changes, and variations in illumination. These factors significantly increase the probability of both false positive and false negative outcomes. Moreover, the limited hardware resources of low-altitude remote sensing equipment necessitate detection models that balance real-time processing speed with efficient parameter scaling. Consequently, the development of a small target detection algorithm tailored for remote sensing imagery, which ensures enhanced detection accuracy while optimizing storage efficiency, represents a critical area of research with extensive interdisciplinary applications.
Deep learning-driven object detection approaches, including single-stage and dual-stage detectors, are typically categorized as anchor-based14,15,16,17,18,19,20,21 and anchor-free methods22,23,24,25. Anchor-based methods rely on predefined anchor boxes to detect objects, while anchor-free methods eliminate the need for anchor box design by directly predicting object positions and sizes. This approach simplifies the detection process and enhances model flexibility. However, the absence of anchor prior information often results in reduced detection accuracy for anchor-free methods. Two-stage detectors typically achieve higher detection accuracy by refining candidate regions through additional computational steps. However, this refinement entails a sacrifice of slower inference speeds. By comparison, single-stage detectors employ dense grids or anchors during the detection process, performing detection in a single forward pass, thereby eliminating the need for candidate box generation. While this approach enables faster inference speeds, it sacrifices some degree of detection accuracy. As a pioneering single-stage detector, the You Only Look Once (YOLO)14 model partitions the input feature map into a grid, generating predictions for both the coordinates and category likelihoods of the bounding boxes within each grid cell. Since its introduction, YOLO has demonstrated significant advantages in detection speed and processing efficiency. However, its performance in detecting small targets remains inadequate, particularly in high-density small target scenarios common in remote sensing applications. Consequently, improving detection accuracy for small targets remains a critical challenge. Improving detection speed, reducing inference time, and enhancing accuracy under limited sample data to attain cutting-edge performance in single-stage detectors has become a pressing research challenge in this field.
Single-stage detectors frequently leverage image feature pyramids for feature extraction and multi-scale object detection26. The effectiveness of multi-scale feature extraction may be further improved by integrating gradient and context information modules alongside noise prediction techniques27. Each hierarchical layer within the feature pyramid encapsulates contextual information pertinent to a specific scale. When integrated with a well-designed and sufficiently complex network, this approach significantly enhances the detection performance for objects of varying sizes28. In practical applications, single-stage detectors are subject to greater constraints under limited computing resources. Reducing the number of model parameters can diminish their ability to adapt to variations in target scale, thereby restricting performance in such scenarios29. Within the scope of small-target detection in low-altitude remote sensing, primary challenges involve augmenting the feature information extraction efficacy of the backbone network, facilitating efficient multi-scale feature integration within the neck, and enhancing the detection head’s precision for small-target identification. Furthermore, to satisfy the real-time processing constraints of hardware systems, minimizing computational complexity through the optimization of the feature pyramid architecture is imperative.
In response to the described obstacles, this article puts forward a cutting-edge detection framework, GOOD-Net, designed for low-altitude remote sensing small target detection. The proposed framework achieves high-precision detection while optimizing storage resource utilization and fulfilling the real-time processing requirements of hardware systems. This study offers the following principal contributions:
-
A series of novel module components, including the ReSSD Block, GPSA, DECBS, etc., are introduced and integrated into the proposed GOOD-Net framework.
-
The GOOD-Net algorithm incorporates a newly designed model architecture, comprising an object-oriented, dynamically adaptive backbone network; a neck network designed to optimize the utilization of global information; and a task-specific processing head augmented for detailed feature refinement. This architecture enhances the utilization of global feature information and establishes a novel design paradigm for target detection networks.
-
Various implementations of the GOOD-Net algorithm, tailored to different model sizes, are presented. These variations enable the selection of an appropriate model based on specific computational resources or accuracy requirements.
The subsequent sections are presented in the following order: Section Related Work reviews prior research and recent advancements in object detection and small target detection within remote sensing. Section Methods offers a well-rounded presentation of the GOOD-Net detection framework. Section Results outlines the experimental setup and presents findings from comparative evaluations and ablation studies. Finally, Section Discussion and Conclusions analyzes the experimental findings and suggests directions for future work.
Related work
Target detection represents a cornerstone task in computer vision. Although YOLO has established itself as a standard in real-time detection, advancements in other frameworks, particularly the region-based two-stage detection methods such as R-CNNs30,31, have significantly influenced the development of object detection techniques.
For instance, the Cascade R-CNN32, developed by the Peking University research team based on Faster R-CNN, incorporates multiple detection stages with progressively higher IoU thresholds to screen high-quality object candidate boxes for precise filtering. Subsequently, the CenterNet detection model developed by Duan et al.33 employs key point triplet detection to evaluate whether key points near a candidate box align closely with the IoU of the ground truth box. Zhao et al.34 introduced the Real-Time Detection Transformer, which replaces the conventional transformer encoder35 with a hybrid encoder specifically optimized for processing high-dimensional features. Min et al.36 developed LHGNet, leveraging the HGNetv2 backbone to preserve channel information and enhance the extraction and integration of local details and channel features at each stage. Jiang et al.37 introduced a micro-target detection head, replacing the conventional large-target prediction head. This modification enhances fine-grained feature extraction of small targets through a multi-scale feature extraction module (MSFEM). Similarly, Wang et al.38 proposed a collaborative CNN-MLP architecture that integrates parallel token interaction mixer (PTIM) and contextual selection fusion module (CSFM).
Previous studies have primarily focused on target detection in general scenarios, achieving significant results. However, these methods encounter challenges when applied to low-altitude remote sensing image processing tasks. Such challenges include the small size of most targets relative to the overall image, significant variations in lighting and viewing angles, and reduced performance under resource constraints, particularly in edge computing deployments on resource-limited devices. To mitigate these challenges, researchers have prioritized optimizing the trade-off between computational efficiency and model efficacy. Their endeavors are directed toward improving the accurate detection of small-scale targets within low-altitude remote sensing imagery.
Zhang et al.39 proposed a two-stage training strategy for single-stage small object detection models. They introduced the Segmentation Assistance (SA) module, which helps the network focus on foreground objects. Additionally, they developed the Triplet Head with a dual distillation mechanism, which refines feature representation and enhances class discrimination. Lin et al.40 designed the Scale Selection Network (SSN), allowing the model to dynamically focus on the most relevant features by employing scale attention mechanisms and selective feature processing. Additionally, they introduced the lightweight Landmark-Guided Scale Attention Network module, which further enhances efficiency by focusing the model’s attention on the scale-specific features of selected regions. Yuan et al.41 introduced the IA-YOLOv8 model, which incorporates two innovative modules: the lightweight Intra-group Multi-scale Fusion Attention (IGMSFA) and Adaptive Weighted Feature Fusion (AWFF). The IGMSFA module facilitates the efficient capture and fusion of multi-scale semantic information from various groups within the input features. Conversely, the AWFF module adaptively assigns weights to individual feature channels, thereby optimizing the fusion of high-level features. Qu et al.42 introduced the Attention Mechanism and Multi-Scale Feature Fusion Network. This model enhances detection performance by incorporating an attention mechanism that enables the network to prioritize significant features, alongside cross-scale feature fusion, which optimizes the detection of tiny objects across different spatial resolutions. In contrast, Wang et al.43 proposed a novel Embedded Cross Framework with Dual-Path Transformer (ECF-DT), which addresses feature space discrepancies through multi-scale contextual aggregation. By integrating a dual-path transformer to fuse fine-grained visual contexts and a unit fusion module to enhance channel-wise positive information, their framework achieves robust performance in complex scenarios with cluttered backgrounds.
Despite significant advancements in the domain of remote sensing, widely utilized detection methodologies persist in encountering challenges in attaining an optimal balance between precision and computational efficiency.
Methods
To balance computational overhead and model performance, thereby achieving improved overall performance and higher detection accuracy in low-altitude remote sensing image processing tasks, this paper introduces the GOOD-Net algorithm. The proposed algorithm incorporates several novel components, including Ghost Position-Sensitive Attention (GPSA) and a Reparameterized Stacked Squeeze-and-Excitation Detail-Enhanced Block (ReSSD Block). Additionally, it introduces Detail Enhanced Convolution with BatchNorm and SiLu (DECBS), Modulated Deformable Convolution (MDConv), and Spatial-Channel Decoupled Downsampling (SCDown) to enhance feature representation. The proposed algorithm integrates a novel architecture comprising an optimized backbone network, a hierarchical neck network, and a task-specific processing head, which collectively represent its key distinction from conventional object detectors.
The proposed algorithm reconstructs a dynamic object-oriented backbone network, enhancing its capacity for discriminative feature learning across distinct targets and improving feature extraction efficacy. Additionally, the neck network is optimized to enable comprehensive cross-scale feature fusion, thereby improving global information utilization. The task-specific processing head is further refined to prioritize high-frequency feature refinement in targets, ensuring thorough feature optimization. Compared to traditional convolutional backbone networks and feature pyramid architectures, these advancements achieve finer-grained feature extraction, broader contextual awareness, and more efficient integration of global features. Consequently, the method addresses current limitations in detection efficiency by enhancing both precision and computational effectiveness.
Workflow of the GOOD-Net algorithm.
Algorithm 1 presents the workflow through pseudocode. Figure 1 illustrates the overall structure of GOOD-Net, comprising the backbone, neck network, and task processing head. It visualizes the architecture, highlighting the relationships among key components and the implemented enhancements.
Overview of the model architecture of the GOOD-Net algorithm.
The input image undergoes transformation into a multi-scale feature representation through the backbone network. As the network becomes deeper, the aspect ratio and spatial resolution of the feature maps progressively diminish, whereas the number of feature channels continues to expand. The hierarchical feature information derived from the backbone network are systematically amalgamated across multiple scales within the neck network, thereby optimizing and enhancing the comprehensive feature representation. In this stage, features from the P1, P2, and P3 layers undergo average pooling and are fused with features from the P4 layer to capture global information comprehensively. Subsequently, features extracted from the P2, P3, and P5 layers are input into the task-specific processing head for the final stages of feature extraction and detection. This structural design effectively balances computational efficiency and detection accuracy, leveraging global features for robust performance in low-altitude remote sensing image processing tasks.
The GOOD-Net algorithm is categorized into four scalable models: narrow (n), slender (s), medium (m), and leviathan (l). These models are primarily defined by adjusting three key parameters: Depth, Width, and Max Channels. The Depth parameter delineates the network’s architectural complexity by specifying the number of layers, thereby influencing the efficacy of feature extraction. The Width parameter determines the dimensionality of feature channels per layer, shaping the breadth of the feature representations. By modulating the channel count in each convolutional layer, this parameter governs the model’s representational capacity while concurrently dictating its computational complexity. Finally, the Max Channels parameter limits the maximum number of channels allowed in a specific layer or module. This restriction mitigates the risk of excessive channel growth when the Width parameter is increased, thereby preventing hardware limitations such as insufficient video memory or computational capacity. Furthermore, it helps reduce the likelihood of overfitting, which can result from an overly complex model architecture.
Table 1 summarizes the key differences in the configurations of depth, width, and maximum channels across varying model sizes within the GOOD-Net framework. For smaller models (e.g., GOOD-Net-n), stricter limitations on depth and width are imposed to preserve computational efficiency, while constraints on the maximum channel count are partially relaxed. This trade-off ensures sufficient feature extraction capability despite the reduced model complexity. Conversely, larger models (e.g., GOOD-Net-l) are assigned greater width and depth to enhance performance; however, the maximum channel count is subject to stricter constraints. This design balances robust feature learning with regularization, thereby preventing overfitting.
Detail enhanced convolution with BatchNorm and SiLu
Compared to traditional target detection tasks, low-altitude aerial image processing involves smaller target proportions, often compounded by challenges such as varying ambient light conditions, lens or subject motion, and resulting issues like blurring or smearing. In these scenarios, high-frequency target information, including edges and contours, is pivotal to the model’s feature learning and recognition mechanisms. Conventional convolutional layers operate over an unconstrained solution space, typically initialized randomly, which restricts their capacity to model these high-frequency features effectively. In order to overcome this limitation, the present study introduces the DECBS module as an integral component of the GOOD-Net algorithm, thereby augmenting the algorithm’s capacity to perceive and learn high-frequency feature.
Structure of the DECBS module.
The architectural configuration of DECBS, depicted in Fig. 2, comprises convolutional aggregation layers, BatchNorm layers, and SiLU activation function layers. The convolutional aggregation layer comprises five parallel convolutions: a standard Conv2d, a central difference (CD) convolution, an angular difference (AD) convolution, a horizontal difference (HD) convolution, and a vertical difference (VD) convolution. These parallel convolutions are strategically designed to enhance feature extraction by combining intensity-level and gradient-level information. While the standard convolution primarily captures intensity-based features, the differential convolutions are tailored to encode gradient-based priors by computing pixel pair differences explicitly. By embedding this prior information directly into the network, the differential convolutions significantly enhance its representational capacity and generalization performance. They leverage gradient-level features to achieve more robust feature representations.
The five parallel convolutions in the convolutional aggregation layer of DECBS utilize identical kernel size, stride, and padding parameters to compute the corresponding weights and biases. During the final calculation, these weights and biases are aggregated, reducing the convolution layer to a single standard convolution. This design ensures that the parameter size remains comparable to that of conventional convolution operations, minimizing additional computational overhead and memory requirements during the inference phase. Furthermore, a BatchNorm layer and an activation function follow the convolutional layer. This setup accelerates convergence, introduces nonlinearity, and enhances the network’s generalization capabilities.
Modulated Deformable convolution
Due to the variability in perspectives associated with low-altitude remote sensing images, targets often appear at different scales and with varying geometric structures across scenes. This variability poses significant challenges to the feature learning capabilities of algorithmic models. To address this issue, the GOOD-Net algorithm incorporates MDConv into its model structure, enhancing its ability to adapt to these variations.
Given that conventional convolution operations are constrained by a rigid planar geometry, their capacity to model irregular geometries is inherently limited. Deformable Convolution (DConv)44 mitigates this limitation by augmenting spatial sampling locations with learned offsets, which are derived from the target task without requiring additional supervision. This design allows DConv to adapt to geometric changes in objects. However, while its spatial support aligns more closely with object structures compared to conventional convolutional networks, it may still extend beyond the region of interest. This can cause the features to be influenced by irrelevant image content.
Workflow of the MDConv module.
To address these limitations, MDConv improves upon DConv by introducing a modulation mechanism. This mechanism adds a weight calculation branch, enhancing its ability to control the spatial support area. Through this mechanism, MDConv not only adjusts the offsets of perceived input features but also modulates their amplitudes across various spatial locations. The workflow of the MDConv module is portrayed in Fig. 3. In extreme cases, MDConv can effectively nullify the signal from specific locations by setting their feature amplitudes to zero, thereby minimizing or eliminating the influence of irrelevant content. This additional modulation mechanism provides MDConv with a greater flexibility to fine-tune its spatial coverage, ensuring greater precision in feature representation.
Equations (1) and (2) present the core formulas of DConv and MDConv, respectively.
Here, \(y\) denotes the output feature information, while \(x\) represents the input feature information. The parameter \(w_j\) efers to the shared projection weight of the \(j\)-th sampling point, and \(q_j\) indicates the \(j\)-th sampling point within the network. The term \(\Delta q_j\) represents the offset of the \(j\)-th sampling point, and \(\Delta m_i\) denotes the modulation scalar associated with the \(j\)-th sampling point.
Analyzing these formulas reveals that the primary distinction between MDConv and DConv lies in the modulation mechanism (\(\Delta m_i\)) and the optimized learning of the offset (\(\Delta q_j\)). These enhancements enable MDConv to more accurately capture relevant features, thereby improving accuracy, generalization, and adaptability across tasks. The comparison clarifies how each model addresses spatial transformations and offers valuable insights into the improvements introduced in MDConv relative to DConv.
Reparameterized stacked squeeze-and-excitation detail-enhanced block
This study presents the ReSSD Block, a novel approach aimed at enhancing the efficiency of feature extraction within the network. The ReSSD Block comprises four primary elements: Conv, RepConv, the stacked squeeze-and-excitation (SSE) Block, and DECBS, organized in both sequential and parallel configurations. The architectural framework is demonstrated in Fig. 4.
Overall design and functional workflow of the ReSSD Block.
RepConv is an improved version of the RepVGG Block45. It adopts the reparameterization design principle, removes the SE Block46 to improve processing speed, and incorporates an activation function to enable nonlinear expression. Similarly, the SSE Block is a redesigned SE Block. By introducing four stacked parallel branches, it decomposes features into multiple streams, processes them independently after global average pooling, and ultimately fuses them to generate channel weights. This multi-branch structure allows for fine-grained decomposition and fusion of input features, capturing richer feature patterns and enhancing the precision of channel weights.
When feature information enters the ReSSD Block, it first undergoes the convolution operation to adjust and split the channels. Subsequently, the features pass through RepConv, multiple stacked SSE Blocks in series, and DECBS along the downward path, eventually reaching the output. Additionally, each internal operation in the ReSSD Block outputs a secondary branch that contributes to the final output via concatenation. This design mimics a residual structure, improving feature preservation and integration. Finally, the fused and concatenated output undergoes a 1\(\times\)1 convolution operation. Cross-channel information fusion is achieved through a combination of linear transformation and nonlinear activation functions.
This process synthesizes features from diverse modules, thereby augmenting the network’s capacity to discern intricate channel patterns and encapsulate global contextual information, culminating in enhanced feature representation and improved performance.
Backbone network
The backbone architecture of the GOOD-Net algorithm is composed of several integral components: Conv, ReSSD Block, MDConv, SCDown, Spatial Pyramid Pooling-Fast (SPPF), and GPSA. The comprehensive structure and functional workflow of this network are illustrated in Fig. 5.
Among these, SCDown25 is specifically designed to mitigate the computational overhead of the model. SCDown operates through two sequential convolutional (Conv) steps. Initially, the input feature information is processed through a 1\(\times\)1 convolution to modify the channel dimensions and facilitate information fusion. Subsequently, a second convolutional layer processes the features to generate the output. This structural design significantly reduces memory usage and computational resource requirements compared to traditional convolution operations, while optimizing the model’s performance.
The GPSA module is introduced to enhance feature processing and extraction capabilities within the backbone network. It integrates GhostConv47 and GPSA Block components, leveraging their combined efficiency and feature representation strengths. GhostConv operates using a two-step convolution (Conv) process. First, the input feature information is processed using an initial Conv operation. This is followed by a secondary convolution utilizing a large-kernel Conv operation. The outputs generated from these two steps are concatenated to construct the final result. Compared to traditional convolution operations, the design of GhostConv reduces computational overhead and minimizes feature redundancy, thereby enhancing efficiency. The GPSA Block, by contrast, represents a Position-Sensitive Attention mechanism that integrates multi-head attention with a feedforward neural network layer, leveraging GhostConv for its construction. This architectural integration, implemented in a sequential manner, enhances the algorithm’s capacity to discern position-specific features while simultaneously optimizing computational efficiency.
Overall design and functional workflow of the Backbone Network within the GOOD-Net framework.
In the backbone network of the GOOD-Net algorithm, the feature stream is hierarchically divided into multiple levels for processing, with each level employing distinct downsampling operations such as Conv, MDConv, and SCDown. For enhanced feature extraction, the ReSSD Block is applied uniformly across all levels. At the terminal stage of the backbone network, the SPPF18 and GPSA modules are utilized to fuse feature information across multiple scales, enabling the capture of contextual information at varying resolutions.
Compared to traditional convolutional backbone networks, this design enables the learning of irregularly shaped features, improving the granularity of feature extraction. It effectively balances resource overhead with model performance, efficiently utilizes global context information, and excels in low-altitude aerial image processing and related tasks.
Task processing header
Given that most targets in low-altitude remote sensing images are relatively small, with only a minority being of medium or large size, the model incorporates features from levels P2, P3, and P5 as inputs to the task processing head. This methodology seeks to augment the algorithm’s comprehensive performance, particularly in terms of improving detection accuracy across diverse target scales.
The task processing head is implemented in a decoupled design, comprising two independent branches, each optimized with distinct loss functions to perform target classification and bounding box regression tasks separately. This decoupling aims to enhance model convergence efficiency. Both branches operate in parallel and utilize a combination of Conv and DECBS modules to extract features related to category and position. Subsequently, a 1\(\times\)1 convolutional layer utilizing Conv2d is implemented to address both classification and regression objectives. The use of DECBS, in conjunction with convolutional operations, facilitates robust feature extraction and task-specific optimization. Figure 6 offers a comprehensive depiction of the overall structure and workflow of the task processing header.
Traditional methods for calculating (x, y, w, h) coordinates using L1/L2 loss or IoU direct regression often fail to account for center point offsets and shape mismatches, resulting in unstable target box regression and significant numerical errors. To address these issues, this study employs a combination of distribution focal loss (DFL) and complete intersection over union (CIoU) for bounding box regression. This approach leverages the complementary strengths of DFL and CIoU to enhance bounding box localization accuracy, mitigate coordinate regression errors and center point deviations, and improve the detection of small objects and targets with mismatched aspect ratios. The fundamental equations defining these loss functions are presented in Eqs. (3) and (4), respectively.
In this context, y denotes the original true label, while S represents the probability distribution P(y), obtained after processing through the Softmax layer. The Euclidean distance between these center points is represented by \(\rho\), whereas c signifies the diagonal length of the closure area encompassing the two rectangular boxes. The variables \(\bf{b}\) and \(\bf{b}^{gt}\) denote the center points of two rectangular boxes. The parameter v quantifies the consistency in the relative proportions of the two rectangular boxes, and \(\alpha\) serves as the weight coefficient.
Structure of the task processing head in the GOOD-Net model.
For the target classification task, binary cross-entropy loss (BCE) is employed as the Classification Loss. The loss function is presented in Eq. (5).
Here, \(\text {Loss}_{\text {cls}}\)denotes the classification loss, while \(\text {Loss}_{\text {box}}\) refers to the bounding box regression loss.
Results
Datasets
The VisDrone dataset48, developed by the AISKYEYE team at the Machine Learning and Data Mining Laboratory of Tianjin University, is specifically curated to evaluate drone performance in object detection and tracking tasks. This comprehensive dataset comprises 288 video clips, 10,209 static images, and 261,908 video frames, captured under diverse conditions, including varying altitudes, weather scenarios, and lighting environments, spanning a wide geographical range. The dataset is methodically partitioned into 6,471 images for training, 548 images for validation, and 1,610 images for testing. The resolution of the images spans from 480\(\times\)360 pixels to 2000\(\times\)1500 pixels. The label statistics for the target categories and the dataset partitioning in the VisDrone dataset are presented in Fig. 7.
The dataset includes a wide variety of scenes from 14 cities across China, covering 10 categories such as pedestrians, cars, tricycles, and motorcycles. Altogether, more than 2.6 million bounding boxes were manually annotated to delineate the target objects within the entire dataset. These annotations are further enriched with attributes such as object category, scene visibility, and object occlusion, capturing the diversity and complexity of urban environments. Given the variability in target sizes, image backgrounds, and capture angles characteristic of drone photography, this dataset functions as a key reference for assessing the performance of computer vision algorithms.
Label statistics and dataset division for target categories in the VisDrone dataset.
In addition, the Car Parking Lot Dataset (CARPK)49 was employed in the robustness experiments. Developed by Meng-Ru Hsieh et al. from National Taiwan University and GE Global Research, it comprises approximately 90,000 car instances captured from four parking lots using PHANTOM 3 PROFESSIONAL drones at an altitude of about 40 meters. This large-scale dataset facilitates car counting in diverse parking scenarios. Each image is annotated with bounding boxes indicating the top-left and bottom-right coordinates of individual vehicles. The annotations support object counting, localization, and further analyses. The CARPK dataset contains a single target category-Car-and the label distribution and dataset partition are illustrated in Fig. 8.
Label statistics and dataset division for target categories in the CARPK dataset.
Evaluation metrics
Evaluation metrics are fundamental in gauging the performance of object detection models, with the choice of these metrics contingent on the specific objectives and practical applications within the scope of neural network-based systems. When applied to object detection tasks, prediction outcomes are commonly dichotomized into two primary classifications: positive and negative instances. These instances are subsequently categorized into four essential types:
-
True Positives (TP): Instances that are accurately identified as positive by the model, indicating that the model has correctly classified all true positive cases.
-
False Positives (FP): Instances that are incorrectly classified as positive by the model, although they are, in fact, negative.
-
False Negatives (FN): Instances where the model fails to correctly identify positive cases, instead classifying them as negative.
-
True Negatives (TN): Instances that are accurately identified as negative, demonstrating the model’s ability to correctly classify all true negative cases.
This study leverages these categories to compute key evaluation metrics, including Model Parameters (Params), Giga Floating-point Operations Per Second (GFLOPs), Precision (P), Recall (R), Average Precision (AP) and mean Average Precision (\(\textrm{mAP}_{50}\), \(\textrm{mAP}_{50-95}\)). These metrics comprehensively assess the model’s effectiveness and computational efficiency, facilitating a nuanced understanding of its performance.
P quantifies the proportion of true positive predictions relative to the total number of instances classified as positive, thereby evaluating the algorithm’s efficacy in accurately identifying relevant cases. Conversely, R denotes the ratio of true positive instances to the total actual positive instances, serving as a measure of the algorithm’s comprehensiveness in detecting all relevant cases. The formal definitions and computational methodologies for these metrics are delineated in Equations (6) and (7).
AP quantifies the area under the precision-recall (PR) curve, offering a comprehensive measure of detection performance across varying recall thresholds. The mean average precision (mAP), a widely adopted metric, is computed as the mean of the AP values across all categories. The formal definitions and computational procedures for these metrics are delineated in Eqs. (8) and (9).
\(\textrm{mAP}_{50}\) denotes the mean average precision at a fixed Intersection over Union (IoU) threshold of 0.5 and is widely employed as a reference measure for gauging the detection accuracy of object detection models, particularly in scenarios involving small objects. In contrast, \(\textrm{mAP}_{50-95}\) represents the mean average precision computed across multiple IoU thresholds ranging from 0.5 to 0.95, providing a more rigorous and comprehensive evaluation of a model’s performance across varying object sizes and scene complexities.
The Params represents the total trainable weights within the model, which determine its capacity. GFLOPs denote the billions of floating-point operations required by a model to process inputs, serving as an indicator of computational complexity. A lower GFLOPs value is advantageous for resource-constrained or real-time scenarios, as it minimizes computational resource demands and enhances efficiency.
Experimental environment and hyperparameters
In the comparison and ablation experiments, the hardware environment comprised a server with an AMD EPYC 9654 processor and two NVIDIA RTX 4090 graphics cards. Table 2 presents the hardware and software configurations.
The edge device deployment experiment is conducted using the NVIDIA Jetson AGX Orin Developer Kit. This platform features a 12-core Arm\(\circledR\) Cortex\(\circledR\)-A78AE v8.2 64-bit CPU and a 2048-core NVIDIA Ampere architecture GPU with 64 Tensor Cores. Detailed specifications are provided in Table 3.
Table 4 summarizes the hyperparameters used during model training, which were chosen to maximize computational efficiency and ensure compatibility with the experimental setup. The hardware specifications, software configurations, and hyperparameter settings remained consistent throughout all experiments. To guarantee the reliability and reproducibility of results, the PyTorch setting “torch.use\(\_\)deterministic\(\_\)algorithms” was enabled.
Comparative experiments
In this section, YOLO1150 was selected as the baseline model for performance evaluation and comparison with GOOD-Net. YOLO11 was chosen due to its established effectiveness in target detection tasks, providing a robust benchmark for assessing the capabilities of the proposed approach.
The GOOD-Net-m model demonstrates significant improvements over the higher-capacity Baseline-x model, reducing the Params by 74.5\(\%\) and computational demands by 76.2\(\%\). Notably, it achieves an increase in precision, improving \(\textrm{mAP}_{50}\) and \(\textrm{mAP}_{50-95}\) by 8.0\(\%\) and 7.7\(\%\), respectively. Similarly, when compared to the lower-complexity Baseline-n model, the GOOD-Net-n model reduces the Params by 34.6\(\%\) and computational requirements by 20.6\(\%\), while achieving substantial gains in precision, with improvements of 18.6\(\%\) for \(\textrm{mAP}_{50}\) and 21.4\(\%\) for \(\textrm{mAP}_{50-95}\). These results underscore the efficiency and performance enhancements achieved by the GOOD-Net models across varying model capacities. The comparative experimental results are summarized in Table 5.
In addition, various advanced methods, such as MFFSODNet37, DDSC-YOLO51, OB-YOLO52, ITS-YOLO53, SOD-YOLO-n54, Sod-Uav55, DAID-YOLO56, HSP-YOLO57, CRL-YOLO58, LE-YOLO36, HRFNet59, EBC-YOLO60, PTCDet61, GFL+EFC62, DMTNet63, EFA-Net64, LMF-UAV-l65, CSFCANet66, GD-PAN-L67, Van6-DETR68, and LKR-DETR69, were selected as baseline models for comparative experiments. These models, representing state-of-the-art methodologies for object detection in low-altitude remote sensing imagery, were benchmarked against the proposed GOOD-Net to emphasize its performance superiority.
The results of the comparative experiments with other advanced models are summarized in Table 6. In comparison with the OB-YOLO model, the GOOD-Net-s model demonstrates a 3.2\(\%\) improvement in the \(\textrm{mAP}_{50}\) accuracy index while achieving a 22.2\(\%\) diminution in the Params and a 38.3\(\%\) decline in computational cost. Similarly, compared to the Sod-Uav model, the GOOD-Net-m model achieves a notable 9.0\(\%\) improvement in the \(\textrm{mAP}_{50}\) accuracy index, accompanied by a 55.0\(\%\) diminution in the Params and a 63.4\(\%\) decline in computational cost. Moreover, GOOD-Net models of varying sizes consistently exhibit competitive performance across the comparative experiments.
Comparative analysis of experimental outcomes between GOOD-Net and other models on the VisDrone dataset.
Figure 9 illustrates the results of the comparative test, offering a more direct and intuitive comparison of the performance differences between the GOOD-Net algorithm and other methods. The horizontal axis represents the number of parameters or computational complexity, while the vertical axis depicts algorithm accuracy. Within this coordinate system, points closer to the upper left corner indicate a better balance between complexity and accuracy, reflecting superior overall performance.
Ablation experiments
The smallest model is selected as a representative example to systematically evaluate the contribution of each component through ablation studies, using YOLO11n as the baseline.
Table 7 summarizes the experimental results. In the table, “A” denotes the integration of the proposed GPSA component into the baseline structure. “B” represents the incorporation of the novel neck network, while “C” indicates the addition of the ReSSD Block feature extraction module. “D” signifies the replacement of certain downsampling convolution operations with the SCDown component, and “E” indicates the deployment of the newly designed detection head. “F” corresponds to the introduction of MDConv to replace some downsampling layers. Finally, “G” reflects the optimization of model parameters, including depth, width, and maximum channel size. The final architecture derived from these enhancements is referred to as the GOOD-Net-n model.
Deploying each component separately reveals its individual contribution to the model’s performance. During the progressive integration of components, incorporating the proposed neck network achieves the largest reduction in model parameters-28.0% (from 2.5M to 1.8M)-while improving the \(\textrm{mAP}_{50}\) and \(\textrm{mAP}_{50-95}\) scores by 12.7% and 15.5%, respectively. Similarly, integrating the detection head yields the most substantial decrease in computational cost, reducing it by 29.6% (from 8.1 GFLOPs to 5.7 GFLOPs) and enhancing \(\textrm{mAP}_{50}\) and \(\textrm{mAP}_{50-95}\) by 4.1.
Furthermore, the results indicate that the cumulative deployment of components yields greater performance improvements than adding each component independently, highlighting their synergistic effects.
Robustness experiments
To evaluate the scalability of the GOOD-Net algorithm, a robustness experiment was conducted to compare its performance against other algorithms on the CARPK dataset. Specifically, YOLO11 was used as the baseline model, while WFFA-SSD70, FS-SSD71, and SF-SSD72 served as reference algorithms for comparison.
Table 8 presents the robustness experiment results. GOOD-Net-m outperforms all baseline models in detection accuracy while reducing the number of parameters by 74.5% and computational cost by 76.1% compared to baseline model-x. Among all methods, GOOD-Net-m achieves the highest \(\textrm{mAP}_{50}\) score, whereas GOOD-Net-l attains the highest \(\textrm{mAP}_{50-95}\) score. These findings demonstrate the scalability of the GOOD-Net algorithm, which consistently delivers superior overall performance across various datasets compared to the baseline method.
Edge device deployment experiments
To assess the practical applicability of the GOOD-Net algorithm, an experiment was conducted deploying the model on edge devices using the VisDrone dataset. YOLO11 served as the baseline for performance comparison. The evaluation considered key metrics, including accuracy, deployment latency on both edge devices and servers.
Table 9 presents the results of the edge device experiment. A comparison of the computational latency between the algorithm deployed on the server and on the edge device reveals that the baseline model operates 8–10 times slower on the edge device than on the server. In contrast, the GOOD-Net algorithm exhibits a latency approximately 6.5 times slower on the edge device than on the server. This finding indicates that the structural design of GOOD-Net is more compatible with edge computing. Although GOOD-Net has slightly higher latency than a baseline model of similar specifications, it meets real-time processing requirements while providing a 16–21% improvement in accuracy compared to the baseline model. These results demonstrate the practical applicability of the GOOD-Net algorithm.
Results visualization
Deep neural network models are inherently complex nonlinear systems, frequently characterized as “black boxes” owing to their limited transparency and interpretability. To demonstrate the performance differences between the proposed GOOD-Net algorithm and baseline models, this study employs various visualization techniques, including confusion matrices, precision-recall curves, feature heatmaps, and visual comparisons. These methods provide a thorough and interpretable evaluation of model performance. Given that low-altitude remote sensing image processing tasks frequently require deployment on edge devices, algorithms must meet stringent real-time and computational efficiency criteria. To ensure the compared models are representative of practical applications, the smaller-scale YOLO11n model is chosen as the baseline. The GOOD-Net-n model, characterized by reduced parameter complexity, lower computational demands, and higher detection accuracy, is compared against this baseline.
The confusion matrix is a fundamental framework for assessing the efficacy of classification models. It offers a structured tabular representation that juxtaposes predicted and actual outcomes, wherein rows generally denote the actual categories, and columns denote the predicted categories (or vice versa). Each cell in the matrix encapsulates the frequency of instances corresponding to a particular combination of actual and predicted classifications. The diagonal elements signify accurate predictions, whereas the off-diagonal elements delineate instances of misclassification.
Comparison of normalized confusion matrices: (a) Baseline model and (b) GOOD-Net.
Figure 10 presents the normalized confusion matrices of two models: (a) the baseline model and (b) the GOOD-Net-n model. Each column in the confusion matrix is normalized to clearly illustrate the proportion of different prediction outcomes within each category, thereby mitigating the bias caused by varying label quantities. The diagonal elements represent correct predictions (true positives and true negatives), while off-diagonal elements indicate misclassifications. Specifically, the first row of off-diagonal elements corresponds to instances where true labels are incorrectly predicted as background (false negatives), and the last column represents cases where the background is misclassified as a labeled category (false positives). Compared to the baseline model, the GOOD-Net-n matrix displays a more pronounced diagonal and reduced off-diagonal elements, indicating a significant improvement in overall prediction accuracy.
The PR curve serves as a critical evaluative tool for quantifying the performance of classification models, especially in scenarios involving imbalanced datasets. This curve encapsulates the equilibrium between Precision and Recall across diverse decision thresholds, thereby offering both a visual representation and a rigorous quantitative analysis of model performance. A PR curve that converges toward the upper-right corner, along with a substantial area under the curve (AUC), indicates enhanced model performance by effectively balancing Precision and Recall.
Comparison of precision-recall curves: (a) Baseline model and (b) GOOD-Net.
Figure 11 illustrates the PR curves for two models: (a) the baseline model and (b) the GOOD-Net-n model. The GOOD-Net-n model demonstrates a curve that is consistently closer to the ideal point (upper right corner) and achieves a larger AUC, indicating enhanced detection capability. These results reflect that GOOD-Net-n effectively optimizes the equilibrium between Precision and Recall, thereby outperforming the baseline model in overall detection performance.
Comparison of attention heatmaps: (a) Original Image, (b) Baseline model and (c) GOOD-Net.
The Grad-CAM73 visualization method is employed to generate heat maps that highlight the focus of different models on various features within the original image. Figure 12 presents the generated heat maps: (a) displays the unprocessed image, (b) shows the baseline model’s feature focus, and (c) illustrates the feature focus of the GOOD-Net-n model. Compared to the baseline model, the GOOD-Net-n model exhibits finer feature perception granularity, reduces attention to irrelevant background elements, and enhances the detection accuracy of densely overlapping small objects. Notably, although the GOOD-Net-n model demonstrates superior feature attention, some background areas still attract attention, possibly due to the similarity between background features and the annotated objects.
Comparison of detection results for samples in the VisDrone-test dataset: (a) Baseline model and (b) GOOD-Net.
In the VisDrone-test dataset, one sample from each image category is randomly selected under varying lighting conditions (slightly overexposed, low light, and normal illumination) for visual comparison. Figure 13 presents the predictions of both the baseline and the proposed GOOD-Net-n models across these scenarios. The results indicate that GOOD-Net-n demonstrates enhanced robustness to changes in ambient lighting and complex backgrounds. Notably, it outperforms the baseline in detecting distant small targets, objects in low-light environments, and densely packed or overlapping objects. Under these challenging conditions, it achieves a substantially higher number of correct detections than the baseline model.
Discussion and conclusions
In low-altitude remote sensing image processing, significant challenges arise due to the small proportions of most targets relative to the overall image, significant changes in lighting conditions and camera perspectives, and the computational constraints inherent to edge computing environments. To address these complexities, this study proposes the Global Object-Oriented Dynamic Network (GOOD-Net), a novel algorithmic framework designed to enhance feature extraction and processing efficiency. The architecture of GOOD-Net is underpinned by three meticulously engineered components: an object-oriented, dynamically adaptive backbone network; a neck network designed to optimize the utilization of global information; and a task-specific processing head augmented for detailed feature refinement. These frameworks synergistically optimize the exploitation of global feature representations while maintaining a harmony between resource utilization and performance outcomes.
Extensive ablation studies validate the effectiveness and necessity of the proposed components and structures, emphasizing their individual contributions to the model and their complementary benefits when used together. Comparative experiments further confirm the superior performance of the GOOD-Net algorithm on key evaluation metrics, demonstrating a significant improvement over the baseline model, particularly in the VisDrone dataset. Additionally, robustness and edge-device deployment experiments substantiate the scalability and practical applicability of GOOD-Net. The result visualization section enhances qualitative and quantitative analysis through various visualization techniques, providing a more intuitive demonstration of GOOD-Net’s advantages.
Notably, while GOOD-Net exhibits outstanding performance across all scales, some prediction errors and omissions persist, and computational delay remains an area for optimization. Addressing these limitations presents a promising direction for future research, aiming to enhance detection algorithm performance and broaden potential application scenarios.
Data availability
The VisDrone dataset utilized in this study is publicly accessible and can be retrieved from its official repository at https://github.com/VisDrone/VisDrone-Dataset. Likewise, the source code and experimental results of the GOOD-Net algorithm are available at https://github.com/Tdzdele/GOOD-Net. For additional data requests pertaining to this study, interested parties are encouraged to contact the corresponding authors, Shaoyun Guan or Yining Jin, to acquire the necessary resources.
References
Ke, R., Li, Z., Tang, J., Pan, Z. & Wang, Y. Real-time traffic flow parameter estimation from uav video based on ensemble classifier and optical flow. IEEE Trans. Intel. Transp. Syst. 20, 54–64. https://doi.org/10.1109/TITS.2018.2797697 (2019).
Khan, S. D., Alarabi, L. & Basalamah, S. A unified deep learning framework of multi-scale detectors for geo-spatial object detection in high-resolution satellite images. Arab. J. Sci. Eng. 47, 9489–9504. https://doi.org/10.1007/s13369-021-06288-x (2022).
Muksimova, S., Umirzakova, S., Shoraimov, K., Baltayev, J. & Cho, Y.-I. Novelty classification model use in reinforcement learning for cervical cancer. Cancers https://doi.org/10.3390/cancers16223782 (2024).
Khan, S. D. & Basalamah, S. Scale and density invariant head detection deep model for crowd counting in pedestrian crowds. Vis. Comput. 37, 2127–2137. https://doi.org/10.1007/s00371-020-01974-7 (2021).
Muksimova, S., Umirzakova, S., Baltayev, J. & Cho, Y.-I. Rl-cervix.net: A hybrid lightweight model integrating reinforcement learning for cervical cell classification. Diagnostics https://doi.org/10.3390/diagnostics15030364 (2025).
Basalamah, S., Khan, S. D., Felemban, E., Naseer, A. & Rehman, F. U. Deep learning framework for congestion detection at public places via learning from synthetic data. J. King Saud Univ. Comput. Inf. Sci. 35, 102–114. https://doi.org/10.1016/j.jksuci.2022.11.005 (2023).
Muksimova, S., Mardieva, S. & Cho, Y.-I. Deep encoder-decoder network-based wildfire segmentation using drone images in real-time. Remote Sens. https://doi.org/10.3390/rs14246302 (2022).
Khan, S. D., Bandini, S., Basalamah, S. & Vizzari, G. Analyzing crowd behavior in naturalistic conditions: Identifying sources and sinks and characterizing main flows. Neurocomputing 177, 543–563. https://doi.org/10.1016/j.neucom.2015.11.049 (2016).
Muksimova, S., Umirzakova, S., Sultanov, M. & Cho, Y. I. Cross-modal transformer-based streaming dense video captioning with neural ode temporal localization. Sensors https://doi.org/10.3390/s25030707 (2025).
Liu, W. et al. Ssd: Single shot multibox detector. In Leibe, B., Matas, J., Sebe, N. & Welling, M. (eds.) Computer Vision – ECCV 2016, 21–37 (Springer International Publishing, Cham, 2016).
Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intel. 42, 318–327. https://doi.org/10.1109/TPAMI.2018.2858826 (2020).
Purkait, P., Zhao, C. & Zach, C. Spp-net: Deep absolute pose regression with synthetic views (2017). arXiv: 1712.03452.
Han, J., Ding, J., Xue, N. & Xia, G.-S. Redet: A rotation-equivariant detector for aerial object detection. in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2785–2794, https://doi.org/10.1109/CVPR46437.2021.00281 (2021).
Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 779–788, https://doi.org/10.1109/CVPR.2016.91 (2016).
Redmon, J. & Farhadi, A. Yolo9000: Better, faster, stronger. in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6517–6525, https://doi.org/10.1109/CVPR.2017.690 (2017).
Redmon, J. & Farhadi, A. Yolov3: An incremental improvement (2018). arXiv: 1804.02767.
Bochkovskiy, A., Wang, C.-Y. & Liao, H.-Y. M. Yolov4: Optimal speed and accuracy of object detection (2020). arXiv: 2004.10934.
Jocher, G. Ultralytics yolov5, https://doi.org/10.5281/zenodo.3908559 (2020).
Wang, C.-Y., Bochkovskiy, A. & Liao, H.-Y. M. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7464–7475, https://doi.org/10.1109/CVPR52729.2023.00721 (2023).
Zhang, J., Lv, Y., Tao, J., Huang, F. & Zhang, J. A robust real-time anchor-free traffic sign detector with one-level feature. IEEE Trans. Emerg. Top. Comput. Intel. 8, 1437–1451. https://doi.org/10.1109/TETCI.2024.3349464 (2024).
Zhan, J. et al. Yolopx: Anchor-free multi-task learning network for panoptic driving perception. Pattern Recogn. 148, 110152. https://doi.org/10.1016/j.patcog.2023.110152 (2024).
Li, C. et al. Yolov6: A single-stage object detection framework for industrial applications (2022). arXiv: 2209.02976.
Jocher, G., Chaurasia, A. & Qiu, J. Ultralytics yolov8 (2023).
Wang, C.-Y., Yeh, I.-H. & Liao, H.-Y. M. Yolov9: Learning what you want to learn using programmable gradient information (2024). arXiv: 2402.13616.
Wang, A. et al. Yolov10: Real-time end-to-end object detection (2024). arXiv: 2405.14458.
Lou, H. et al. Dc-yolov8: Small-size object detection algorithm based on camera sensor. Electronics https://doi.org/10.3390/electronics12102323 (2023).
Meng, S. et al. A robust infrared small target detection method jointing multiple information and noise prediction: Algorithm and benchmark. IEEE Trans. Geosci. Remote Sens. 61, 1–17. https://doi.org/10.1109/TGRS.2023.3295932 (2023).
Dai, X. et al. Dynamic detr: End-to-end object detection with dynamic attention. in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2968–2977, https://doi.org/10.1109/ICCV48922.2021.00298 (2021).
Chen, Z., Ji, H., Zhang, Y., Zhu, Z. & Li, Y. High-resolution feature pyramid network for small object detection on drone view. IEEE Trans. Circ. Syst. Video Technol. 34, 475–489. https://doi.org/10.1109/TCSVT.2023.3286896 (2024).
Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intel. 39, 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031 (2017).
Girshick, R. Fast R-CNN. in 2015 IEEE International Conference on Computer Vision (ICCV), 1440–1448, https://doi.org/10.1109/ICCV.2015.169 (2015).
Cai, Z. & Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6154–6162, https://doi.org/10.1109/CVPR.2018.00644 (2018).
Duan, K. et al. Centernet: Keypoint triplets for object detection. in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 6568–6577, https://doi.org/10.1109/ICCV.2019.00667 (2019).
Zhao, Y. et al. Detrs beat yolos on real-time object detection. in 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16965–16974, https://doi.org/10.1109/CVPR52733.2024.01605 (2024).
Vaswani, A. et al. Attention is all you need. in Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 6000-6010 (Curran Associates Inc., Red Hook, NY, USA, 2017).
Yue, M., Zhang, L., Huang, J. & Zhang, H. Lightweight and efficient tiny-object detection based on improved yolov8n for uav aerial images. Drones https://doi.org/10.3390/drones8070276 (2024).
Jiang, L. et al. Mffsodnet: Multiscale feature fusion small object detection network for uav aerial images. IEEE Trans. Instrum. Meas. 73, 1–14. https://doi.org/10.1109/TIM.2024.3381272 (2024).
Wang, Z., Wang, C., Li, X., Xia, C. & Xu, J. MLP-NET: Multilayer perceptron fusion network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 63, 1–13. https://doi.org/10.1109/TGRS.2024.3515648 (2025).
Zhang, J., Hong, Z., Chen, X. & Li, Y. Few-shot object detection for remote sensing imagery using segmentation assistance and triplet head. Remote Sens. https://doi.org/10.3390/rs16193630 (2024).
Lin, Z. & Leng, B. Ssn: Scale selection network for multi-scale object detection in remote sensing images. Remote Sens. https://doi.org/10.3390/rs16193697 (2024).
Yuan, Z. et al. Small object detection in uav remote sensing images based on intra-group multi-scale fusion attention and adaptive weighted feature fusion mechanism. Remote Sens. https://doi.org/10.3390/rs16224265 (2024).
Qu, J., Tang, Z., Zhang, L., Zhang, Y. & Zhang, Z. Remote sensing small object detection network based on attention mechanism and multi-scale feature fusion. Remote Sens. https://doi.org/10.3390/rs15112728 (2023).
Wang, B., Yang, M., Cao, P. & Liu, Y. A novel embedded cross framework for high-resolution salient object detection. Appl. Intell. 55, 277. https://doi.org/10.1007/s10489-024-06073-x (2025).
Dai, J. et al. Deformable convolutional networks. in 2017 IEEE International Conference on Computer Vision (ICCV), 764–773, https://doi.org/10.1109/ICCV.2017.89 (2017).
Ding, X. et al. Repvgg: Making vgg-style convnets great again. in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13728–13737, https://doi.org/10.1109/CVPR46437.2021.01352 (2021).
Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018).
Han, K. et al. Ghostnet: More features from cheap operations. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020).
Zhu, P. et al. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intel. 44, 7380–7399. https://doi.org/10.1109/TPAMI.2021.3119563 (2021).
Hsieh, M.-R., Lin, Y.-L. & Hsu, W. H. Drone-based object counting by spatially regularized regional proposal networks. in The IEEE International Conference on Computer Vision (ICCV) (IEEE, 2017).
Jocher, G. & Qiu, J. Ultralytics yolo11 (2024).
Zhou, S. & Zhou, H. Detection based on semantics and a detail infusion feature pyramid network and a coordinate adaptive spatial feature fusion mechanism remote sensing small object detector. Remote Sens. 16, https://doi.org/10.3390/rs16132416 (2024).
Liu, R., Zhu, Y., Wang, Y. & Xing, Z. Ob-yolo: A UAV image detection model for reducing computational resource consumption. in 2024 International Joint Conference on Neural Networks (IJCNN), 1–8, https://doi.org/10.1109/IJCNN60899.2024.10649954 (2024).
Li, T., Chen, X., Xie, X., Liu, F. & Pan, Y. Small-object detection based on its-yolov5s-p2. in 2024 5th International Conference on Computer Engineering and Intelligent Control (ICCEIC), 17–20, https://doi.org/10.1109/ICCEIC64099.2024.10775919 (2024).
Li, Y. et al. Sod-yolo: Small-object-detection algorithm based on improved yolov8 for UAV images. Remote Sens. https://doi.org/10.3390/rs16163057 (2024).
Li, Y., Wang, Y., Ma, Z., Wang, X. & Tang, Y. Sod-UAV: Small object detection for unmanned aerial vehicle images via improved yolov7. in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7610–7614, https://doi.org/10.1109/ICASSP48485.2024.10448458 (2024).
Ping, H., Jie, L. & Huahong, Z. Daid-yolo: Small object detection algorithm for drone aerial images. in 2024 9th International Conference on Signal and Image Processing (ICSIP), 591–595, https://doi.org/10.1109/ICSIP61881.2024.10671547 (2024).
Zhang, H., Sun, W., Sun, C., He, R. & Zhang, Y. Hsp-yolov8: UAV aerial photography small target detection algorithm. Drones https://doi.org/10.3390/drones8090453 (2024).
Wang, Z. et al. Improved small object detection algorithm CRL-YOLOV5. Sensors https://doi.org/10.3390/s24196437 (2024).
Chen, Z., Ji, H., Zhang, Y., Liu, W. & Zhu, Z. Hybrid receptive field network for small object detection on drone view. Chin. J. Aeronaut. 38, 103127. https://doi.org/10.1016/j.cja.2024.06.036 (2025).
Luo, H. et al. Ebc-yolo: A remote sensing target recognition model adapted for complex environments. Earth Sci. Informatics 18, 282. https://doi.org/10.1007/s12145-025-01808-x (2025).
Su, J., Qin, Y., Jia, Z. & Hou, Y. PTCDet: Advanced uav imagery target detection. Sci. Rep. 14, 27403. https://doi.org/10.1038/s41598-024-78749-w (2024).
Xiao, Y., Xu, T., Yu, X., Fang, Y. & Li, J. A lightweight fusion strategy with enhanced interlayer feature correlation for small object detection. IEEE Trans. Geosci. Remote Sens. 62, 1–11. https://doi.org/10.1109/TGRS.2024.3457155 (2024).
Zhang, Y., Sang, X., Sun, Y., Liu, S. & Zhou, S. DMTNet: Dual-domain adaptive multi-scale feature fusion network with transformer for small target detection. Vis. Comput. https://doi.org/10.1007/s00371-025-03825-9 (2025).
Liu, X., Zhang, G. & Zhou, B. An efficient feature aggregation network for small object detection in UAV aerial images. J. Supercomput. 81, 548. https://doi.org/10.1007/s11227-025-06987-4 (2025).
Yang, W., He, Q. & Li, Z. A lightweight multidimensional feature network for small object detection on UAVs. Pattern Anal. Appl. 28, 29. https://doi.org/10.1007/s10044-024-01389-3 (2025).
Li, J., Zheng, C., Chen, P., Zhang, J. & Wang, B. Small object detection in UAV imagery based on channel-spatial fusion cross attention. Signal Image Video Process. 19, 302. https://doi.org/10.1007/s11760-025-03850-0 (2025).
Sun, F., He, N., Li, R., Wang, X. & Xu, S. Gd-pan: A multiscale fusion architecture applied to object detection in UAV aerial images. Multimed. Syst. 30, 143. https://doi.org/10.1007/s00530-024-01342-8 (2024).
Lu, X. et al. Van-DETR: Enhanced real-time object detection with vanillanet and advanced feature fusion. Vis. Comput. 41, 4221–4238. https://doi.org/10.1007/s00371-024-03656-0 (2025).
Dong, Y., Xu, F. & Guo, J. LKR-DETR: Small object detection in remote sensing images based on multi-large kernel convolution. J. Real-Time Image Process. 22, 46. https://doi.org/10.1007/s11554-025-01622-0 (2025).
Li, M., Pi, D. & Qin, S. An efficient single shot detector with weight-based feature fusion for small object detection. Sci. Rep. 13, 9883. https://doi.org/10.1038/s41598-023-36972-x (2023).
Liang, X., Zhang, J., Zhuo, L., Li, Y. & Tian, Q. Small object detection in unmanned aerial vehicle images using feature fusion and scaling-based single shot detector with spatial context analysis. IEEE Trans. Circ. Syst. Video Technol. 30, 1758–1770. https://doi.org/10.1109/TCSVT.2019.2905881 (2020).
Yu, J., Gao, H., Sun, J., Zhou, D. & Ju, Z. Spatial cognition-driven deep learning for car detection in unmanned aerial vehicle imagery. IEEE Trans. Cogn. Dev. Syst. 14, 1574–1583. https://doi.org/10.1109/TCDS.2021.3124764 (2022).
Selvaraju, R. R. et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 128, 336–359. https://doi.org/10.1007/s11263-019-01228-7 (2020).
Acknowledgements
This study was supported by the Heilongjiang Province Philosophy and Social Sciences Research Planning Annual Project (No. 23KSD105), the National College Students’ Innovation and Entrepreneurship Training Program of China (Nos. 202410240065 and 202310240082), the Heilongjiang Provincial College Students’ Innovation and Entrepreneurship Training Program (Nos. S202410240020 and S202410240033), and the Harbin University of Commerce College Students’ Innovation and Entrepreneurship Training Program (No. 202410240191).
Author information
Authors and Affiliations
Contributions
Conceptualization, D.T.; methodology, D.T.; software, D.T.; validation, D.T. and S.T.; formal analysis, D.T. and S.T.; investigation, D.T. and S.T.; resources, D.T., S.T., Y.W.; data curation, D.T. and S.T.; writing—original draft preparation, D.T., S.T. and Y.W.; writing—review and editing, D.T., S.T., Y.W., S.G. and Y.J.; visualization, D.T. and S.T.; supervision, D.T.; project administration, D.T.; funding acquisition, D.T., S.T., Y.W., S.G. and Y.J. All authors have read and agreed to the published version of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Tang, D., Tang, S., Wang, Y. et al. A global object-oriented dynamic network for low-altitude remote sensing object detection. Sci Rep 15, 19071 (2025). https://doi.org/10.1038/s41598-025-02194-6
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-02194-6
















