Introduction

Remote sensing and object detection technologies have been instrumental in driving advancements across diverse domains, including urban traffic management, intelligent monitoring systems, vehicle tracking, and medical diagnostics1,2,3,4,5,6,7,8,9. By enabling real-time data acquisition and analysis, it has contributed to improved system efficiency, accuracy, and innovation in these fields. Although deep learning-based detection models10,11,12,13 have achieved notable success in detecting medium and large objects with high accuracy, the detection of small objects in images persists as a considerable challenge due to intrinsic factors such as low resolution, occlusion, and constrained feature representation. In addition, the images processed by target detection algorithms in low-altitude remote sensing equipment differ significantly from those in traditional datasets due to complex background scenes, diverse perspective changes, and variations in illumination. These factors significantly increase the probability of both false positive and false negative outcomes. Moreover, the limited hardware resources of low-altitude remote sensing equipment necessitate detection models that balance real-time processing speed with efficient parameter scaling. Consequently, the development of a small target detection algorithm tailored for remote sensing imagery, which ensures enhanced detection accuracy while optimizing storage efficiency, represents a critical area of research with extensive interdisciplinary applications.

Deep learning-driven object detection approaches, including single-stage and dual-stage detectors, are typically categorized as anchor-based14,15,16,17,18,19,20,21 and anchor-free methods22,23,24,25. Anchor-based methods rely on predefined anchor boxes to detect objects, while anchor-free methods eliminate the need for anchor box design by directly predicting object positions and sizes. This approach simplifies the detection process and enhances model flexibility. However, the absence of anchor prior information often results in reduced detection accuracy for anchor-free methods. Two-stage detectors typically achieve higher detection accuracy by refining candidate regions through additional computational steps. However, this refinement entails a sacrifice of slower inference speeds. By comparison, single-stage detectors employ dense grids or anchors during the detection process, performing detection in a single forward pass, thereby eliminating the need for candidate box generation. While this approach enables faster inference speeds, it sacrifices some degree of detection accuracy. As a pioneering single-stage detector, the You Only Look Once (YOLO)14 model partitions the input feature map into a grid, generating predictions for both the coordinates and category likelihoods of the bounding boxes within each grid cell. Since its introduction, YOLO has demonstrated significant advantages in detection speed and processing efficiency. However, its performance in detecting small targets remains inadequate, particularly in high-density small target scenarios common in remote sensing applications. Consequently, improving detection accuracy for small targets remains a critical challenge. Improving detection speed, reducing inference time, and enhancing accuracy under limited sample data to attain cutting-edge performance in single-stage detectors has become a pressing research challenge in this field.

Single-stage detectors frequently leverage image feature pyramids for feature extraction and multi-scale object detection26. The effectiveness of multi-scale feature extraction may be further improved by integrating gradient and context information modules alongside noise prediction techniques27. Each hierarchical layer within the feature pyramid encapsulates contextual information pertinent to a specific scale. When integrated with a well-designed and sufficiently complex network, this approach significantly enhances the detection performance for objects of varying sizes28. In practical applications, single-stage detectors are subject to greater constraints under limited computing resources. Reducing the number of model parameters can diminish their ability to adapt to variations in target scale, thereby restricting performance in such scenarios29. Within the scope of small-target detection in low-altitude remote sensing, primary challenges involve augmenting the feature information extraction efficacy of the backbone network, facilitating efficient multi-scale feature integration within the neck, and enhancing the detection head’s precision for small-target identification. Furthermore, to satisfy the real-time processing constraints of hardware systems, minimizing computational complexity through the optimization of the feature pyramid architecture is imperative.

In response to the described obstacles, this article puts forward a cutting-edge detection framework, GOOD-Net, designed for low-altitude remote sensing small target detection. The proposed framework achieves high-precision detection while optimizing storage resource utilization and fulfilling the real-time processing requirements of hardware systems. This study offers the following principal contributions:

  • A series of novel module components, including the ReSSD Block, GPSA, DECBS, etc., are introduced and integrated into the proposed GOOD-Net framework.

  • The GOOD-Net algorithm incorporates a newly designed model architecture, comprising an object-oriented, dynamically adaptive backbone network; a neck network designed to optimize the utilization of global information; and a task-specific processing head augmented for detailed feature refinement. This architecture enhances the utilization of global feature information and establishes a novel design paradigm for target detection networks.

  • Various implementations of the GOOD-Net algorithm, tailored to different model sizes, are presented. These variations enable the selection of an appropriate model based on specific computational resources or accuracy requirements.

The subsequent sections are presented in the following order: Section Related Work reviews prior research and recent advancements in object detection and small target detection within remote sensing. Section Methods offers a well-rounded presentation of the GOOD-Net detection framework. Section Results outlines the experimental setup and presents findings from comparative evaluations and ablation studies. Finally, Section Discussion and Conclusions analyzes the experimental findings and suggests directions for future work.

Related work

Target detection represents a cornerstone task in computer vision. Although YOLO has established itself as a standard in real-time detection, advancements in other frameworks, particularly the region-based two-stage detection methods such as R-CNNs30,31, have significantly influenced the development of object detection techniques.

For instance, the Cascade R-CNN32, developed by the Peking University research team based on Faster R-CNN, incorporates multiple detection stages with progressively higher IoU thresholds to screen high-quality object candidate boxes for precise filtering. Subsequently, the CenterNet detection model developed by Duan et al.33 employs key point triplet detection to evaluate whether key points near a candidate box align closely with the IoU of the ground truth box. Zhao et al.34 introduced the Real-Time Detection Transformer, which replaces the conventional transformer encoder35 with a hybrid encoder specifically optimized for processing high-dimensional features. Min et al.36 developed LHGNet, leveraging the HGNetv2 backbone to preserve channel information and enhance the extraction and integration of local details and channel features at each stage. Jiang et al.37 introduced a micro-target detection head, replacing the conventional large-target prediction head. This modification enhances fine-grained feature extraction of small targets through a multi-scale feature extraction module (MSFEM). Similarly, Wang et al.38 proposed a collaborative CNN-MLP architecture that integrates parallel token interaction mixer (PTIM) and contextual selection fusion module (CSFM).

Previous studies have primarily focused on target detection in general scenarios, achieving significant results. However, these methods encounter challenges when applied to low-altitude remote sensing image processing tasks. Such challenges include the small size of most targets relative to the overall image, significant variations in lighting and viewing angles, and reduced performance under resource constraints, particularly in edge computing deployments on resource-limited devices. To mitigate these challenges, researchers have prioritized optimizing the trade-off between computational efficiency and model efficacy. Their endeavors are directed toward improving the accurate detection of small-scale targets within low-altitude remote sensing imagery.

Zhang et al.39 proposed a two-stage training strategy for single-stage small object detection models. They introduced the Segmentation Assistance (SA) module, which helps the network focus on foreground objects. Additionally, they developed the Triplet Head with a dual distillation mechanism, which refines feature representation and enhances class discrimination. Lin et al.40 designed the Scale Selection Network (SSN), allowing the model to dynamically focus on the most relevant features by employing scale attention mechanisms and selective feature processing. Additionally, they introduced the lightweight Landmark-Guided Scale Attention Network module, which further enhances efficiency by focusing the model’s attention on the scale-specific features of selected regions. Yuan et al.41 introduced the IA-YOLOv8 model, which incorporates two innovative modules: the lightweight Intra-group Multi-scale Fusion Attention (IGMSFA) and Adaptive Weighted Feature Fusion (AWFF). The IGMSFA module facilitates the efficient capture and fusion of multi-scale semantic information from various groups within the input features. Conversely, the AWFF module adaptively assigns weights to individual feature channels, thereby optimizing the fusion of high-level features. Qu et al.42 introduced the Attention Mechanism and Multi-Scale Feature Fusion Network. This model enhances detection performance by incorporating an attention mechanism that enables the network to prioritize significant features, alongside cross-scale feature fusion, which optimizes the detection of tiny objects across different spatial resolutions. In contrast, Wang et al.43 proposed a novel Embedded Cross Framework with Dual-Path Transformer (ECF-DT), which addresses feature space discrepancies through multi-scale contextual aggregation. By integrating a dual-path transformer to fuse fine-grained visual contexts and a unit fusion module to enhance channel-wise positive information, their framework achieves robust performance in complex scenarios with cluttered backgrounds.

Despite significant advancements in the domain of remote sensing, widely utilized detection methodologies persist in encountering challenges in attaining an optimal balance between precision and computational efficiency.

Methods

To balance computational overhead and model performance, thereby achieving improved overall performance and higher detection accuracy in low-altitude remote sensing image processing tasks, this paper introduces the GOOD-Net algorithm. The proposed algorithm incorporates several novel components, including Ghost Position-Sensitive Attention (GPSA) and a Reparameterized Stacked Squeeze-and-Excitation Detail-Enhanced Block (ReSSD Block). Additionally, it introduces Detail Enhanced Convolution with BatchNorm and SiLu (DECBS), Modulated Deformable Convolution (MDConv), and Spatial-Channel Decoupled Downsampling (SCDown) to enhance feature representation. The proposed algorithm integrates a novel architecture comprising an optimized backbone network, a hierarchical neck network, and a task-specific processing head, which collectively represent its key distinction from conventional object detectors.

The proposed algorithm reconstructs a dynamic object-oriented backbone network, enhancing its capacity for discriminative feature learning across distinct targets and improving feature extraction efficacy. Additionally, the neck network is optimized to enable comprehensive cross-scale feature fusion, thereby improving global information utilization. The task-specific processing head is further refined to prioritize high-frequency feature refinement in targets, ensuring thorough feature optimization. Compared to traditional convolutional backbone networks and feature pyramid architectures, these advancements achieve finer-grained feature extraction, broader contextual awareness, and more efficient integration of global features. Consequently, the method addresses current limitations in detection efficiency by enhancing both precision and computational effectiveness.

Algorithm 1
Algorithm 1The alternative text for this image may have been generated using AI.
Full size image

Workflow of the GOOD-Net algorithm.

Algorithm 1 presents the workflow through pseudocode. Figure 1 illustrates the overall structure of GOOD-Net, comprising the backbone, neck network, and task processing head. It visualizes the architecture, highlighting the relationships among key components and the implemented enhancements.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Overview of the model architecture of the GOOD-Net algorithm.

The input image undergoes transformation into a multi-scale feature representation through the backbone network. As the network becomes deeper, the aspect ratio and spatial resolution of the feature maps progressively diminish, whereas the number of feature channels continues to expand. The hierarchical feature information derived from the backbone network are systematically amalgamated across multiple scales within the neck network, thereby optimizing and enhancing the comprehensive feature representation. In this stage, features from the P1, P2, and P3 layers undergo average pooling and are fused with features from the P4 layer to capture global information comprehensively. Subsequently, features extracted from the P2, P3, and P5 layers are input into the task-specific processing head for the final stages of feature extraction and detection. This structural design effectively balances computational efficiency and detection accuracy, leveraging global features for robust performance in low-altitude remote sensing image processing tasks.

The GOOD-Net algorithm is categorized into four scalable models: narrow (n), slender (s), medium (m), and leviathan (l). These models are primarily defined by adjusting three key parameters: Depth, Width, and Max Channels. The Depth parameter delineates the network’s architectural complexity by specifying the number of layers, thereby influencing the efficacy of feature extraction. The Width parameter determines the dimensionality of feature channels per layer, shaping the breadth of the feature representations. By modulating the channel count in each convolutional layer, this parameter governs the model’s representational capacity while concurrently dictating its computational complexity. Finally, the Max Channels parameter limits the maximum number of channels allowed in a specific layer or module. This restriction mitigates the risk of excessive channel growth when the Width parameter is increased, thereby preventing hardware limitations such as insufficient video memory or computational capacity. Furthermore, it helps reduce the likelihood of overfitting, which can result from an overly complex model architecture.

Table 1 summarizes the key differences in the configurations of depth, width, and maximum channels across varying model sizes within the GOOD-Net framework. For smaller models (e.g., GOOD-Net-n), stricter limitations on depth and width are imposed to preserve computational efficiency, while constraints on the maximum channel count are partially relaxed. This trade-off ensures sufficient feature extraction capability despite the reduced model complexity. Conversely, larger models (e.g., GOOD-Net-l) are assigned greater width and depth to enhance performance; however, the maximum channel count is subject to stricter constraints. This design balances robust feature learning with regularization, thereby preventing overfitting.

Table 1 Comparison of differences across various scale models of the GOOD-Net algorithm.

Detail enhanced convolution with BatchNorm and SiLu

Compared to traditional target detection tasks, low-altitude aerial image processing involves smaller target proportions, often compounded by challenges such as varying ambient light conditions, lens or subject motion, and resulting issues like blurring or smearing. In these scenarios, high-frequency target information, including edges and contours, is pivotal to the model’s feature learning and recognition mechanisms. Conventional convolutional layers operate over an unconstrained solution space, typically initialized randomly, which restricts their capacity to model these high-frequency features effectively. In order to overcome this limitation, the present study introduces the DECBS module as an integral component of the GOOD-Net algorithm, thereby augmenting the algorithm’s capacity to perceive and learn high-frequency feature.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Structure of the DECBS module.

The architectural configuration of DECBS, depicted in Fig. 2, comprises convolutional aggregation layers, BatchNorm layers, and SiLU activation function layers. The convolutional aggregation layer comprises five parallel convolutions: a standard Conv2d, a central difference (CD) convolution, an angular difference (AD) convolution, a horizontal difference (HD) convolution, and a vertical difference (VD) convolution. These parallel convolutions are strategically designed to enhance feature extraction by combining intensity-level and gradient-level information. While the standard convolution primarily captures intensity-based features, the differential convolutions are tailored to encode gradient-based priors by computing pixel pair differences explicitly. By embedding this prior information directly into the network, the differential convolutions significantly enhance its representational capacity and generalization performance. They leverage gradient-level features to achieve more robust feature representations.

The five parallel convolutions in the convolutional aggregation layer of DECBS utilize identical kernel size, stride, and padding parameters to compute the corresponding weights and biases. During the final calculation, these weights and biases are aggregated, reducing the convolution layer to a single standard convolution. This design ensures that the parameter size remains comparable to that of conventional convolution operations, minimizing additional computational overhead and memory requirements during the inference phase. Furthermore, a BatchNorm layer and an activation function follow the convolutional layer. This setup accelerates convergence, introduces nonlinearity, and enhances the network’s generalization capabilities.

Modulated Deformable convolution

Due to the variability in perspectives associated with low-altitude remote sensing images, targets often appear at different scales and with varying geometric structures across scenes. This variability poses significant challenges to the feature learning capabilities of algorithmic models. To address this issue, the GOOD-Net algorithm incorporates MDConv into its model structure, enhancing its ability to adapt to these variations.

Given that conventional convolution operations are constrained by a rigid planar geometry, their capacity to model irregular geometries is inherently limited. Deformable Convolution (DConv)44 mitigates this limitation by augmenting spatial sampling locations with learned offsets, which are derived from the target task without requiring additional supervision. This design allows DConv to adapt to geometric changes in objects. However, while its spatial support aligns more closely with object structures compared to conventional convolutional networks, it may still extend beyond the region of interest. This can cause the features to be influenced by irrelevant image content.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Workflow of the MDConv module.

To address these limitations, MDConv improves upon DConv by introducing a modulation mechanism. This mechanism adds a weight calculation branch, enhancing its ability to control the spatial support area. Through this mechanism, MDConv not only adjusts the offsets of perceived input features but also modulates their amplitudes across various spatial locations. The workflow of the MDConv module is portrayed in Fig. 3. In extreme cases, MDConv can effectively nullify the signal from specific locations by setting their feature amplitudes to zero, thereby minimizing or eliminating the influence of irrelevant content. This additional modulation mechanism provides MDConv with a greater flexibility to fine-tune its spatial coverage, ensuring greater precision in feature representation.

Equations (1) and (2) present the core formulas of DConv and MDConv, respectively.

$$\begin{aligned} y(q)&= \sum _{j=1}^{j} w_j \cdot x \left( q + q_j \right) \end{aligned}$$
(1)
$$\begin{aligned} y(q)&= \sum _{j=1}^{j} w_i \cdot x \left( q_0 + q_j + \Delta q_j \right) \cdot \Delta m_i \end{aligned}$$
(2)

Here, \(y\) denotes the output feature information, while \(x\) represents the input feature information. The parameter \(w_j\) efers to the shared projection weight of the \(j\)-th sampling point, and \(q_j\) indicates the \(j\)-th sampling point within the network. The term \(\Delta q_j\) represents the offset of the \(j\)-th sampling point, and \(\Delta m_i\) denotes the modulation scalar associated with the \(j\)-th sampling point.

Analyzing these formulas reveals that the primary distinction between MDConv and DConv lies in the modulation mechanism (\(\Delta m_i\)) and the optimized learning of the offset (\(\Delta q_j\)). These enhancements enable MDConv to more accurately capture relevant features, thereby improving accuracy, generalization, and adaptability across tasks. The comparison clarifies how each model addresses spatial transformations and offers valuable insights into the improvements introduced in MDConv relative to DConv.

Reparameterized stacked squeeze-and-excitation detail-enhanced block

This study presents the ReSSD Block, a novel approach aimed at enhancing the efficiency of feature extraction within the network. The ReSSD Block comprises four primary elements: Conv, RepConv, the stacked squeeze-and-excitation (SSE) Block, and DECBS, organized in both sequential and parallel configurations. The architectural framework is demonstrated in Fig. 4.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Overall design and functional workflow of the ReSSD Block.

RepConv is an improved version of the RepVGG Block45. It adopts the reparameterization design principle, removes the SE Block46 to improve processing speed, and incorporates an activation function to enable nonlinear expression. Similarly, the SSE Block is a redesigned SE Block. By introducing four stacked parallel branches, it decomposes features into multiple streams, processes them independently after global average pooling, and ultimately fuses them to generate channel weights. This multi-branch structure allows for fine-grained decomposition and fusion of input features, capturing richer feature patterns and enhancing the precision of channel weights.

When feature information enters the ReSSD Block, it first undergoes the convolution operation to adjust and split the channels. Subsequently, the features pass through RepConv, multiple stacked SSE Blocks in series, and DECBS along the downward path, eventually reaching the output. Additionally, each internal operation in the ReSSD Block outputs a secondary branch that contributes to the final output via concatenation. This design mimics a residual structure, improving feature preservation and integration. Finally, the fused and concatenated output undergoes a 1\(\times\)1 convolution operation. Cross-channel information fusion is achieved through a combination of linear transformation and nonlinear activation functions.

This process synthesizes features from diverse modules, thereby augmenting the network’s capacity to discern intricate channel patterns and encapsulate global contextual information, culminating in enhanced feature representation and improved performance.

Backbone network

The backbone architecture of the GOOD-Net algorithm is composed of several integral components: Conv, ReSSD Block, MDConv, SCDown, Spatial Pyramid Pooling-Fast (SPPF), and GPSA. The comprehensive structure and functional workflow of this network are illustrated in Fig. 5.

Among these, SCDown25 is specifically designed to mitigate the computational overhead of the model. SCDown operates through two sequential convolutional (Conv) steps. Initially, the input feature information is processed through a 1\(\times\)1 convolution to modify the channel dimensions and facilitate information fusion. Subsequently, a second convolutional layer processes the features to generate the output. This structural design significantly reduces memory usage and computational resource requirements compared to traditional convolution operations, while optimizing the model’s performance.

The GPSA module is introduced to enhance feature processing and extraction capabilities within the backbone network. It integrates GhostConv47 and GPSA Block components, leveraging their combined efficiency and feature representation strengths. GhostConv operates using a two-step convolution (Conv) process. First, the input feature information is processed using an initial Conv operation. This is followed by a secondary convolution utilizing a large-kernel Conv operation. The outputs generated from these two steps are concatenated to construct the final result. Compared to traditional convolution operations, the design of GhostConv reduces computational overhead and minimizes feature redundancy, thereby enhancing efficiency. The GPSA Block, by contrast, represents a Position-Sensitive Attention mechanism that integrates multi-head attention with a feedforward neural network layer, leveraging GhostConv for its construction. This architectural integration, implemented in a sequential manner, enhances the algorithm’s capacity to discern position-specific features while simultaneously optimizing computational efficiency.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

Overall design and functional workflow of the Backbone Network within the GOOD-Net framework.

In the backbone network of the GOOD-Net algorithm, the feature stream is hierarchically divided into multiple levels for processing, with each level employing distinct downsampling operations such as Conv, MDConv, and SCDown. For enhanced feature extraction, the ReSSD Block is applied uniformly across all levels. At the terminal stage of the backbone network, the SPPF18 and GPSA modules are utilized to fuse feature information across multiple scales, enabling the capture of contextual information at varying resolutions.

Compared to traditional convolutional backbone networks, this design enables the learning of irregularly shaped features, improving the granularity of feature extraction. It effectively balances resource overhead with model performance, efficiently utilizes global context information, and excels in low-altitude aerial image processing and related tasks.

Task processing header

Given that most targets in low-altitude remote sensing images are relatively small, with only a minority being of medium or large size, the model incorporates features from levels P2, P3, and P5 as inputs to the task processing head. This methodology seeks to augment the algorithm’s comprehensive performance, particularly in terms of improving detection accuracy across diverse target scales.

The task processing head is implemented in a decoupled design, comprising two independent branches, each optimized with distinct loss functions to perform target classification and bounding box regression tasks separately. This decoupling aims to enhance model convergence efficiency. Both branches operate in parallel and utilize a combination of Conv and DECBS modules to extract features related to category and position. Subsequently, a 1\(\times\)1 convolutional layer utilizing Conv2d is implemented to address both classification and regression objectives. The use of DECBS, in conjunction with convolutional operations, facilitates robust feature extraction and task-specific optimization. Figure 6 offers a comprehensive depiction of the overall structure and workflow of the task processing header.

Traditional methods for calculating (x, y, w, h) coordinates using L1/L2 loss or IoU direct regression often fail to account for center point offsets and shape mismatches, resulting in unstable target box regression and significant numerical errors. To address these issues, this study employs a combination of distribution focal loss (DFL) and complete intersection over union (CIoU) for bounding box regression. This approach leverages the complementary strengths of DFL and CIoU to enhance bounding box localization accuracy, mitigate coordinate regression errors and center point deviations, and improve the detection of small objects and targets with mismatched aspect ratios. The fundamental equations defining these loss functions are presented in Eqs. (3) and (4), respectively.

$$\begin{aligned} \text {DFL}(S_i, S_{i+1})&= - \left[ (y - y_i) \log (S_{i+1}) + (y_{i+1} - y) \log (S_i) \right] \end{aligned}$$
(3)
$$\begin{aligned} Loss_{CIoU}&=1 +\alpha v + \frac{\rho ^2(\bf{b},\bf{b}^{gt})}{c^2} -IoU \end{aligned}$$
(4)

In this context, y denotes the original true label, while S represents the probability distribution P(y), obtained after processing through the Softmax layer. The Euclidean distance between these center points is represented by \(\rho\), whereas c signifies the diagonal length of the closure area encompassing the two rectangular boxes. The variables \(\bf{b}\) and \(\bf{b}^{gt}\) denote the center points of two rectangular boxes. The parameter v quantifies the consistency in the relative proportions of the two rectangular boxes, and \(\alpha\) serves as the weight coefficient.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Structure of the task processing head in the GOOD-Net model.

For the target classification task, binary cross-entropy loss (BCE) is employed as the Classification Loss. The loss function is presented in Eq. (5).

$$\begin{aligned} \text {Loss}_{\text {total}} = \text {Loss}_{\text {cls}} + \text {Loss}_{\text {box}} \end{aligned}$$
(5)

Here, \(\text {Loss}_{\text {cls}}\)denotes the classification loss, while \(\text {Loss}_{\text {box}}\) refers to the bounding box regression loss.

Results

Datasets

The VisDrone dataset48, developed by the AISKYEYE team at the Machine Learning and Data Mining Laboratory of Tianjin University, is specifically curated to evaluate drone performance in object detection and tracking tasks. This comprehensive dataset comprises 288 video clips, 10,209 static images, and 261,908 video frames, captured under diverse conditions, including varying altitudes, weather scenarios, and lighting environments, spanning a wide geographical range. The dataset is methodically partitioned into 6,471 images for training, 548 images for validation, and 1,610 images for testing. The resolution of the images spans from 480\(\times\)360 pixels to 2000\(\times\)1500 pixels. The label statistics for the target categories and the dataset partitioning in the VisDrone dataset are presented in Fig. 7.

The dataset includes a wide variety of scenes from 14 cities across China, covering 10 categories such as pedestrians, cars, tricycles, and motorcycles. Altogether, more than 2.6 million bounding boxes were manually annotated to delineate the target objects within the entire dataset. These annotations are further enriched with attributes such as object category, scene visibility, and object occlusion, capturing the diversity and complexity of urban environments. Given the variability in target sizes, image backgrounds, and capture angles characteristic of drone photography, this dataset functions as a key reference for assessing the performance of computer vision algorithms.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Label statistics and dataset division for target categories in the VisDrone dataset.

In addition, the Car Parking Lot Dataset (CARPK)49 was employed in the robustness experiments. Developed by Meng-Ru Hsieh et al. from National Taiwan University and GE Global Research, it comprises approximately 90,000 car instances captured from four parking lots using PHANTOM 3 PROFESSIONAL drones at an altitude of about 40 meters. This large-scale dataset facilitates car counting in diverse parking scenarios. Each image is annotated with bounding boxes indicating the top-left and bottom-right coordinates of individual vehicles. The annotations support object counting, localization, and further analyses. The CARPK dataset contains a single target category-Car-and the label distribution and dataset partition are illustrated in Fig. 8.

Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

Label statistics and dataset division for target categories in the CARPK dataset.

Evaluation metrics

Evaluation metrics are fundamental in gauging the performance of object detection models, with the choice of these metrics contingent on the specific objectives and practical applications within the scope of neural network-based systems. When applied to object detection tasks, prediction outcomes are commonly dichotomized into two primary classifications: positive and negative instances. These instances are subsequently categorized into four essential types:

  • True Positives (TP): Instances that are accurately identified as positive by the model, indicating that the model has correctly classified all true positive cases.

  • False Positives (FP): Instances that are incorrectly classified as positive by the model, although they are, in fact, negative.

  • False Negatives (FN): Instances where the model fails to correctly identify positive cases, instead classifying them as negative.

  • True Negatives (TN): Instances that are accurately identified as negative, demonstrating the model’s ability to correctly classify all true negative cases.

This study leverages these categories to compute key evaluation metrics, including Model Parameters (Params), Giga Floating-point Operations Per Second (GFLOPs), Precision (P), Recall (R), Average Precision (AP) and mean Average Precision (\(\textrm{mAP}_{50}\), \(\textrm{mAP}_{50-95}\)). These metrics comprehensively assess the model’s effectiveness and computational efficiency, facilitating a nuanced understanding of its performance.

$$\begin{aligned} P=\frac{TP}{(TP+FP)} \end{aligned}$$
(6)

P quantifies the proportion of true positive predictions relative to the total number of instances classified as positive, thereby evaluating the algorithm’s efficacy in accurately identifying relevant cases. Conversely, R denotes the ratio of true positive instances to the total actual positive instances, serving as a measure of the algorithm’s comprehensiveness in detecting all relevant cases. The formal definitions and computational methodologies for these metrics are delineated in Equations (6) and (7).

$$\begin{aligned} R=\frac{TP}{(TP+FN)} \end{aligned}$$
(7)

AP quantifies the area under the precision-recall (PR) curve, offering a comprehensive measure of detection performance across varying recall thresholds. The mean average precision (mAP), a widely adopted metric, is computed as the mean of the AP values across all categories. The formal definitions and computational procedures for these metrics are delineated in Eqs. (8) and (9).

$$\begin{aligned} AP=\int _{0}^{1} {P(R)}{dR} \end{aligned}$$
(8)

\(\textrm{mAP}_{50}\) denotes the mean average precision at a fixed Intersection over Union (IoU) threshold of 0.5 and is widely employed as a reference measure for gauging the detection accuracy of object detection models, particularly in scenarios involving small objects. In contrast, \(\textrm{mAP}_{50-95}\) represents the mean average precision computed across multiple IoU thresholds ranging from 0.5 to 0.95, providing a more rigorous and comprehensive evaluation of a model’s performance across varying object sizes and scene complexities.

$$\begin{aligned} mAP=\frac{1}{k}\sum _{i=1}^kAP_i \end{aligned}$$
(9)

The Params represents the total trainable weights within the model, which determine its capacity. GFLOPs denote the billions of floating-point operations required by a model to process inputs, serving as an indicator of computational complexity. A lower GFLOPs value is advantageous for resource-constrained or real-time scenarios, as it minimizes computational resource demands and enhances efficiency.

Experimental environment and hyperparameters

In the comparison and ablation experiments, the hardware environment comprised a server with an AMD EPYC 9654 processor and two NVIDIA RTX 4090 graphics cards. Table 2 presents the hardware and software configurations.

The edge device deployment experiment is conducted using the NVIDIA Jetson AGX Orin Developer Kit. This platform features a 12-core Arm\(\circledR\) Cortex\(\circledR\)-A78AE v8.2 64-bit CPU and a 2048-core NVIDIA Ampere architecture GPU with 64 Tensor Cores. Detailed specifications are provided in Table 3.

Table 4 summarizes the hyperparameters used during model training, which were chosen to maximize computational efficiency and ensure compatibility with the experimental setup. The hardware specifications, software configurations, and hyperparameter settings remained consistent throughout all experiments. To guarantee the reliability and reproducibility of results, the PyTorch setting “torch.use\(\_\)deterministic\(\_\)algorithms” was enabled.

Table 2 Hardware and software parameter configurations for the comparative and ablation experiments.
Table 3 Hardware and software parameter configurations for edge device deployment experiments.
Table 4 Hyperparameter settings for model training.

Comparative experiments

In this section, YOLO1150 was selected as the baseline model for performance evaluation and comparison with GOOD-Net. YOLO11 was chosen due to its established effectiveness in target detection tasks, providing a robust benchmark for assessing the capabilities of the proposed approach.

Table 5 Analysis of experimental outcomes between GOOD-Net and baseline models on the VisDrone dataset.

The GOOD-Net-m model demonstrates significant improvements over the higher-capacity Baseline-x model, reducing the Params by 74.5\(\%\) and computational demands by 76.2\(\%\). Notably, it achieves an increase in precision, improving \(\textrm{mAP}_{50}\) and \(\textrm{mAP}_{50-95}\) by 8.0\(\%\) and 7.7\(\%\), respectively. Similarly, when compared to the lower-complexity Baseline-n model, the GOOD-Net-n model reduces the Params by 34.6\(\%\) and computational requirements by 20.6\(\%\), while achieving substantial gains in precision, with improvements of 18.6\(\%\) for \(\textrm{mAP}_{50}\) and 21.4\(\%\) for \(\textrm{mAP}_{50-95}\). These results underscore the efficiency and performance enhancements achieved by the GOOD-Net models across varying model capacities. The comparative experimental results are summarized in Table 5.

Table 6 Evaluation of experimental outcomes between GOOD-Net and advanced methods on the VisDrone dataset.

In addition, various advanced methods, such as MFFSODNet37, DDSC-YOLO51, OB-YOLO52, ITS-YOLO53, SOD-YOLO-n54, Sod-Uav55, DAID-YOLO56, HSP-YOLO57, CRL-YOLO58, LE-YOLO36, HRFNet59, EBC-YOLO60, PTCDet61, GFL+EFC62, DMTNet63, EFA-Net64, LMF-UAV-l65, CSFCANet66, GD-PAN-L67, Van6-DETR68, and LKR-DETR69, were selected as baseline models for comparative experiments. These models, representing state-of-the-art methodologies for object detection in low-altitude remote sensing imagery, were benchmarked against the proposed GOOD-Net to emphasize its performance superiority.

The results of the comparative experiments with other advanced models are summarized in Table 6. In comparison with the OB-YOLO model, the GOOD-Net-s model demonstrates a 3.2\(\%\) improvement in the \(\textrm{mAP}_{50}\) accuracy index while achieving a 22.2\(\%\) diminution in the Params and a 38.3\(\%\) decline in computational cost. Similarly, compared to the Sod-Uav model, the GOOD-Net-m model achieves a notable 9.0\(\%\) improvement in the \(\textrm{mAP}_{50}\) accuracy index, accompanied by a 55.0\(\%\) diminution in the Params and a 63.4\(\%\) decline in computational cost. Moreover, GOOD-Net models of varying sizes consistently exhibit competitive performance across the comparative experiments.

Fig. 9
Fig. 9The alternative text for this image may have been generated using AI.
Full size image

Comparative analysis of experimental outcomes between GOOD-Net and other models on the VisDrone dataset.

Figure 9 illustrates the results of the comparative test, offering a more direct and intuitive comparison of the performance differences between the GOOD-Net algorithm and other methods. The horizontal axis represents the number of parameters or computational complexity, while the vertical axis depicts algorithm accuracy. Within this coordinate system, points closer to the upper left corner indicate a better balance between complexity and accuracy, reflecting superior overall performance.

Ablation experiments

The smallest model is selected as a representative example to systematically evaluate the contribution of each component through ablation studies, using YOLO11n as the baseline.

Table 7 Ablation experiment results for GOOD-Net on the VisDrone dataset.

Table 7 summarizes the experimental results. In the table, “A” denotes the integration of the proposed GPSA component into the baseline structure. “B” represents the incorporation of the novel neck network, while “C” indicates the addition of the ReSSD Block feature extraction module. “D” signifies the replacement of certain downsampling convolution operations with the SCDown component, and “E” indicates the deployment of the newly designed detection head. “F” corresponds to the introduction of MDConv to replace some downsampling layers. Finally, “G” reflects the optimization of model parameters, including depth, width, and maximum channel size. The final architecture derived from these enhancements is referred to as the GOOD-Net-n model.

Deploying each component separately reveals its individual contribution to the model’s performance. During the progressive integration of components, incorporating the proposed neck network achieves the largest reduction in model parameters-28.0% (from 2.5M to 1.8M)-while improving the \(\textrm{mAP}_{50}\) and \(\textrm{mAP}_{50-95}\) scores by 12.7% and 15.5%, respectively. Similarly, integrating the detection head yields the most substantial decrease in computational cost, reducing it by 29.6% (from 8.1 GFLOPs to 5.7 GFLOPs) and enhancing \(\textrm{mAP}_{50}\) and \(\textrm{mAP}_{50-95}\) by 4.1.

Furthermore, the results indicate that the cumulative deployment of components yields greater performance improvements than adding each component independently, highlighting their synergistic effects.

Robustness experiments

To evaluate the scalability of the GOOD-Net algorithm, a robustness experiment was conducted to compare its performance against other algorithms on the CARPK dataset. Specifically, YOLO11 was used as the baseline model, while WFFA-SSD70, FS-SSD71, and SF-SSD72 served as reference algorithms for comparison.

Table 8 Comparative evaluation of experimental outcomes between GOOD-Net and other methods on the CARPK dataset.

Table 8 presents the robustness experiment results. GOOD-Net-m outperforms all baseline models in detection accuracy while reducing the number of parameters by 74.5% and computational cost by 76.1% compared to baseline model-x. Among all methods, GOOD-Net-m achieves the highest \(\textrm{mAP}_{50}\) score, whereas GOOD-Net-l attains the highest \(\textrm{mAP}_{50-95}\) score. These findings demonstrate the scalability of the GOOD-Net algorithm, which consistently delivers superior overall performance across various datasets compared to the baseline method.

Edge device deployment experiments

To assess the practical applicability of the GOOD-Net algorithm, an experiment was conducted deploying the model on edge devices using the VisDrone dataset. YOLO11 served as the baseline for performance comparison. The evaluation considered key metrics, including accuracy, deployment latency on both edge devices and servers.

Table 9 presents the results of the edge device experiment. A comparison of the computational latency between the algorithm deployed on the server and on the edge device reveals that the baseline model operates 8–10 times slower on the edge device than on the server. In contrast, the GOOD-Net algorithm exhibits a latency approximately 6.5 times slower on the edge device than on the server. This finding indicates that the structural design of GOOD-Net is more compatible with edge computing. Although GOOD-Net has slightly higher latency than a baseline model of similar specifications, it meets real-time processing requirements while providing a 16–21% improvement in accuracy compared to the baseline model. These results demonstrate the practical applicability of the GOOD-Net algorithm.

Table 9 Comparative evaluation of experimental outcomes between GOOD-Net and other methods on the VisDrone dataset in edge device deployment environment.

Results visualization

Deep neural network models are inherently complex nonlinear systems, frequently characterized as “black boxes” owing to their limited transparency and interpretability. To demonstrate the performance differences between the proposed GOOD-Net algorithm and baseline models, this study employs various visualization techniques, including confusion matrices, precision-recall curves, feature heatmaps, and visual comparisons. These methods provide a thorough and interpretable evaluation of model performance. Given that low-altitude remote sensing image processing tasks frequently require deployment on edge devices, algorithms must meet stringent real-time and computational efficiency criteria. To ensure the compared models are representative of practical applications, the smaller-scale YOLO11n model is chosen as the baseline. The GOOD-Net-n model, characterized by reduced parameter complexity, lower computational demands, and higher detection accuracy, is compared against this baseline.

The confusion matrix is a fundamental framework for assessing the efficacy of classification models. It offers a structured tabular representation that juxtaposes predicted and actual outcomes, wherein rows generally denote the actual categories, and columns denote the predicted categories (or vice versa). Each cell in the matrix encapsulates the frequency of instances corresponding to a particular combination of actual and predicted classifications. The diagonal elements signify accurate predictions, whereas the off-diagonal elements delineate instances of misclassification.

Fig. 10
Fig. 10The alternative text for this image may have been generated using AI.
Full size image

Comparison of normalized confusion matrices: (a) Baseline model and (b) GOOD-Net.

Figure 10 presents the normalized confusion matrices of two models: (a) the baseline model and (b) the GOOD-Net-n model. Each column in the confusion matrix is normalized to clearly illustrate the proportion of different prediction outcomes within each category, thereby mitigating the bias caused by varying label quantities. The diagonal elements represent correct predictions (true positives and true negatives), while off-diagonal elements indicate misclassifications. Specifically, the first row of off-diagonal elements corresponds to instances where true labels are incorrectly predicted as background (false negatives), and the last column represents cases where the background is misclassified as a labeled category (false positives). Compared to the baseline model, the GOOD-Net-n matrix displays a more pronounced diagonal and reduced off-diagonal elements, indicating a significant improvement in overall prediction accuracy.

The PR curve serves as a critical evaluative tool for quantifying the performance of classification models, especially in scenarios involving imbalanced datasets. This curve encapsulates the equilibrium between Precision and Recall across diverse decision thresholds, thereby offering both a visual representation and a rigorous quantitative analysis of model performance. A PR curve that converges toward the upper-right corner, along with a substantial area under the curve (AUC), indicates enhanced model performance by effectively balancing Precision and Recall.

Fig. 11
Fig. 11The alternative text for this image may have been generated using AI.
Full size image

Comparison of precision-recall curves: (a) Baseline model and (b) GOOD-Net.

Figure 11 illustrates the PR curves for two models: (a) the baseline model and (b) the GOOD-Net-n model. The GOOD-Net-n model demonstrates a curve that is consistently closer to the ideal point (upper right corner) and achieves a larger AUC, indicating enhanced detection capability. These results reflect that GOOD-Net-n effectively optimizes the equilibrium between Precision and Recall, thereby outperforming the baseline model in overall detection performance.

Fig. 12
Fig. 12The alternative text for this image may have been generated using AI.
Full size image

Comparison of attention heatmaps: (a) Original Image, (b) Baseline model and (c) GOOD-Net.

The Grad-CAM73 visualization method is employed to generate heat maps that highlight the focus of different models on various features within the original image. Figure 12 presents the generated heat maps: (a) displays the unprocessed image, (b) shows the baseline model’s feature focus, and (c) illustrates the feature focus of the GOOD-Net-n model. Compared to the baseline model, the GOOD-Net-n model exhibits finer feature perception granularity, reduces attention to irrelevant background elements, and enhances the detection accuracy of densely overlapping small objects. Notably, although the GOOD-Net-n model demonstrates superior feature attention, some background areas still attract attention, possibly due to the similarity between background features and the annotated objects.

Fig. 13
Fig. 13The alternative text for this image may have been generated using AI.
Full size image

Comparison of detection results for samples in the VisDrone-test dataset: (a) Baseline model and (b) GOOD-Net.

In the VisDrone-test dataset, one sample from each image category is randomly selected under varying lighting conditions (slightly overexposed, low light, and normal illumination) for visual comparison. Figure 13 presents the predictions of both the baseline and the proposed GOOD-Net-n models across these scenarios. The results indicate that GOOD-Net-n demonstrates enhanced robustness to changes in ambient lighting and complex backgrounds. Notably, it outperforms the baseline in detecting distant small targets, objects in low-light environments, and densely packed or overlapping objects. Under these challenging conditions, it achieves a substantially higher number of correct detections than the baseline model.

Discussion and conclusions

In low-altitude remote sensing image processing, significant challenges arise due to the small proportions of most targets relative to the overall image, significant changes in lighting conditions and camera perspectives, and the computational constraints inherent to edge computing environments. To address these complexities, this study proposes the Global Object-Oriented Dynamic Network (GOOD-Net), a novel algorithmic framework designed to enhance feature extraction and processing efficiency. The architecture of GOOD-Net is underpinned by three meticulously engineered components: an object-oriented, dynamically adaptive backbone network; a neck network designed to optimize the utilization of global information; and a task-specific processing head augmented for detailed feature refinement. These frameworks synergistically optimize the exploitation of global feature representations while maintaining a harmony between resource utilization and performance outcomes.

Extensive ablation studies validate the effectiveness and necessity of the proposed components and structures, emphasizing their individual contributions to the model and their complementary benefits when used together. Comparative experiments further confirm the superior performance of the GOOD-Net algorithm on key evaluation metrics, demonstrating a significant improvement over the baseline model, particularly in the VisDrone dataset. Additionally, robustness and edge-device deployment experiments substantiate the scalability and practical applicability of GOOD-Net. The result visualization section enhances qualitative and quantitative analysis through various visualization techniques, providing a more intuitive demonstration of GOOD-Net’s advantages.

Notably, while GOOD-Net exhibits outstanding performance across all scales, some prediction errors and omissions persist, and computational delay remains an area for optimization. Addressing these limitations presents a promising direction for future research, aiming to enhance detection algorithm performance and broaden potential application scenarios.