Introduction

Object detection is one of the core tasks in remote sensing image analysis and has played a pivotal role in various practical applications, including military reconnaissance, geographic information systems, urban planning, environmental monitoring, and disaster assessment1. Unlike natural images, remote sensing imagery typically features high resolution, large-scale perspectives, and complex backgrounds. These characteristics result in a sparse distribution of foreground objects, significant variation in object sizes, and the prevalence of small targets that occupy only a few pixels2. Such small objects often exhibit limited texture and blurred boundaries, making them difficult to distinguish from the background. Due to these challenges, traditional detection methods based on candidate region generation and region-based feature extraction often suffer from low accuracy and high miss rates when applied to remote sensing images, failing to meet the demands of fine-grained analysis3,4.

In recent years, with the rapid advancement of deep learning, particularly convolutional neural networks (CNNs), object detection technologies have achieved remarkable progress. Conventional detection frameworks can be broadly categorized into two types: two-stage and one-stage methods. Two-stage methods, exemplified by Faster R-CNN5 , first generate a set of region proposals and then perform refined classification and bounding box regression. These methods typically yield high detection accuracy but suffer from slower inference and greater structural complexity. In contrast, one-stage methods such as the YOLO series6,7,8,9,10,11 and SSD12 directly perform dense predictions on the image, offering faster inference but often underperforming in small object detection and high-precision tasks. To bridge the gap between performance and efficiency, researchers have explored innovations in net architectures, feature fusion strategies, and training techniques. With the introduction of the Transformer13 architecture into computer vision, self-attention-based object detectors such as DETR (Detection Transformer)14 have pioneered the formulation of object detection as a sequence-to-sequence prediction task. DETR eliminates the reliance on heuristic components like anchor generation and non-maximum suppression (NMS), enabling end-to-end training. However, it suffers from slow convergence and limited modeling capacity for small objects and dense scenes14. To address these issues, Deformable DETR15 was proposed, incorporating deformable attention mechanisms to significantly accelerate convergence and enhance local detail modeling. Subsequently, methods such as DAB-DETR16, DN-DETR17, and more recently RT-DETR18 have further optimized query initialization and feature interaction mechanisms, improving both detection accuracy and real-time performance. These advancements have established Transformer-based detectors as a prominent research direction. In the context of small object detection in remote sensing imagery, DETR-like methods face additional challenges, including severe scale imbalance, increased background clutter, and semantic ambiguity in features. To tackle these issues, recent methods such as D-FINE19 introduce deep feature-guided strategies for small object enhancement, using specialized structures to contract the receptive field and improve small object perception. DEIM20 adopts a dual-path interaction mechanism that integrates structural and semantic features across layers, facilitating high-level semantic fusion and mitigating background interference. In addition, these efforts21,22,23,24 signify a shift in remote sensing object detection frameworks toward finer-grained and more tightly coupled feature modeling.

Despite these advancements, remote sensing object detection still faces numerous challenges. Targets often occupy only a tiny fraction of high-resolution images, leading to feature sparsity and making small objects particularly difficult to detect. Moreover, their features are often overwhelmed or diluted by deep convolutional layers, resulting in low detection accuracy25. Existing multi-scale feature fusion strategies are typically coarse and struggle to effectively align high-level and low-level features, thereby limiting the model’s ability to perceive objects at varying scales26. In addition, background clutter further complicates the detection of small objects, making it harder to distinguish targets from surrounding noise in complex scenes. Most current methods also lack sufficient global contextual modeling, which exacerbates the challenges of detecting small-scale objects in large, cluttered environments. Furthermore, some high-accuracy models incur significant computational costs, making it difficult to achieve a balance between inference efficiency and detection performance, which hinders their practical deployment in real-world remote sensing applications.

Fig. 1
figure 1

Comparison of different detection methods on the SIMD dataset in terms of accuracy and efficiency. The radius of each circle indicates the GFLOPs.

To address the core challenges of small target feature loss, rough multi-scale feature fusion, insufficient global context modeling, and difficulty in balancing detection efficiency and accuracy in remote sensing targets, this paper proposes a DEIM-based remote sensing object detection framework named HyperFusion-DEIM. The overall architecture comprises three key components: Multi-Path Attention Network (MAPNet), Scale-Aware Feature Enhancement (SAFE), and Multi-level Feature Concentration (MFC). Specifically, MAPNet enhances fine-grained feature representation and multi-scale modeling by incorporating the The Shallow Robust Feature Downsampling (SRFD)27 and Multi-Path Attention Fusion (MPAF) modules. The SAFE module integrates a Transformer with the HyperACE28 structure to improve cross-layer semantic interaction and contextual awareness. The MFC module performs deep multi-level feature fusion, effectively enhancing object boundary perception and detection robustness.

Figure 1 presents a performance comparison of different models on the SIMD dataset. Although the proposed HyperFusion-DEIM exhibit relatively higher GFLOPs, they outperform state-of-the-art detectors such as YOLO, RT-DETR, D-Fine, and DEIM in both FPS and AP, demonstrating superior detection accuracy and real-time performance.

As shown in Fig. 2, we compare DEIM with the proposed HyperFusion-DEIM in terms of the detection pipeline and output quality on remote sensing tasks. The baseline uses HGNetv229 as the backbone, relying mainly on spatial convolutions for feature extraction. However, it struggles with efficient multi-scale integration, leading to blurred textures and weak boundaries in intermediate feature maps. This is particularly problematic for small objects, resulting in missed detections and inaccurate localization.

Fig. 2
figure 2

Comparison of the architecture and detection results between the DEIM and the proposed HyperFusion-DEIM. By introducing the MAPNet, SAFE and MFC, the proposed method enhances feature representation and fusion capabilities, significantly improving detection performance in remote sensing imagery. The yellow circle in the bottom-right represents the falsely detected or missed targets in the baseline method.

The HyperFusion-DEIM introduces three key components: MAPNet, SAFE, and MFC, which significantly enhance modeling capability. Compared to DEIM, our approach retains sharper edge details and clearer textures, as shown in the generated feature maps. In the final detection results, the yellow-circled regions highlight that our method achieves more accurate detection and localization of small objects, especially in complex backgrounds. This results in fewer missed detections, particularly near road edges and occluded areas. The main contributions of this paper are as follows:

  1. (1)

    We propose HyperFusion-DEIM, a unified framework that integrates small object enhancement, contextual semantic modeling, and multi-scale fusion optimization in a cascaded design, addressing key challenges in existing detection methods.

  2. (2)

    To address small object sparsity, we introduce MAPNet, which improves low-level feature retention and multi-scale aggregation by combining SRFD and MPAF modules, enhancing boundary clarity and fine-grained texture modeling.

  3. (3)

    We design the SAFE encoder to overcome the lack of contextual modeling and inefficient information flow. It incorporates Transformer and HyperACE mechanisms, and, together with the MFC module, enables effective fusion of global context and local semantics, improving semantic interactions and attention across hierarchical features.

  4. (4)

    The MFC module mitigates multi-scale fusion challenges and semantic misalignment by establishing deep inter-layer connections and geometric alignment across feature levels, enhancing spatial coherence and saliency balance.

The remainder of this paper is organized as follows: Section “Related work” reviews related work in remote sensing object detection. Section “Method” details our proposed model and its components. Section “Experiments” presents ablation studies to validate each module’s effectiveness, followed by comparative experiments and visual analyses. Finally, Section “Conclusion” concludes the paper and outlines future research directions.

Related work

CNN-based remote sensing object detection

Traditional convolutional neural network (CNN)-based methods for object detection can be categorized into two types: two-stage and one-stage detectors.

Two-stage detectors, such as Faster R-CNN30, Mask R-CNN31, and Cascade R-CNN32, generate region proposals (RoIs) and use detection heads to classify object categories and regress bounding boxes for each RoI feature. While these methods are known for their high accuracy, especially in boundary refinement and semantic detail preservation, they suffer from slow inference speed and high training complexity, making them impractical for real-time remote sensing applications. Furthermore, cascade regressor architectures like in Cascade R-CNN, while improving detection for small objects, still struggle with feature misalignment between high- and low-level features, limiting their performance in high-resolution imagery where fine-grained details are critical for small object detection. Libra R-CNN33 addresses the imbalance problem in small object detection by refining the original features with non-local blocks. However, while this improves object representation, the non-local operations introduce computational overhead, limiting real-time applicability, particularly for large remote sensing images that require fast processing.

One-stage detectors, such as SSD12, RetinaNet34, and the YOLO series6,7,8,9,10,11,28,35, eliminate the need for region proposals, directly predicting bounding box coordinates and class labels using a single neural network. These methods excel in speed-critical applications due to their fast inference, but they often sacrifice detection accuracy due to feature downsampling and loss of fine-grained details, especially for small objects. The inability to retain high-resolution features makes it difficult for these models to detect small objects in remote sensing imagery, where targets often occupy only a tiny fraction of the image.

Drone-YOLO36 uses a three-layer PAFPN structure and large-resolution feature maps, specifically tailored for small target detection. While effective, its reliance on high-resolution feature maps leads to increased computational load, making it less efficient for large-scale remote sensing data. Similarly, ESOD37 introduces sparse detection heads to reduce redundant computations over background regions. However, it still struggles with multi-scale feature alignment, which is critical for detecting small targets across varying image resolutions. Feature fusion techniques like those in FFCA-YOLO38 and ACDF-YOLO39 improve local region perception and multi-scale fusion but fail to efficiently integrate global contextual information, often resulting in poor performance in cluttered and complex scenes, common in remote sensing data. In addition, there are several CNN-based approaches that leverage lightweight methods40,41 or techniques such as distillation42,43 to enhance both detection accuracy and speed.

Transformer-based remote sensing object detection

In recent years, Transformer-based object detection methods, such as DETR14 and its variants (e.g., Deformable DETR15, RT-DETR18, Drone-DETR44, MCG-RTDETR45), have gained significant attention due to their ability to model long-range dependencies using self-attention mechanisms. However, traditional Vision Transformers (ViTs) suffer from quadratic computational complexity with respect to image resolution, which becomes prohibitive for high-resolution remote sensing imagery. While optimized models like Deformable DETR and RT-DETR improve computational efficiency, they still face issues with feature misalignment across scales and ineffective handling of small objects. Despite various enhancements, the reliance on sparse attention and a limited number of queries in these models hinders their ability to capture the fine-grained features needed for detecting small objects in remote sensing imagery.

Drone-DETR44, built upon RT-DETR, incorporates a lightweight backbone and dual-path attention mechanism, improving small object detection. However, it still struggles with the complexity of large-scale background clutter in remote sensing images and scale variation, which require robust multi-scale feature fusion to handle effectively. Moreover, the introduction of Transformer-based models has increased the overall model complexity, limiting their real-time deployment capabilities.

Feature extraction and fusion

To enhance multi-scale feature representation, several architectures have been developed. The Feature Pyramid Network (FPN)46 uses a bottom-up pyramid structure to integrate features at multiple resolutions, improving small object detection. However, FPN and its extensions, such as PANet47 and BiFPN48, often suffer from inefficient feature fusion across layers, which hinders performance when small objects appear at varying scales or in complex backgrounds. While these methods improve detection accuracy by fusing multi-scale features, they still struggle to align low-level and high-level features in a way that is both computationally efficient and effective for small object detection.

CF2PN49 and LR-FPN50 introduce additional modules for cross-scale feature fusion and better shallow spatial/contextual representation, respectively. However, both still fall short in handling the global context necessary for remote sensing imagery, where background clutter and environmental variations often obscure small targets. LR-FPN, in particular, suffers from inadequate global context modeling, which is critical for distinguishing objects from background noise in remote sensing data.

Addressing the gaps with HyperFusion-DEIM

Despite the significant progress made by both CNN-based and Transformer-based detectors, small object detection in remote sensing remains a challenging problem due to issues such as feature sparsity, background clutter, and scale variation. Current methods, while improving on one aspect of detection (e.g., speed, accuracy, or feature fusion), often fail to provide a comprehensive solution that balances accuracy, efficiency, and scalability. For instance, two-stage detectors like Cascade R-CNN offer high accuracy but struggle with inference speed, while one-stage detectors like YOLO prioritize speed at the cost of fine-grained detection, and Transformer-based models like DETR face challenges with feature alignment and global context modeling.

HyperFusion-DEIM addresses these gaps by introducing a novel multi-path attention network that facilitates better feature fusion at multiple scales while also preserving fine-grained details. It enhances small object detection by employing an adaptive feature enhancement encoder that improves feature alignment and multi-scale fusion, making it more robust to background clutter and scale variations. Furthermore, the HyperACE module incorporated in HyperFusion-DEIM ensures that the global context is effectively modeled, thereby enabling the model to distinguish small objects from complex backgrounds in remote sensing imagery. This innovative approach allows HyperFusion-DEIM to balance accuracy, computational efficiency, and small object sensitivity, making it a highly effective solution for real-time remote sensing applications.

Method

In this section, we present the HyperFusion-DEIM framework. Subsection “Macroscopic architecture” introduces the overall architecture, Subsection “Multi-path attention network(MAPNet)” details the backbone MAPNet and its submodules MPAF and SRFD, and Subsection “Scale-aware feature enhancement (SAFE)” describes the proposed SAFE module and its submodule MFC.

Macroscopic architecture

Fig. 3
figure 3

Architecture of HyperFusion-DEIM. The model employs a four-stage backbone to progressively extract multi-scale features through MPAF modules. The encoder integrates the HyperACE module, Transformer units, and the MFC structure, enabling efficient cross-scale feature fusion.

Our approach builds upon DEIM20 as the baseline, which introduces a novel strategy to improve both convergence speed and detection accuracy in the original DETR. DEIM fundamentally restructures the object detection pipeline by incorporating an innovative matching mechanism that accelerates training and enhances performance. The core structure of DEIM consists of three key components: the HGNetv2 backbone, an encoder integrating Transformer and PAFPN, and a decoder-based prediction head. Its primary innovation lies in the enhanced matching algorithm, which dynamically optimizes the correspondence between predicted boxes and ground truth annotations. This significantly reduces training complexity and computational cost by streamlining the alignment process, while maintaining high precision in object localization and classification. Additionally, DEIM utilizes the MAL loss, which incorporates mutual attention to model interactions between objects, further improving detection performance.

The HyperFusion-DEIM architecture consists of three key components: a feature extraction backbone (MAPNet), a feature enhancement encoder (SAFE), and a detection decoder. As shown in Fig. 3, MAPNet serves as the backbone feature extractor. The SRFD module enhances shallow semantic features, followed by a staged feature extraction pipeline (Stage 1–Stage 4) that captures multi-scale features, forming a five-level feature pyramid for fusion and semantic modeling. In the encoder, the SAFE module enables efficient semantic modeling and cross-layer interaction. It integrates a Transformer-based architecture with the HyperACE mechanism, using residual connections and up-/down-sampling paths for deep-to-shallow feature fusion. HyperACE enhances spatial context, improving accuracy for small objects. To address multi-scale misalignment and semantic inconsistency, the MFC module ensures better alignment and consistency across resolutions. In the decoder, a lightweight structure with a self-distillation mechanism optimizes supervised learning. Multi-scale features are decoded in parallel to predict bounding boxes (Box Pred) and category scores (Score Pred), enhancing inference efficiency. This design enables accurate detection and localization of multi-scale objects in remote sensing images.

Multi-path attention network(MAPNet)

Overall structure

In remote sensing object detection, shallow feature maps often suffer from blurred edges and missing textures, while repeated downsampling during multi-level pyramid construction results in the loss of small object features. Although the HGNetv2 backbone in DEIM offers advantages in structural symmetry and computational efficiency, it remains limited in remote sensing contexts. Key limitations include insufficient multi-scale modeling, inadequate fusion of shallow and deep features, and simplistic attention mechanisms that fail to emphasize critical object regions effectively.

To address these challenges, we propose a novel backbone network, MAPNet, as shown in Fig. 4. The architecture consists of three main components: the SRFD (Fig. 4d)27, HG_Stage_MPAF blocks (Fig. 4b), and the MPAF_Block (Fig. 4c). The SRFD module enhances edge and texture representations in shallow layers, effectively mitigating the loss of small object features. Next, multi-level HG_Stage_MPAF modules process features through either depthwise separable convolutions or identity mappings, depending on whether downsampling is applied. Each stage uses parallel MPAF_Block modules to extract and fuse semantic information across scales. The MPAF_Block incorporates multi-path attention mechanisms and feature aggregation strategies to improve the representation of salient object regions while suppressing background noise.

MAPNet improves feature extraction accuracy and robustness through a staged, multi-path, and attention-guided fusion strategy. It demonstrates superior performance in multi-scale and dense object detection tasks, proving its effectiveness in remote sensing applications.

Fig. 4
figure 4

Architecture of MAPNet. The architecture includes the SRFD feature extraction module, HG_Stage_MPAF modules, MPAF_Block, and the multi-stage, multi-path attention fusion mechanism.

The backbone configuration is shown in Table 1.

Table 1 Model structure and layer configuration for backbone.

The shallow robust feature downsampling (SRFD)

SRFD is shown in Fig. 4d, which enhances low-level feature representation by introducing multiple parallel convolutional and pooling pathways, including operations such as GConv, DWConvD, and MaxD. By employing different downsampling techniques to extract multiple complementary feature maps, SRFD constructs a more robust feature representation capable of capturing detailed textures and structural information across multiple scales. This design significantly improves the detection accuracy of small objects.

In addition to enhancing multi-scale perceptual capability, SRFD improves the preservation of fine-grained details through effective feature fusion using concatenation followed by convolution (Cat + Conv). Furthermore, SRFD maintains low computational overhead while suppressing background noise and improving the model’s focus on salient object regions.

The multi-path attention fusion (MPAF)

Fig. 5
figure 5

Structure of the MPAF. It consists of three parallel paths: an attention refinement path \(F_1\), an efficient feature enhancement path \(F_2\), an information preservation path \(F_3\).

MPAF module is designed to process the input feature map \(X_{in}\) through three parallel paths, as illustrated in Fig. 5. The attention refinement path \(F_{1}\) employs a series of iRMB (Inverted Residual Mobile Block)51 attention modules to perform hierarchical deep extraction, enabling the capture of rich multi-scale feature representations. The efficient feature enhancement path \(F_{2}\) utilizes a combination of depthwise separable convolution (DSConv), depthwise convolution (DWConv), and pointwise convolution (PWConv) to enhance feature expression and emphasize salient information. The information preservation path \(F_{3}\) applies a simple \(1\times 1\) convolution to retain the original information with minimal transformation. The outputs from all three paths are concatenated and fused via a \(1\times 1\) convolution to produce the final output \(X_{out}\), achieving a balance between depth, efficiency, and feature integrity.

In the attention refinement path \(F_{1}\), the input feature X is divided into m parallel branches ( \(Branch_{1}\) to \(Branch_{m}\)). Each branch processes its input through a series of stacked iRMB modules, where each iRMB module internally implements a self-attention mechanism to enhance feature representation. The outputs from all branches are finally fused via concatenation to produce the aggregated feature map. The overall processing can be described as follows:

$$\begin{aligned} F_1(X) & =Concat(Branch_1,Branch_2,...,Branch_m) \\ Branch_i & = iRMB_n(...(iRMB_1(X_i))),\quad i\in \{1,2,...,m\} \\ iRMB(X) & = \textrm{Conv}_{1 \times 1}\left( \textrm{DWConv}_{3 \times 3}\left( V \otimes (QK^{T})\right) + V \otimes (QK^{T})\right) + X \end{aligned}$$
(1)

This path ensures effective information transmission, where the multi-branch structure enables the capture of diverse feature representations at different levels. The self-attention mechanism further enhances the modeling of global dependencies, while convolution operations maintain sensitivity to local patterns. The residual connection within each iRMB module (i.e., \(+X\)) facilitates gradient flow and alleviates training difficulties in deep networks. This design allows the attention refinement path to efficiently extract rich and hierarchical features, making it particularly well-suited for detecting small objects in remote sensing imagery.

The efficient feature enhancement path \(F_2\) forms a feature transformation pipeline composed of three consecutive operations: DWConv focuses on spatial feature extraction, PWConv integrates information across channels, and the overall sequence progressively reconstructs and enhances feature representations from multiple dimensions. Compared with the other paths, this branch offers a unique perspective on feature transformation. By decomposing traditional convolutions into three lightweight operations, this design significantly reduces both the number of parameters and computational complexity. The process can be formulated as:

$$\begin{aligned} F_{2}(X) = \textrm{PWConv}(\textrm{DWConv}(\textrm{Conv}_{1 \times 1}(X))) \end{aligned}$$
(2)

The information preservation path \(F_{3}\) processes the input feature X through a single \(1\times 1\) convolution layer. This path ensures that the essential semantics of the original input are retained, providing foundational support for the more complex transformation paths. The process is defined as:

$$\begin{aligned} F_3(X) = \text {Conv}_{1 \times 1}(X) \end{aligned}$$
(3)

The final output is given by:

$$\begin{aligned} X_{out} = \text {Conv}_{1 \times 1} \left( \text {Concat} \left( F_1(X_{in}), F_2(X_{in}), F_3(X_{in}) \right) \right) \end{aligned}$$
(4)

This module addresses the problem of insufficient feature representation in small object detection within remote sensing imagery. By employing a multi-path design, it enhances feature extraction capabilities using convolutional operations at different scales to preserve spatial details and mitigate information loss. In the \(F_1\) path, an attention module (indicated by the gray block on the left) is introduced, where self-attention based on query (Q), key (K), and value (V) computations strengthens the model’s focus on small objects, thereby significantly improving the detection rate of small targets in remote sensing scenarios.

Scale-aware feature enhancement (SAFE)

Overall structure

In remote sensing object detection, traditional feature fusion techniques, such as PAFPN, aim to enhance multi-scale detection performance by aggregating features from different network layers. However, PAFPN encounters several challenges when applied to remote sensing imagery, particularly in terms of handling objects with varying scales and addressing cluttered backgrounds. First, PAFPN lacks sufficient adaptability in the fusion of features across multiple scales, hindering its ability to effectively represent both small and large objects within complex remote sensing scenes. Second, conventional fusion methods are vulnerable to background noise interference, which compromises detection accuracy and increases the likelihood of false positives and missed detections. Additionally, PAFPN often relies on simple concatenation and stacking operations for feature aggregation, which can lead to feature degradation and inadequate representation of the underlying spatial and semantic information.

To address the limitations outlined above, we propose a novel encoder architecture, illustrated in the Encoder section of Fig. 3. SAFE introduces an adaptive scale-aware mechanism designed to enhance the network’s ability to detect objects at multiple scales. Initially, SAFE employs a Transformer module to capture global dependencies, thereby improving the spatial awareness of feature maps, as depicted in Fig. 6b. Subsequently, a HyperACE module refines the feature representations, enabling the model to focus on salient regions while suppressing background noise, as shown in Fig. 6a.

SAFE significantly enhances the contextual representation of multi-scale features through the integration of a customized MFC module and a feature diffusion mechanism. This design ensures that each feature map not only preserves local spatial details but also incorporates global semantic context via deep cross-scale integration, thereby providing more discriminative and robust representations for downstream detection and classification tasks. The MFC module serves as the central innovation within SAFE and is depicted in Fig. 6c.

Fig. 6
figure 6

Core architecture of the SAFE module, comprising three sub-modules: (a) HyperACE, (b) Transformer, and (c) MFC.

The feature diffusion mechanism in SAFE operates via bidirectional pathways: top-down and bottom-up, effectively propagating context-rich features across different detection scales. In the first stage, outputs from the feature focusing module are downsampled (from \(P_4\) to \(P_5\)) and upsampled (from \(P_4\) to \(P_3\)) to be fused with higher- and lower-resolution features, generating the initial multi-scale representation. In the second stage, a secondary feature aggregation process, facilitated by MFC and coupled with deep feature enhancement, enables thorough feature integration and the diffusion of contextual information.

This two-stage feature diffusion strategy effectively mitigates the scale-induced information loss commonly observed in traditional feature pyramid networks. It substantially enhances the consistency and expressiveness of multi-scale features, particularly in complex remote sensing object detection scenarios.

HyperACE

HyperACE (Hypergraph-based Adaptive Correlation Enhancement) is an innovative feature enhancement module introduced in YOLOv1328, designed to model high-order relationships among multi-scale features through hypergraph computation, as shown in Fig. 6a. It utilizes an adaptive hyperedge generation mechanism that dynamically captures complex dependencies across spatial positions and channels. By leveraging both global average pooling and max pooling to generate contextual vectors, and combining these vectors with learnable hyperedge prototypes, HyperACE adaptively estimates the participation strength of each vertex in every hyperedge. This approach facilitates more effective feature aggregation and enhancement.

In our framework, the fused features are passed through the HyperACE module, enabling adaptive high-order correlation modeling and feature refinement across both spatial locations and detection scales. Subsequently, the FullPAD mechanism ensures efficient and structured information propagation by distributing the correlation-enhanced features through three distinct pathways: from the backbone to the neck, within the inner layers of the neck, and from the neck to the detection head. This organization guarantees a seamless and effective flow of enriched features throughout the network.

Multi-level feature concentration (MFC)

To address two core challenges in remote sensing object detection, insufficient multi-scale object perception and a lack of contextual information, we propose an efficient multi-scale feature fusion mechanism, the MFC module, as illustrated in Fig. 6c. This module employs depthwise separable convolutions (DWConv) with multiple kernel sizes (5\(\times\)5, 7\(\times\)7, 9\(\times\)9, and 11\(\times\)11) to capture spatial context across various object scales. Multi-kernel convolutions allow the model to observe the same scene from diverse receptive fields, significantly enhancing the contextual awareness of local regions and improving detection performance in occluded or densely populated small object scenarios.

An identity mapping path is included, and residual fusion is applied at the end to integrate shallow detail features with deep semantic information, thereby alleviating inconsistencies caused by semantic gaps across different layers. The use of depthwise separable convolutions also substantially reduces parameter count and computational complexity, allowing the module to be seamlessly integrated into complex networks and improving overall inference efficiency.

For the low-level feature map \(P_3\), an adaptive downsampling operation (ADown) is applied to match the spatial resolution of \(P_4\) and \(P_5\). For the mid-level \(P_4\) and high-level \(P_5\) features, a \(1 \times 1\) convolution is used to unify channel dimensions.

$$\begin{aligned} F_3 = \text {ADown}(P_3), \quad F_4 = \text {Conv}_{1 \times 1}(P_4), \quad F_5 = \text {Conv}_{1 \times 1}(P_5) \end{aligned}$$
(5)

here, F Represents the feature map at different stages of the network.

The three multi-scale feature maps are first concatenated to integrate contextual information across different resolutions. The fused representation is then processed by multiple depthwise separable convolution layers with different kernel sizes, enabling the extraction of spatial semantics under diverse receptive fields. Finally, the outputs of all branches are aggregated via element-wise addition to produce a unified multi-scale representation.

$$\begin{aligned} F_{concat} & = \text {Concat}(F_3, F_4, F_5) \\ D_k & = \text {DWConv}_{k \times k}(F), \quad k \in \{5, 7, 9, 11\} \\ O_{\text {multi}} & = D_5 + D_7 + D_9 + D_{11} + D_{\text {ID}} \end{aligned}$$
(6)

where, \(D_k\) Represents a DWConv operation applied to the feature map F, with a kernel size k that varies over a predefined set 5, 7, 9, 11, \(O_{multi}\) Represents the aggregated output after applying depthwise convolutions.

The aggregated feature is subsequently passed through a \(1 \times 1\) convolution to compress the channel dimension, reducing redundancy while retaining key semantic information. Finally, to enhance feature expressiveness and stabilize network training, a residual connection is applied by adding the initial concatenated feature to the fused output \(F_{\text {fused}}\), producing the final output feature map \(F_{\text {out}}\).

$$\begin{aligned} \begin{gathered} F_{\text {fused}} = \text {Conv}_{1 \times 1}(O_{\text {multi}}) \\ F_{\text {out}} = F_{\text {fused}} + F_{\text {concat}} \end{gathered} \end{aligned}$$
(7)

By jointly incorporating multi-scale aggregation, semantic alignment, and a lightweight design, the MFC module effectively bridges the gap between shallow detail features and deep semantic representations. This integration substantially enhances the expressiveness and robustness of the learned features.

Experiments

Datasets

The SIMD52 remote sensing object detection dataset is a fine-grained benchmark comprising 5000 high-resolution optical satellite images collected across diverse geographic regions and various seasonal conditions, using satellite-based acquisition platforms. It contains 45,096 annotated object instances spanning 15 categories: Car, Truck, Van, Long Vehicle, Bus, Airliner, Propeller Aircraft, Trainer Aircraft, Chartered Aircraft, Fighter Aircraft, Other, Stair Truck, Pushback Truck, Helicopter, and Boat. Each instance is annotated with precise location, object class, and bounding box information, ensuring detailed and rigorous evaluation. Data analysis reveals three major challenges in the SIMD dataset. An extremely wide scale distribution is observed, with object sizes ranging from very small to large, as illustrated in the left plot of Fig. 7a. Most instances lie within the normalized range of [0, 0.5], resulting in a long-tail distribution. In addition, the dataset exhibits pronounced category imbalance: the Car class contains 20,504 instances, whereas the Boat class has only 49. Another challenge arises from the high degree of visual similarity among object types within the Aircraft and Vehicle categories, where overlaps in spatial scale further complicate fine-grained classification.

The VEDAI53 dataset comprises 1246 high-resolution aerial images (1024\(\times\)1024 pixels, 12.5 cm/pixel spatial resolution), collected between 2012 and 2014 under varied lighting and weather conditions across both rural and urban scenes. These images were captured using aerial platforms, ensuring a wide range of acquisition perspectives. It contains 3640 annotated instances across 7 vehicle categories, with each instance labeled by location, orientation, and class. Compared with SIMD, VEDAI presents two major difficulties. Small objects dominate the dataset, with a large proportion concentrated in the normalized size range of [0.0, 0.2]. Moreover, category imbalance is severe: the Car class includes 1377 instances, whereas the Truck class has only 105, yielding a sample ratio greater than 13:1. This imbalance substantially limits the generalization ability of detection algorithms, particularly on underrepresented categories, as illustrated in Fig. 7b.

Fig. 7
figure 7

Object size distribution and category statistics in the SIMD and VEDAI datasets. The left plot illustrates the scatter distribution of normalized object width versus height, with categories distinguished by color. The right plot shows the number of object instances per category, providing a comparative view of category frequencies across the two datasets.

Experimental environment

To ensure fairness in training and comparison across models, all ablation studies and experiments were conducted at the Super Intelligent Computing Center of Xijing University. The experiments were performed on NVIDIA A800 GPUs with 80 GB memory, paired with Intel 6338N Xeon CPUs, under the Red Hat Enterprise Linux Version 4.8.5–28 (https://www.redhat.com/) operating system. The software environment was configured with the CUDA Version 12.1 (https://developer.nvidia.com/cuda-toolkit), Python Version 3.11 (https://www.python.org/), and PyTorch Version 2.1 (https://pytorch.org/). The complete hardware configuration used for both training and testing is summarized in Table 2.

Table 2 Configuration of experiment environments.

The hyperparameter configurations used in our experiments are summarized in Table 3. Training was performed for a total of 160 epochs, including 78 flat epochs. A warm-up phase of 2000 iterations (warmup_iter) was applied to stabilize the early stages of training. The initial learning rate was set to 0.0008, with a weight decay coefficient of 0.0001. The batch size was fixed at 4. To improve training stability, we employed the Exponential Moving Average (EMA) strategy, while Automatic Mixed Precision (AMP) was disabled to ensure numerical consistency and convergence reliability.

Table 3 Hyperparameter settings for training HyperFusion-DEIM.

Ablation study

To evaluate the individual and combined contributions of the key components in the proposed HyperFusion-DEIM framework, we perform ablation experiments by incrementally integrating the SRFD, MPAF, and SAFE modules. Using DEIM as the baseline, we assess the impact of each module on detection accuracy and computational efficiency. We use Average Precision(AP) as the primary metric, which is an indicator used to evaluate the accuracy of object detection models. It summarizes the accuracy recall curve into one value by averaging the accuracy at different recall levels. For a specific IoU threshold (e.g., 0.50), \(AP_{50}\) is computed by considering only those predictions whose IoU with the ground truth box is greater than or equal to 0.50. In more comprehensive evaluations, \(AP_{50:95}\) averages the precision over multiple IoU thresholds (from 0.50 to 0.95 in steps of 0.05). This gives a more robust measure of detection accuracy at varying levels of localization precision. In addition, it also adopts number of parameters (Params), and computational complexity (GFLOPs), providing a comprehensive analysis of detection performance, inference cost, and model compactness.

Table 4 Ablation study on the SIMD dataset.

As summarized in Table 4, adding the SRFD module alone (Row 2) introduces negligible increases in parameters and GFLOPs, while slightly improving \(AP_{50}\) from 76.6 to 76.9%. Incorporating MPAF alone (Row 3) substantially increases model size and computational complexity, but also delivers a significant accuracy gain, reaching 80.6%. The SAFE module (Row 4) enhances detection performance through improved global context modeling, achieving \(AP_{50}\) of 78.8% with relatively low parameter and computation cost (4.67 M, 11.27 GFLOPs), striking a favorable balance between efficiency and expressiveness. Combining SRFD and MPAF (Row 5) further improves accuracy to 82.9%, demonstrating the strong synergy between shallow detail preservation and mid-level multi-scale fusion. Finally, the full model (Row 6), integrating all three modules, attains the best results with \(AP_{50}\) of 83.6% and 67.2% in \(AP_{50:95}\), corresponding to gains of 7.0% and 6.3% over the baseline.

These results validate the complementary contributions of the proposed modules and the enhanced overall modeling capacity for remote sensing object detection. The ablation study confirms that each component positively influences the trade-off between accuracy and computational complexity, forming an integrated framework particularly effective for detecting small targets and capturing rich semantic information in complex remote sensing scenes.

On the VEDAI dataset, the impact of different module combinations on model performance was similarly evaluated, as summarized in Table 5. Compared with configurations using a single module, the collaborative integration of multiple modules produces more stable and superior detection results. While MPAF or SAFE individually provides measurable improvements, the highest performance is obtained when all three modules are combined. In this configuration, the model achieves an \(AP_{50}\) of 53.7% and an \(AP_{50:95}\) of 33.8%, substantially exceeding the baseline 12% and 8.7%, respectively. These results underscore the compatibility and strong complementarity among the three modules. Moreover, the integrated design maintains a favorable balance between detection accuracy and computational cost, representing an effective enhancement over the baseline model.

Table 5 Ablation study on the VEDAI dataset.

Table 6 summarizes the ablation results of embedding the MPAF module at different backbone stages (Stage2–Stage5) on the VEDAI dataset. The results show that progressively integrating MPAF across multiple stages consistently improves detection performance, with \(AP_{50}\) rising from 42.9% with MPAF applied only at Stage5 to 44.2% when it is incorporated across all stages. Although multi-stage deployment introduces additional computational cost (GFLOPs increasing from 22.65 to 74.85), it substantially enhances multi-scale feature representation while preserving semantic consistency. These findings demonstrate the hierarchical adaptability and cumulative benefit of the MPAF module in multi-stage feature fusion.

Table 6 Ablation experiments of MPAF module on VEDAI dataset.

Comparative experiments

Table 7 presents a comprehensive comparison between the proposed HyperFusion-DEIM and several state-of-the-art real-time object detectors. Compared with its baseline, DEIM, HyperFusion-DEIM demonstrates notable improvements across AP metrics for various model variants. Specifically, the N variant outperforms DEIM by 4.6% in AP, + 5.5% in \(AP_{50}\), and + 4.6% in \(AP_75\), while the S variant achieves gains of 4.3% in \(AP_{50}\), 1.4% in \(AP_{50}\), and 3.6% in \(AP_75\). Although the parameter count and GFLOPs increase slightly, the inference speed (FPS) improves by nearly 50%, indicating an excellent trade-off between accuracy and efficiency.

Table 7 Comparison with real-time object detectors on SIMD-test.

In comparison with the YOLO family, HyperFusion-DEIM achieves a superior balance between detection accuracy and computational cost. For instance, HyperFusion-DEIM-S maintains comparable FPS and AP to YOLOv8-S, while improving \(AP_{50}\) by 1.9%, \(AP_s\) by 1.1%, and \(AP_m\) by 4.4%. Against more recent variants such as YOLOv11 and YOLOv12, HyperFusion-DEIM consistently attains higher AP across small, medium, and large object categories. Both the N and S variants deliver favorable FPS while achieving elevated AP values through optimized computational efficiency. Notably, in scenarios involving small objects and complex backgrounds, the proposed model leverages multi-scale feature fusion and enhanced contextual modeling to substantially improve robustness and detection precision.

Furthermore, when compared with the latest RT-DETR series, HyperFusion-DEIM provides superior accuracy and efficiency. It achieves an AP of 82.3%, surpassing RT-DETR-R50’s 79.2%, while simultaneously offering a significant FPS improvement. These results validate the effectiveness of the proposed architecture for real-time detection tasks. Overall, through refined feature fusion and enhanced context awareness, HyperFusion-DEIM delivers substantial gains in both accuracy and computational efficiency, establishing it as a compelling solution for high-performance real-time remote sensing object detection.

Figure 8 provides a comparative evaluation of HyperFusion-DEIM against several lightweight real-time object detectors, including YOLOv8-N, YOLOv11-N, YOLOv12-N, D-Fine-N, and DEIM-N. As shown in Fig. 8a, HyperFusion-DEIM achieves a throughput of 296.3 FPS, substantially higher than all competing methods, with particularly large margins over YOLOv8-N and YOLOv11-N. Figure 8b further reports the multi-scale AP results. HyperFusion-DEIM consistently outperforms alternative models across all metrics, including AP and \(AP_{50}\), with especially pronounced improvements in overall detection accuracy. Although the proposed framework incurs slightly higher computational complexity, it maintains an excellent balance between accuracy and efficiency. These results highlight the effectiveness and practicality of HyperFusion-DEIM for high-performance real-time object detection tasks in challenging remote sensing scenarios.

Fig. 8
figure 8

Performance comparison of different detection models.

Fig. 9
figure 9

Multi-class object detection results of HyperFusion-DEIM on the SIMD dataset.

Figure 9 illustrates the superior performance of HyperFusion-DEIM on multi-class object detection within the SIMD dataset, covering complex remote sensing scenes such as urban blocks, docks, airports, and transportation hubs. The results demonstrate that diverse targets (e.g., vehicles, aircraft, and vessels) are accurately localized under challenging conditions, including high density, occlusion, and large scale variations, thereby reflecting both robustness and high precision. This capability stems from several key architectural components. The multi-stage MPAF structure in MAPNet enables effective multi-scale feature aggregation, while the SRFD module enhances low-level edge and texture cues, substantially improving the boundary delineation of small objects. The SAFE module, which integrates Transformer and HyperACE mechanisms, further strengthens semantic representation and suppresses background noise, as confirmed by activation maps that concentrate on true object regions, particularly along road boundaries and occluded areas. In addition, the MFC module bridges semantic gaps across layers by enhancing spatial alignment and saliency consistency, leading to fewer missed detections, more accurate localization, and sharper object contours, especially in regions with densely distributed small objects. Taken together, HyperFusion-DEIM establishes a unified architectural design for remote sensing small-object detection, integrating feature enhancement, contextual modeling, and fusion optimization in a closed-loop manner.

Figure 10 provides a comparative analysis of detection results and heatmaps on the SIMD dataset. Subfigure (a) shows the original image, (b) presents the baseline method, and (c) illustrates the proposed HyperFusion-DEIM. As shown, the DEIM baseline exhibits noticeable background interference and imprecise localization, particularly in regions containing small-scale vessels and densely clustered aircraft. The corresponding heatmaps reveal dispersed activation patterns, with significant missed detections and poorly localized bounding boxes. In contrast, HyperFusion-DEIM demonstrates more concentrated responses in target areas, with activation regions closely aligned with actual object locations. The heatmaps reflect clearer boundaries, stronger activations, and reduced background noise, highlighting the model’s improved sensitivity to small objects and its enhanced contextual modeling capabilities.

Fig. 10
figure 10

Example detection results and heat maps on the SIMD database: (a) Original images, (b) DEIM, (c) HyperFusion-DEIM.

Table 8 presents a performance comparison between the proposed HyperFusion-DEIM-N and several state-of-the-art lightweight object detection models, including the YOLO series, RT-DETR series, as well as D-Fine-N and DEIM-N, evaluated on the VEDAI dataset. The proposed method achieves superior accuracy, attaining the highest \(mAP_{50}\) of 53.3% and \(mAP_{50:95}\) of 34.6%, surpassing advanced models such as RT-DETRv2-r18 and YOLOv8-N. Although HyperFusion-DEIM-N exhibits relatively higher model complexity (134.1M parameters and 79.7 GFLOPs), it maintains a competitive inference speed of 200.64 FPS, achieving an optimal balance between accuracy and efficiency. Compared to DEIM-N and D-Fine-N, our method significantly improves detection accuracy for small objects, while preserving a favorable inference rate. These results underscore the effectiveness and generalizability of the proposed multi-scale fusion and context-aware modeling strategies in remote sensing scenarios.

Table 8 Comparison results on the VEDAI database.

The RT-DETR-r18 has the lowest FPS primarily due to its large number of parameters and the high computational cost, as indicated by its GFLOPs value. This suggests that it is a more complex and heavier model compared to other methods like YOLOv8N or D-Fine-N, which prioritize faster inference speeds and lower computational overhead. HyperFusion-DEIM-N has a relatively high number of parameters and GFLOPs, but it still achieves a high FPS (200.64 FPS). This shows a clear contrast with RT-DETR-r18. This is because HyperFusion-DEIM performs many optimizations during the inference phase by separating training from inference, such as simplifying the structure and reducing redundant computations, which significantly boosts inference speed.

Figure 11 presents a comparative visualization of detection results between the proposed method and two mainstream approaches, DEIM and D-Fine, on remote sensing images. The four columns display (a) ground truth annotations, (b) detection results of the proposed method, (c) DEIM, and (d) D-Fine, respectively.

Fig. 11
figure 11

Comparison of detection effectiveness between HyperFusion-DEIM and popular methods.

As shown, the proposed method significantly outperforms DEIM and D-Fine in detection accuracy, with fewer false positives and missed detections across various complex scenes. Specifically, DEIM and D-Fine exhibit considerable missed detections (e.g., rows 2 and 4) and localization errors (e.g., row 3). In contrast, our method accurately delineates object boundaries (e.g., the small car in row 1) and reliably detects small or occluded targets in low-contrast regions, such as building edges and road intersections (e.g., row 5). This performance improvement is primarily attributed to the integrated enhancements in small object boundary modeling, multi-scale feature fusion, and context-aware perception within the HyperFusion-DEIM framework. Notably, the SAFE module, incorporating Transformer and HyperACE mechanisms, plays a key role in enhancing the model’s attention focus and its ability to detect objects in complex backgrounds. Overall, the proposed method demonstrates superior robustness and accuracy in remote sensing object detection, confirming its strong potential for practical deployment in real-world scenarios.

Conclusion

This paper introduces HyperFusion-DEIM, a novel framework for remote sensing object detection that integrates small object representation enhancement, contextual semantic modeling, and multi-scale feature fusion optimization. Specifically, we propose MAPNet, which incorporates multi-level MPAF modules to enhance multi-scale feature aggregation, while the SRFD module strengthens shallow semantic cues, improving the capture of boundary and texture features for small objects. Additionally, the proposed SAFE encoder combines Transformer and HyperACE mechanisms to model long-range semantic dependencies while preserving spatial details. The MFC module establishes deep cross-level connections and geometric alignment, improving both semantic consistency and spatial resolution. Extensive experiments conducted on two benchmark datasets, SIMD and VEDAI, demonstrate that HyperFusion-DEIM outperforms state-of-the-art lightweight detection models (e.g., YOLO variants, RT-DETR, and DEIM), achieving superior accuracy and robustness in scenarios with densely distributed small objects and complex backgrounds, while maintaining competitive inference speed.

Despite the promising performance of the proposed framework, several limitations remain. The introduction of multi-branch modules and Transformer blocks increases both model complexity and computational cost, potentially limiting its deployment on resource-constrained platforms, such as embedded edge devices. Furthermore, the model’s performance may degrade under extreme conditions, such as low illumination, motion blur, or severe occlusion, suggesting that robustness in these challenging scenarios is an area for further improvement. Future work will focus on enhancing model compression and deployment efficiency, exploring strategies such as knowledge distillation, lightweight attention mechanisms to mitigate computational overhead. Additionally, we plan to explore multi-modal remote sensing fusion, incorporating data from sources like LiDAR, SAR, and multispectral imagery, to improve semantic discrimination and cross-domain generalization. This will extend the applicability of our method to real-world intelligent remote sensing interpretation tasks.