Multi path attention and scale aware fusion for accurate object detection in remote sensing imagery

Liu, Jing; Tao, Junjie; Liu, Xiaoyong; Ma, Jun; Guo, Chaoping; Dong, Chunyu; Li, Pan; Shi, Peijun

doi:10.1038/s41598-025-25900-w

Download PDF

Article
Open access
Published: 25 November 2025

Multi path attention and scale aware fusion for accurate object detection in remote sensing imagery

Jing Liu¹,
Junjie Tao¹,
Xiaoyong Liu¹,
Jun Ma¹,
Chaoping Guo¹,
Chunyu Dong¹,
Pan Li¹ &
…
Peijun Shi¹

Scientific Reports volume 15, Article number: 41810 (2025) Cite this article

1575 Accesses
Metrics details

Subjects

Abstract

The pursuit of accurate yet computationally efficient object detection within remote sensing imagery remains a cornerstone for the advancement of intelligent interpretation systems. Although substantial progress has been achieved in recent years, prevailing approaches still exhibit notable deficiencies in three critical aspects: the discriminative capacity of feature representation, the depth of semantic modeling, and the effectiveness of multi-scale information fusion. These shortcomings become particularly pronounced when addressing small-scale targets, which are highly susceptible to omission or misclassification. In response to these limitations, this work introduces HyperFusion-DEIM, a cascaded detection paradigm specifically designed to simultaneously reinforce object-level representations, enrich contextual semantic dependencies, and optimize scale-aware feature integration. Central to this framework is the Multi-Path Attention Network (MAPNet), which augments shallow semantic cues and edge–texture sensitivity for small object recognition through the joint operation of the Multi-Path Attention Fusion (MPAF) module and the Shallow Robust Feature Downsampling (SRFD) mechanism. Complementing this, the Scale-Aware Feature Enhancement (SAFE) encoder incorporates a Multi-level Feature Concentration (MFC) module to achieve cross-layer geometric alignment, while the integration of Transformer layers with HyperACE enables the capture of long-range semantic correlations without compromising spatial fidelity. Empirical validation conducted on the SIMD and VEDAI benchmarks demonstrates the clear superiority of HyperFusion-DEIM over state-of-the-art lightweight detectors in both predictive accuracy and robustness. Specifically, the model attains 64.5% AP on SIMD, outperforming RT-DETR and DEIM by 4.8% and 4.6%, respectively, while sustaining a peak inference throughput of 296.33 FPS. On VEDAI, HyperFusion-DEIM surpasses YOLOv12 and YOLOv13 by margins of 4.9% and 8.0%, and exceeds RT-DETRv2 and DEIM by 2.5% and 8.5%, all while maintaining real-time operation at 79.7 FPS. This performance showcases HyperFusion-DEIM’s practical viability for real-time detection, particularly in resource-constrained environments where both speed and accuracy are critical.

MFCA-Net: a deep learning method for semantic segmentation of remote sensing images

Article Open access 08 March 2024

SED-YOLO based multi-scale attention for small object detection in remote sensing

Article Open access 24 January 2025

Improved YOLOv9-based remote sensing image detection method

Article Open access 29 December 2025

Introduction

Object detection is one of the core tasks in remote sensing image analysis and has played a pivotal role in various practical applications, including military reconnaissance, geographic information systems, urban planning, environmental monitoring, and disaster assessment¹. Unlike natural images, remote sensing imagery typically features high resolution, large-scale perspectives, and complex backgrounds. These characteristics result in a sparse distribution of foreground objects, significant variation in object sizes, and the prevalence of small targets that occupy only a few pixels². Such small objects often exhibit limited texture and blurred boundaries, making them difficult to distinguish from the background. Due to these challenges, traditional detection methods based on candidate region generation and region-based feature extraction often suffer from low accuracy and high miss rates when applied to remote sensing images, failing to meet the demands of fine-grained analysis^3,4.

In recent years, with the rapid advancement of deep learning, particularly convolutional neural networks (CNNs), object detection technologies have achieved remarkable progress. Conventional detection frameworks can be broadly categorized into two types: two-stage and one-stage methods. Two-stage methods, exemplified by Faster R-CNN⁵ , first generate a set of region proposals and then perform refined classification and bounding box regression. These methods typically yield high detection accuracy but suffer from slower inference and greater structural complexity. In contrast, one-stage methods such as the YOLO series^{6,7,8,9,10,11} and SSD¹² directly perform dense predictions on the image, offering faster inference but often underperforming in small object detection and high-precision tasks. To bridge the gap between performance and efficiency, researchers have explored innovations in net architectures, feature fusion strategies, and training techniques. With the introduction of the Transformer¹³ architecture into computer vision, self-attention-based object detectors such as DETR (Detection Transformer)¹⁴ have pioneered the formulation of object detection as a sequence-to-sequence prediction task. DETR eliminates the reliance on heuristic components like anchor generation and non-maximum suppression (NMS), enabling end-to-end training. However, it suffers from slow convergence and limited modeling capacity for small objects and dense scenes¹⁴. To address these issues, Deformable DETR¹⁵ was proposed, incorporating deformable attention mechanisms to significantly accelerate convergence and enhance local detail modeling. Subsequently, methods such as DAB-DETR¹⁶, DN-DETR¹⁷, and more recently RT-DETR¹⁸ have further optimized query initialization and feature interaction mechanisms, improving both detection accuracy and real-time performance. These advancements have established Transformer-based detectors as a prominent research direction. In the context of small object detection in remote sensing imagery, DETR-like methods face additional challenges, including severe scale imbalance, increased background clutter, and semantic ambiguity in features. To tackle these issues, recent methods such as D-FINE¹⁹ introduce deep feature-guided strategies for small object enhancement, using specialized structures to contract the receptive field and improve small object perception. DEIM²⁰ adopts a dual-path interaction mechanism that integrates structural and semantic features across layers, facilitating high-level semantic fusion and mitigating background interference. In addition, these efforts^21,22,23,24 signify a shift in remote sensing object detection frameworks toward finer-grained and more tightly coupled feature modeling.

Despite these advancements, remote sensing object detection still faces numerous challenges. Targets often occupy only a tiny fraction of high-resolution images, leading to feature sparsity and making small objects particularly difficult to detect. Moreover, their features are often overwhelmed or diluted by deep convolutional layers, resulting in low detection accuracy²⁵. Existing multi-scale feature fusion strategies are typically coarse and struggle to effectively align high-level and low-level features, thereby limiting the model’s ability to perceive objects at varying scales²⁶. In addition, background clutter further complicates the detection of small objects, making it harder to distinguish targets from surrounding noise in complex scenes. Most current methods also lack sufficient global contextual modeling, which exacerbates the challenges of detecting small-scale objects in large, cluttered environments. Furthermore, some high-accuracy models incur significant computational costs, making it difficult to achieve a balance between inference efficiency and detection performance, which hinders their practical deployment in real-world remote sensing applications.

To address the core challenges of small target feature loss, rough multi-scale feature fusion, insufficient global context modeling, and difficulty in balancing detection efficiency and accuracy in remote sensing targets, this paper proposes a DEIM-based remote sensing object detection framework named HyperFusion-DEIM. The overall architecture comprises three key components: Multi-Path Attention Network (MAPNet), Scale-Aware Feature Enhancement (SAFE), and Multi-level Feature Concentration (MFC). Specifically, MAPNet enhances fine-grained feature representation and multi-scale modeling by incorporating the The Shallow Robust Feature Downsampling (SRFD)²⁷ and Multi-Path Attention Fusion (MPAF) modules. The SAFE module integrates a Transformer with the HyperACE²⁸ structure to improve cross-layer semantic interaction and contextual awareness. The MFC module performs deep multi-level feature fusion, effectively enhancing object boundary perception and detection robustness.

Figure 1 presents a performance comparison of different models on the SIMD dataset. Although the proposed HyperFusion-DEIM exhibit relatively higher GFLOPs, they outperform state-of-the-art detectors such as YOLO, RT-DETR, D-Fine, and DEIM in both FPS and AP, demonstrating superior detection accuracy and real-time performance.

As shown in Fig. 2, we compare DEIM with the proposed HyperFusion-DEIM in terms of the detection pipeline and output quality on remote sensing tasks. The baseline uses HGNetv2²⁹ as the backbone, relying mainly on spatial convolutions for feature extraction. However, it struggles with efficient multi-scale integration, leading to blurred textures and weak boundaries in intermediate feature maps. This is particularly problematic for small objects, resulting in missed detections and inaccurate localization.

The HyperFusion-DEIM introduces three key components: MAPNet, SAFE, and MFC, which significantly enhance modeling capability. Compared to DEIM, our approach retains sharper edge details and clearer textures, as shown in the generated feature maps. In the final detection results, the yellow-circled regions highlight that our method achieves more accurate detection and localization of small objects, especially in complex backgrounds. This results in fewer missed detections, particularly near road edges and occluded areas. The main contributions of this paper are as follows:

(1)
We propose HyperFusion-DEIM, a unified framework that integrates small object enhancement, contextual semantic modeling, and multi-scale fusion optimization in a cascaded design, addressing key challenges in existing detection methods.
(2)
To address small object sparsity, we introduce MAPNet, which improves low-level feature retention and multi-scale aggregation by combining SRFD and MPAF modules, enhancing boundary clarity and fine-grained texture modeling.
(3)
We design the SAFE encoder to overcome the lack of contextual modeling and inefficient information flow. It incorporates Transformer and HyperACE mechanisms, and, together with the MFC module, enables effective fusion of global context and local semantics, improving semantic interactions and attention across hierarchical features.
(4)
The MFC module mitigates multi-scale fusion challenges and semantic misalignment by establishing deep inter-layer connections and geometric alignment across feature levels, enhancing spatial coherence and saliency balance.

The remainder of this paper is organized as follows: Section “Related work” reviews related work in remote sensing object detection. Section “Method” details our proposed model and its components. Section “Experiments” presents ablation studies to validate each module’s effectiveness, followed by comparative experiments and visual analyses. Finally, Section “Conclusion” concludes the paper and outlines future research directions.

Related work

CNN-based remote sensing object detection

Traditional convolutional neural network (CNN)-based methods for object detection can be categorized into two types: two-stage and one-stage detectors.

Two-stage detectors, such as Faster R-CNN³⁰, Mask R-CNN³¹, and Cascade R-CNN³², generate region proposals (RoIs) and use detection heads to classify object categories and regress bounding boxes for each RoI feature. While these methods are known for their high accuracy, especially in boundary refinement and semantic detail preservation, they suffer from slow inference speed and high training complexity, making them impractical for real-time remote sensing applications. Furthermore, cascade regressor architectures like in Cascade R-CNN, while improving detection for small objects, still struggle with feature misalignment between high- and low-level features, limiting their performance in high-resolution imagery where fine-grained details are critical for small object detection. Libra R-CNN³³ addresses the imbalance problem in small object detection by refining the original features with non-local blocks. However, while this improves object representation, the non-local operations introduce computational overhead, limiting real-time applicability, particularly for large remote sensing images that require fast processing.

One-stage detectors, such as SSD¹², RetinaNet³⁴, and the YOLO series^{6,7,8,9,10,11,28,35}, eliminate the need for region proposals, directly predicting bounding box coordinates and class labels using a single neural network. These methods excel in speed-critical applications due to their fast inference, but they often sacrifice detection accuracy due to feature downsampling and loss of fine-grained details, especially for small objects. The inability to retain high-resolution features makes it difficult for these models to detect small objects in remote sensing imagery, where targets often occupy only a tiny fraction of the image.

Drone-YOLO³⁶ uses a three-layer PAFPN structure and large-resolution feature maps, specifically tailored for small target detection. While effective, its reliance on high-resolution feature maps leads to increased computational load, making it less efficient for large-scale remote sensing data. Similarly, ESOD³⁷ introduces sparse detection heads to reduce redundant computations over background regions. However, it still struggles with multi-scale feature alignment, which is critical for detecting small targets across varying image resolutions. Feature fusion techniques like those in FFCA-YOLO³⁸ and ACDF-YOLO³⁹ improve local region perception and multi-scale fusion but fail to efficiently integrate global contextual information, often resulting in poor performance in cluttered and complex scenes, common in remote sensing data. In addition, there are several CNN-based approaches that leverage lightweight methods^40,41 or techniques such as distillation^42,43 to enhance both detection accuracy and speed.

Transformer-based remote sensing object detection

In recent years, Transformer-based object detection methods, such as DETR¹⁴ and its variants (e.g., Deformable DETR¹⁵, RT-DETR¹⁸, Drone-DETR⁴⁴, MCG-RTDETR⁴⁵), have gained significant attention due to their ability to model long-range dependencies using self-attention mechanisms. However, traditional Vision Transformers (ViTs) suffer from quadratic computational complexity with respect to image resolution, which becomes prohibitive for high-resolution remote sensing imagery. While optimized models like Deformable DETR and RT-DETR improve computational efficiency, they still face issues with feature misalignment across scales and ineffective handling of small objects. Despite various enhancements, the reliance on sparse attention and a limited number of queries in these models hinders their ability to capture the fine-grained features needed for detecting small objects in remote sensing imagery.

Drone-DETR⁴⁴, built upon RT-DETR, incorporates a lightweight backbone and dual-path attention mechanism, improving small object detection. However, it still struggles with the complexity of large-scale background clutter in remote sensing images and scale variation, which require robust multi-scale feature fusion to handle effectively. Moreover, the introduction of Transformer-based models has increased the overall model complexity, limiting their real-time deployment capabilities.

Feature extraction and fusion

To enhance multi-scale feature representation, several architectures have been developed. The Feature Pyramid Network (FPN)⁴⁶ uses a bottom-up pyramid structure to integrate features at multiple resolutions, improving small object detection. However, FPN and its extensions, such as PANet⁴⁷ and BiFPN⁴⁸, often suffer from inefficient feature fusion across layers, which hinders performance when small objects appear at varying scales or in complex backgrounds. While these methods improve detection accuracy by fusing multi-scale features, they still struggle to align low-level and high-level features in a way that is both computationally efficient and effective for small object detection.

CF2PN⁴⁹ and LR-FPN⁵⁰ introduce additional modules for cross-scale feature fusion and better shallow spatial/contextual representation, respectively. However, both still fall short in handling the global context necessary for remote sensing imagery, where background clutter and environmental variations often obscure small targets. LR-FPN, in particular, suffers from inadequate global context modeling, which is critical for distinguishing objects from background noise in remote sensing data.

Addressing the gaps with HyperFusion-DEIM

Despite the significant progress made by both CNN-based and Transformer-based detectors, small object detection in remote sensing remains a challenging problem due to issues such as feature sparsity, background clutter, and scale variation. Current methods, while improving on one aspect of detection (e.g., speed, accuracy, or feature fusion), often fail to provide a comprehensive solution that balances accuracy, efficiency, and scalability. For instance, two-stage detectors like Cascade R-CNN offer high accuracy but struggle with inference speed, while one-stage detectors like YOLO prioritize speed at the cost of fine-grained detection, and Transformer-based models like DETR face challenges with feature alignment and global context modeling.

HyperFusion-DEIM addresses these gaps by introducing a novel multi-path attention network that facilitates better feature fusion at multiple scales while also preserving fine-grained details. It enhances small object detection by employing an adaptive feature enhancement encoder that improves feature alignment and multi-scale fusion, making it more robust to background clutter and scale variations. Furthermore, the HyperACE module incorporated in HyperFusion-DEIM ensures that the global context is effectively modeled, thereby enabling the model to distinguish small objects from complex backgrounds in remote sensing imagery. This innovative approach allows HyperFusion-DEIM to balance accuracy, computational efficiency, and small object sensitivity, making it a highly effective solution for real-time remote sensing applications.

Method

In this section, we present the HyperFusion-DEIM framework. Subsection “Macroscopic architecture” introduces the overall architecture, Subsection “Multi-path attention network(MAPNet)” details the backbone MAPNet and its submodules MPAF and SRFD, and Subsection “Scale-aware feature enhancement (SAFE)” describes the proposed SAFE module and its submodule MFC.

Macroscopic architecture

Our approach builds upon DEIM²⁰ as the baseline, which introduces a novel strategy to improve both convergence speed and detection accuracy in the original DETR. DEIM fundamentally restructures the object detection pipeline by incorporating an innovative matching mechanism that accelerates training and enhances performance. The core structure of DEIM consists of three key components: the HGNetv2 backbone, an encoder integrating Transformer and PAFPN, and a decoder-based prediction head. Its primary innovation lies in the enhanced matching algorithm, which dynamically optimizes the correspondence between predicted boxes and ground truth annotations. This significantly reduces training complexity and computational cost by streamlining the alignment process, while maintaining high precision in object localization and classification. Additionally, DEIM utilizes the MAL loss, which incorporates mutual attention to model interactions between objects, further improving detection performance.

The HyperFusion-DEIM architecture consists of three key components: a feature extraction backbone (MAPNet), a feature enhancement encoder (SAFE), and a detection decoder. As shown in Fig. 3, MAPNet serves as the backbone feature extractor. The SRFD module enhances shallow semantic features, followed by a staged feature extraction pipeline (Stage 1–Stage 4) that captures multi-scale features, forming a five-level feature pyramid for fusion and semantic modeling. In the encoder, the SAFE module enables efficient semantic modeling and cross-layer interaction. It integrates a Transformer-based architecture with the HyperACE mechanism, using residual connections and up-/down-sampling paths for deep-to-shallow feature fusion. HyperACE enhances spatial context, improving accuracy for small objects. To address multi-scale misalignment and semantic inconsistency, the MFC module ensures better alignment and consistency across resolutions. In the decoder, a lightweight structure with a self-distillation mechanism optimizes supervised learning. Multi-scale features are decoded in parallel to predict bounding boxes (Box Pred) and category scores (Score Pred), enhancing inference efficiency. This design enables accurate detection and localization of multi-scale objects in remote sensing images.

Multi-path attention network(MAPNet)

Overall structure

In remote sensing object detection, shallow feature maps often suffer from blurred edges and missing textures, while repeated downsampling during multi-level pyramid construction results in the loss of small object features. Although the HGNetv2 backbone in DEIM offers advantages in structural symmetry and computational efficiency, it remains limited in remote sensing contexts. Key limitations include insufficient multi-scale modeling, inadequate fusion of shallow and deep features, and simplistic attention mechanisms that fail to emphasize critical object regions effectively.

To address these challenges, we propose a novel backbone network, MAPNet, as shown in Fig. 4. The architecture consists of three main components: the SRFD (Fig. 4d)²⁷, HG_Stage_MPAF blocks (Fig. 4b), and the MPAF_Block (Fig. 4c). The SRFD module enhances edge and texture representations in shallow layers, effectively mitigating the loss of small object features. Next, multi-level HG_Stage_MPAF modules process features through either depthwise separable convolutions or identity mappings, depending on whether downsampling is applied. Each stage uses parallel MPAF_Block modules to extract and fuse semantic information across scales. The MPAF_Block incorporates multi-path attention mechanisms and feature aggregation strategies to improve the representation of salient object regions while suppressing background noise.

MAPNet improves feature extraction accuracy and robustness through a staged, multi-path, and attention-guided fusion strategy. It demonstrates superior performance in multi-scale and dense object detection tasks, proving its effectiveness in remote sensing applications.

The backbone configuration is shown in Table 1.

Table 1 Model structure and layer configuration for backbone.

Full size table

The shallow robust feature downsampling (SRFD)

SRFD is shown in Fig. 4d, which enhances low-level feature representation by introducing multiple parallel convolutional and pooling pathways, including operations such as GConv, DWConvD, and MaxD. By employing different downsampling techniques to extract multiple complementary feature maps, SRFD constructs a more robust feature representation capable of capturing detailed textures and structural information across multiple scales. This design significantly improves the detection accuracy of small objects.

In addition to enhancing multi-scale perceptual capability, SRFD improves the preservation of fine-grained details through effective feature fusion using concatenation followed by convolution (Cat + Conv). Furthermore, SRFD maintains low computational overhead while suppressing background noise and improving the model’s focus on salient object regions.

The multi-path attention fusion (MPAF)

MPAF module is designed to process the input feature map $X_{in}$ through three parallel paths, as illustrated in Fig. 5. The attention refinement path $F_{1}$ employs a series of iRMB (Inverted Residual Mobile Block)⁵¹ attention modules to perform hierarchical deep extraction, enabling the capture of rich multi-scale feature representations. The efficient feature enhancement path $F_{2}$ utilizes a combination of depthwise separable convolution (DSConv), depthwise convolution (DWConv), and pointwise convolution (PWConv) to enhance feature expression and emphasize salient information. The information preservation path $F_{3}$ applies a simple $1\times 1$ convolution to retain the original information with minimal transformation. The outputs from all three paths are concatenated and fused via a $1\times 1$ convolution to produce the final output $X_{out}$, achieving a balance between depth, efficiency, and feature integrity.

In the attention refinement path $F_{1}$, the input feature X is divided into m parallel branches ( $Branch_{1}$ to $Branch_{m}$). Each branch processes its input through a series of stacked iRMB modules, where each iRMB module internally implements a self-attention mechanism to enhance feature representation. The outputs from all branches are finally fused via concatenation to produce the aggregated feature map. The overall processing can be described as follows:

$$\begin{aligned} F_1(X) & =Concat(Branch_1,Branch_2,...,Branch_m) \\ Branch_i & = iRMB_n(...(iRMB_1(X_i))),\quad i\in \{1,2,...,m\} \\ iRMB(X) & = \textrm{Conv}_{1 \times 1}\left( \textrm{DWConv}_{3 \times 3}\left( V \otimes (QK^{T})\right) + V \otimes (QK^{T})\right) + X \end{aligned}$$

(1)

This path ensures effective information transmission, where the multi-branch structure enables the capture of diverse feature representations at different levels. The self-attention mechanism further enhances the modeling of global dependencies, while convolution operations maintain sensitivity to local patterns. The residual connection within each iRMB module (i.e., $+X$) facilitates gradient flow and alleviates training difficulties in deep networks. This design allows the attention refinement path to efficiently extract rich and hierarchical features, making it particularly well-suited for detecting small objects in remote sensing imagery.

The efficient feature enhancement path $F_2$ forms a feature transformation pipeline composed of three consecutive operations: DWConv focuses on spatial feature extraction, PWConv integrates information across channels, and the overall sequence progressively reconstructs and enhances feature representations from multiple dimensions. Compared with the other paths, this branch offers a unique perspective on feature transformation. By decomposing traditional convolutions into three lightweight operations, this design significantly reduces both the number of parameters and computational complexity. The process can be formulated as:

$$\begin{aligned} F_{2}(X) = \textrm{PWConv}(\textrm{DWConv}(\textrm{Conv}_{1 \times 1}(X))) \end{aligned}$$

(2)

The information preservation path $F_{3}$ processes the input feature X through a single $1\times 1$ convolution layer. This path ensures that the essential semantics of the original input are retained, providing foundational support for the more complex transformation paths. The process is defined as:

$$\begin{aligned} F_3(X) = \text {Conv}_{1 \times 1}(X) \end{aligned}$$

(3)

The final output is given by:

$$\begin{aligned} X_{out} = \text {Conv}_{1 \times 1} \left( \text {Concat} \left( F_1(X_{in}), F_2(X_{in}), F_3(X_{in}) \right) \right) \end{aligned}$$

(4)

This module addresses the problem of insufficient feature representation in small object detection within remote sensing imagery. By employing a multi-path design, it enhances feature extraction capabilities using convolutional operations at different scales to preserve spatial details and mitigate information loss. In the $F_1$ path, an attention module (indicated by the gray block on the left) is introduced, where self-attention based on query (Q), key (K), and value (V) computations strengthens the model’s focus on small objects, thereby significantly improving the detection rate of small targets in remote sensing scenarios.

Scale-aware feature enhancement (SAFE)

Overall structure

In remote sensing object detection, traditional feature fusion techniques, such as PAFPN, aim to enhance multi-scale detection performance by aggregating features from different network layers. However, PAFPN encounters several challenges when applied to remote sensing imagery, particularly in terms of handling objects with varying scales and addressing cluttered backgrounds. First, PAFPN lacks sufficient adaptability in the fusion of features across multiple scales, hindering its ability to effectively represent both small and large objects within complex remote sensing scenes. Second, conventional fusion methods are vulnerable to background noise interference, which compromises detection accuracy and increases the likelihood of false positives and missed detections. Additionally, PAFPN often relies on simple concatenation and stacking operations for feature aggregation, which can lead to feature degradation and inadequate representation of the underlying spatial and semantic information.

To address the limitations outlined above, we propose a novel encoder architecture, illustrated in the Encoder section of Fig. 3. SAFE introduces an adaptive scale-aware mechanism designed to enhance the network’s ability to detect objects at multiple scales. Initially, SAFE employs a Transformer module to capture global dependencies, thereby improving the spatial awareness of feature maps, as depicted in Fig. 6b. Subsequently, a HyperACE module refines the feature representations, enabling the model to focus on salient regions while suppressing background noise, as shown in Fig. 6a.

SAFE significantly enhances the contextual representation of multi-scale features through the integration of a customized MFC module and a feature diffusion mechanism. This design ensures that each feature map not only preserves local spatial details but also incorporates global semantic context via deep cross-scale integration, thereby providing more discriminative and robust representations for downstream detection and classification tasks. The MFC module serves as the central innovation within SAFE and is depicted in Fig. 6c.

The feature diffusion mechanism in SAFE operates via bidirectional pathways: top-down and bottom-up, effectively propagating context-rich features across different detection scales. In the first stage, outputs from the feature focusing module are downsampled (from $P_4$ to $P_5$) and upsampled (from $P_4$ to $P_3$) to be fused with higher- and lower-resolution features, generating the initial multi-scale representation. In the second stage, a secondary feature aggregation process, facilitated by MFC and coupled with deep feature enhancement, enables thorough feature integration and the diffusion of contextual information.

This two-stage feature diffusion strategy effectively mitigates the scale-induced information loss commonly observed in traditional feature pyramid networks. It substantially enhances the consistency and expressiveness of multi-scale features, particularly in complex remote sensing object detection scenarios.

HyperACE

HyperACE (Hypergraph-based Adaptive Correlation Enhancement) is an innovative feature enhancement module introduced in YOLOv13²⁸, designed to model high-order relationships among multi-scale features through hypergraph computation, as shown in Fig. 6a. It utilizes an adaptive hyperedge generation mechanism that dynamically captures complex dependencies across spatial positions and channels. By leveraging both global average pooling and max pooling to generate contextual vectors, and combining these vectors with learnable hyperedge prototypes, HyperACE adaptively estimates the participation strength of each vertex in every hyperedge. This approach facilitates more effective feature aggregation and enhancement.

In our framework, the fused features are passed through the HyperACE module, enabling adaptive high-order correlation modeling and feature refinement across both spatial locations and detection scales. Subsequently, the FullPAD mechanism ensures efficient and structured information propagation by distributing the correlation-enhanced features through three distinct pathways: from the backbone to the neck, within the inner layers of the neck, and from the neck to the detection head. This organization guarantees a seamless and effective flow of enriched features throughout the network.

Multi-level feature concentration (MFC)

To address two core challenges in remote sensing object detection, insufficient multi-scale object perception and a lack of contextual information, we propose an efficient multi-scale feature fusion mechanism, the MFC module, as illustrated in Fig. 6c. This module employs depthwise separable convolutions (DWConv) with multiple kernel sizes (5$\times$5, 7$\times$7, 9$\times$9, and 11$\times$11) to capture spatial context across various object scales. Multi-kernel convolutions allow the model to observe the same scene from diverse receptive fields, significantly enhancing the contextual awareness of local regions and improving detection performance in occluded or densely populated small object scenarios.

An identity mapping path is included, and residual fusion is applied at the end to integrate shallow detail features with deep semantic information, thereby alleviating inconsistencies caused by semantic gaps across different layers. The use of depthwise separable convolutions also substantially reduces parameter count and computational complexity, allowing the module to be seamlessly integrated into complex networks and improving overall inference efficiency.

For the low-level feature map $P_3$, an adaptive downsampling operation (ADown) is applied to match the spatial resolution of $P_4$ and $P_5$. For the mid-level $P_4$ and high-level $P_5$ features, a $1 \times 1$ convolution is used to unify channel dimensions.

$$\begin{aligned} F_3 = \text {ADown}(P_3), \quad F_4 = \text {Conv}_{1 \times 1}(P_4), \quad F_5 = \text {Conv}_{1 \times 1}(P_5) \end{aligned}$$

(5)

here, F Represents the feature map at different stages of the network.

The three multi-scale feature maps are first concatenated to integrate contextual information across different resolutions. The fused representation is then processed by multiple depthwise separable convolution layers with different kernel sizes, enabling the extraction of spatial semantics under diverse receptive fields. Finally, the outputs of all branches are aggregated via element-wise addition to produce a unified multi-scale representation.

$$\begin{aligned} F_{concat} & = \text {Concat}(F_3, F_4, F_5) \\ D_k & = \text {DWConv}_{k \times k}(F), \quad k \in \{5, 7, 9, 11\} \\ O_{\text {multi}} & = D_5 + D_7 + D_9 + D_{11} + D_{\text {ID}} \end{aligned}$$

(6)

where, $D_k$ Represents a DWConv operation applied to the feature map F, with a kernel size k that varies over a predefined set 5, 7, 9, 11, $O_{multi}$ Represents the aggregated output after applying depthwise convolutions.

The aggregated feature is subsequently passed through a $1 \times 1$ convolution to compress the channel dimension, reducing redundancy while retaining key semantic information. Finally, to enhance feature expressiveness and stabilize network training, a residual connection is applied by adding the initial concatenated feature to the fused output $F_{\text {fused}}$, producing the final output feature map $F_{\text {out}}$.

$$\begin{aligned} \begin{gathered} F_{\text {fused}} = \text {Conv}_{1 \times 1}(O_{\text {multi}}) \\ F_{\text {out}} = F_{\text {fused}} + F_{\text {concat}} \end{gathered} \end{aligned}$$

(7)

By jointly incorporating multi-scale aggregation, semantic alignment, and a lightweight design, the MFC module effectively bridges the gap between shallow detail features and deep semantic representations. This integration substantially enhances the expressiveness and robustness of the learned features.

Experiments

Datasets

The SIMD⁵² remote sensing object detection dataset is a fine-grained benchmark comprising 5000 high-resolution optical satellite images collected across diverse geographic regions and various seasonal conditions, using satellite-based acquisition platforms. It contains 45,096 annotated object instances spanning 15 categories: Car, Truck, Van, Long Vehicle, Bus, Airliner, Propeller Aircraft, Trainer Aircraft, Chartered Aircraft, Fighter Aircraft, Other, Stair Truck, Pushback Truck, Helicopter, and Boat. Each instance is annotated with precise location, object class, and bounding box information, ensuring detailed and rigorous evaluation. Data analysis reveals three major challenges in the SIMD dataset. An extremely wide scale distribution is observed, with object sizes ranging from very small to large, as illustrated in the left plot of Fig. 7a. Most instances lie within the normalized range of [0, 0.5], resulting in a long-tail distribution. In addition, the dataset exhibits pronounced category imbalance: the Car class contains 20,504 instances, whereas the Boat class has only 49. Another challenge arises from the high degree of visual similarity among object types within the Aircraft and Vehicle categories, where overlaps in spatial scale further complicate fine-grained classification.

The VEDAI⁵³ dataset comprises 1246 high-resolution aerial images (1024$\times$1024 pixels, 12.5 cm/pixel spatial resolution), collected between 2012 and 2014 under varied lighting and weather conditions across both rural and urban scenes. These images were captured using aerial platforms, ensuring a wide range of acquisition perspectives. It contains 3640 annotated instances across 7 vehicle categories, with each instance labeled by location, orientation, and class. Compared with SIMD, VEDAI presents two major difficulties. Small objects dominate the dataset, with a large proportion concentrated in the normalized size range of [0.0, 0.2]. Moreover, category imbalance is severe: the Car class includes 1377 instances, whereas the Truck class has only 105, yielding a sample ratio greater than 13:1. This imbalance substantially limits the generalization ability of detection algorithms, particularly on underrepresented categories, as illustrated in Fig. 7b.

Experimental environment

To ensure fairness in training and comparison across models, all ablation studies and experiments were conducted at the Super Intelligent Computing Center of Xijing University. The experiments were performed on NVIDIA A800 GPUs with 80 GB memory, paired with Intel 6338N Xeon CPUs, under the Red Hat Enterprise Linux Version 4.8.5–28 (https://www.redhat.com/) operating system. The software environment was configured with the CUDA Version 12.1 (https://developer.nvidia.com/cuda-toolkit), Python Version 3.11 (https://www.python.org/), and PyTorch Version 2.1 (https://pytorch.org/). The complete hardware configuration used for both training and testing is summarized in Table 2.

Table 2 Configuration of experiment environments.

Full size table

The hyperparameter configurations used in our experiments are summarized in Table 3. Training was performed for a total of 160 epochs, including 78 flat epochs. A warm-up phase of 2000 iterations (warmup_iter) was applied to stabilize the early stages of training. The initial learning rate was set to 0.0008, with a weight decay coefficient of 0.0001. The batch size was fixed at 4. To improve training stability, we employed the Exponential Moving Average (EMA) strategy, while Automatic Mixed Precision (AMP) was disabled to ensure numerical consistency and convergence reliability.

Table 3 Hyperparameter settings for training HyperFusion-DEIM.

Full size table

Ablation study

To evaluate the individual and combined contributions of the key components in the proposed HyperFusion-DEIM framework, we perform ablation experiments by incrementally integrating the SRFD, MPAF, and SAFE modules. Using DEIM as the baseline, we assess the impact of each module on detection accuracy and computational efficiency. We use Average Precision(AP) as the primary metric, which is an indicator used to evaluate the accuracy of object detection models. It summarizes the accuracy recall curve into one value by averaging the accuracy at different recall levels. For a specific IoU threshold (e.g., 0.50), $AP_{50}$ is computed by considering only those predictions whose IoU with the ground truth box is greater than or equal to 0.50. In more comprehensive evaluations, $AP_{50:95}$ averages the precision over multiple IoU thresholds (from 0.50 to 0.95 in steps of 0.05). This gives a more robust measure of detection accuracy at varying levels of localization precision. In addition, it also adopts number of parameters (Params), and computational complexity (GFLOPs), providing a comprehensive analysis of detection performance, inference cost, and model compactness.

Table 4 Ablation study on the SIMD dataset.

Full size table

As summarized in Table 4, adding the SRFD module alone (Row 2) introduces negligible increases in parameters and GFLOPs, while slightly improving $AP_{50}$ from 76.6 to 76.9%. Incorporating MPAF alone (Row 3) substantially increases model size and computational complexity, but also delivers a significant accuracy gain, reaching 80.6%. The SAFE module (Row 4) enhances detection performance through improved global context modeling, achieving $AP_{50}$ of 78.8% with relatively low parameter and computation cost (4.67 M, 11.27 GFLOPs), striking a favorable balance between efficiency and expressiveness. Combining SRFD and MPAF (Row 5) further improves accuracy to 82.9%, demonstrating the strong synergy between shallow detail preservation and mid-level multi-scale fusion. Finally, the full model (Row 6), integrating all three modules, attains the best results with $AP_{50}$ of 83.6% and 67.2% in $AP_{50:95}$, corresponding to gains of 7.0% and 6.3% over the baseline.

These results validate the complementary contributions of the proposed modules and the enhanced overall modeling capacity for remote sensing object detection. The ablation study confirms that each component positively influences the trade-off between accuracy and computational complexity, forming an integrated framework particularly effective for detecting small targets and capturing rich semantic information in complex remote sensing scenes.

On the VEDAI dataset, the impact of different module combinations on model performance was similarly evaluated, as summarized in Table 5. Compared with configurations using a single module, the collaborative integration of multiple modules produces more stable and superior detection results. While MPAF or SAFE individually provides measurable improvements, the highest performance is obtained when all three modules are combined. In this configuration, the model achieves an $AP_{50}$ of 53.7% and an $AP_{50:95}$ of 33.8%, substantially exceeding the baseline 12% and 8.7%, respectively. These results underscore the compatibility and strong complementarity among the three modules. Moreover, the integrated design maintains a favorable balance between detection accuracy and computational cost, representing an effective enhancement over the baseline model.

Table 5 Ablation study on the VEDAI dataset.

Full size table

Table 6 summarizes the ablation results of embedding the MPAF module at different backbone stages (Stage2–Stage5) on the VEDAI dataset. The results show that progressively integrating MPAF across multiple stages consistently improves detection performance, with $AP_{50}$ rising from 42.9% with MPAF applied only at Stage5 to 44.2% when it is incorporated across all stages. Although multi-stage deployment introduces additional computational cost (GFLOPs increasing from 22.65 to 74.85), it substantially enhances multi-scale feature representation while preserving semantic consistency. These findings demonstrate the hierarchical adaptability and cumulative benefit of the MPAF module in multi-stage feature fusion.

Table 6 Ablation experiments of MPAF module on VEDAI dataset.

Full size table

Comparative experiments

Table 7 presents a comprehensive comparison between the proposed HyperFusion-DEIM and several state-of-the-art real-time object detectors. Compared with its baseline, DEIM, HyperFusion-DEIM demonstrates notable improvements across AP metrics for various model variants. Specifically, the N variant outperforms DEIM by 4.6% in AP, + 5.5% in $AP_{50}$, and + 4.6% in $AP_75$, while the S variant achieves gains of 4.3% in $AP_{50}$, 1.4% in $AP_{50}$, and 3.6% in $AP_75$. Although the parameter count and GFLOPs increase slightly, the inference speed (FPS) improves by nearly 50%, indicating an excellent trade-off between accuracy and efficiency.

Table 7 Comparison with real-time object detectors on SIMD-test.

Full size table

In comparison with the YOLO family, HyperFusion-DEIM achieves a superior balance between detection accuracy and computational cost. For instance, HyperFusion-DEIM-S maintains comparable FPS and AP to YOLOv8-S, while improving $AP_{50}$ by 1.9%, $AP_s$ by 1.1%, and $AP_m$ by 4.4%. Against more recent variants such as YOLOv11 and YOLOv12, HyperFusion-DEIM consistently attains higher AP across small, medium, and large object categories. Both the N and S variants deliver favorable FPS while achieving elevated AP values through optimized computational efficiency. Notably, in scenarios involving small objects and complex backgrounds, the proposed model leverages multi-scale feature fusion and enhanced contextual modeling to substantially improve robustness and detection precision.

Furthermore, when compared with the latest RT-DETR series, HyperFusion-DEIM provides superior accuracy and efficiency. It achieves an AP of 82.3%, surpassing RT-DETR-R50’s 79.2%, while simultaneously offering a significant FPS improvement. These results validate the effectiveness of the proposed architecture for real-time detection tasks. Overall, through refined feature fusion and enhanced context awareness, HyperFusion-DEIM delivers substantial gains in both accuracy and computational efficiency, establishing it as a compelling solution for high-performance real-time remote sensing object detection.

Figure 8 provides a comparative evaluation of HyperFusion-DEIM against several lightweight real-time object detectors, including YOLOv8-N, YOLOv11-N, YOLOv12-N, D-Fine-N, and DEIM-N. As shown in Fig. 8a, HyperFusion-DEIM achieves a throughput of 296.3 FPS, substantially higher than all competing methods, with particularly large margins over YOLOv8-N and YOLOv11-N. Figure 8b further reports the multi-scale AP results. HyperFusion-DEIM consistently outperforms alternative models across all metrics, including AP and $AP_{50}$, with especially pronounced improvements in overall detection accuracy. Although the proposed framework incurs slightly higher computational complexity, it maintains an excellent balance between accuracy and efficiency. These results highlight the effectiveness and practicality of HyperFusion-DEIM for high-performance real-time object detection tasks in challenging remote sensing scenarios.

Figure 9 illustrates the superior performance of HyperFusion-DEIM on multi-class object detection within the SIMD dataset, covering complex remote sensing scenes such as urban blocks, docks, airports, and transportation hubs. The results demonstrate that diverse targets (e.g., vehicles, aircraft, and vessels) are accurately localized under challenging conditions, including high density, occlusion, and large scale variations, thereby reflecting both robustness and high precision. This capability stems from several key architectural components. The multi-stage MPAF structure in MAPNet enables effective multi-scale feature aggregation, while the SRFD module enhances low-level edge and texture cues, substantially improving the boundary delineation of small objects. The SAFE module, which integrates Transformer and HyperACE mechanisms, further strengthens semantic representation and suppresses background noise, as confirmed by activation maps that concentrate on true object regions, particularly along road boundaries and occluded areas. In addition, the MFC module bridges semantic gaps across layers by enhancing spatial alignment and saliency consistency, leading to fewer missed detections, more accurate localization, and sharper object contours, especially in regions with densely distributed small objects. Taken together, HyperFusion-DEIM establishes a unified architectural design for remote sensing small-object detection, integrating feature enhancement, contextual modeling, and fusion optimization in a closed-loop manner.

Figure 10 provides a comparative analysis of detection results and heatmaps on the SIMD dataset. Subfigure (a) shows the original image, (b) presents the baseline method, and (c) illustrates the proposed HyperFusion-DEIM. As shown, the DEIM baseline exhibits noticeable background interference and imprecise localization, particularly in regions containing small-scale vessels and densely clustered aircraft. The corresponding heatmaps reveal dispersed activation patterns, with significant missed detections and poorly localized bounding boxes. In contrast, HyperFusion-DEIM demonstrates more concentrated responses in target areas, with activation regions closely aligned with actual object locations. The heatmaps reflect clearer boundaries, stronger activations, and reduced background noise, highlighting the model’s improved sensitivity to small objects and its enhanced contextual modeling capabilities.

Table 8 presents a performance comparison between the proposed HyperFusion-DEIM-N and several state-of-the-art lightweight object detection models, including the YOLO series, RT-DETR series, as well as D-Fine-N and DEIM-N, evaluated on the VEDAI dataset. The proposed method achieves superior accuracy, attaining the highest $mAP_{50}$ of 53.3% and $mAP_{50:95}$ of 34.6%, surpassing advanced models such as RT-DETRv2-r18 and YOLOv8-N. Although HyperFusion-DEIM-N exhibits relatively higher model complexity (134.1M parameters and 79.7 GFLOPs), it maintains a competitive inference speed of 200.64 FPS, achieving an optimal balance between accuracy and efficiency. Compared to DEIM-N and D-Fine-N, our method significantly improves detection accuracy for small objects, while preserving a favorable inference rate. These results underscore the effectiveness and generalizability of the proposed multi-scale fusion and context-aware modeling strategies in remote sensing scenarios.

Table 8 Comparison results on the VEDAI database.

Full size table

The RT-DETR-r18 has the lowest FPS primarily due to its large number of parameters and the high computational cost, as indicated by its GFLOPs value. This suggests that it is a more complex and heavier model compared to other methods like YOLOv8N or D-Fine-N, which prioritize faster inference speeds and lower computational overhead. HyperFusion-DEIM-N has a relatively high number of parameters and GFLOPs, but it still achieves a high FPS (200.64 FPS). This shows a clear contrast with RT-DETR-r18. This is because HyperFusion-DEIM performs many optimizations during the inference phase by separating training from inference, such as simplifying the structure and reducing redundant computations, which significantly boosts inference speed.

Figure 11 presents a comparative visualization of detection results between the proposed method and two mainstream approaches, DEIM and D-Fine, on remote sensing images. The four columns display (a) ground truth annotations, (b) detection results of the proposed method, (c) DEIM, and (d) D-Fine, respectively.

As shown, the proposed method significantly outperforms DEIM and D-Fine in detection accuracy, with fewer false positives and missed detections across various complex scenes. Specifically, DEIM and D-Fine exhibit considerable missed detections (e.g., rows 2 and 4) and localization errors (e.g., row 3). In contrast, our method accurately delineates object boundaries (e.g., the small car in row 1) and reliably detects small or occluded targets in low-contrast regions, such as building edges and road intersections (e.g., row 5). This performance improvement is primarily attributed to the integrated enhancements in small object boundary modeling, multi-scale feature fusion, and context-aware perception within the HyperFusion-DEIM framework. Notably, the SAFE module, incorporating Transformer and HyperACE mechanisms, plays a key role in enhancing the model’s attention focus and its ability to detect objects in complex backgrounds. Overall, the proposed method demonstrates superior robustness and accuracy in remote sensing object detection, confirming its strong potential for practical deployment in real-world scenarios.

Conclusion

This paper introduces HyperFusion-DEIM, a novel framework for remote sensing object detection that integrates small object representation enhancement, contextual semantic modeling, and multi-scale feature fusion optimization. Specifically, we propose MAPNet, which incorporates multi-level MPAF modules to enhance multi-scale feature aggregation, while the SRFD module strengthens shallow semantic cues, improving the capture of boundary and texture features for small objects. Additionally, the proposed SAFE encoder combines Transformer and HyperACE mechanisms to model long-range semantic dependencies while preserving spatial details. The MFC module establishes deep cross-level connections and geometric alignment, improving both semantic consistency and spatial resolution. Extensive experiments conducted on two benchmark datasets, SIMD and VEDAI, demonstrate that HyperFusion-DEIM outperforms state-of-the-art lightweight detection models (e.g., YOLO variants, RT-DETR, and DEIM), achieving superior accuracy and robustness in scenarios with densely distributed small objects and complex backgrounds, while maintaining competitive inference speed.

Despite the promising performance of the proposed framework, several limitations remain. The introduction of multi-branch modules and Transformer blocks increases both model complexity and computational cost, potentially limiting its deployment on resource-constrained platforms, such as embedded edge devices. Furthermore, the model’s performance may degrade under extreme conditions, such as low illumination, motion blur, or severe occlusion, suggesting that robustness in these challenging scenarios is an area for further improvement. Future work will focus on enhancing model compression and deployment efficiency, exploring strategies such as knowledge distillation, lightweight attention mechanisms to mitigate computational overhead. Additionally, we plan to explore multi-modal remote sensing fusion, incorporating data from sources like LiDAR, SAR, and multispectral imagery, to improve semantic discrimination and cross-domain generalization. This will extend the applicability of our method to real-world intelligent remote sensing interpretation tasks.

Data availability

The SIMD and VEDAI are available at following https://github.com/ihians/simd and https://downloads.greyc.fr/vedai/, respectively.

References

Cheng, G. & Han, J. A survey on object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 117, 11–28 (2016).
Article ADS Google Scholar
Li, K., Wan, G., Cheng, G., Meng, L. & Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote. Sens. 159, 296–307 (2020).
Article ADS Google Scholar
Liu, J., Jing, D., Zhang, H. & Dong, C. Srfad-net: Scale-robust feature aggregation and diffusion network for object detection in remote sensing images. Electronics 13, 2358 (2024).
Article Google Scholar
Liu, J. et al. Unified spatial-frequency modeling and alignment for multi-scale small object detection. Symmetry 17, 242 (2025).
Article ADS Google Scholar
Girshick, R. Fast r-cnn. arXiv preprint arXiv:1504.08083 (2015).
Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 779–788 (2016).
Redmon, J. & Farhadi, A. Yolov3: An incremental improvement. arXiv e-prints (2018).
Bochkovskiy, A., Wang, C.-Y. & Liao, H.-Y. M. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020).
Wang, C.-Y., Bochkovskiy, A. & Liao, H.-Y. M. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7464–7475 (2023).
Wang, C.-Y., Yeh, I.-H. & Liao, H.-Y. M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv preprint arXiv:2402.13616 (2024).
Wang, A. et al. Yolov10: Real-time end-to-end object detection. arXiv preprint arXiv:2405.14458 (2024).
Liu, W. et al. Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, 21–37 (Springer, 2016).
Vaswani, A. Attention is all you need. Advances in Neural Information Processing Systems (2017).
Carion, N. et al. End-to-end object detection with transformers. In European conference on computer vision, 213–229 (Springer, 2020).
Zhu, X. et al. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020).
Yang, J., Zhang, H., Zhou, Y., Guo, Z. & Lin, F. Improved dab-detr model for irregular traffic obstacles detection in vision based driving environment perception scenario. Appl. Intell. 55, 541 (2025).
Article Google Scholar
Li, F. et al. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 13619–13627 (2022).
Zhao, Y. et al. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16965–16974 (2024).
Peng, Y. et al. D-fine: Redefine regression task in detrs as fine-grained distribution refinement. arXiv preprint arXiv:2410.13842 (2024).
Huang, S. et al. Deim: Detr with improved matching for fast convergence. In Proceedings of the Computer Vision and Pattern Recognition Conference, 15162–15171 (2025).
Song, H., Yuan, Y., Ouyang, Z., Yang, Y. & Xiang, H. Quantitative regularization in robust vision transformer for remote sensing image classification. Photogram. Rec. 39, 340–372 (2024).
Article Google Scholar
Shen, J. et al. An anchor-free lightweight deep convolutional network for vehicle detection in aerial images. IEEE Trans. Intell. Transp. Syst. 23, 24330–24342 (2022).
Article Google Scholar
Song, H. et al. Optimized data distribution learning for enhancing vision transformer-based object detection in remote sensing images. Photogram. Rec. 40, e70004 (2025).
Article Google Scholar
Shen, J., Liu, N., Sun, H., Li, D. & Zhang, Y. An instrument indication acquisition algorithm based on lightweight deep convolutional neural network and hybrid attention fine-grained features. IEEE Trans. Instrum. Meas. 73, 1–16 (2024).
Google Scholar
Xiao, J. et al. Lstfe-net: Long short-term feature enhancement network for video small object detection. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14613–14622 (2023).
Liu, J. et al. Lightweight progressive fusion calibration network for rotated object detection in remote sensing images. Electronics 13, 3172 (2024).
Article Google Scholar
Lu, W., Chen, S.-B., Tang, J., Ding, C. H. & Luo, B. A robust feature downsampling module for remote-sensing visual tasks. IEEE Trans. Geosci. Remote Sens. 61, 1–12 (2023).
Google Scholar
Lei, M. et al. Yolov13: Real-time object detection with hypergraph-enhanced adaptive visual perception. arXiv preprint arXiv:2506.17733 (2025).
Dong, P., Wang, D., Wang, Y. & Zong, G. Surface defect detection of cigarette packs based on improved yolov8. In 2024 9th International Conference on Intelligent Computing and Signal Processing (ICSP), 1745–1749 (IEEE, 2024).
Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149 (2016).
Article ADS PubMed Google Scholar
He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, 2961–2969 (2017).
Cai, Z. & Vasconcelos, N. Cascade R-CNN: High quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 43, 1483–1498 (2019).
Article ADS Google Scholar
Pang, J. et al. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 821–830 (2019).
Lin, T. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002 (2017).
Tian, Y., Ye, Q. & Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv preprint arXiv:2502.12524 (2025).
Zhang, Z. Drone-yolo: An efficient neural network method for target detection in drone images. Drones 7, 526 (2023).
Article Google Scholar
Liu, K. et al. Esod: efficient small object detection on high-resolution images. IEEE Transactions on Image Processing (2024).
Ge, Y., Ji, W., Yin, S. & Zhang, W. Small object detection in remote sensing based on feature fusion and channel attention. In 2024 8th International Symposium on Computer Science and Intelligent Control (ISCSIC), 195–200 (IEEE, 2024).
Fei, X., Guo, M., Li, Y., Yu, R. & Sun, L. Acdf-yolo: Attentive and cross-differential fusion network for multimodal remote sensing object detection. Remote Sens. 16, 3532 (2024).
Article ADS Google Scholar
Shen, J. et al. An algorithm based on lightweight semantic features for ancient mural element object detection. NPJ Heritage Sci. 13, 70 (2025).
Article Google Scholar
Shen, J. et al. Finger vein recognition algorithm based on lightweight deep convolutional neural network. IEEE Trans. Instrum. Meas. 71, 1–13 (2021).
Google Scholar
Song, H. et al. Cmkd-net: A cross-modal knowledge distillation method for remote sensing image classification. Adv. Space Res. (2025).
Song, H. et al. Symmetrical learning and transferring: Efficient knowledge distillation for remote sensing image classification. Symmetry 17, 1002 (2025).
Article ADS Google Scholar
Kong, Y., Shang, X. & Jia, S. Drone-detr: Efficient small object detection for remote sensing image using enhanced rt-detr model. Sensors 24, 5496 (2024).
Article ADS PubMed PubMed Central Google Scholar
Yu, C. & Shin, Y. Mcg-rtdetr: multi-convolution and context-guided network with cascaded group attention for object detection in unmanned aerial vehicle imagery. Remote Sensing 16, 3169 (2024).
Article ADS Google Scholar
Lin, T.-Y. et al. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2117–2125 (2017).
Liu, S., Qi, L., Qin, H., Shi, J. & Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 8759–8768 (2018).
Tan, M., Pang, R. & Le, Q. V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10781–10790 (2020).
Huang, W., Li, G., Chen, Q., Ju, M. & Qu, J. Cf2pn: A cross-scale feature fusion pyramid network based remote sensing target detection. Remote Sens. 13, 847 (2021).
Article ADS Google Scholar
Li, H., Zhang, R., Pan, Y., Ren, J. & Shen, F. Lr-fpn: Enhancing remote sensing object detection with location refined feature pyramid network. arXiv preprint arXiv:2404.01614 (2024).
Zhang, J. et al. Rethinking mobile block for efficient attention-based models. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 1389–1400 (IEEE Computer Society, 2023).
Haroon, M., Shahzad, M. & Fraz, M. M. Multisized object detection using spaceborne optical imagery. IEEE J. Select. Topics Appl. Earth Observ. Remote Sens. 13, 3032–3046 (2020).
Article ADS Google Scholar
Razakarivony, S. & Jurie, F. Vehicle detection in aerial imagery: A small target detection benchmark. J. Vis. Commun. Image Represent. 34, 187–203 (2016).
Article Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge the Xi’an Key Laboratory of Human-Machine Integration and Control Technology for Intelligent Rehabilitation for their support and assistance during this research.

Funding

This research received no specific grant from any funding agency.

Author information

Authors and Affiliations

Xi’an Key Laboratory of Human-Machine Integration and Control Technology for Intelligent Rehabilitation, School of Computer Science, Xijing University, Xi’an, 710123, China
Jing Liu, Junjie Tao, Xiaoyong Liu, Jun Ma, Chaoping Guo, Chunyu Dong, Pan Li & Peijun Shi

Authors

Jing Liu
View author publications
Search author on:PubMed Google Scholar
Junjie Tao
View author publications
Search author on:PubMed Google Scholar
Xiaoyong Liu
View author publications
Search author on:PubMed Google Scholar
Jun Ma
View author publications
Search author on:PubMed Google Scholar
Chaoping Guo
View author publications
Search author on:PubMed Google Scholar
Chunyu Dong
View author publications
Search author on:PubMed Google Scholar
Pan Li
View author publications
Search author on:PubMed Google Scholar
Peijun Shi
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization, J.L. (Jing Liu) and J.T. (Junjie Tao); methodology, C.G.(Chaoping Guo) and C.D.(Chunyu Dong); software, C.D. and X.L. (Xiaoyong Liu); Validation, P.L. (Pan Li) and P.S. (Peijun Shi); writing—original draft preparation, J.L. and J.M.(Jun Ma); writing—review and editing, P.L. and J.M.; visualization, J.T. and P.S. All authors have read and agreed to the published version of this manuscript.

Corresponding author

Correspondence to Jing Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, J., Tao, J., Liu, X. et al. Multi path attention and scale aware fusion for accurate object detection in remote sensing imagery. Sci Rep 15, 41810 (2025). https://doi.org/10.1038/s41598-025-25900-w

Download citation

Received: 12 September 2025
Accepted: 24 October 2025
Published: 25 November 2025
Version of record: 25 November 2025
DOI: https://doi.org/10.1038/s41598-025-25900-w

Subjects

Abstract

Similar content being viewed by others

MFCA-Net: a deep learning method for semantic segmentation of remote sensing images

SED-YOLO based multi-scale attention for small object detection in remote sensing

Improved YOLOv9-based remote sensing image detection method

Introduction

Related work

CNN-based remote sensing object detection

Transformer-based remote sensing object detection

Feature extraction and fusion

Addressing the gaps with HyperFusion-DEIM

Method

Macroscopic architecture

Multi-path attention network(MAPNet)

Overall structure

The shallow robust feature downsampling (SRFD)

The multi-path attention fusion (MPAF)

Scale-aware feature enhancement (SAFE)

Overall structure

HyperACE

Multi-level feature concentration (MFC)

Experiments

Datasets

Experimental environment

Ablation study

Comparative experiments

Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links