Introduction

In recent years, object detection in aerial imagery has attracted widespread attention due to its broad applications in urban traffic management1, environmental monitoring2, and disaster management3, particularly in UAV-based object detection4,5. The complexity of UAV-based object detection lies not only in the dynamic nature of the aerial environment but also in the challenges posed by variations in object scale, resolution, and environmental conditions (e.g., low-light and nighttime scenarios). To improve detection performance in these challenging environments, multispectral imaging (integrating visible and infrared (IR) images) has become a powerful tool that provides complementary information to enhance detection accuracy and robustness.

Fig. 1
figure 1

Example of the edge blurring problem in RGB-IR object detection. The first row shows infrared images, and the second row shows visible light images. In low-light environments, the object edges in infrared images become blurred, making it difficult to distinguish the foreground from the background. In visible light images, target detection is impossible due to insufficient lighting.

The fusion of visible and infrared images presents some challenges. Misalignment between the two modalities due to variations in flight dynamics, sensor characteristics, and environmental factors can significantly hinder effective feature extraction6,7,8. However, DroneVehicle contains misaligned RGB–IR pairs due to UAV motion and sensor differences. To comprehensively evaluate both misalignment robustness and generalization ability, we further test CMEE-Det on two well-aligned multispectral datasets, DVTOD and LLVIP. Furthermore, object representations differ across modalities, such as the lack of color and texture information in infrared images, requiring complex methods to extract and combine complementary features9. Multi-scale object detection is also a challenge10, especially in UAV-based imagery, where varying flight altitudes of the UAV cause objects to appear in different sizes, complicating the detection process. In addition, in harsh environments such as low-light or adverse weather conditions, as shown in Fig. 1, object edges in infrared images may become blurred11, making it difficult to distinguish the foreground from the background. The quality of visible light images sharply decreases at night due to insufficient lighting, further complicating object detection.

Traditional mainstream methods for multispectral object detection can be divided into two categories: pixel-level fusion14,16,17 and feature-level fusion19,20,21,22.These fusion strategies combine multimodal information at different levels, significantly improving object detection accuracy compared to single-modality methods.However, these methods often overlook explicit cross-modal guidance, which allows one modality to extract richer cues, helping the other modality generate more valuable feature representations for multimodal fusion.

Recently, many multimodal remote sensing object detection methods have been proposed12,13. Most efforts have focused on designing complex network architectures to improve detection performance. Ouyang et al.14 proposed a new multimodal feature-guided mask reconstruction pretraining method, M2FP, aimed at addressing the challenges of large-scale multi-scale object detection for multispectral scene perception from a UAV perspective. To enhance UAV object detection under extreme conditions such as low illumination and strong occlusion, Wang et al.12 proposed the Cross-modal Remote Sensing Image Object Detection (CRSIOD) network, which effectively integrates features from different sensors. Their design, including illumination-aware, uncertainty-aware modules, and a three-branch feature enhancement network, significantly improves detection performance in complex environments. Yuan et al. proposed a universal and efficient UniRGB-IR framework, which effectively integrates features from RGB and infrared images by introducing a multimodal feature pool (MFP) and a supplementary feature injector (SFI) module, significantly improving performance in RGB-IR semantic tasks. These methods are often computationally expensive, making them difficult to deploy on resource-limited UAV platforms. To address the issue of computational complexity, other studies focus on designing lightweight networks. Zhang et al.14 proposed a remote sensing image (RSI) object detection method called SuperYOLO, which effectively improves detection accuracy for multi-scale small objects by fusing multimodal data and incorporating auxiliary super-resolution learning. It also optimizes computational cost during the inference phase, achieving an excellent balance between accuracy and speed. However, this fusion method lacks cross-modal interaction and fails to fully leverage the advantages of multimodal features, making it susceptible to modality attacks.

To address these challenges, we propose a novel cross-modal edge-enhanced detector for multispectral object detection in UAVs. To handle variations in object scale and resolution, we first introduce the Multi-Scale Feature Fusion Module (MSFFM), which captures fine-grained details at different levels of the feature hierarchy. This enables the model to detect objects of varying sizes and adapt to changes in object resolution caused by UAV flight dynamics. Specifically, we use dilated convolutions to expand the receptive field by inserting “gaps” between kernel elements, without increasing the kernel size. This approach allows the model to capture features at multiple scales, improving its ability to detect objects of various sizes and providing a more comprehensive scene representation. By incorporating dilated convolutions, the model can effectively perform multi-scale detection, ensuring stable performance in complex and dynamic environments. In multispectral object detection, object boundary information is crucial for distinguishing objects from the background. To this end, we introduce the Edge Feature Enhancement Module (EFEM) in both visible and infrared images, designed to accentuate object contours, particularly in low-contrast or complex background settings, enhancing object distinguishability. Specifically, the core idea of the edge enhancement module is to apply differential convolutions to subtract the weighted fusion pooling layer from the input feature map, thereby enhancing the edge features in the image. Inspired by the success of transformer-based attention mechanisms, we design the Cross-Modal Feature Fusion Module (CMFFM) to facilitate effective interaction between visible and infrared features. By leveraging self-attention mechanisms, the module learns to adjust and fuse complementary information from both modalities, enhancing the model’s robustness and improving feature representations across both spectra. To ensure that both low-level edge details and high-level contextual information are preserved in the final detection representation, the enhanced edge features and global features from both modalities are combined. This enables the model to capture both local fine-grained features and global contextual dependencies, resulting in more accurate object detection performance.

The contributions of this paper are summarized as follows:

  1. (1)

    We address the issue of weak edge blurring in RGB-IR object detection for aerial images. To the best of our knowledge, this is the first time the problem of edge blurring in RGB-IR aerial image object detection has been proposed and analyzed.

  2. (2)

    We propose a cross-modal edge-enhanced detector to tackle challenges such as modality misalignment, scale variation, and edge blurring in UAV-based multispectral object detection.

  3. (3)

    We conducted extensive experiments on UAV-based multispectral datasets, including DroneVehicle, DVTOD, and LLVIP, demonstrating the effectiveness of our method and showing significant improvements in detection performance compared to existing methods.

Related work

Oriented object detection

Due to significant variations in flight altitude, angle, and scene coverage, orientation-based object detection methods are more suitable for UAV-based object detection. Dai et al.15 proposed a new detector, TARDet, a two-stage anchor-free rotated object detector, which generates coarse localization boxes using feature refinement and directed generation modules, and extracts rotation-invariant features through an alignment convolution module. Ding et al.23 proposed a lightweight module, RoI Transformer, which addresses the mismatch between RoIs and objects in aerial images through spatial transformations, significantly improving classification and localization accuracy. Recent work by Taghipour & Ghassemian focused on spatial-spectral fusion for hyperspectral anomaly detection by analyzing spatial patterns, improving feature extraction reliability60. In 2021, they introduced a visual attention-driven framework for hyperspectral anomaly detection, enhancing performance by incorporating spatial-spectral features through attention mechanisms61. Xie et al.24 proposed an efficient and simple orientation-based object detection framework, R-CNN, which resolves the speed bottleneck in existing two-stage detectors. The method generates high-quality oriented proposals in the first stage using an oriented RPN, and refines and recognizes them in the second stage using an oriented R-CNN head. Google25 introduced a novel deep convolutional neural network architecture, Xception, inspired by Inception, which replaces traditional Inception modules with depthwise separable convolutions. In recent years, deep learning-based object detection methods have achieved state-of-the-art performance26,27,28,29. To further advance aerial object detection, Feng et al. introduced the HazyDet dataset. The aforementioned methods have achieved promising results in RGB-based object detection. Although these methods have advanced object detection, they mainly focus on a single modality. Under low-light conditions, RGB-based object detection may struggle to achieve satisfactory results. As a result, researchers are exploring the use of multispectral data to enhance object detection accuracy and robustness30,31,32,33. To address this challenge and ensure high performance, we propose a single-stage two-stream object detector.

Object detection in UAV images

Target detection in UAV images faces unique challenges, primarily due to significant variations in flight altitude, angle, and scene coverage. An effective approach to addressing these complexities is multi-scale feature fusion, which is crucial for handling targets of varying sizes. CFANet, proposed by Zhang et al.34, effectively addresses the challenge of detecting dense small objects in UAV images by using the Cross-Layer Feature Aggregation (CFA) module, Layered Association Spatial Pyramid Pooling (LASPP) module, and Adaptive Overlapping Slice (AOS) method, showing significant performance improvements across multiple datasets. Zhang et al.35 proposed SODNet, which uses an adaptive spatial parallel convolution module to improve real-time detection of small objects through specialized feature extraction and information fusion techniques. Zhang et al.36 introduced FANet, an arbitrary orientation remote sensing target detection method based on feature fusion and angle classification. By adopting angle prediction branches and Circular Smoothing Label (CSL) methods, angle regression is transformed into a classification problem, resolving the issue of abrupt changes in rotated frame boundaries. To tackle the challenges of large scale variations and dense small objects, Zhang et al.37 proposed SGMFNet, a UAV image object detection network that combines self-attention guidance and multi-scale feature fusion. This method effectively combines multi-scale features and enhances small object feature extraction by designing the Global-Local Feature Guidance (GLFG) module and the improved Parallel Sampling Feature Fusion (PSFF) module, thereby improving detection accuracy. Li et al.38 introduced a novel Perceptual Generative Adversarial Network (Perceptual GAN) model to enhance small object detection. The model improves the detection of small objects by transforming their low-quality representations into super-resolution representations, narrowing the gap between small and large object representations. However, these methods are all single-modality detection approaches, overlooking the complementary information between different modalities. An et al.39 proposed ECISNet, which enhances the feature extraction ability of multimodal data through the Cross-Modal Information Sharing (CIS) module and Modal Effectiveness Guidance (MEG) module, ensuring accurate detection even when one modality fails. Sun et al.20 developed UA-CMDet, a UAV-based cross-modal vehicle detection framework, which introduces an uncertainty perception module to quantify the uncertainty of each object and improves detection performance in low-light and complex environments through cross-modal information fusion. However, traditional multimodal image fusion methods usually focus on pixel-level feature fusion, overlooking the precise handling of edge information. To address this issue, our approach not only fully utilizes the complementary information between modalities but also considers edge blurring and multi-scale feature fusion in multispectral data, effectively enhancing object detection performance.

Fig. 2
figure 2

Cross-modal edge-enhanced detector. We propose a dual-stream feature extractor with cross-modal edge enhancement to extract modality-specific features. After obtaining modality-specific regional features, the MSFFM module captures multi-scale features, which are then merged by the CMFFM module to combine image feature pairs for class and bounding box prediction.

Method

Architecture

Based on a comprehensive analysis of the challenges and difficulties in UAV-based cross-modal object detection, we propose a cross-modal edge enhancement detector to address issues such as modality misalignment, scale variations, and edge blurring in multispectral object detection. The overall architecture, as shown in Fig. 2, consists of three main components: the Edge Feature Enhancement Module (EFEM), Multi-Scale Feature Fusion Module (MSFFM), and Cross-Modal Feature Fusion Module (CMFFM). First, the extracted features are fed into the EFEM, where a smoothing layer (DSConv3 × 3) is used to reduce noise and extract stable edge information. Next, average pooling, max pooling, differential convolution, and learned weight coefficients are combined to enhance the edge features. The edge-enhanced feature map is then passed to the MSFFM, which effectively captures information at different scales through adaptive pooling and dilated convolution. This is crucial for multimodal image fusion, especially when images have different resolutions or scales, as it ensures the model remains efficient and precise when handling objects of various sizes. The features are then sent to the CMFFM, where a cross-attention mechanism computes the correlation between queries and keys to effectively weight and fuse features from different modalities, ensuring that useful information from each modality is reinforced. Detailed information about each part will be presented in the following subsections.

Fig. 3
figure 3

The structure of the Edge Feature Enhancement Module.

Edge feature enhancement module

Edge blurring is a common issue in multimodal object detection, particularly when dealing with visible light and infrared images. Due to the low contrast of infrared images and the loss of details in visible light images, the edge regions often appear unclear, leading to blurred edge information of the objects. This edge blurring issue directly affects object localization and classification, particularly in complex backgrounds or low-light conditions, where edge clarity is crucial. To effectively address this problem, we propose an edge enhancement mechanism that enhances the edge features in the image to improve object recognizability. The details are illustrated in Fig. 3.

First, the input visible light image \({X_{RGB}}\) and infrared image \({X_{IR}}\)​ are processed through depthwise separable convolution to achieve feature smoothing. This operation effectively removes low-frequency noise while preserving the primary structural information of the image.

$$\begin{gathered} {{X^{\prime}}_{RGB}}=DSConv3 \times 3({X_{RGB}}) \hfill \\ {{X^{\prime}}_{IR}}=DSConv3 \times 3({X_{IR}}) \hfill \\ \end{gathered}$$
(1)

Here, \(DSConv3 \times 3\) represents a 3 × 3 depthwise separable convolution operation, which performs convolution on each channel for more efficient computation.

We then use average pooling and max pooling operations to extract features at different scales. To combine the advantages of these two pooling methods, we adopted the concept of hybrid pooling18 and proposed an adaptive weighted pooling mechanism. Specifically, we adaptively adjust the weight of average pooling and max pooling to more flexibly extract global and local features from the image.

$$\begin{gathered} avg1=AvgPool({{X^{\prime}}_{RGB}}),\hbox{max} 1=MaxPool({{X^{\prime}}_{RGB}}) \hfill \\ avg2=AvgPool({{X^{\prime}}_{IR}}),\hbox{max} 2=MaxPool({{X^{\prime}}_{IR}}) \hfill \\ \end{gathered}$$
(2)
$$\begin{gathered} new\_rgb={\lambda _{rgb}} \cdot avg1+(1 - {\lambda _{rgb}}) \cdot avg2 \hfill \\ new\_ir={\lambda _{ir}} \cdot \hbox{max} 1+(1 - {\lambda _{ir}}) \cdot \hbox{max} 2 \hfill \\ \end{gathered}$$
(3)

In the formula, \({X^{\prime}_{RGB}}\) and \({X^{\prime}_{IR}}\) represent the processed feature maps of the visible light and infrared images,\({F^{\prime}_{RGB}},{F^{\prime}_{IR}} \in {R^{W \times H \times C}}\), respectively. \(\lambda {}_{{_{{rgb}}}}\) and \({\lambda _{ir}}\) represent the weights between 0 and 1, \({\lambda _{RGB}},{\lambda _{IR}} \in R\), which are learnable parameters. \(new\_rgb\) and \(new\_ir\) denote the visible-light and infrared feature maps obtained through hybrid pooling, respectively. Then, the fused features \(new\_rgb\) and \(new\_ir\) are processed by differential convolution to compute the difference between the hybrid pooling-weighted features and the initial input features, resulting in enhanced edge feature images:

$$\begin{gathered} edg{e_{RGB}}=Con{v_{diff}}(new\_rgb) - {X_{rgb}} \hfill \\ edg{e_{IR}}=Con{v_{diff}}(new\_ir) - {X_{IR}} \hfill \\ \end{gathered}$$
(4)

\(Conv{}_{{diff}}\) is the differential convolution, The differential convolution is implemented using a 3 × 3 kernel, stride 1, and padding 1, with a dilation rate r = 1 (default non-dilated convolution) to extract localized edge details, and \(edg{e_{RGB}}\) and \(edg{e_{IR}}\) are the edge-enhanced feature maps of the visible light and infrared images, respectively.

$$\begin{gathered} weigh{t_{RGB}}=\sigma (BN(Conv(edg{e_{RGB}}))) \hfill \\ weigh{t_{IR}}=\sigma (BN(Conv(edg{e_{IR}}))) \hfill \\ \end{gathered}$$
(5)

\(Conv\) is the 1 × 1 convolution layer, \(BN\) represents BatchNorm, \(\sigma\) represents the Sigmoid function, and \(weigh{t_{RGB}}\) and \(weigh{t_{IR}}\) are the generated weights for the visible light and infrared images, respectively.

$$\begin{gathered} ou{t_{RGB}}=weigh{t_{RGB}} \times {X_{RGB}}+{X_{RGB}} \hfill \\ ou{t_{IR}}=weigh{t_{IR}} \times {X_{IR}}+{X_{IR}} \hfill \\ \end{gathered}$$
(6)

\(ou{t_{RGB}}\) and \(ou{t_{IR}}\) are the feature maps of the visible light and infrared images after edge enhancement through weighting.

Fig. 4
figure 4

The structure of the Multi-Scale Feature Fusion Module.

Multi-Scale feature fusion module

In multimodal fusion tasks, due to significant differences in resolution, scale, and feature distribution between visible light and infrared images, direct pixel-level or simple feature-level fusion often struggles to capture complementary information from both modalities, leading to degraded fusion performance.To address this issue, we propose a Multi-Scale Feature Fusion Module. The module utilizes techniques such as depthwise separable convolution, pooling, and dilated convolution to process and enhance the input multi-channel image data, extracting useful spatial and contextual information. For clarity, we only show the MSFFM module for the visible light image. The detailed structure is shown in Fig. 4.

First, the weighted edge-enhanced visible light feature map \(ou{t_{RGB}}\) is processed using adaptive average pooling, generating dynamic convolution kernel templates from each feature map for the subsequent multi-scale dynamic convolution operation, as shown in the following formula:

$$ou{t^{\prime}_{RGB}}=Pool(ou{t_{RGB}})$$
(7)

\(Pool\) is the adaptive average pooling layer, which resizes the input image to a fixed 5 × 5 output size to extract multi-scale global features. \(ou{t^{\prime}_{RGB}} \in {R^{B \times C \times 5 \times 5}}\) represents the convolutional kernel generated by the pooling operation with a fixed size. Then, for each batch of \(i \in \{ 1,2,...,B\}\), dynamic dilated convolution is applied to the corresponding feature map \(ou{t_{RGB}}\left[ i \right]\) using the generated dynamic convolution kernel \(ou{t^{\prime}_{RGB}}\left[ i \right] \in {R^{B \times C \times 5 \times 5}}\). The dynamic convolution operation formula is as follows:

$$out_{{RGB}}^{{(d)}}\left[ i \right]=Con{v_{dil=d}}(ou{t_{RGB}}\left[ i \right],ou{t^{\prime}_{RGB}}\left[ i \right]),d \in \{ 1,2,3\}$$
(8)

\(Con{v_{dil=d}}()\) represents dilated convolution, we use fixed dilation rates \(r \in \left\{ {1,2,4} \right\}\) across the three parallel branches of the MSFFM to capture multi-scale context without increasing parameter count. By combining the results of different dilated convolutions, the module can capture contextual information at multiple scales, thus enhancing feature extraction capability.

$$out_{{RGB}}^{{new}}=\sum\nolimits_{{d=1}}^{D} {out_{{RGB}}^{d}}$$
(9)

\(out_{{RGB}}^{d}\) represents the output of the d-th dilated convolution, and D is the total number of dilation rates. The summation operation enhances the ability to capture target structures and contextual information by integrating feature information from different receptive fields. Then, a pointwise convolution is applied to the enhanced feature map \(out_{{RGB}}^{{new}}\), aiming to perform a linear combination and transformation of the channel dimension to enhance feature interaction between channels while compressing redundant information. Its mathematical expression is:

$${F_{RGB}}=W * out_{{RGB}}^{{new}}$$
(10)

\({F_{RGB}}\) is the output feature map after the pointwise convolution, \({W^{1 \times 1}}\) is the 1 × 1 convolution kernel corresponding to the input feature map, and denotes the pointwise convolution operation. The pointwise convolution works by performing a weighted sum of all channels at each pixel using the 1 × 1 convolution kernel.

Then, the feature map \({F_{RGB}}\) is further processed through depthwise separable convolution to extract local spatial information, utilizing the decomposed convolution operation to achieve efficient computation and fine-grained feature extraction. The specific formula is:

$$F_{{RGB}}^{{depth}}[c]={W^{3 \times 3}}[c] * {F_{RGB}}[c]$$
(11)

Here, c denotes the channel index, and \({W^{3 \times 3}}[c]\) represents the 3 × 3 convolution kernel for the corresponding channel. Similarly, the infrared image’s feature map \(F_{{IR}}^{{depth}}\) can also be obtained. The final output feature maps, \(F_{{RGB}}^{{depth}}\) and \(F_{{IR}}^{{depth}}\), are the results of multi-scale feature fusion. They retain the original resolution and structure of the input feature maps while enhancing the correlation between features at different resolutions.

Cross-modal feature fusion module

In multimodal fusion tasks, semantic inconsistencies may exist between features from different modalities. Direct stacking or simple fusion often leads to interference between modalities, thereby impairing the representational capacity of the fused features. To address this issue, we propose a Cross-Modal Feature Fusion Module, as shown in Fig. 5, which enhances the synergy between features of different modalities through cross-modal feature interaction and channel embedding operations, generating more expressive fused features.

First, the channels of \(F_{{RGB}}^{{depth}}\) and \(F_{{IR}}^{{depth}}\) are reduced through linear mapping to generate intermediate representations:

$$\begin{gathered} {Y_{RGB}},{U_{RGB}}=Chunk(\sigma (F_{{RGB}}^{{depth}}{W_{RGB}}+{b_{RGB}}),\dim = - 1) \hfill \\ {Y_{IR}},{U_{IR}}=Chunk(\sigma (F_{{IR}}^{{depth}}{W_{IR}}+{b_{IR}}),\dim = - 1) \hfill \\ \end{gathered}$$
(12)

Here, \({W_{RGB}} \in {R^{C \times (C/r \times 2)}}\) and \({W_{IR}} \in {R^{C \times (C/r \times 2)}}\) represent the dimensionality reduction weight matrices, \({b_{RGB}}\) and \({b_{IR}}\) are the biases, r denotes the reduction rate, \(\sigma\) represents the ReLU activation function, \(Chunk\) splits the tensor into two parts along the last dimension: X7 and X8. Then, a multi-head cross-attention mechanism is used to enhance the complementarity of the two modality features:

$${{\rm T}_{RGB}},{{\rm T}_{IR}}=CrossAttention({U_{RGB}},{U_{IR}})$$
(13)

Here, \({{\rm T}_{RGB}}\) and \({{\rm T}_{IR}}\) represent the features after cross-modal interaction. The calculation process is as follows: we first generate query, key, and value descriptors from the input features. Specifically, we use convolutional layers for each modality to generate the descriptors and reshape them into 2D tensors. The generation process is defined as follows:

$$\begin{gathered} {Q_{RGB}}=\Gamma (W_{{RGB}}^{Q} * {U_{RGB}}),{K_{RGB}}=\Gamma (W_{{RGB}}^{K} * {U_{RGB}}),{V_{RGB}}=\Gamma (W_{{RGB}}^{V} * {U_{RGB}}) \hfill \\ {Q_{IR}}=\Gamma (W_{{IR}}^{Q} * {U_{IR}}),{K_{IR}}=\Gamma (W_{{IR}}^{K} * {U_{IR}}),{V_{IR}}=\Gamma (W_{{IR}}^{V} * {U_{IR}}) \hfill \\ \end{gathered}$$
(14)

Where \({W^Q},{W^K} \in {R^{N \times {D_K}}}\) and \({W^V} \in {R^{N \times {D_V}}}\) are the convolution with kernel size 1 × 1 associated with query, key and value, respectively. We use \({H_{heads}}=8\) attention heads for parallel computation. \(Q \in {R^{HW \times C}},K \in {R^{HW \times C}}\) and \(V \in {R^{HW \times C}}\) are the final feature descriptors after a reshape operation \(\Gamma ( \cdot )\). Then, the cross-modal attention weights are calculated:

$$\begin{gathered} {{\rm T}_{RGB}}=Soft\hbox{max} (\frac{{{Q_{RGB}}K_{{IR}}^{{\rm T}}}}{{\sqrt d }}) \hfill \\ {{\rm T}_{IR}}=Soft\hbox{max} (\frac{{{Q_{IR}}K_{{RGB}}^{{\rm T}}}}{{\sqrt d }}) \hfill \\ \end{gathered}$$
(15)

Here, \({{\rm T}_{RGB}}\) and \({{\rm T}_{IR}}\) represent the features of the visible light and infrared images after cross-modal interaction. The attention mechanism uses temperature scaling \(\sqrt[{}]{{{D_k}}}\) in the softmax operation. Then, the results of the cross-attention computation are concatenated with the initial features \(Y{}_{{RGB}}\) and \({Y_{IR}}\), followed by dimensionality reduction and linear reconstruction to generate new features.

$$\begin{gathered} {{{\rm T}^{\prime}}_{RGB}}=Concat({{\rm T}_{RGB}},{Y_{RGB}})W+b \hfill \\ {{{\rm T}^{\prime}}_{IR}}=Concat({{\rm T}_{IR}},{Y_{IR}})W+b \hfill \\ \end{gathered}$$
(16)

Here, \(W \in {R^{(C/r \times 2) \times C}}\). Then, the final interaction features are generated through residual connections.

$$\begin{gathered} {{\hat {{\rm T}}}_{RGB}}=Norm(F_{{RGB}}^{{depth}}+{{{\rm T}^{\prime}}_{RGB}}) \hfill \\ {{\hat {{\rm T}}}_{IR}}=Norm(F_{{IR}}^{{depth}}+{{{\rm T}^{\prime}}_{IR}}) \hfill \\ \end{gathered}$$
(17)

Then, the features of the two modalities are concatenated along the channel dimension.

$${\hat {{\rm T}}_{fuse}}=Concat({\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{{\rm T}} _{RGB}},{\hat {{\rm T}}_{IR}}) \in {R^{B \times N \times 2C}}$$
(18)

Then, the channel embedding module enhances the feature interaction between channels and spatial dimensions through a series of convolution operations.

$${\hat {{\rm T}}_{out}}=Norm({X_{res}}+\sigma (DWConv(\sigma (Conv2D({\hat {{\rm T}}_{fuse}})))))$$
(19)

Here, \({X_{res}}=Conv2D({\hat {{\rm T}}_{RGB}})\) represents the residual branch, \(DWConv\) is a 3 × 3 depthwise separable convolution, \(\sigma\) represents the ReLU activation function, and \({\hat {{\rm T}}_{out}}\) is the final output feature.

Fig. 5
figure 5

The structure of the Cross-Modal Feature Fusion Module.

Loss function

This study employs a multi-task joint loss function consistent with the original YOLOv5 and FCOS frameworks, directly inheriting their effectiveness in single-modality object detection tasks.The overall loss function is formulated as follows:

$$L={L_{cls}}+{\lambda _1}{L_{reg}}$$
(20)

The classification loss (\({L_{cls}}\)) combines Binary Cross-Entropy (BCE) Loss and Focal Loss, serving distinct objectives: BCE Loss for object existence prediction and Focal Loss for category classification, thereby effectively addressing class imbalance issues.The regression loss (\({L_{reg}}\)) employs CIoU Loss for optimizing boundary box coordinate regression, enhancing dense small-object localization accuracy.Where \({\lambda _1}\)= 0.05 is determined through validation set tuning.

Experiments

Experimental settings

Datasets

Our experiments were conducted on the DroneVehicle RGB-IR vehicle detection dataset20. DroneVehicle is a large-scale drone-based dataset with well-aligned visible/infrared pairs from day to night.Next, we will demonstrate the generalization capability of this network architecture on the multispectral datasets DVTOD6 and LLVIP59.

  1. (1)

    DroneVehicle: The DroneVehicle dataset is a large-scale paired dataset of infrared and visible images captured from drones, containing 953,087 vehicle instances across 56,878 images. This dataset covers a variety of scenes, including urban roads, residential areas, parking lots, and scenes ranging from daytime to nighttime. Additionally, the authors have provided rich annotations with oriented bounding boxes for five categories: sedan, bus, truck, van, and freight vehicle. The dataset is divided into a training set and a test set, and our experimental results are inferred from the test set.

  2. (2)

    DVTOD: The DVTOD dataset is a large-scale paired dataset of infrared and visible images collected from drones, containing 16 challenging attributes and 54 captured scenes. It includes 1,606 pairs for training and 573 pairs for testing. To ensure uniformity in scene, category, and attribute distributions, the training and test sets are divided based on scenes, while maintaining uniform category and attribute distributions.

  3. (3)

    LLVIP: The LLVIP dataset is a low-light image collection dataset, containing a total of 15,488 pairs of images, with 12,025 pairs used for training and the remaining 3,463 pairs used for testing. These images were collected by a monitoring system at 26 different locations and aligned through manual filtering. Unlike the DroneVehicle dataset, it is a well-aligned dataset.

Evaluation indicators

  1. (4)

    For these three public datasets, we evaluated them using the most common evaluation metric in object detection, mean Average Precision (mAP@[0.5:0.95]), which is further divided into mAP@50, mAP@75, and mAP@[0.5:0.95]. The mAP@50 metric represents the average of all AP values for all categories at IoU = 0.50. Similarly, the mAP@75 metric represents the average of all AP values for all categories at IoU = 0.75. The mAP@[0.5:0.95] metric represents the average AP across the range of IoU from 0.50 to 0.95, with a step size of 0.05. The higher the value of this metric, the better the performance of our method on the corresponding dataset.

Realization details

Our method is implemented using PyTorch 1.10.1 framework on a Linux server with seven NVIDIA GeForce GTX 1080 Ti GPUs. The training phase takes 300 epochs with the batch size of 49. The SGD optimizer is used with the initial learning rate of 1.0 × 10 − 2.The loss function is utilized following the detectors of YOLOv5 and FCOS in the orignal paper. In the ablation studies, We use the extended YOLOv5 framework as the default baseline for comparison. The DroneVehicle dataset provides annotations in the form of Horizontal Bounding Boxes (HBB). The RGB and IR image pairs in the dataset are inherently co-registered (aligned). Before being input to the network, both the visible-light and infrared streams are resized to a fixed input resolution of 640 × 640 using standard bilinear interpolation, maintaining the original aspect ratio by padding where necessary.

Comparisons

To ensure fair comparisons, all comparison methods in this study—including models originally designed for oriented object bounding box (OBB) detection, such as RoI Transformer and Oriented R-CNN—were uniformly trained and evaluated using the standard horizontal bounding box (HBB) annotations provided by the DroneVehicle dataset. We compared our method with competitive methods on the DroneVehicle test set (as shown in Table 1). Due to the lack of cross-modal oriented detection methods for aerial scenes in existing studies, we evaluated our method and eight other state-of-the-art methods, five of which are single-modal methods and three are multi-modal methods. Among the multi-modal methods, CALNet achieves the highest detection accuracy (75.39% mAP@0.5). In contrast, our proposed detector achieved 77.63% mAP@0.5, outperforming other methods and improving 2.24% mAP@0.5 compared to the state-of-the-art methods. Furthermore, our method outperforms existing methods across most categories in the UAV dataset. Specifically, our method improves the accuracy of detecting cars, trucks, buses, and vans by 6.61%, 3.34%, 7.08%, and 7.97% mAP@0.5, respectively, further demonstrating its effectiveness in alleviating edge blurring issues.

As shown in Table 2, we evaluated our method against 11 other advanced methods, including 6 multi-modal methods. The performance of the method using both RGB and infrared images outperforms single-modal methods. Among the multi-modal methods, CALNet has a higher detection accuracy (76.41% mAP@0.5). In contrast, our method performs better, achieving 77.65% mAP@0.5, outperforming other methods and improving 1.24% mAP@0.5 compared to the state-of-the-art methods.

Table 1 Performance comparison of different methods on the dronevehicle test dataset.

For the DVTOD dataset, we compared CMEE-Det with 8 methods: 5 single-modal algorithms (YOLOv3, YOLOv5, YOLOv6, YOLOv7, YOLOv8) and 3 multi-modal algorithms (YOLOv5 + Add, CMX, CFT). The results are shown in Table 3. Existing advanced fusion detection methods are 5.2%, 2.8%, and 1.7% lower than CMEE-Det, respectively. The edge blur of multi-modal images leads to reduced detection accuracy and difficulty in convergence for multi-modal detectors. In terms of detection accuracy for visible light images, the detection accuracy for the visible modality is low due to lighting conditions, complex backgrounds, and occlusions. Moreover, the dataset contains a large number of low-light image pairs, which also leads to lower detection accuracy for visible images. These results also demonstrate the effectiveness of our detection method.

Table 2 Performance comparison of different methods on the dronevehicle validation dataset.
Table 3 Performance comparison of different methods on the DVTOD dataset.

LLVIP is an aligned dataset collected under low-light conditions. We compared our proposed CMEE-Det with five multi-spectral object detection methods and five single-modal object detection methods (as shown in Table 4). Our CMEE-Det achieved good detection performance on this dataset, obtaining the best results among all the compared methods. These results demonstrate that CMEE-Det not only handles misaligned UAV data, but also maintains strong performance on well-aligned multispectral datasets, confirming the generalizability and robustness of the proposed framework.

Table 4 Performance comparison of different methods on the LLVIP dataset.

Qualitative analysis

Fig. 6
figure 6

The first column shows the daytime scene, and the fourth column shows the nighttime scene. The first and third rows represent visible light images, and the second and fourth rows represent infrared images. The second and fifth columns show the baseline method visualization, while the third and sixth columns show the visualization of our method.

The improved attention responses observed in Fig. 6 can be attributed to the complementary roles of EFEM and CMFFM. EFEM enhances the separation between foreground object boundaries and the surrounding background by sharpening cross-modal edge cues, which reduces ambiguity in regions where IR edges are typically blurred. This clearer structural guidance allows the CMFFM cross-attention mechanism to assign higher attention weights to true object regions instead of noise or blurred contours. As a result, the fused feature maps display more coherent activation patterns and stronger focus on semantically meaningful areas, explaining the more comprehensive region coverage shown in our qualitative examples. As shown in the second and fifth columns of the figure, the baseline method has limited coverage of different regions in the input image. In contrast, the third and sixth columns show that our method can leverage global spatial location information and the correlations between different objects to more comprehensively capture all object features. CMEE-Det effectively fuses cross-modal features, not only accelerating the integration of background information but also significantly enhancing the saliency of object features at multiple levels.

Figure 7 shows the visualization of CMEE-Det on drones in the DroneVehicle dataset, accurately classifying and locating objects under dense small objects and low-light conditions. The experimental results at Figure (a) mAP@0.5 and Figure (b) mAP@[0.5:0.95] rates for different data sets are shown in Fig. 8.

Fig. 7
figure 7

Visualization results of CMEE-Det on UAVs in the DroneVehicle dataset, with different colored boxes for different categories. It performs excellently in tasks such as small object localization, dense object localization, and fine-grained classification under various lighting conditions.

Fig. 8
figure 8

Training process of mAP@50 and mAP@[0.5:0.95] different dataset.

Figure 9 shows the Figure (a) mAP@50 and Figure (b) mAP@[0.5:0.95] results for our detector and baseline detector for specific categories in the DroneVehicle dataset. In contrast, our CMEE-Det employs a more effective fusion strategy that greatly improves accuracy, especially in detecting trucks and vans.

Fig. 9
figure 9

Different categories of mAP@50 and mAP@[0.5:0.95] results for our detector with baseline on the DroneVehicle dataset.

Ablation study

Ablation on each component

We adopt the extended YOLOv5 as our primary baseline because it is a strong and widely recognized single-stage detector, whose stable architecture and mature training pipeline make it easier to isolate and quantify the incremental effects of our proposed modules (EFEM, MSFFM, CMFFM). Although our loss design is compatible with both YOLOv5 and FCOS, YOLOv5 provides a cleaner and more consistent foundation for controlled ablation, enabling more reliable analysis of component-wise contributions.

Effect of EFEM Module.We extended Yolov5 as our baseline. First, we performed an ablation study on the EFEM module while keeping the rest of the network structure unchanged. Compared to the data in row 2 of Table 5, mAP@50 increased by 2.02% in row 5. The experimental results indicate that the module effectively enhanced the edge information representation of visible and infrared images during the feature extraction phase, while also precisely capturing the edge differences between cross-modal features through differential convolution and edge enhancement operations. This not only improved the accuracy of cross-modal feature alignment but also further optimized the robustness and consistency of modal fusion.

Effect of MSFFM Module. In row 3 of Table 5, we performed an ablation of the MSFFM module, keeping the rest of the network unchanged. The mAP@[0.5:0.95] increased by 2.17% in row 5 compared to row 3. Experimental results show that the module effectively extracts details and semantic features of visible light and infrared images at different scales. The module extracts rich contextual information at different receptive fields through adaptive pooling and dilated convolution, while accurately capturing spatial details through feature reconstruction. This method effectively enhances the complementarity of cross-modal features and improves the ability to capture target features in complex scenes, thereby significantly improving the expressive power of fused features and the robustness of detection performance.

Effect of CMFFM Module. As shown in row 4 of Table 5, we performed an ablation of the CMFFM module, while keeping the rest of the network unchanged during both training and testing. Comparing row 5 with row 4 in Table 5, we observe that the addition of CMFFM increased mAP@50 by 2.34%. Experimental results show that the module effectively captures the correlation between different modalities and promotes efficient cross-modal feature interaction. Moreover, the channel embedding operation further optimizes the expression of fused features and enhances the modeling of spatial consistency in the input features. Overall, this module significantly enhances the compatibility and information representation ability of cross-modal features, providing high-quality fused feature support for downstream detection tasks.

Table 5 Ablation study of EFEM, MSFFM, and CMFFM.

The experimental results in Table 5 show that any combination of two modules significantly improves detection performance, with a clear improvement over the baseline results. This indicates that the modules complement each other in capturing multimodal features, enhancing feature representation, and integrating contextual information, positively impacting the final detection task. Furthermore, when EFEM, MSFFM, and CMFFM modules are combined, detection performance is further improved. This result validates the advantage of the three modules working together, as their fusion through multiple mechanisms not only enhances feature distinguishability but also better adapts to the target detection needs in complex scenarios.

Analysis of the effectiveness of improvement methods

Table 6 provides a detailed comparison of performance indicators for different object detection model configurations. The analysis results show that the CMEE-Det model proposed in this paper achieves the best comprehensive performance, especially when using CSPDarknet53 as the backbone network. This configuration achieves the highest accuracy of 84.4% mAP@50 and inference speed of 70 FPS with a parameter count of 115.5 M and a computational load of 205.08 GFLOPs. This is not only significantly better than the baseline method YOLOv5 + Add (5.2% accuracy improvement), but also better than other high-performance models such as CFT and YOLOv5 + CMX. At the same time, compared with the performance of CMEE-Det using different backbone networks, CSPDarknet53 version demonstrates better balance in speed, accuracy, and resource consumption, which strongly proves the efficiency and superiority of CMEE-Det architecture design.

Table 6 Performance comparison of different backbone networks and detector configurations on DVTOD dataset.

Hyperparameter sensitivity analysis

Table 7 Sensitivity analysis of CMEE-Det to initial learning rate (LR) on the DVTOD Dataset.

The initial learning rate (LR) is a critical hyperparameter governing the convergence of the model. We evaluated the sensitivity of CMEE-Det by testing four different initial LRs on theDVTOD dataset, and the mAP@50 results are illustrated in Table 7. The default LR of 1.0 × 10− 2 yields the optimal result, providing the best balance between convergence speed and final detection accuracy.

Table 8 Effectiveness analysis of the adaptive weighting mechanism in the EFEM.

We evaluated the effectiveness of the Adaptive Weighting Mechanism in the EFEM framework, as detailed in Table 8. The baseline model (without EFEM) achieved a mAP@50 of 82.6%, while the EFEM with fixed average fusion improved the result to 83.2%. By introducing learnable parameters for \(\lambda {}_{{_{{rgb}}}}\) and \(\lambda {}_{{ir}}\), the adaptive weighting mechanism further boosted the performance to 84.4%, confirming its ability to optimally fuse multi-modal features.

Conclusions

In this paper, we analyze the impact of edge blurring on multimodal object detection and then propose CMEE-Det, a novel cross-modal edge enhancement detector for multispectral object detection in UAVs. UAV detection tasks, several limitations remain. First, the model is optimized for visible–infrared fusion in aerial scenarios, and its performance under extreme environments (e.g., heavy fog, nighttime glare, severe sensor noise) may still degrade, as such cases are underrepresented in existing datasets. The adaptive pooling and cross-modal enhancement modules introduce additional computational overhead compared to single-modality detectors, which may limit deployment on extremely resource-constrained onboard processors. CMEE-Det is currently evaluated mainly on vehicle and person detection tasks; its generalization to other multispectral tasks (e.g., semantic segmentation, medical imaging, agriculture monitoring) remains a promising direction for future research. To address these limitations, future work will explore lightweight modules for real-time onboard inference, investigate robust training strategies for extreme illumination and weather conditions, and extend CMEE-Det to broader multimodal perception tasks. Beyond UAV-based multispectral detection, the principles behind CMEE-Det—namely edge-aware representation enhancement and robust cross-modal feature interaction—hold strong potential for broader multimodal perception tasks, such as autonomous driving, robotics, and remote sensing, where reliable fusion of heterogeneous sensors is essential.