Introduction

As the infrared image of medium and long wave segment has good imaging performance at night and in low light environment, the object detection of UAV infrared image is widely used in security monitoring1, military reconnaissance2, medical imaging3, environmental monitoring4 and other scenarios. For small infrared objects with weak thermal radiation characteristics, there are many challenges in complex background interference, weak infrared features, algorithm complexity and real-time performance. The size of small objects is usually less than 20 × 20 pixels. According to the motion characteristics of the detection object, there are two research areas of infrared small object detection: the single frame image and video sequence image. This paper mainly researches the problem of single frame infrared small object detection.

Recently, depending on whether to generate candidate regions, deep learning is divided into one-stage and two-stage detection methods. One-stage detection method uses the feedforward network to locate and classify small objects. The real-time efficiency of the detection is high. Typical detection methods include Yolo series (You Only Look Once)5, SSD (Single Shot MultiBox Detector)6, RetinaNet7, etc. The two-stage detection method needs to form a candidate region, and then classify the candidate region and make boundary box regression. The detection accuracy of the candidate region is high in complex scenarios, but the calculation speed is relatively slow. The typical detection models include Faster R-CNN8, R-FCN9, Mask R-CNN10, etc. The object detection problem of UAV infrared image requires to overcome difficulties such as low signal-to-noise ratio, complex object motion characteristics, object interference or occlusion by clouds. Therefore, the lightweight Yolo series known for high efficiency and accuracy methods have been widely adopted in complex scenarios with high real-time performance.

Tian et al.11 proposed an IV-Yolo object detection network that integrates visible and infrared image features, using two-branch fusion structure to realize UAV small object detection. Xiao et al.12 introduced a new C2f-DCNv3 module to enhance the feature extraction ability, while adding BiFPN to the neck structure to achieve multi-scale feature fusion of small UAV objects. Xue et al.13 proposed a lightweight FECI-RTDETR UAV infrared small-object detection algorithm, which combines spatial feature selection mechanism with intra-scale feature interaction module and understands the context semantic information of aerial infrared small object using cross-scale feature fusion. Zhang et al.14 proposed that the ESD-Yolov8 model identifies the defect characteristics of infrared solar cells and enhances the detection of small defects by integrating the EMA attention mechanism with the C2f_EMA module. Zhu et al.15 combined the Yolov9 and DeepSORT to design multi-object tracking (MOT) models to track and identify endangered birds and drone objects.

With the rapid development of deep learning, Gu and Dao et al. proposed a deep learning Mamba model16 based on State Space Model (SSM)17, which realized long sequence data processing with linear complexity by designing efficient hardware sensing algorithm. In small object detection, Yolo model may have a missed detection problem, while Mamba model can enhance the feature of small objects by the fusion of multi-frame image, which can improve the detection ability of small objects. Wang et al.18 proposed Mamba YOLO, a novel object detection framework that integrates a State Space Model backbone into the YOLO architecture to establish a new, simple baseline for efficient and effective detection. Wang et al.19 introduced Mamba-YOLO-World, which combines the open-vocabulary detection capabilities of YOLO-World with the efficient sequence modeling of a Mamba-based backbone to enhance performance in detecting objects from an unlimited vocabulary. Malekmohammadi et al.20 conduct a comparative analysis, employing YOLO, Vision Transformers, and the emerging Vision Mamba architecture for the classification of Gleason grades in prostate cancer histopathology images to evaluate their diagnostic performance.

Based on the local feature extraction of convolutional layer and the long distance dependence of state space models (SSM), some scholars propose the double branch structure for object detection21,22. Integrating the attention mechanism into the Mamba model can dynamically focus on key frames and object areas, extract global contextual information, and reduce background interference23,24. Improvements to the Mamba and Yolo have been applied in the fields of medical image diagnostic detection25,26 and road damage detection27,28. Yu et al.29 designed the SFFNet and VMamba-GIE module to enhance the features of infrared small objects. Chen et al.30 designed a nested structure of MiM-ISTD, which consists of outer and inner Mamba modules to capture global and local infrared image features and overcome the computational cost and memory limitations. Li et al.31 proposed the HMCNet model, which employs a hybrid architecture combining a state space model and CNN. Lu et al.32 explored the effectiveness and efficiency of state-space models for single-frame ISTD tasks, which integrated the visual Mamba module into a lightweight CNN-based network for infrared small object segmentation. Zhang et al.33 proposed an encoder-decoder architecture featuring Pixel Difference Mamba (PDMamba) and a Layer Restoration Module (LRM). In recent years, computer vision research has shown a trend towards model efficiency and cross-domain adaptability. The research has significantly enhanced the domain adaptability of object detection by introducing the Residual Channel Attention mechanism (RCA)34 and improving the YOLO series architectures35,36, combined with semi-automated dataset construction techniques. These technologies have been successfully applied in multi-scale scenarios: supporting remote sensing image classification and earthquake damage assessment at the macro level37, achieving vehicle detection and tracking in aerial video at the meso level38,39, and enabling real-time driver state monitoring at the micro level40, demonstrating the generalization ability of deep learning technology across different spatial scales.

Fig. 1
figure 1

Small object detection results. Top: Example images from the VEDAI dataset demonstrate the challenges in small object features. Bottom: Performance comparison of different methods in terms of mAP@0.5.

In the task of infrared small object detection, Yolo stands out with its efficient real-time performance and multi-scale detection capabilities. Meanwhile, the Mamba model, based on the state space model (SSM), is particularly suited for handling time series data. Despite advances in Yolo and Mamba models, no framework effectively combines their strengths for infrared small object detection. Inspired by this, combining the real-time performance of Yolo and Mamba’s context modeling ability, this paper constructs a high-precision infrared small object detection system for complex scenarios.

The contributions of this study are summarized as follows:

  1. (1)

    We propose the Super Mamba model, a multi-modal fusion object detection framework based on SSM and Yolov8. Experiments on VEDAI, as shown in Fig. 1, demonstrate that our Super Mamba achieves a significant performance improvement compared to existing approaches. We replace the standard convolution with the RFAConv, capturing multi-scale features through a multi-branch structure. The network can adaptively adjust the receptive field, and dilated convolution can reduce the number of parameters, making it more suitable for small object detection.

  2. (2)

    In the network backbone, SAM is integrated into the VSS module, which dynamically calculates the weights of different spatial positions and utilizes SSM to capture global contextual information of the image, efficiently achieving the extraction and fusion of small object features. Subsequently, in the core area of small objects, the combination of SE and VSS forms a “local-global” collaborative feature enhancement effect. This framework significantly improves the accuracy and robustness of infrared small object detection while maintaining linear time complexity.

  3. (3)

    We introduce a multi-scale dynamic feature optimization mechanism to realize multi-level feature fusion through BiFPN, and combine with FEM to enhance the detailed features of small objects, effectively addressing the challenges of small object detection in complex backgrounds.

The rest of the paper is organized as follows: “Related work” discusses related work, including State Space Model and Receptive-Field Attention Convolution Operation. Section “Methods” describes the method we employed to perform this overview analysis. Section “Experimental results” summarizes the experimental results and associated discussions. Finally, “Generalization analysis” summarizes the findings and presents future research areas.

Related work

State space model

Recently, SSM17 has received much attention in the field of deep learning due to its linear expansion in sequence length. The SSM can be regarded as a linear time-invariant system, which maps the input sequence \(\:x\left(t\right)\in\:\mathbb{R}\) to the output response \(\:y\left(\text{t}\right)\in\:\mathbb{R}\) through the implicit state \(\:h\left(t\right)\in\:{\mathbb{R}}^{N}\), the process can be expressed by a linear ordinary differential equation:

$$\:h{\prime\:}\left(t\right)=\text{A}h\left(t\right)+\text{B}x\left(t\right)$$
(1)
$$\:y\left(t\right)=\text{C}h\left(t\right)+\text{D}x\left(t\right)$$
(2)

In the formula, \(\:\text{A}\in\:{\mathbb{R}}^{N\times\:N}\) is the state matrix, \(\:\text{B}\in\:{\mathbb{R}}^{N\times\:1}\), \(\:\text{C}\in\:{\mathbb{R}}^{1\times\:N}\) and \(\:\text{D}\in\:{\mathbb{R}}^{1}\) are the projection matrix.

In deep learning, data discretization of the SSM model is achieved by discretizing continuous ordinary differential equations. GU17 proposed the S4 model as a discretization of continuous SSM, converting continuous parameters \(\:\text{A}\) and \(\:\text{B}\) into discrete parameters \(\:\overline{A}\) and \(\:\overline{B}\) using the step size parameter \(\:\varDelta\:\). The most commonly used method for discretization in SSM is the zero-order hold (ZOH), and its calculation formula is:

$$\:\stackrel{-}{A}=\text{e}\text{x}\text{p}(\varDelta\:A)$$
(3)
$$\:\stackrel{-}{B}={(\varDelta\:A)}^{-1}\left(\text{e}\text{x}\text{p}\right(\varDelta\:A)-\text{I})\varDelta\:B$$
(4)

In the formula, \(\:\text{I}\) is the unit matrix. Thus, the discretized SSM equation is:

$$\:{h}_{t}=\stackrel{-}{\text{A}}{h}_{t-1}+\stackrel{-}{\text{B}}{x}_{t}$$
(5)
$$\:{\varvec{y}}_{\varvec{t}}=\mathbf{C}{\varvec{h}}_{\varvec{t}}$$
(6)

where \(\:{h}_{t}\) is the hidden state when the time step is \(\:t\), and \(\:{x}_{t}\) and \(\:{y}_{t}\) are the input and output sequences when the time step is \(\:t\), respectively.

Receptive-field attention convolution operation

During the convolution operation, the standard convolution kernel extracts information through shared parameters, resulting in the network being insensitive to differential information at different locations. The RFAConv addresses the issue of shared parameters in convolution by emphasizing different features within the receptive field sliding window and prioritizing the spatial characteristics of the receptive field41. Therefore, RFAConv improves the problem of limited network performance due to the shared parameters of the convolution kernel. In the backbone network, replacing the convolution operations with RFAConv can enhance the detail feature of the small objects, which makes the network more precise in recognizing small objects, while the computational cost is relatively small.

Methods

Network framework for small object detection

The infrared small object detection algorithm proposed in this article is an improved one-stage object detection algorithm based on the Super Mamba framework. It utilizes the lightweight structure of Super Mamba and the multi-scale network for feature extraction, and improve the BiFPN in the neck using the FEM module. The classification and regression sub-networks of the decoupled head are employed for the position regression and classification of infrared small objects. The overall network framework is shown in Fig. 1. The Super Mamba is divided into three parts: the backbone, neck and head.

In the backbone network of the Super Mamba model, multi-scale features are extracted through four different stages. In Stage 1 and Stage 2, the SAM block is integrated into the VSS block, which is beneficial for extracting spatial location information and local features of infrared small object. In Stage3 and Stage 4, the SE is combined with the VSS block to enhance the channel features, suppress irrelevant channel information, and strengthen the global feature of infrared small objects. In the neck, the multi-branch convolution of FEM is integrated into the top-down and bottom-up BiFPN to fuse the contextual semantic information of small objects. Finally, the decoupling head in the Yolo model is used to classify and predict the input images.

Simple RFAConv initialization module

Recent studies indicate that using segmented patches to divide images into non-overlapping parts as their initial modules may limit the optimization capability of the network, thereby affecting overall performance42. Therefore, we propose a simple RFAConv initial module. Instead of using non-overlapping patches, we used two RFAConv with kernel size of 3 and a stride of 2. RFAConv not only emphasizes the importance of various characteristics within the receptive field, but also pays attention to the spatial characteristics of the receptive field. The implementation process of RFAConv initial module designed in this paper is shown in the red module RFAConv in Fig. 2.

Input image \(\:X\in\:{\mathbb{R}}^{H\times\:W\times\:C}\), where H is the height; W is the width; C is the number of channels. The simple initialization module uses two RFAConv convolutional layers

$$\:{X}_{S}=\sigma\:\left(BN\right(RFAConv\left(X\right)\left)\right)$$
(7)

Output image \(\:{X}_{ST}=X{\odot X}_{S}\), here,\(\:\odot\)indicate element-by-element multiplication.

Fig. 2
figure 2

The Network structure shows the overview of the proposed Super Mamba framework. Our new contributions include: (1) RFAConv replaces standard convolution to capture multi-scale features with adaptive receptive fields, (2) SAM and SE are integrated into the VSS module, enabling dynamic spatial weighting and global context modeling for small objects, (3) The FEM-BiFPN mechanism optimizes multi-scale fusion and detail enhancement.

VSS module integrated with attention mechanism

In our network, SAM emphasizes spatial regions containing critical small objects, while SE enhances channel-wise feature discriminability for refined object representation, and SAM and SE integration into VSS could be more intuitive. The Spatial Attention Mechanism (SAM) is employed to search for global contextual information, combined with the VSS block to efficiently encode the input images43. This reduces redundant computations and allows the model to focus on small object areas in infrared images (as shown in Fig. 3). The implementation process is as follows:

The global average pooling \(\:{X}_{avg}\in\:{\mathbb{R}}^{H\times\:W\times\:1}\) and the global maximum pooling \(\:{X}_{max}\in\:{\mathbb{R}}^{H\times\:W\times\:1}\) are concatenated in the channel dimension to get \(\:{F}_{cat}\in\:{\mathbb{R}}^{H\times\:W\times\:2}\), that is

$$\:{X}_{avg}=AvgPool\left({X}_{ST}\right)$$
(8)
$$\:{X}_{max}=MaxPool\left({X}_{ST}\right)$$
(9)
$$\:{{F}_{cat}=Concat[X}_{avg};{X}_{max}]$$
(10)

By convolutioning \(\:{F}_{cat}\) with a \(\:1\times\:1\) convolution kernel, we obtain the spatial attention feature map \(\:{M}_{s}\). The SiLU function is applied to \(\:{M}_{s}\) for activation, resulting in the normalized spatial attention weight \(\:\widehat{{M}_{s}}\in\:\left(\text{0,1}\right)\). The enhanced spatial attention feature map is denoted as \(\:{F}_{out}\),

$$\:{M}_{s}=Conv\left({F}_{cat}\right)$$
(11)
$$\:\widehat{{M}_{s}}=\sigma\:\left({M}_{s}\right)$$
(12)
$$\:{F}_{out}=\widehat{{M}_{s}}\odot {X}_{ST}$$
(13)
Fig. 3
figure 3

Cross-modal multi-scale feature extraction block, which consists of four modules, (1) SAM block enhances spatial focus by selectively emphasizing critical regions and suppressing irrelevant background noise; (2) SE block improves feature representation by adaptively recalibrating channel-wise feature responses; (3) VSS block enables efficient global context modeling with linear complexity through state space mechanisms; (4) SS2D block achieves comprehensive spatial context fusion by transforming 2D data into structured 1D sequences for selective scanning.

Fig. 4
figure 4

Illustration of 2D-Selective-Scan (SS2D). Input patches are traversed along four different scanning paths (Cross-Scan), and each sequence is independently processed by distinct S6 blocks. Subsequently, the results are merged to construct the 2D feature map as the final output (Cross-Merge).

VSS Block enhances the model’s expressive capability by performing variable spatial sampling on feature maps, efficiently handling the spatiotemporal dependencies of long image sequences. In this paper, the spatial feature enhancement graph \(\:{F}_{out}\) is layer-normalized to \(\:{X}_{LN}\), and then 2-dimensional cross-scan of \(\:{X}_{LN}\) to obtain \(\:{X}_{SS2D}\), \(\:{X}_{SS2D}\) and \(\:{F}_{out}\) are fully connected. Finally, perform a nonlinear transformation after layer-normalizing the \(\:{X}_{cat}\). The output feature map \(\:{X}_{out}\) is obtained by dynamically adjusting the gating weights, which is shown in Fig. 4.

$$\:{X}_{LN}=LN\left({F}_{out}\right)$$
(14)
$$\:{X}_{SS2D}=SS2D\left({X}_{LN}\right)$$
(15)
$$\:{X}_{cat}=Concat\left({X}_{SS2D}{;F}_{out}\right)$$
(16)
$$\:{X}_{out}={X}_{cat}\odot FFN\left(LN\right({X}_{cat}\left)\right)\:\:\:$$
(17)

In the deep network of this paper, the SE block is introduced, through the two steps of Squeeze and Excitation adaptively recalibrating channel weight of the feature map, combined with VSS Block, multi-scale extracted features of infrared image are obtained, which is shown in Fig. 3.

The global average pooling is performed on the feature image \(\:{X}_{out}\in\:{\mathbb{R}}^{H\times\:W\times\:C}\). The global information of each channel is expressed as \(\:Z\), which is transformed by two fully connected layers to get \(\:{Z}_{1}\), and the channel weight vector is \(\:{Z}_{2}\), and the output feature map is \(\:{T}_{out}\).

$$\:Z=Squeeze\left({X}_{out}\right)$$
(18)
$$\:{Z}_{1}=\sigma\:({W}_{1}Z+{b}_{1})$$
(19)
$$\:{Z}_{2}=Sigmoid({W}_{2}{Z}_{1}+{b}_{2})$$
(20)
$$\:{T}_{out}={X}_{out}\odot {Z}_{2}$$
(21)

Here, the weights are \(\:{W}_{1}\in\:{\mathbb{R}}^{\frac{C}{r}\times\:C}\:\text{a}\text{n}\text{d}\:{W}_{2}\in\:{\mathbb{R}}^{C\times\:\frac{C}{r}}\), and the bias are \(\:{b}_{1}\in\:{\mathbb{R}}^{C}\:,\:\:{b}_{2}\in\:{\mathbb{R}}^{C}\). Similarly, combining with the VSS block, the feature of multiscale image \(\:{E}_{out}\) is obtained.

$$\:{E}_{out}=VSS\left({T}_{out}\right)$$
(22)

The feature fusion module of BiFPN with FEM

The Feature Enhancement Module (FEM) using a multi-branch convolutional structure enhances the expressive capability of infrared small object feature by extracting contextual semantic information, thereby improving the detection performance of small objects in deep learning networks within complex backgrounds. In this paper, a multi-scale bidirectional pyramid feature fusion network is proposed based on FEM in the SMamba neck (as shown in Fig. 2).

Since infrared small object are often difficult to be identified in high-level feature image, we improve the feature pyramid network with FEM and BiFPN44 in the neck of our SMamba. BiFPN effectively alleviates the dilution and loss of small target detail features during the information transmission process of traditional FPN through bidirectional cross-scale connections and weighted feature fusion, thereby more fully integrating low-level features containing key details of small targets with high-level features that contain semantic context. First, enhance the local context information of small objects through multi-branch structures and extended convolution. Perform multi-scale convolution operations on the feature images \(\:{E}_{out}\in\:{\mathbb{R}}^{H\times\:W\times\:C}\), extracted at each stage. Feature maps of different scales \(\:{Y}_{1}{,Y}_{2}{,\cdots\:,Y}_{n}\), obtained from convolution kernels of different scales \(\:{K}_{1}{,K}_{2}{,\cdots\:,K}_{n}\), and \(\:{Y}_{i}={E}_{out}\odot {K}_{i}+{b}_{i}\:\:(i=\text{1,2},\cdots\:,n)\). Then, by introducing the Bidirectional Feature Pyramid Network (BiFPN), feature fusion is performed in both Top-Down and Bottom-Up directions. This means that each feature map can not only obtain information from the upper-level feature but also from the lower-level feature, thereby enhancing the feature of infrared small objects.

Experimental results

Dataset

We conduct a comprehensive experiment on the proposed Super Mamba model. The experiment uses the VEDAI dataset, which consists of 1,246 small images featuring various backgrounds from a drone perspective. All images are 512 × 512 in size and include eight different categories: car, pickup, truck, camping and so on. Additionally, M3FD and LLVIP datasets are also selected to promote the detection results of our framework. The selection of the VEDAI, M3FD, and LLVIP datasets aims to systematically validate the robustness of the model under varying lighting conditions and background complexity. VEDAI provides multi-view complex backgrounds; M3FD covers a range of extreme lighting conditions; and LLVIP focuses on low-light challenges. This combination ensures the comprehensiveness and universality of the evaluation.

Implementation details

Our proposed framework is implemented in Python and runs on a workstation with an NVIDIA GeForce GTX 4070TI SURPER (16 GB). During the network training, each image is resized to 640 × 640. For the input sizes of three bands, the floating-point operations per second and parameter computation costs of our Super Mamba are 28.4Ga and 17.6 MB, respectively. The Adam optimizer is used with a batch size of 16 and 300 epochs for network optimization.

Before the network training, the VEDAI dataset is divided into training set, test set and verification set in the ratio of 7:2:1. The training epoch is set to 150, and the initial learning rate is 0.01. Considering that there are a large number of small objects in the sample images, and considering the balance between real-time performance and accuracy in the detection process, the sample is normalized to 640 × 640. In order to ensure the fairness of the model results, no pre-training weights are used in the ablation experiments, and all training processes shared consistent hyper parameter settings.

Accuracy metrics

In this paper, the evaluation indicators include Precision (P), Recall rate (R), mean Average Precision (mAP), Peak Signal-to-Noise Ratio (PSNR) and Model Parameter quantity. Generally, the higher the values of these metrics, the better the detection performance for infrared small object detection tasks. The P, R, AP can be calculated by

$$\:\text{P}=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{P}};$$
$$\:\text{R}=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{N}};$$
$$\:\text{A}\text{P}=\frac{\text{T}\text{P}+\text{T}\text{N}}{\text{T}\text{P}+\text{T}\text{N}+\text{F}\text{P}+\text{F}\text{N}};$$

where, where the true positive (TP) and true negative (TN) denote correct prediction, and the false positive (FP) and false negative (FN) denote incorrect outcome. The precision and recall are correlated with the commission and omission errors, respectively. The mAP is a comprehensive indicator obtained by averaging AP values, which uses an integral method to calculate the area enclosed by the Precision-Recall curve and coordinate axis of all categories. Hence, the mAP can be calculated by

$$\:\text{m}\text{A}\text{P}=\frac{{\sum\:}_{\text{n}=1}^{\text{N}\text{u}\text{m}\left(\text{c}\text{l}\text{a}\text{s}\text{s}\text{e}\text{s}\right)}\text{A}\text{P}\left(\text{n}\right)}{\text{T}\text{P}+\text{T}\text{N}+\text{F}\text{P}+\text{F}\text{N}}.$$

Ablation study

As shown in Table 1, the accuracy of the traditional Mamba model is 71.4%, with an average precision value of 70%. After adding the RFAConv module to the backbone network, the model’s accuracy increased by 2.6% points, and the average precision improved by 2.2%. This indicates that using attention mechanisms in receptive field space requires only a small computational cost to optimize the shared convolutional parameters, thereby enhancing model performance. When the SAM module is introduced alone, the average precision increases by 4.7%, suggesting that adding spatial attention mechanisms can suppress background interference, making the network more focused on the local area where small object are located. Adding the SE channel attention mechanism alone results in almost no change in the accuracy of small object detection, but significantly reduces computational costs. When the FEM module is activated alone, the improved model can effectively integrate features from different receptive fields while retaining and enhancing the details of small object, with the average precision increasing by 6.9% and recall rate by 2.5%. If only a single RFAConv, SAM, SE, FEM module is added to Mamba model for small object detection, the value of mAP@0.95 is only about 40%. As the stricter IoU thresholds penalize small objects more45, the mAP@0.95 of object detection is low.

Table 1 The ablation experiment results for object detection performance on the VEDAI dataset.

When all four modules are added simultaneously to the backbone and neck of the network, the proposed Super Mamba model achieves a significant improvement in accuracy, with mAP@0.5 increasing by 22.3%, reaching a detection accuracy of 92.3%, and mAP@0.95 is also 77.8%. These show that our proposed model can accurately identify small objects and accurately locate them, reducing the situation of missed detection and false detection. Meanwhile, the PSNR value reaching 45.70 indicates that the details and structures of the image are well preserved, which helps to improve the accuracy of small object detection.

Comparisons with previous methods

As can be seen from Fig. 5, the mAP@0.5 training curves of Super-yolo-star46, Super-yolo-Dattention47, Super-yolo-RFAConv41, Yolov548, Yolov849 and Yolov1150 are all lower than that of the proposed method in 300 rounds of network training. The proposed method can still maintain “high precision” under “high recall rate”, and the network performance is excellent.

Fig. 5
figure 5

Comparison of mAP@0.5 training curves of different methods.

To provide a more intuitive representation of the multi-scale feature images of the Super Mamba model at the P3, P4, and P5 stages in the neck, the feature images for each stage are shown at the top of Fig. 6. The small objects feature fusion images for each stage are displayed in the fifth column at the top of Fig. 6. Experiments have shown that the proposed method can accurately extract multi-scale features of small objects. After the feature enhancement and bidirectional pyramid module, the fusion effect is very satisfactory.

Fig. 6
figure 6

Feature-level visualization of the Super Mamba backbone with the same input, (a–d) display four distinct samples from the VEDAI dataset. Top: (a) and (b) are the feature heatmaps of small object for different stages. Bottom: (c) and (d) are the feature heatmaps for different methods.

By comparing seven methods on the VEDAI data set, we found that in addition to Yolov11 and our proposed method, the Super-DA method showed the phenomenon of infrared thermal feature loss, while the other 4 methods showed the phenomenon of significant expansion of the infrared thermal feature range of small objects to varying degrees, as shown at the bottom in Fig. 6.

Fig. 7
figure 7

Visual comparisons of small object detection using different methods, where subfigures (ad) display four representative samples from the VEDAI dataset. As shown in the bottom row, our SMamba method achieves a superior performance of over 0.83 in mAP@0.5, whereas other methods frequently exhibit undetected cases. The missed small objects have been highlighted with yellow circles.

In the VEDAI dataset, due to the complex background interference, the six methods—Super-yolo-star, Super-yolo-Dattention, Super-yolo-RFAConv, Yolov5, and Yolov8 and Yolov11—showed varying degrees of missed detection for small objects such as car or boat, as indicated by the small objects in the blue circles in the last column of Fig. 7. Additionally, it is evident that our proposed method has significantly higher detection accuracy for small objects compared to other methods. In addition, it can be noticed that the superior performance is achieved for the classes of Car, Pickup, Tractor and Camping, which have the most training instances, as shown in Fig. 7.

As shown in Table 2, the parameters of the proposed method are all lower than those of the Yolo series models, however, their average precision significantly exceeds that of the Yolo series models, reaching 92.3%. The Super-yolo-star model has the lowest accuracy at only 58.1%; while the Super-yolo-RFAConv model has the lowest recall rate at 59.47%. Although the Super-yolo-star, Super-yolo-RFAConv and Super-yolo-Dattention models have fewer parameters and faster computation speeds, their average detection precision is not high, failing to reach 75%. Our proposed method in this paper has the accuracy, recall and average precision of 91.4%, 91.9% and 92.3% respectively for small object detection. In particular, mAP@0.75 also has reached 89.1%, and mAP@0.95 has reached 77.8%.

The deep learning network not only solves the problem of small object detection in complex scenes, but also has high robustness and strong general performance, which can be effectively applied to various small object detection scenarios.

Table 2 Performance comparison of different object detection algorithms on VEDAI dataset.

Generalization analysis

To verify the generalization capability of our framework, we conducted small object detection experiments using open-source datasets, including the RGBT Drone Person dataset and the VTAUV-det dataset for underwater object detection. We selected six common methods for comparison: Super-yolo-star, Super-yolo-Dattention, Super-yolo-RFAConv, Yolov5, Yolov8, and Yolov11.

Table 3 Performance comparison of different object detection algorithms on M3FD and LLVIP datasets.
  1. (1)

    M3FD (Multispectral, Multimodal and Multiscale Fusion Detection): The dataset was released in 2022 and includes 4200 pairs of calibrated and aligned infrared and RGB image. The dataset covers four main scenarios with various environments, lighting, seasons, and weather, with a wide range of pixel variations. Most of the images have a resolution of 1024 × 768 pixels and provide annotations for six categories: pedestrians, cars, buses, motorcycles, streetlights, and trucks. The total size of the dataset exceeds 15GB.

  2. (2)

    LLVIP (Large-scale Longitudinal Visual-Inertial People): The dataset was released in 2021, focusing on pedestrian detection and tracking, particularly in dynamic and complex environments. It covers various real-world scenarios, including city streets, indoor settings, and public transportation, with diverse lighting and weather conditions. The LLVIP dataset provides rich data, featuring over 100,000 RGB images and depth images, each with a resolution of 1920 × 1080 pixels, totaling more than 100GB.

As shown in Table 3, our Super Mamba achieved the best detection results, the value of mAP@0.5, mAP@0.75 and mAP@0.95 are 93.22%, 90.34%, 88.65% in the M3FD dataset, respectively. The Super Mamba method is higher than other detection methods. In the same way, the experimental results indicate that the proposed method achieved an accuracy of 94.13%, 87.75%, 85.14% in the LLVIP dataset. These findings suggest that despite the varying characteristics of the datasets, the model demonstrates excellent generalization capabilities, adapting to different scenarios and tasks. This discovery provides important evidence for further optimizing the model and points the way forward for future research.

Conclusion

This paper improves the Mamba algorithm and proposes a multi-scale fusion Super Mamba small object detection algorithm. First, the receptive field attention convolution RFAConv is integrated into the backbone network and replaces the commonly used Conv, effectively extracting spatial features of the receptive field, alleviating the issue of shared convolution parameters, and enhancing network performance; second, the spatial attention mechanism and channel attention mechanism are added to the feature-enhanced Mamba model, achieving multi-scale and multi-feature extraction of small objects; when using BiFPN for multi-level feature fusion at the neck, introducing the FEM block can enhance local context information of small objects. Experimental results on three datasets show that compared with other mainstream algorithms, the improved model achieves mAP@0.5 of 92.3%, 93.2%, 94.1%, meeting the requirements for real-time detection of small objects.

The current detection of small object pixels in this article is around 20 × 20. How to extend this method to tasks involving ultra-small object pixels below 10 × 10 is a subject that requires further research. In future work, we plan to systematically evaluate performance in multimodal complex scenes in both the visible and infrared spectra, optimize deep learning networks using graph neural networks, and utilize fewer labeled data during training to further enhance the model’s generalization ability, aiming for successful application in challenging environments such as on the ground or underwater.