Super Mamba feature enhancement framework for small object detection

Shi, Na; Yang, Zheng; Yang, Guang; Li, Kai; Yang, Zhiliang; An, Jianping; Li, Sicheng; Zhang, Liangliang; Jing, Senyang

doi:10.1038/s41598-025-21837-2

Download PDF

Article
Open access
Published: 23 October 2025

Super Mamba feature enhancement framework for small object detection

Na Shi^1,2,
Zheng Yang^1,3,
Guang Yang^1,3,
Kai Li^1,3,
Zhiliang Yang^1,3,
Jianping An⁴,
Sicheng Li^1,3,
Liangliang Zhang^1,3 &
…
Senyang Jing^1,3

Scientific Reports volume 15, Article number: 37148 (2025) Cite this article

2777 Accesses
3 Citations
Metrics details

Subjects

Abstract

It is very challenging to accurately and timely detect small object containing dozens of pixels from infrared images. Compared with the complex background in infrared images taken by low-altitude drones, a framework is designed to learn a strong feature representation separating the object from the background, which usually leads to a large computational amount. In this paper, we proposed a Super Mamba (SMamba) framework for UAV infrared small object detection, which performs deep learning of nonlinear complex data. Our SMamba framework performs high resolution object detection on multi-scale objects, considering both detection accuracy and computational cost. First, the Receptive Field Attention Convolution (RFAConv) is used into the backbone network and replaced the commonly convolution, and the multi-scale features is adjusted through the dynamic receptive field to optimize the computing efficiency. Furthermore, the Spatial Attention Mechanism (SAM) and Squeeze-Excitation (SE) are added to the State Space Model (SSM) to achieve multi-scale and multi-feature extraction for small object. Moreover, in the neck, the Feature Enhancement Module (FEM) is introduced to Bidirectional Feature Pyramid Network (BiFPN) can enhance the local context information of small objects and improve the detection efficiency. The experimental results show that Super Mamba achieved more than 92% accuracy on VEDAI dataset (in terms of mAP@ 0.5), which is more than 20% higher than the existing large models such as Yolov5, Yolov8, and Yolov11. The pytorch code is available at: https://github.com/wolfololo/Super-Mamba-A-Framework-for-Small-Object-Detection-with-Enhanced-Detection.

Introduction

As the infrared image of medium and long wave segment has good imaging performance at night and in low light environment, the object detection of UAV infrared image is widely used in security monitoring¹, military reconnaissance², medical imaging³, environmental monitoring⁴ and other scenarios. For small infrared objects with weak thermal radiation characteristics, there are many challenges in complex background interference, weak infrared features, algorithm complexity and real-time performance. The size of small objects is usually less than 20 × 20 pixels. According to the motion characteristics of the detection object, there are two research areas of infrared small object detection: the single frame image and video sequence image. This paper mainly researches the problem of single frame infrared small object detection.

Recently, depending on whether to generate candidate regions, deep learning is divided into one-stage and two-stage detection methods. One-stage detection method uses the feedforward network to locate and classify small objects. The real-time efficiency of the detection is high. Typical detection methods include Yolo series (You Only Look Once)⁵, SSD (Single Shot MultiBox Detector)⁶, RetinaNet⁷, etc. The two-stage detection method needs to form a candidate region, and then classify the candidate region and make boundary box regression. The detection accuracy of the candidate region is high in complex scenarios, but the calculation speed is relatively slow. The typical detection models include Faster R-CNN⁸, R-FCN⁹, Mask R-CNN¹⁰, etc. The object detection problem of UAV infrared image requires to overcome difficulties such as low signal-to-noise ratio, complex object motion characteristics, object interference or occlusion by clouds. Therefore, the lightweight Yolo series known for high efficiency and accuracy methods have been widely adopted in complex scenarios with high real-time performance.

Tian et al.¹¹ proposed an IV-Yolo object detection network that integrates visible and infrared image features, using two-branch fusion structure to realize UAV small object detection. Xiao et al.¹² introduced a new C2f-DCNv3 module to enhance the feature extraction ability, while adding BiFPN to the neck structure to achieve multi-scale feature fusion of small UAV objects. Xue et al.¹³ proposed a lightweight FECI-RTDETR UAV infrared small-object detection algorithm, which combines spatial feature selection mechanism with intra-scale feature interaction module and understands the context semantic information of aerial infrared small object using cross-scale feature fusion. Zhang et al.¹⁴ proposed that the ESD-Yolov8 model identifies the defect characteristics of infrared solar cells and enhances the detection of small defects by integrating the EMA attention mechanism with the C2f_EMA module. Zhu et al.¹⁵ combined the Yolov9 and DeepSORT to design multi-object tracking (MOT) models to track and identify endangered birds and drone objects.

With the rapid development of deep learning, Gu and Dao et al. proposed a deep learning Mamba model¹⁶ based on State Space Model (SSM)¹⁷, which realized long sequence data processing with linear complexity by designing efficient hardware sensing algorithm. In small object detection, Yolo model may have a missed detection problem, while Mamba model can enhance the feature of small objects by the fusion of multi-frame image, which can improve the detection ability of small objects. Wang et al.¹⁸ proposed Mamba YOLO, a novel object detection framework that integrates a State Space Model backbone into the YOLO architecture to establish a new, simple baseline for efficient and effective detection. Wang et al.¹⁹ introduced Mamba-YOLO-World, which combines the open-vocabulary detection capabilities of YOLO-World with the efficient sequence modeling of a Mamba-based backbone to enhance performance in detecting objects from an unlimited vocabulary. Malekmohammadi et al.²⁰ conduct a comparative analysis, employing YOLO, Vision Transformers, and the emerging Vision Mamba architecture for the classification of Gleason grades in prostate cancer histopathology images to evaluate their diagnostic performance.

Based on the local feature extraction of convolutional layer and the long distance dependence of state space models (SSM), some scholars propose the double branch structure for object detection^21,22. Integrating the attention mechanism into the Mamba model can dynamically focus on key frames and object areas, extract global contextual information, and reduce background interference^23,24. Improvements to the Mamba and Yolo have been applied in the fields of medical image diagnostic detection^25,26 and road damage detection^27,28. Yu et al.²⁹ designed the SFFNet and VMamba-GIE module to enhance the features of infrared small objects. Chen et al.³⁰ designed a nested structure of MiM-ISTD, which consists of outer and inner Mamba modules to capture global and local infrared image features and overcome the computational cost and memory limitations. Li et al.³¹ proposed the HMCNet model, which employs a hybrid architecture combining a state space model and CNN. Lu et al.³² explored the effectiveness and efficiency of state-space models for single-frame ISTD tasks, which integrated the visual Mamba module into a lightweight CNN-based network for infrared small object segmentation. Zhang et al.³³ proposed an encoder-decoder architecture featuring Pixel Difference Mamba (PDMamba) and a Layer Restoration Module (LRM). In recent years, computer vision research has shown a trend towards model efficiency and cross-domain adaptability. The research has significantly enhanced the domain adaptability of object detection by introducing the Residual Channel Attention mechanism (RCA)³⁴ and improving the YOLO series architectures^35,36, combined with semi-automated dataset construction techniques. These technologies have been successfully applied in multi-scale scenarios: supporting remote sensing image classification and earthquake damage assessment at the macro level³⁷, achieving vehicle detection and tracking in aerial video at the meso level^38,39, and enabling real-time driver state monitoring at the micro level⁴⁰, demonstrating the generalization ability of deep learning technology across different spatial scales.

In the task of infrared small object detection, Yolo stands out with its efficient real-time performance and multi-scale detection capabilities. Meanwhile, the Mamba model, based on the state space model (SSM), is particularly suited for handling time series data. Despite advances in Yolo and Mamba models, no framework effectively combines their strengths for infrared small object detection. Inspired by this, combining the real-time performance of Yolo and Mamba’s context modeling ability, this paper constructs a high-precision infrared small object detection system for complex scenarios.

The contributions of this study are summarized as follows:

(1)
We propose the Super Mamba model, a multi-modal fusion object detection framework based on SSM and Yolov8. Experiments on VEDAI, as shown in Fig. 1, demonstrate that our Super Mamba achieves a significant performance improvement compared to existing approaches. We replace the standard convolution with the RFAConv, capturing multi-scale features through a multi-branch structure. The network can adaptively adjust the receptive field, and dilated convolution can reduce the number of parameters, making it more suitable for small object detection.
(2)
In the network backbone, SAM is integrated into the VSS module, which dynamically calculates the weights of different spatial positions and utilizes SSM to capture global contextual information of the image, efficiently achieving the extraction and fusion of small object features. Subsequently, in the core area of small objects, the combination of SE and VSS forms a “local-global” collaborative feature enhancement effect. This framework significantly improves the accuracy and robustness of infrared small object detection while maintaining linear time complexity.
(3)
We introduce a multi-scale dynamic feature optimization mechanism to realize multi-level feature fusion through BiFPN, and combine with FEM to enhance the detailed features of small objects, effectively addressing the challenges of small object detection in complex backgrounds.

The rest of the paper is organized as follows: “Related work” discusses related work, including State Space Model and Receptive-Field Attention Convolution Operation. Section “Methods” describes the method we employed to perform this overview analysis. Section “Experimental results” summarizes the experimental results and associated discussions. Finally, “Generalization analysis” summarizes the findings and presents future research areas.

Related work

State space model

Recently, SSM¹⁷ has received much attention in the field of deep learning due to its linear expansion in sequence length. The SSM can be regarded as a linear time-invariant system, which maps the input sequence $\:x\left(t\right)\in\:\mathbb{R}$ to the output response $\:y\left(\text{t}\right)\in\:\mathbb{R}$ through the implicit state $\:h\left(t\right)\in\:{\mathbb{R}}^{N}$, the process can be expressed by a linear ordinary differential equation:

$$\:h{\prime\:}\left(t\right)=\text{A}h\left(t\right)+\text{B}x\left(t\right)$$

(1)

$$\:y\left(t\right)=\text{C}h\left(t\right)+\text{D}x\left(t\right)$$

(2)

In the formula, $\:\text{A}\in\:{\mathbb{R}}^{N\times\:N}$ is the state matrix, $\:\text{B}\in\:{\mathbb{R}}^{N\times\:1}$, $\:\text{C}\in\:{\mathbb{R}}^{1\times\:N}$ and $\:\text{D}\in\:{\mathbb{R}}^{1}$ are the projection matrix.

In deep learning, data discretization of the SSM model is achieved by discretizing continuous ordinary differential equations. GU¹⁷ proposed the S4 model as a discretization of continuous SSM, converting continuous parameters $\:\text{A}$ and $\:\text{B}$ into discrete parameters $\:\overline{A}$ and $\:\overline{B}$ using the step size parameter $\:\varDelta\:$. The most commonly used method for discretization in SSM is the zero-order hold (ZOH), and its calculation formula is:

$$\:\stackrel{-}{A}=\text{e}\text{x}\text{p}(\varDelta\:A)$$

(3)

$$\:\stackrel{-}{B}={(\varDelta\:A)}^{-1}\left(\text{e}\text{x}\text{p}\right(\varDelta\:A)-\text{I})\varDelta\:B$$

(4)

In the formula, $\:\text{I}$ is the unit matrix. Thus, the discretized SSM equation is:

$$\:{h}_{t}=\stackrel{-}{\text{A}}{h}_{t-1}+\stackrel{-}{\text{B}}{x}_{t}$$

(5)

$$\:{\varvec{y}}_{\varvec{t}}=\mathbf{C}{\varvec{h}}_{\varvec{t}}$$

(6)

where $\:{h}_{t}$ is the hidden state when the time step is $\:t$, and $\:{x}_{t}$ and $\:{y}_{t}$ are the input and output sequences when the time step is $\:t$, respectively.

Receptive-field attention convolution operation

During the convolution operation, the standard convolution kernel extracts information through shared parameters, resulting in the network being insensitive to differential information at different locations. The RFAConv addresses the issue of shared parameters in convolution by emphasizing different features within the receptive field sliding window and prioritizing the spatial characteristics of the receptive field⁴¹. Therefore, RFAConv improves the problem of limited network performance due to the shared parameters of the convolution kernel. In the backbone network, replacing the convolution operations with RFAConv can enhance the detail feature of the small objects, which makes the network more precise in recognizing small objects, while the computational cost is relatively small.

Methods

Network framework for small object detection

The infrared small object detection algorithm proposed in this article is an improved one-stage object detection algorithm based on the Super Mamba framework. It utilizes the lightweight structure of Super Mamba and the multi-scale network for feature extraction, and improve the BiFPN in the neck using the FEM module. The classification and regression sub-networks of the decoupled head are employed for the position regression and classification of infrared small objects. The overall network framework is shown in Fig. 1. The Super Mamba is divided into three parts: the backbone, neck and head.

In the backbone network of the Super Mamba model, multi-scale features are extracted through four different stages. In Stage 1 and Stage 2, the SAM block is integrated into the VSS block, which is beneficial for extracting spatial location information and local features of infrared small object. In Stage3 and Stage 4, the SE is combined with the VSS block to enhance the channel features, suppress irrelevant channel information, and strengthen the global feature of infrared small objects. In the neck, the multi-branch convolution of FEM is integrated into the top-down and bottom-up BiFPN to fuse the contextual semantic information of small objects. Finally, the decoupling head in the Yolo model is used to classify and predict the input images.

Simple RFAConv initialization module

Recent studies indicate that using segmented patches to divide images into non-overlapping parts as their initial modules may limit the optimization capability of the network, thereby affecting overall performance⁴². Therefore, we propose a simple RFAConv initial module. Instead of using non-overlapping patches, we used two RFAConv with kernel size of 3 and a stride of 2. RFAConv not only emphasizes the importance of various characteristics within the receptive field, but also pays attention to the spatial characteristics of the receptive field. The implementation process of RFAConv initial module designed in this paper is shown in the red module RFAConv in Fig. 2.

Input image $\:X\in\:{\mathbb{R}}^{H\times\:W\times\:C}$, where H is the height; W is the width; C is the number of channels. The simple initialization module uses two RFAConv convolutional layers

$$\:{X}_{S}=\sigma\:\left(BN\right(RFAConv\left(X\right)\left)\right)$$

(7)

Output image $\:{X}_{ST}=X{\odot X}_{S}$, here,$\:\odot$indicate element-by-element multiplication.

VSS module integrated with attention mechanism

In our network, SAM emphasizes spatial regions containing critical small objects, while SE enhances channel-wise feature discriminability for refined object representation, and SAM and SE integration into VSS could be more intuitive. The Spatial Attention Mechanism (SAM) is employed to search for global contextual information, combined with the VSS block to efficiently encode the input images⁴³. This reduces redundant computations and allows the model to focus on small object areas in infrared images (as shown in Fig. 3). The implementation process is as follows:

The global average pooling $\:{X}_{avg}\in\:{\mathbb{R}}^{H\times\:W\times\:1}$ and the global maximum pooling $\:{X}_{max}\in\:{\mathbb{R}}^{H\times\:W\times\:1}$ are concatenated in the channel dimension to get $\:{F}_{cat}\in\:{\mathbb{R}}^{H\times\:W\times\:2}$, that is

$$\:{X}_{avg}=AvgPool\left({X}_{ST}\right)$$

(8)

$$\:{X}_{max}=MaxPool\left({X}_{ST}\right)$$

(9)

$$\:{{F}_{cat}=Concat[X}_{avg};{X}_{max}]$$

(10)

By convolutioning $\:{F}_{cat}$ with a $\:1\times\:1$ convolution kernel, we obtain the spatial attention feature map $\:{M}_{s}$. The SiLU function is applied to $\:{M}_{s}$ for activation, resulting in the normalized spatial attention weight $\:\widehat{{M}_{s}}\in\:\left(\text{0,1}\right)$. The enhanced spatial attention feature map is denoted as $\:{F}_{out}$,

$$\:{M}_{s}=Conv\left({F}_{cat}\right)$$

(11)

$$\:\widehat{{M}_{s}}=\sigma\:\left({M}_{s}\right)$$

(12)

$$\:{F}_{out}=\widehat{{M}_{s}}\odot {X}_{ST}$$

(13)

VSS Block enhances the model’s expressive capability by performing variable spatial sampling on feature maps, efficiently handling the spatiotemporal dependencies of long image sequences. In this paper, the spatial feature enhancement graph $\:{F}_{out}$ is layer-normalized to $\:{X}_{LN}$, and then 2-dimensional cross-scan of $\:{X}_{LN}$ to obtain $\:{X}_{SS2D}$, $\:{X}_{SS2D}$ and $\:{F}_{out}$ are fully connected. Finally, perform a nonlinear transformation after layer-normalizing the $\:{X}_{cat}$. The output feature map $\:{X}_{out}$ is obtained by dynamically adjusting the gating weights, which is shown in Fig. 4.

$$\:{X}_{LN}=LN\left({F}_{out}\right)$$

(14)

$$\:{X}_{SS2D}=SS2D\left({X}_{LN}\right)$$

(15)

$$\:{X}_{cat}=Concat\left({X}_{SS2D}{;F}_{out}\right)$$

(16)

$$\:{X}_{out}={X}_{cat}\odot FFN\left(LN\right({X}_{cat}\left)\right)\:\:\:$$

(17)

In the deep network of this paper, the SE block is introduced, through the two steps of Squeeze and Excitation adaptively recalibrating channel weight of the feature map, combined with VSS Block, multi-scale extracted features of infrared image are obtained, which is shown in Fig. 3.

The global average pooling is performed on the feature image $\:{X}_{out}\in\:{\mathbb{R}}^{H\times\:W\times\:C}$. The global information of each channel is expressed as $\:Z$, which is transformed by two fully connected layers to get $\:{Z}_{1}$, and the channel weight vector is $\:{Z}_{2}$, and the output feature map is $\:{T}_{out}$.

$$\:Z=Squeeze\left({X}_{out}\right)$$

(18)

$$\:{Z}_{1}=\sigma\:({W}_{1}Z+{b}_{1})$$

(19)

$$\:{Z}_{2}=Sigmoid({W}_{2}{Z}_{1}+{b}_{2})$$

(20)

$$\:{T}_{out}={X}_{out}\odot {Z}_{2}$$

(21)

Here, the weights are $\:{W}_{1}\in\:{\mathbb{R}}^{\frac{C}{r}\times\:C}\:\text{a}\text{n}\text{d}\:{W}_{2}\in\:{\mathbb{R}}^{C\times\:\frac{C}{r}}$, and the bias are $\:{b}_{1}\in\:{\mathbb{R}}^{C}\:,\:\:{b}_{2}\in\:{\mathbb{R}}^{C}$. Similarly, combining with the VSS block, the feature of multiscale image $\:{E}_{out}$ is obtained.

$$\:{E}_{out}=VSS\left({T}_{out}\right)$$

(22)

The feature fusion module of BiFPN with FEM

The Feature Enhancement Module (FEM) using a multi-branch convolutional structure enhances the expressive capability of infrared small object feature by extracting contextual semantic information, thereby improving the detection performance of small objects in deep learning networks within complex backgrounds. In this paper, a multi-scale bidirectional pyramid feature fusion network is proposed based on FEM in the SMamba neck (as shown in Fig. 2).

Since infrared small object are often difficult to be identified in high-level feature image, we improve the feature pyramid network with FEM and BiFPN⁴⁴ in the neck of our SMamba. BiFPN effectively alleviates the dilution and loss of small target detail features during the information transmission process of traditional FPN through bidirectional cross-scale connections and weighted feature fusion, thereby more fully integrating low-level features containing key details of small targets with high-level features that contain semantic context. First, enhance the local context information of small objects through multi-branch structures and extended convolution. Perform multi-scale convolution operations on the feature images $\:{E}_{out}\in\:{\mathbb{R}}^{H\times\:W\times\:C}$, extracted at each stage. Feature maps of different scales $\:{Y}_{1}{,Y}_{2}{,\cdots\:,Y}_{n}$, obtained from convolution kernels of different scales $\:{K}_{1}{,K}_{2}{,\cdots\:,K}_{n}$, and $\:{Y}_{i}={E}_{out}\odot {K}_{i}+{b}_{i}\:\:(i=\text{1,2},\cdots\:,n)$. Then, by introducing the Bidirectional Feature Pyramid Network (BiFPN), feature fusion is performed in both Top-Down and Bottom-Up directions. This means that each feature map can not only obtain information from the upper-level feature but also from the lower-level feature, thereby enhancing the feature of infrared small objects.

Experimental results

Dataset

We conduct a comprehensive experiment on the proposed Super Mamba model. The experiment uses the VEDAI dataset, which consists of 1,246 small images featuring various backgrounds from a drone perspective. All images are 512 × 512 in size and include eight different categories: car, pickup, truck, camping and so on. Additionally, M3FD and LLVIP datasets are also selected to promote the detection results of our framework. The selection of the VEDAI, M3FD, and LLVIP datasets aims to systematically validate the robustness of the model under varying lighting conditions and background complexity. VEDAI provides multi-view complex backgrounds; M3FD covers a range of extreme lighting conditions; and LLVIP focuses on low-light challenges. This combination ensures the comprehensiveness and universality of the evaluation.

Implementation details

Our proposed framework is implemented in Python and runs on a workstation with an NVIDIA GeForce GTX 4070TI SURPER (16 GB). During the network training, each image is resized to 640 × 640. For the input sizes of three bands, the floating-point operations per second and parameter computation costs of our Super Mamba are 28.4Ga and 17.6 MB, respectively. The Adam optimizer is used with a batch size of 16 and 300 epochs for network optimization.

Before the network training, the VEDAI dataset is divided into training set, test set and verification set in the ratio of 7:2:1. The training epoch is set to 150, and the initial learning rate is 0.01. Considering that there are a large number of small objects in the sample images, and considering the balance between real-time performance and accuracy in the detection process, the sample is normalized to 640 × 640. In order to ensure the fairness of the model results, no pre-training weights are used in the ablation experiments, and all training processes shared consistent hyper parameter settings.

Accuracy metrics

In this paper, the evaluation indicators include Precision (P), Recall rate (R), mean Average Precision (mAP), Peak Signal-to-Noise Ratio (PSNR) and Model Parameter quantity. Generally, the higher the values of these metrics, the better the detection performance for infrared small object detection tasks. The P, R, AP can be calculated by

$$\:\text{P}=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{P}};$$

$$\:\text{R}=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{N}};$$

$$\:\text{A}\text{P}=\frac{\text{T}\text{P}+\text{T}\text{N}}{\text{T}\text{P}+\text{T}\text{N}+\text{F}\text{P}+\text{F}\text{N}};$$

where, where the true positive (TP) and true negative (TN) denote correct prediction, and the false positive (FP) and false negative (FN) denote incorrect outcome. The precision and recall are correlated with the commission and omission errors, respectively. The mAP is a comprehensive indicator obtained by averaging AP values, which uses an integral method to calculate the area enclosed by the Precision-Recall curve and coordinate axis of all categories. Hence, the mAP can be calculated by

$$\:\text{m}\text{A}\text{P}=\frac{{\sum\:}_{\text{n}=1}^{\text{N}\text{u}\text{m}\left(\text{c}\text{l}\text{a}\text{s}\text{s}\text{e}\text{s}\right)}\text{A}\text{P}\left(\text{n}\right)}{\text{T}\text{P}+\text{T}\text{N}+\text{F}\text{P}+\text{F}\text{N}}.$$

Ablation study

As shown in Table 1, the accuracy of the traditional Mamba model is 71.4%, with an average precision value of 70%. After adding the RFAConv module to the backbone network, the model’s accuracy increased by 2.6% points, and the average precision improved by 2.2%. This indicates that using attention mechanisms in receptive field space requires only a small computational cost to optimize the shared convolutional parameters, thereby enhancing model performance. When the SAM module is introduced alone, the average precision increases by 4.7%, suggesting that adding spatial attention mechanisms can suppress background interference, making the network more focused on the local area where small object are located. Adding the SE channel attention mechanism alone results in almost no change in the accuracy of small object detection, but significantly reduces computational costs. When the FEM module is activated alone, the improved model can effectively integrate features from different receptive fields while retaining and enhancing the details of small object, with the average precision increasing by 6.9% and recall rate by 2.5%. If only a single RFAConv, SAM, SE, FEM module is added to Mamba model for small object detection, the value of mAP@0.95 is only about 40%. As the stricter IoU thresholds penalize small objects more⁴⁵, the mAP@0.95 of object detection is low.

Table 1 The ablation experiment results for object detection performance on the VEDAI dataset.

Full size table

When all four modules are added simultaneously to the backbone and neck of the network, the proposed Super Mamba model achieves a significant improvement in accuracy, with mAP@0.5 increasing by 22.3%, reaching a detection accuracy of 92.3%, and mAP@0.95 is also 77.8%. These show that our proposed model can accurately identify small objects and accurately locate them, reducing the situation of missed detection and false detection. Meanwhile, the PSNR value reaching 45.70 indicates that the details and structures of the image are well preserved, which helps to improve the accuracy of small object detection.

Comparisons with previous methods

As can be seen from Fig. 5, the mAP@0.5 training curves of Super-yolo-star⁴⁶, Super-yolo-Dattention⁴⁷, Super-yolo-RFAConv⁴¹, Yolov5⁴⁸, Yolov8⁴⁹ and Yolov11⁵⁰ are all lower than that of the proposed method in 300 rounds of network training. The proposed method can still maintain “high precision” under “high recall rate”, and the network performance is excellent.

To provide a more intuitive representation of the multi-scale feature images of the Super Mamba model at the P3, P4, and P5 stages in the neck, the feature images for each stage are shown at the top of Fig. 6. The small objects feature fusion images for each stage are displayed in the fifth column at the top of Fig. 6. Experiments have shown that the proposed method can accurately extract multi-scale features of small objects. After the feature enhancement and bidirectional pyramid module, the fusion effect is very satisfactory.

By comparing seven methods on the VEDAI data set, we found that in addition to Yolov11 and our proposed method, the Super-DA method showed the phenomenon of infrared thermal feature loss, while the other 4 methods showed the phenomenon of significant expansion of the infrared thermal feature range of small objects to varying degrees, as shown at the bottom in Fig. 6.

In the VEDAI dataset, due to the complex background interference, the six methods—Super-yolo-star, Super-yolo-Dattention, Super-yolo-RFAConv, Yolov5, and Yolov8 and Yolov11—showed varying degrees of missed detection for small objects such as car or boat, as indicated by the small objects in the blue circles in the last column of Fig. 7. Additionally, it is evident that our proposed method has significantly higher detection accuracy for small objects compared to other methods. In addition, it can be noticed that the superior performance is achieved for the classes of Car, Pickup, Tractor and Camping, which have the most training instances, as shown in Fig. 7.

As shown in Table 2, the parameters of the proposed method are all lower than those of the Yolo series models, however, their average precision significantly exceeds that of the Yolo series models, reaching 92.3%. The Super-yolo-star model has the lowest accuracy at only 58.1%; while the Super-yolo-RFAConv model has the lowest recall rate at 59.47%. Although the Super-yolo-star, Super-yolo-RFAConv and Super-yolo-Dattention models have fewer parameters and faster computation speeds, their average detection precision is not high, failing to reach 75%. Our proposed method in this paper has the accuracy, recall and average precision of 91.4%, 91.9% and 92.3% respectively for small object detection. In particular, mAP@0.75 also has reached 89.1%, and mAP@0.95 has reached 77.8%.

The deep learning network not only solves the problem of small object detection in complex scenes, but also has high robustness and strong general performance, which can be effectively applied to various small object detection scenarios.

Table 2 Performance comparison of different object detection algorithms on VEDAI dataset.

Full size table

Generalization analysis

To verify the generalization capability of our framework, we conducted small object detection experiments using open-source datasets, including the RGBT Drone Person dataset and the VTAUV-det dataset for underwater object detection. We selected six common methods for comparison: Super-yolo-star, Super-yolo-Dattention, Super-yolo-RFAConv, Yolov5, Yolov8, and Yolov11.

Table 3 Performance comparison of different object detection algorithms on M3FD and LLVIP datasets.

Full size table

(1)
M3FD (Multispectral, Multimodal and Multiscale Fusion Detection): The dataset was released in 2022 and includes 4200 pairs of calibrated and aligned infrared and RGB image. The dataset covers four main scenarios with various environments, lighting, seasons, and weather, with a wide range of pixel variations. Most of the images have a resolution of 1024 × 768 pixels and provide annotations for six categories: pedestrians, cars, buses, motorcycles, streetlights, and trucks. The total size of the dataset exceeds 15GB.
(2)
LLVIP (Large-scale Longitudinal Visual-Inertial People): The dataset was released in 2021, focusing on pedestrian detection and tracking, particularly in dynamic and complex environments. It covers various real-world scenarios, including city streets, indoor settings, and public transportation, with diverse lighting and weather conditions. The LLVIP dataset provides rich data, featuring over 100,000 RGB images and depth images, each with a resolution of 1920 × 1080 pixels, totaling more than 100GB.

As shown in Table 3, our Super Mamba achieved the best detection results, the value of mAP@0.5, mAP@0.75 and mAP@0.95 are 93.22%, 90.34%, 88.65% in the M3FD dataset, respectively. The Super Mamba method is higher than other detection methods. In the same way, the experimental results indicate that the proposed method achieved an accuracy of 94.13%, 87.75%, 85.14% in the LLVIP dataset. These findings suggest that despite the varying characteristics of the datasets, the model demonstrates excellent generalization capabilities, adapting to different scenarios and tasks. This discovery provides important evidence for further optimizing the model and points the way forward for future research.

Conclusion

This paper improves the Mamba algorithm and proposes a multi-scale fusion Super Mamba small object detection algorithm. First, the receptive field attention convolution RFAConv is integrated into the backbone network and replaces the commonly used Conv, effectively extracting spatial features of the receptive field, alleviating the issue of shared convolution parameters, and enhancing network performance; second, the spatial attention mechanism and channel attention mechanism are added to the feature-enhanced Mamba model, achieving multi-scale and multi-feature extraction of small objects; when using BiFPN for multi-level feature fusion at the neck, introducing the FEM block can enhance local context information of small objects. Experimental results on three datasets show that compared with other mainstream algorithms, the improved model achieves mAP@0.5 of 92.3%, 93.2%, 94.1%, meeting the requirements for real-time detection of small objects.

The current detection of small object pixels in this article is around 20 × 20. How to extend this method to tasks involving ultra-small object pixels below 10 × 10 is a subject that requires further research. In future work, we plan to systematically evaluate performance in multimodal complex scenes in both the visible and infrared spectra, optimize deep learning networks using graph neural networks, and utilize fewer labeled data during training to further enhance the model’s generalization ability, aiming for successful application in challenging environments such as on the ground or underwater.

Data availability

The PyTorch code implementation for the proposed model is publicly available on GitHub at: https://github.com/wolfololo/Super-Mamba-A-Framework-for-Small-Object-Detection-with-Enhanced-Detection.The datasets analyzed during the current study are publicly available from the following sources: (1) VEDAI: The Vehicle Detection in Aerial Imagery dataset is available at: https://pan.baidu.com/s/19ktRRUcYXYcvdvc3sk3ygQ?pwd=j227. (2) M3FD: The Multi-spectral Multi-modal Object Detection Benchmark is available on OpenDataLab at: https://pan.baidu.com/s/1xUZsXzgjp0Q2GcLLcDDOKw?pwd=j227. (3) LLVIP: The Low-Light Visible-Infrared Paired dataset is available on GitHub at: https://github.com/bupt-ai-cz/LLVIP. These datasets comprise diverse image sets for small object detection, covering vehicles and pedestrians in various scenes, and were used for model training, validation, and evaluation. All relevant data and code are provided without restrictions to ensure the full reproducibility of this study.

References

Yao, S., Zhu, Q., Zhang, T., Cui, W. & Yan, P. Infrared image Small-object detection based on improved FCOS and Spatio-Temporal features. Electronics 11 (6), 933 (2022).
Article Google Scholar
Li, C. Q., Huang, Z. C., Xie, X. M. & Li, W. IST-TransNet: infrared small object detection based on transformer network. Infrared Phys. Technol. 132, 104723 (2023).
Article Google Scholar
Lin, W. et al. Research on infrared dim and small object detection algorithm based on local contrast and gradient. J. Spat. Sci. 68 (4), 741–758 (2022).
Article Google Scholar
Luo, X. et al. MBFormer-Yolo: multibranch adaptive Spatial feature detection network for small infrared object detection. IEEE Sens. J. 24 (12), 19517–19530 (2024).
Article ADS Google Scholar
Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You Only Look Once: Unified Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 779–788 (2016).
Liu, W., Anguelov, D., Ge, R., Dollár, P. & Farhadi, A. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision (ECCV), 21–37 (2016).
Lin, T. Y. et al. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intelligence(TPAMI). 42 (2), 318–327 (2020).
Article ADS Google Scholar
Ren, S., He, K., Girshick, R., Sun, J. & Faster, R-C-N-N. Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems (NIPS), 28, 91–99 (2015).
Dai, J., Li, Y., He, K. & Sun, J. R-FCN: object detection via region-based fully convolutional networks. Adv. Neural. Inf. Process. Syst. 0, 379–387 (2016).
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R. & Mask, R-C-N-N. IEEE Trans. Pattern Anal. Mach. Intell., 42(2), 386–397 (2020).
Article ADS PubMed Google Scholar
Tian, D., Yan, X., Zhou, D., Wang, C. & Zhang, W. IV-Yolo: A lightweight Dual-Branch object detection network. Sensors 24 (19), 6181 (2024).
Article ADS PubMed PubMed Central Google Scholar
Xiao, N., Hong, X. & Zheng, Z. BDK-Yolov8: an enhanced algorithm for UAV infrared image object detection. IEEE Access 12, 191129–191139 (2024).
Google Scholar
Xue, R., Hua, S. & Xu, H. FECI-RTDETR a lightweight unmanned aerial vehicle infrared small object detector algorithm based on RT-DETR. IEEE Access 13, 9578–9591 (2025).
Article Google Scholar
Zhang, L., Wu, X., Liu, Z., Yu, P. & Yang, M. ESD-Yolov8: an efficient solar cell fault detection model based on Yolov8. IEEE Access 12, 138801–138815 (2024).
Article Google Scholar
Zhu, J., Ma, C., Rong, J. & Cao, Y. Bird and UAVs recognition detection and tracking based on improved Yolov9-DeepSORT. IEEE Access 12, 147942–147957 (2024).
Article Google Scholar
Gu, A., Dao, T. & Mamba Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752 (2023).
Gu, A., Goel, K. & Ré, C. Efficiently modeling long sequences with structured state spaces. In ICLR 2022-10th International Conference on Learning Representations (2022).
Wang, Z., Li, C., Xu, H., Zhu, X. & Li, H. Mamba yolo: A simple baseline for object detection with state space model. arXiv:2406.05835v2 (2024).
Wang, H. et al. Mamba-Yolo-World: marrying Yolo-World with Mamba for Open-Vocabulary detection. arXiv:240908513. (2024).
Malekmohammadi, A., Badiezadeh, A., Mirhassani, S. M., Gifani, P. & Vafaeezadeh, M. Classification of Gleason grading in prostate cancer histopathology images using deep learning techniques: Yolo, vision transformers, and vision Mamba. arXiv:2409.17122 (2024).
Ma, H., Lei, S., Celik, T. & Li, H. C. FER-Yolo-Mamba: facial expression detection and classification based on selective state Space. ArXiv, 2405.01828 (2024).
Zhao, R., Tang, S. H., Shen, J., Supeni, E. E. B. & Rahim, S. A. Enhancing autonomous driving safety: A robust traffic sign detection and recognition model TSD-Yolo. Sig. Process. 225. https://doi.org/10.1016/j.sigpro.2024.109619 (2024).
Zhao, Z. & He, P. Yolo-Mamba: object detection method for infrared aerial images. Signal. Image Video Process. 18 (12), 8793–8803 (2024).
Article Google Scholar
Bai, M., Di, X., Yu, L., Ding, J. & Lin, H. A pine wilt disease detection model integrated with Mamba model and attention mechanisms using UAV imagery. Remote Sens. 17 (2). https://doi.org/10.3390/rs17020255 (2025).
Badiezadeh, A., Malekmohammadi, A., Mirhassani, S. M., Gifani, P. & Vafaeezadeh, M. Segmentation strategies in deep learning for prostate cancer diagnosis: A comparative study of Mamba, SAM, and Yolo. ArXiv 2409, 16205 (2024).
ADS Google Scholar
Cao, Y. et al. BrYolo-Mamba: A approach to efficient tracheal lesion detection in bronchoscopy. IEEE Access. 12, 174630–174639 (2024).
Article Google Scholar
Yu, C. & Lu, Z. Yolo-VSI: an improved Yolov8 model for detecting railway turnouts defects in complex environments. Comput. Mater. Continua. 81 (2), 3261–3280 (2024).
Article Google Scholar
Sun, P. et al. A deep feature fusion Mamba network for detection of asphalt pavement distress. Constr. Build. Mater. 469. https://doi.org/10.1016/j.conbuildmat.2025.140393 (2025).
Yu, Z., Pan, N., Zhou, J. & SFFNet Shallow feature fusion network based on detection framework for infrared small object detection. Remote Sens. 16 (22), 4160 (2024).
Article ADS Google Scholar
Chen, T. et al. MiM-ISTD: Mamba-in-Mamba for efficient infrared Small-object detection. IEEE Trans. Geosci. Remote Sens. 62, 5007613 (2024).
Article Google Scholar
Li, B., Rao, P., Su, Y., Chen, X. & HMCNet A hybrid Mamba–CNN UNet for infrared small object detection. Remote Sens. 17, 452 (2025).
Article ADS Google Scholar
Lu, J. & Zhang, X. Lightweight visual Mamba-Guided attention network for infrared small object detection. J. Electron. Imaging. 34 (2), 023018 (2025).
Article ADS Google Scholar
Zhang, M., Li, X., Gao, F. & Guo, J. IRMamba: pixel difference Mamba with layer restoration for infrared small object detection. AAAI Conf. Artif. Intell. https://doi.org/10.1609/aaai.v39i9.33085 (2025).
Article Google Scholar
Gomaa, A. & Saad, O. M. Residual Channel-attention (RCA) network for remote sensing image scene classification. Multimed Tools Appl. 84, 33837–33861 (2025).
Article Google Scholar
Gomaa, A. Advanced Domain Adaptation Technique for Object Detection Leveraging Semi-Automated Dataset Construction and Enhanced YOLOv8. In 6th Novel Intelligent and Leading Emerging Sciences Conference (NILES), Giza, Egypt, 211–214 (2024)., Giza, Egypt, 211–214 (2024).
Gomaa, A. & Abdalrazik, A. Novel deep learning domain adaptation approach for object detection using Semi.Self Building dataset and modified YOLOv4. World Electr. Veh. J. 15 (6), 255 (2024).
Article Google Scholar
Salem, M., Gomaa, A. & Tsurusaki, N. Detection of Earthquake-Induced Building Damages Using Remote Sensing Data and Deep Learning: A Case Study of Mashiki Town, Japan. In IGARSS 2023–2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 2350–2353 (2023).
Gomaa, A., Abdelwahab, M. M. & Abo-Zahhad, M. Efficient vehicle detection and tracking strategy in aerial videos by employing morphological operations and feature points motion analysis. Multimed Tools Appl. 79, 26023–26043 (2020).
Article Google Scholar
Gomaa, A., Abdelwahab, M. M. & Abo-Zahhad, M. Real-Time Algorithm for Simultaneous Vehicle Detection and Tracking in Aerial View Videos. IEEE 61st International Midwest Symposium on Circuits and Systems (MWSCAS), Windsor, ON, Canada, 222–225 (2018).
Hassan, O. F. et al. Real-time driver drowsiness detection using transformer architectures: a novel deep learning approach. Sci. Rep. 15, 17493 (2025).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhang, X. et al. RFAConv: innovating Spatial attention and standard convolutional operation. ArXiv https://doi.org/10.48550/ArXiv.2304.03198 (2023).
Article PubMed PubMed Central Google Scholar
Li, Y. et al. Rethinking Vision Transformer for MobileNet Size and Speed. In Proceedings of the IEEE international conference on computer vision (2023).
Liu, Y. et al. VMamba: visual state space model. ArXiv https://doi.org/10.48550/ArXiv.2401.10166 (2024).
Article PubMed PubMed Central Google Scholar
Tan, M. X., Pang, R. M., Le, Q. & EfficientDet Scalable and Efficient Object Detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 10778–10787 (2020).
Umirzakova, S., Muksimova, S., Mahliyo Olimjon Qizi, A. & Cho, Y. I. Lightweight transformer with adaptive rotational convolutions for aerial object detection. Appl. Sci. 15 (9), 5212 (2025).
Article CAS Google Scholar
Ma, X., Dai, X., Bai, Y., Wang, Y. & Fu, Y. Rewrite Stars arXiv, (2024). https://arxiv.org/abs/2403.19967.
Xia, Z. et al. Vision Transformer with Deformable Attention. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4794–4803 (2022).
Glenn, J. Yolov5 release v6.1, (2022). https://github.com/ultralytics/Yolov5/releases/tag/v6.1.
Glenn, J. Ultralytics Yolov8, (2023). https://github.com/ultralytics/ultralytics.
Sharma, A., Kumar, V. & Longchamps, L. Comparative performance of Yolov8, Yolov9, Yolov10, Yolov11 and faster R-CNN models for detection of multiple weed species. Smart Agric. Technol. https://doi.org/10.1016/j.atech.2024.100648 (2024).
Article Google Scholar

Download references

Acknowledgements

The authors appreciate the editors and anonymous reviewers for their valuable recommendation.

Funding

This research was supported by the Central Guidance Fund for Local Scientific and Technological Development under Grant(YDZJSX2025D034, YDZJSX2024D032), the Shanxi Key Research and Development Program Project under Grant(202202010101007), and the Shanxi Provincial Special Fund for the Transformation of Scientific and Technological Achievements under Grants (202204021301044, 202304021301028).

Author information

Authors and Affiliations

State Key Laboratory of Extreme Environment Optoelectronics Dynamic Measurement Technology and Instrument, Taiyuan, Shanxi, China
Na Shi, Zheng Yang, Guang Yang, Kai Li, Zhiliang Yang, Sicheng Li, Liangliang Zhang & Senyang Jing
School of Mathematics, North University of China, Taiyuan, Shanxi, China
Na Shi
School of Signal and Information Processing, North University of China, Taiyuan, Shanxi, China
Zheng Yang, Guang Yang, Kai Li, Zhiliang Yang, Sicheng Li, Liangliang Zhang & Senyang Jing
School of Cyberspace Science and Technology, Beijing Institute of Technology, Beijing, China
Jianping An

Authors

Na Shi
View author publications
Search author on:PubMed Google Scholar
Zheng Yang
View author publications
Search author on:PubMed Google Scholar
Guang Yang
View author publications
Search author on:PubMed Google Scholar
Kai Li
View author publications
Search author on:PubMed Google Scholar
Zhiliang Yang
View author publications
Search author on:PubMed Google Scholar
Jianping An
View author publications
Search author on:PubMed Google Scholar
Sicheng Li
View author publications
Search author on:PubMed Google Scholar
Liangliang Zhang
View author publications
Search author on:PubMed Google Scholar
Senyang Jing
View author publications
Search author on:PubMed Google Scholar

Contributions

Methodology: Na Shi, Zheng Yang, Guang Yang; Conceptualization: Na Shi, Zheng Yang, Guang Yang; Software: Na Shi, Guang Yang, Senyang Jing; Validation: Zhiliang Yang, Kai Li; Formal analysis: Kai Li, Jianping An, Liangliang Zhang; Writing-original draft preparation: Na Shi; Writing-review and editing: Na Shi; Supervision: Sicheng Li. All authors reviewed the manuscript.

Corresponding author

Correspondence to Guang Yang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Shi, N., Yang, Z., Yang, G. et al. Super Mamba feature enhancement framework for small object detection. Sci Rep 15, 37148 (2025). https://doi.org/10.1038/s41598-025-21837-2

Download citation

Received: 30 April 2025
Accepted: 24 September 2025
Published: 23 October 2025
Version of record: 23 October 2025
DOI: https://doi.org/10.1038/s41598-025-21837-2

Keywords

This article is cited by

Dscyolo: dynamic snake convolutional YOLO network for underwater image recognition
- Mengqi Xue
- Baoju Zhang
- Bo Zhang
The Journal of Supercomputing (2026)

Subjects

Abstract

Introduction

Related work

State space model

Receptive-field attention convolution operation

Methods

Network framework for small object detection

Simple RFAConv initialization module

VSS module integrated with attention mechanism

The feature fusion module of BiFPN with FEM

Experimental results

Dataset

Implementation details

Accuracy metrics

Ablation study

Comparisons with previous methods

Generalization analysis

Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Dscyolo: dynamic snake convolutional YOLO network for underwater image recognition

Search

Quick links