Introduction

Traffic sign recognition (TSR) technology1 occupies a core role in intelligent transportation systems. It not only helps with real-time traffic monitoring but also is an indispensable part of autonomous driving. In Advanced Driver Assistance Systems, TSR technology can compensate for the driver’s visual oversight. In autonomous driving (e.g., SAE Levels 3 to 5), the detection and recognition of traffic signs, as the basis of vehicle perception, is the premise to ensure the safety of autonomous driving2. However, due to various factors, TSR still faces numerous challenges. Improving the precision and reliability of TSR remains a pressing challenge requiring immediate attention.

Traditional methods for traffic sign recognition (TSR) primarily rely on geometric or handcrafted features, such as shape, color, or spatial relationships. Among them, the Histogram of Oriented Gradients (HOG) combined with a Support Vector Machine (SVM) classifier has been one of the most widely used pipelines. However, handcrafted features like HOG remain highly sensitive to variations in illumination, shadows, and adverse weather (e.g., rain, fog, or snow), even when preprocessing techniques such as Gamma correction are applied. This sensitivity limits their robustness in complex real-world scenarios, where environmental conditions vary significantly. Consequently, traditional feature-based approaches often struggle to maintain stable recognition performance without the aid of additional sensors or adaptive mechanisms. These challenges have motivated the development of deep learning-based methods3,4,5,6 (CNNs) that can automatically extract and integrate more discriminative features.

Aiming at improving accuracy and designing a flexible fusion network, this study proposes a Siamese neural network (SNN) framework that integrates an improved Mamba network with mainstream CNN architectures (including VGG-16, ResNet, AlexNet, or MobileNetV2), named Siamese-Mamba&CNN (SM&CNN). Then, a Heterogeneous Network Feature Reconstruction (HN-FR) module is introduced to fuse feature information across different scales. Based on the fusion of neural network modules, this approach integrates an improved Mamba network, which integrates the advantages of traditional Convolutional Neural Networks (CNNs) and State Space Models (SSMs). Specifically, the proposed improved Mamba replaces the convolutional structures by introducing the Depth Separable convolution Block (DS Block) based on the Med-Mamba network7, reducing computational complexity. The framework’s compatibility with various CNNs enables adaptive feature extraction, such as using ResNet’s residual learning for complex patterns, MobileNetV2’s lightweight design for mobile deployment, or VGG-16’s hierarchical features for structural recognition. Through comprehensive architectural comparisons, we establish optimal network configurations for different operational scenarios. The HN-FR module dynamically reconstructs features from these heterogeneous networks, achieving better representation capability. The novelty of this study is demonstrated through the following contributions: (1) The designed fusion network effectively integrates the lightweight architecture of the improved Mamba network and the feature extraction capability of the flexibly replaceable CNN. This flexibility allows the model to be applied to different network configurations, providing adaptability across various deployment needs while preserving reliable feature representation. In the improved Mamba network, DS Block is incorporated into the traditional Med-Mamba structure, achieving performance improvements with a reduction of 6.94 M parameters, thereby enhancing computational and memory efficiency. (2) In the fusion process of heterogeneous networks, we propose an HN-FR method to integrate heterogeneous network features and reconstruct convolution features through feature mapping, attention weighting, and full connection, achieving fast and excellent performance. Experiments show that the backbone network (such as MobileNetV2, VGG-16, ResNet50, AlexNet, etc.) improves its accuracy by 0.95% to 36.2% after integrating the proposed HN-FR method, with an average improvement of 10.8%.

Related work

In recent years, the field of TSR has attracted extensive research attention and spawned many innovative solutions. In this section, we analyze the research works in TSR over the years. Traditional methods primarily rely on geometric features8 and handcrafted features9,10 including the form and color attributes of traffic signs. In contrast, the use of neural networks to extract convolutional features, especially the integration of diverse neural network components to identify and classify traffic signs, usually promotes performance enhancement.

Traditional methods

The traditional TSR methods mainly includes the extraction of geometric features or handcrafted features. The TSR process of extracting geometric features focuses on identifying the specific shapes and geometric patterns of traffic signs, which are very important for the sign recognition process. The common methods include the Hough transform, the Canny operator, and other techniques. For example, Satti et al.11 proposed an integrated method to detect traffic signs and potholes on Indian roads. This method uses the joint application of Features from the Accelerated Segmentation Test (FAST) and the Random Sample Consensus (RANSAC) algorithm to extract and match features in the process of detecting and identifying traffic signs. Pothole detection employed an enhanced Canny Edge detector alongside a bio-inspired contour detection technique. The experimental data confirmed that the comprehensive model developed in this study surpassed the existing models in key indicators such as accuracy, specificity, and sensitivity. In the TSR process of extracting handcrafted features, traffic sign recognition mainly relies on artificially designed features. Typical handcrafted features consist of Histogram of Oriented Gradients (HOG), Scale-Invariant Feature Transform (SIFT)12, and Local Binary Pattern (LBP), among others. Kassani et al.9 proposed an innovative sparse model for traffic sign classification, utilizing a Soft Histogram of Oriented Gradients (SHOG) combined with sparse classifiers. They proposed a variant of HOG, named SHOG, which leverages the symmetry of traffic sign images to determine the optimal cell locations for histogram computation, enhancing discriminative power over traditional HOG. Furthermore, two analytical classifiers based on sparse polynomial models were introduced to implicitly select features and mitigate overfitting. The method was tested on the GTSRB dataset and 16 additional datasets from the UCI repository, showing competitive results when compared to cutting-edge approaches.

However, in practical applications, geometric or handcrafted feature-based methods suffer from illumination or weather changes, sign obstruction, or breakage. Consequently, traditional approaches9,12 generally exhibit limited consistency in recognition accuracy and robustness under diverse real-world conditions, compared with the methods based on deep learning widely used in recent years.

Deep Learning-Based methods

Neural networks, particularly deep learning models, surpass traditional approaches by extracting extensive features from images and autonomously learning complex feature representations from large-scale datasets. Deep learning methods excel in both accuracy and efficiency, and advancements in neural networks—such as AlexNet13, LeNet14, VGG-1615,16, DenseNet17, and ResNet18,19—have significantly propelled the development of object detection and recognition technologies.

Classical deep learning architectures

In the field of TSR, the development of convolutional neural networks (CNNs) has markedly advanced the accuracy of image classification. Existing approaches have generally adopted two strategies. On one hand, techniques such as optimized deep network architectures aim to maximize recognition performance under sufficient computational resources. On the other hand, lightweight network designs, including MobileNet and other parameter-efficient architectures, are employed to enable inference in resource-constrained environments. In the first strategy, Castruita et al.20 combine R-CNN or YOLO v3 for detection with a modified ResNet-50 for classification, achieving robust performance on a newly created Mexican traffic sign dataset with mAP up to 95.33% and overall classification accuracy of 99.00%, even under occlusion and non-sign scenarios. Saha et al.21 introduce a unified residual convolutional network with hierarchical dilated skip connections and a low-memory dilated residual representation, enabling high accuracy across multiple benchmarks. Zatroch et al.22 proposed a Siamese network with one-shot learning and synthetic data augmentation, allowing effective recognition from very few samples per class. Collectively, these innovations—integrated detection-classification pipelines, advanced residual architectures, and few-shot learning with data augmentation—demonstrate effective strategies for achieving high-precision TSR under diverse conditions. These studies highlight the strong recognition capability of deep CNNs, yet their large model sizes and high computational cost hinder real-time deployment on resource-limited platforms. To enhance inference efficiency, some studies have focused on network optimization and lightweight designs. Zhang et al.23 proposed the Transformer-TSD-DETR, introducing ConvNormLayer and TSDBlock for multi-scale feature extraction, which reduces parameters, improves inference speed. Lin et al.24 developed YOLO-LLTS, incorporating HRFM-SOD and multi-branch feature interactive attention to enhance small-target detection in low-light conditions, reaching 77.2% accuracy on the TT100K-night dataset while maintaining YOLO’s efficiency. Despite these gains, lightweight strategies can slightly reduce accuracy under complex or low-contrast conditions.

Apart from the aforementioned strategies, recent years have also witnessed significant representative works, such as attention mechanisms and networks based on state-space models (e.g., Mamba), which further advance traffic sign recognition. Shen et al.25 proposed a multi-scale grouping attention pyramid network. This method effectively aggregates feature information of different scales by constructing a multi-scale attention module, suppresses background interference, and enhances the perception of small-sized traffic signs. In order to realize the detection of small traffic signs under complex weather conditions, Qu et al.26 introduced the Coordinate Attention (CA) mechanism into the backbone network, embedded the location information into the channel attention, and improved the network ‘s perception of spatial details. Li et al.27 proposed a sample defense method for traffic sign confrontation based on an attention mechanism. This method simulates the human visual system, focuses on the key areas of the target in the image, and ignores the background disturbance, thereby improving the robustness of the model against attacks. The Mamba architecture, as a state space model (SSM), has provided a new direction for visual modeling. Therefore, more researchers28,29,30 have improved the Mamba and achieved a wide range of applications. For example, Liu et al.31 proposed a lightweight visual network LightViM based on Mamba for resource-constrained scenarios. By constructing the LGF-Mamba module, the bidirectional state space model is used to extract the global context features (such as shape contour) of the image, and the local convolution module is used to capture high-frequency details (such as edge texture) to achieve multi-frequency information complementarity. Validated on data sets such as medical endoscopes and bird recognition, its accuracy exceeds that of lightweight models such as MobileViT by more than 3.7%, providing a new sequence modeling paradigm for edge-end visual tasks. Despite the significant advantages of deep learning-based TSR methods in accuracy and robustness.

Multi-Network fusion methods

The increasing availability of high-performance computing resources has enabled the effective fusion of neural networks, which demonstrates significant advantages in traffic sign recognition and classification. Yu et al.16 proposed a multi-image fusion model method that combines YOLO-V3 and VGG-16 networks. By utilizing correlations among multiple images, this model efficiently detects and identifies traffic signs in driving videos with high accuracy. Testing results on public datasets indicate that the recognition accuracy of this model exceeds 90%, and its performance in detecting and recognizing all types of traffic signs under various conditions outperforms baseline methods. Li et al.32 introduced an effective two-stage fusion neural network framework. Unlike traditional cascaded methods, this framework predicts categories directly in the first-stage detection and enhances overall robustness by fusing category predictions from both stages. In feature fusion, He et al.33 developed the Attentional Aggregative Interaction Network (AAIN), a deep learning model that enhances feature interactions in recommender systems through a cyclic explicit module and attention mechanism, capturing high-order features effectively. The model was integrated with DNN to form DAAIN, outperforming existing models in feature interaction and accuracy. Jin et al.34 proposed MF-SSD based on the SSD framework. Through a feature fusion strategy, the effective channel is strengthened, which improves the detection effect of small-sized traffic signs. Yu et al.35 proposed a real-time target detection network based on multi-channel feature fusion. Aiming at the shortcomings of YOLOv8 in the detection of small targets such as traffic signs, a multi-path attention mechanism and a two-way fusion structure are introduced. Through multi-dimensional feature tensor fusion, this method adaptively enhances the spatial and channel information in shallow features, effectively suppresses redundant background noise, and improves the discrimination of small targets. These multi-network fusion strategies commonly exploit hierarchical and multi-scale feature representations to integrate complementary information from different layers or channels, thereby enhancing recognition accuracy.

Compared with traditional deep learning methods, multi-network fusion approaches can learn and integrate features from different scales and hierarchical levels, thereby reducing the risk of model overfitting. Future work may explore the fusion of more diverse feature types and the design of more flexible fusion strategies to further enhance recognition performance.

Proposed method

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Sketch of the SM&CNN. The backbone network 1 can be any existing model, such as AlexNet, ResNet-50, VGG-16, or MobileNetV2; and the improved-Mamba is another backbone network.

Figure 1 presents the comprehensive framework of the proposed method. Based on our previous work in36, and inspired by the work of Med-Mamba7 and AAIN33, we propose a method for TSR by using an SNN framework to fuse an improved Mamba network with mainstream CNN architectures, including VGG-16, ResNet-50, MobileNetV2, etc. The purpose is to improve recognition performance by integrating the advantages of heterogeneous networks. Specifically, a Heterogeneous Network Feature Reconstruction (HN-FR) module is proposed to reconstruct the features. Figure 1 illustrates that the proposed method is composed of three main components. Firstly, within the framework of the SNN, we define positive and negative sample pairs. Secondly, we introduce the customized improved Mamba network and selectable CNN backbones, proposing a Heterogeneous Network Feature Reconstruction (HN-FR) module for reconstructing convolutional features. Finally, we train the network by setting training objectives and loss functions. The following chapters will explain our proposed method in detail.

State space features from improved Mamba network

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Sketch of the improved mamba network.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Flow chart of DS block.

Figure 2 describes the basic structure of the improved Mamba network. The overall framework of this model is primarily based on the Med-Mamba network, which is proposed by Yue et al.7. It comprises a patch embedding layer, multiple stacked SS-Conv-SSM modules, patch merging layers for down-sampling, and a classifier. The architecture integrates State Space Models (SSMs), represented by Mamba, and a series of convolutional branch operations. Based on the Med Mamba network, we have made further improvements. The SSM-Branch branch of the Med-Mamba network is retained. Affected by the depth separable convolution design, we introduce a Depth Separable Block (DS Block) to substitute the original Conv-Branch, aiming to enhance computational efficiency. The detailed operation of the DS Block is presented in Fig. 3.

In contrast to traditional Conv-Branch operations, the DS Block utilizes a depth separable convolution design, where the smaller number of network parameters benefits both model training and inference speed. Under the same hardware environment, DS Block improves the execution speed of the model while maintaining strong expressiveness and has the potential to be deployed on portable devices. Furthermore, the use of regularization operation and activation function enhances training stability and non-linear expressive capabilities, ensuring the model achieves both high accuracy and enhanced processing speed. Therefore, the DS Block is a lightweight and efficient module that, through its rational structural design, effectively reduces computational costs and improves the model’s inference speed, making it an efficient module that balances speed and performance.

Given a module input \({R^{H \times W \times C}}\)and a module output \(y \in {R^{H \times W \times C}}\), and then there is

$$x \in {R^{H \times W \times C}},{x_{i=1,2}} \in {R^{H \times W \times \frac{C}{2}}}$$
(1)

where \({x_1}\) and \({x_2}\) are the inputs of the DS block and SSM -Branch, respectively. Based on the above, the construction process of DS Block can be defined as follows:

$$\left\{ {\begin{array}{*{20}c} {x^{'} = \vartheta \left( {\varphi \left( {x_{1} \times \gamma _{{1 \times 1}}^{1} } \right)} \right)} \\ {x^{{''}} = \vartheta \left( {\varphi \left( {x^{'} \times \gamma _{{3 \times 3}}^{{depthwise}} } \right)} \right)} \\ {\widetilde{{x_{1} }} = \varphi (x^{{''}} \times \gamma _{{1 \times 1}}^{2} )} \\ \end{array} } \right.$$
(2)

where \({\mathbf{\gamma }}_{{{\mathbf{1}} \times {\mathbf{1}}}}^{{}}\) refers to the depth separable convolution with a \(1 \times 1\) kernel;\({\mathbf{\gamma }}_{{{\mathbf{3}} \times {\mathbf{3}}}}^{{}}\) refers to the depth separable convolution with a \(3 \times 3\) kernel;\(\varphi\) is a regularization operation;\(\vartheta\) is the activation function.

The \({f^{ - 1}}\)and g are used to represent channel-concatenation and channel-shuffle respectively. The channel cascade is used to restore the size of the channel dimension, and the channel shuffling is used to shuffle the feature map on the channel dimension to avoid the loss of information between channels caused by the grouping convolution operation. In summary, the feature representations outputted by the improved Mamba network are as follows:

$${{\mathbf{y}}_{\mathbf{2}}}=x \oplus g({f^{ - 1}}(\widetilde {{{x_1}}},\widetilde {{{x_2}}}))$$
(3)

where \(\widetilde {{{x_2}}}\) are the output features of the SSM-Branch in the Med-Mamba network7.

Configurable CNN backbone

The SSM in Mamba has the advantages of dynamic long-range context capture and direction-sensitive feature selectivity in feature modeling. Traditional CNN also has advantages in strong local feature representation and spatial translation invariance. As a result, we try to fuse the advantages of those two different networks to improve the accuracy of TSR. For demonstrate the flexible of our network, we choose four different CNN networks as replaceable CNN components.

Case1

$${{\mathbf{y}}_1}=MP\left( {\delta \left( {C{\text{onv}}_{{3 \times 3}}^{{\left( n \right)}}\left( { \cdots \delta \left( {C{\text{onv}}_{{3 \times 3}}^{{\left( 1 \right)}}\left( {{x_{in}}} \right)} \right) \cdots } \right)} \right)} \right) \in {R^{\frac{H}{{{2^K}}} \times }}^{{\frac{H}{{{2^K}}} \times {C^k}}}$$
(4)

Case2

$${{\mathbf{y}}_2}=\mathcal{F}({x_{in}},\{ {W_i}\} _{{i=1}}^{3})+{x_{in}} \in {R^{H \times W \times C}}$$
(5)

Case3

$${{\mathbf{y}}_3}={\text{Con}}{{\text{v}}_{1 \times 1}}(\sigma ({\text{DWCon}}{{\text{v}}_{3 \times 3}}(\sigma ({\text{Con}}{{\text{v}}_{1 \times 1}}({x_{in}}))))) \in {R^{H \times W \times 6C}}$$
(6)

Case4

$${{\mathbf{y}}_4}={\text{MP}}(\sigma ({\text{Con}}{{\text{v}}_{11 \times 11}}({x_{in}}))) \in {R^{(H - 10) \times (W - 10) \times 96}}$$
(7)

where \({{\mathbf{y}}_{\mathbf{1}}}\) in Case 1 is the output characteristic of VGG-1637; \({{\mathbf{y}}_2}\) in Case 2 is the output feature of ResNet-5038; \({{\mathbf{y}}_3}\) in Case 3 is the output feature of MobileNetV239; \({{\mathbf{y}}_4}\) in Case 4 is the output feature of AlexNet40.

Proposed feature fusing method

A Heterogeneous Network Feature Reconstruction module (HN-FR) is designed to integrate features generated by the improved Mamba and the mainstream CNN architectures. Within complex models, cross-feature interaction fusion33 facilitates comprehensive modeling of the interplay between disparate feature vectors, capturing the underlying correlations and intricate relationships among these features. This approach is particularly applicable in scenarios that require the capture of complex feature interactions, multimodal data integration, or the processing of high-dimensional sparse data, and it offers extensive application potential in fields like recommendation systems, natural language processing, and computer vision. The HN-FR is proposed for the purpose of merging and dimension-reducing feature vectors from different networks to enhance performance in classification tasks. Specifically, the HN-FR module takes intermediate layer feature representations from two distinct networks (e.g., Mamba and VGG-16) and integrates these into a low-dimensional, semantically enriched feature representation through a learned attention mechanism. This attention mechanism dynamically computes the importance of each input feature, assigning varying weights to different features by inspired in41. It ensures the model can autonomously identify and highlight the most distinctive features while minimizing redundant and less relevant information. Compared with existing attention-based and feature fusion methods discussed in the related work, the attention mechanism in the proposed HN-FR module is specifically designed to perform arbitration between two heterogeneous networks, emphasizing the selection and weighting of the more reliable feature sources during fusion. Figure 4 provides the structure diagram of the HN-FR module. In the HN-FR module, the middle layer features of the extracted CNN are used to generate the attention weight \({\mathbf{\theta }}\) through the learnable attention parameters and the Sigmoid activation function, and the feature \({{\mathbf{y}}_{\mathbf{1}}}\) is weighted element by element to obtain the weighted feature \({\mathbf{\Psi }}\). Then, \({\mathbf{\Psi }}\) is concatenated with the original feature \({{\mathbf{y}}_{\mathbf{2}}}\) and sent to the first fully connected layer, producing the intermediate feature \({{\mathbf{z}}_{\mathbf{1}}}\). Subsequently, \({{\mathbf{z}}_{\mathbf{1}}}\) is combined with \({\mathbf{Y}}\) as input to the second fully connected layer, resulting in the final fused feature \({{\mathbf{z}}_{\mathbf{2}}}\). The module realizes the fusion and reconstruction of cross-source features through attention mechanism and a nonlinear transformation and provides high-quality feature representation for subsequent tasks. The specifics are outlined below.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Sketch of the HN-FR Module.

We concatenate the intermediate layer features of the Mamba network with those from the CNN backbone into a unified representation vector:

$${\mathbf{Y}}=[{{\mathbf{y}}_{\mathbf{1}}},{{\mathbf{y}}_{\mathbf{2}}}] \in {R^{N \times 2D}}$$
(8)

and feed \({\mathbf{Y}}\) back into the network.

An attention mechanism is applied to adjust the CNN backbone features, with the attention weights \({\mathbf{\theta }} \in {R^{N \times 1}}\) calculated as follows:

$${\mathbf{\theta }}=\delta ({{\mathbf{y}}_{\mathbf{1}}}{\mathbf{a}})$$
(9)

where \({\mathbf{a}} \in {R^{2D \times 1}}\) is the learnable attention parameter, and \(\delta\)denotes the Sigmoid activation function. These attention weights are then used to scale the features elementwise:

$${\mathbf{\Psi }}={{\mathbf{y}}_{\mathbf{1}}} \odot {\mathbf{\theta }}$$
(10)

We take the results of the splicing of \({\mathbf{\Psi }}\) and \({{\mathbf{y}}_{\mathbf{2}}}\) as the input of the first full connection, and take the output of the first full connection and the splicing result of \({\mathbf{Y}}\) as the input of the second full connection operation, which is expressed as follows :

$${{\mathbf{X}}_{\mathbf{1}}}=\left[ {{\mathbf{\Psi }},{{\mathbf{y}}_{\mathbf{2}}}} \right]$$
(11)
$${{\mathbf{X}}_2}=\left[ {{{\mathbf{z}}_{\mathbf{1}}},{\mathbf{Y}}} \right]$$
(12)

where \({{\mathbf{X}}_{\mathbf{1}}}\) is the input of the first full connection; \({{\mathbf{X}}_2}\)is the input of the second full connection.

Then, the weighted features undergo two fully connected operations with weight matrices and bias vectors. Next, ReLU is introduced as the activation function, and the detailed formula is given below:

$${{\mathbf{z}}_{\mathbf{1}}}=\ell ({{\mathbf{X}}_1}{{\mathbf{W}}_1}+{{\mathbf{b}}_1})$$
(13)
$${{\mathbf{z}}_{\mathbf{2}}}=\ell \left( {{{\mathbf{X}}_{\mathbf{2}}}{{\mathbf{W}}_2}+{{\mathbf{b}}_2}} \right)$$
(14)

where \({{\mathbf{W}}_1} \in {R^{D \times 2D}}\) and \({{\mathbf{W}}_2} \in {R^{D \times D}}\) represent the weight matrices of two connected layers, respectively.\({{\mathbf{b}}_{\mathbf{1}}} \in {R^D}\) and \({{\mathbf{b}}_2} \in {R^D}\) denote the corresponding bias vectors, and\(\ell\) refers to ReLU, a type of nonlinear activation function.

The reconstruction process allows the model to automatically focus on the most critical part of the input features through the attention mechanism, thereby enhancing the integration of complementary information from\({{\mathbf{y}}_{\mathbf{1}}}\) and \({{\mathbf{y}}_{\mathbf{2}}}\) Additionally, feature reconstruction and down-sampling dimensionality reduction are achieved through two fully connected operations.

Loss function

The SNN primarily consists of a backbone feature extraction network and a comparison network. The distance between a positive sample pair, such as a node image, and its matching (positive) image, is denoted by \(d({s^{(i)}},{s^{(j)}})\). Similarly,\(d({s^{(i)}},{s^{(k)}})\) represents the distance for a negative sample pair, which is the distance between a node image and a non-matching (negative) image.

$$d({s^{(i)}},{s^{(j)}})=||f({s^{(i)}}) - f({s^{(j)}})|{|^2} \leqslant d({s^{(i)}},{s^{(k)}})=||f({s^{(i)}}) - f({s^{(k)}})|{|^2}$$
(15)

In the binary classification problem of SNN, BCEWithLogitsLoss is often used serving as the loss function to evaluate the similarity between two input samples. It combines the sigmoid activation with the cross-entropy loss, avoiding the direct application of the sigmoid function on the output layer and maintaining numerical stability. In SNN, the output d typically represents a distance metric, and BCEWithLogitsLoss can effectively handle this distance-based metric by accurately computing the difference between the similarity score and the label, aiding the network in distinguishing between similar and dissimilar samples. The formula for BCEWithLogitsLoss is as follows:

$$L(y,\hat {y})= - \frac{1}{N}\sum\limits_{{i=1}}^{N} {[{y_i} \times \log (\sigma ({{\hat {y}}_l}))+(1 - {y_i}) \times \log (1 - \sigma ({{\hat {y}}_l}))]}$$
(16)

where N denotes the number of samples; \(\:{y}_{i}\) represents the model output (i.e., logits) for the \(\:i\)-th sample, which has not undergone sigmoid activation; \(\sigma ({\hat {y}_l})=\frac{1}{{1+{e^{ - {{\hat {y}}_l}}}}}\) is the probability value obtained by applying the sigmoid activation function to \(\:\widehat{{y}_{i}}\), ranging between [0,1]; and \(\:{y}_{i}\) is the true label for the i-th sample, taking a value of either 0 or 1.

Experiment

The experiment was performed on a system equipped with Intel’s 12th-generation i5 CPU and NVIDIA RTX 2080 Ti GPU, providing efficient computation for model training and inference. The environment was configured using Anaconda, with the model developed and trained in Python leveraging the PyTorch framework. Conda was used to manage package dependencies, ensuring the environment’s reproducibility and stability. This hardware and software configuration provided robust computational support for model training and validation, enabling the experiment to be completed within a reasonable timeframe.

Evaluation indicators

In TSR, mAP (mean Average Precision) and accuracy can be used as evaluation indicators to evaluate model performance. However, the accuracy assesses the model’s performance across the entire set of test samples, which makes it more suitable for those categories with a large number of images. In contrast, mAP incorporates both accuracy and recall to assess the model’s effectiveness. It evaluates the average accuracy under the precision-recall curve under different thresholds and indicates the performance of the model under different categories, especially in scenarios with numerous categories or unbalanced data. Therefore, in the three data sets involved in this paper, according to their respective characteristics, GTSRB and TSRD use accuracy as the evaluation index, and mAP is beneficial for performance verification of TT100K, as presented by42.

Accuracy is a basic and intuitive evaluation metric that can provide effective performance evaluation in the case of balanced categories or simple tasks. The computation is as follows:

$$Accu=\frac{{{P_{true}}+{N_{ture}}}}{{{P_{ture}}+{N_{ture}}+{P_{false}}+{N_{false}}}}$$
(17)

where \({P_{true}}\) denotes the count of samples accurately classified as positive, \({N_{ture}}\) refers to the count of samples accurately classified as negative, \({P_{false}}\) indicates the count of samples mistakenly classified as positive, and \({N_{false}}\) signifies the count of samples mistakenly classified as negative.

Because Precision and Recall have an inherently inverse relationship, it is challenging to comprehensively assess the effectiveness of the test results using these two metrics alone, so we use the average accuracy for quantitative analysis. The performance evaluation diagram is drawn and the P-R curve is obtained. Then, the precision-recall (P-R) curve is integrated to calculate the average precision (AP). The formula for calculation is as follows:

$$AP=\int_{0}^{1} {P(r)dr}$$
(18)

To obtain the overall evaluation for multi-class traffic sign recognition, mAP is determined by averaging the AP values across all categories. The formula for calculation is:

$$mAP=\frac{1}{m}\sum\limits_{{i=1}}^{N} {A{P_i}}$$
(19)

Datasets

The experiment is based on three well-known traffic sign datasets: the German Traffic Sign Recognition Benchmark (GTSRB), the Chinese Traffic Sign Recognition Dataset (TSRD), and Tsinghua-Tencent 100 K (TT100K). Table 1 details the total number of images, including the number of training and testing images, and the categories of traffic signs included in each dataset.

Table 1 Datasets used in experiments.

The GTSRB dataset comprises approximately 50,000 traffic sign images across 43 categories, captured under diverse lighting and weather conditions, which helps to improve the adaptability of the model to the actual environment. The GTSRB dataset presents various challenges arising from image conditions, as illustrated in Fig. 5. Certain images suffer from issues such as low resolution, inadequate contrast, insufficient brightness, blurriness, poor lighting, and motion-induced tilt. The TSRD dataset focuses on Chinese traffic signs, including images of six main categories of traffic signs, totaling about 10,000 images, providing training data for the model to adapt to local Chinese traffic signs. The distortion in the TSRD dataset is similar to that of the GTSRB. As depicted in Fig. 5, the TSRD dataset is mainly influenced by issues such as image aging, inadequate brightness, and blurriness. In contrast, the TT100K dataset provides a large collection of more than 100,000 labeled images, specifically for traffic sign detection and recognition tasks, and covers a variety of traffic signs widely used in China. The dataset is widely used in autonomous driving and intelligent transportation system research. TT100K supports images collected under various conditions (different illumination, weather, and shooting angle), so there is also deformation, blur, occlusion, and more. It supports traffic sign classification and target detection tasks, which helps promote automatic TSR system.

In the experiment we designed, the dataset was divided into training and testing sets according to the author’s original classification, The use of three datasets ensures the model’s classification performance across a variety of traffic sign categories and complex real-world scenarios.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

GTSRB, TSRD and TT100K image variations. (a) GTSRB images variations, (b) TSRD images variations, (c) TT100K images variations.

Experimental training

The experimental training is based on the SM&CNN framework, designed to implement an efficient training process that maximizes the performance of deep learning models. The experiments are conducted in a GPU environment supporting CUDA, with automatic mixed precision training (AMP) enabled by default to reduce memory usage, and distributed data parallelism (DP) is utilized to maximize the utilization of hardware resources. The experimental model is based on an enhanced Siamese network architecture. In order to maintain the stability of the training process, the initial learning rate is set to 0.01, the weight attenuation coefficient is set to 5 × 10, the optimization algorithm selects the Stochastic Gradient Descent (SGD) and adds a momentum factor of 0.9.

Training data is enhanced to improve generalization performance through data augmentation strategies, including random rotation, cropping, flipping, and other operations. Considering the properties of the input images, the data loader groups the images into batches, enhancing the efficiency of data flow transfer and accelerating the training process.

Throughout the training, we employ the cosine annealing strategy to dynamically adjust the learning rate:

$$l{r_t}=l{r_{\hbox{min} }}+\frac{1}{2}(l{r_{\hbox{max} }} - l{r_{\hbox{min} }})(1+\cos (\frac{t}{T}\pi ))$$
(20)

where \(l{r_{\hbox{max} }}\) is the initial learning rate, \(l{r_{\hbox{min} }}\) is the minimum learning rate, and T is the number of training iterations. This strategy smooths the variation of the learning rate, enhancing the stability of training. The training procedure consists of two stages: the freezing stage and the unfreezing stage. During the freezing stage, the majority of the network’s parameters remain fixed, and only specific layers are trained to ensure the robustness of basic feature extraction. In the unfreezing phase, network layers are gradually unfrozen, optimizing global parameters and enhancing the model’s adaptability in diverse scenarios.

Experimental analysis

We designed the following experiments to test the performance of our algorithm on different datasets, including comparative experiments on GTSRB, TSRD, and TT100K. Taking the VGG-16 network as an example (Siamese-Mamba&VGG16), we select the current mainstream methods, including CNN, LeNet, MobileNet, VGG-16, ANN, etc., for comparative experiments. Most of these methods are based on deep learning network architecture. Meanwhile, the methods we choose also include the improvement of some mainstream network architecture, or the combination of multiple network architectures, such as DeepThin, RIECNN, etc.

Table 2 presents the comparison results on the GTSRB dataset. The method proposed in this study exceeds all comparison models and achieves 99.83% accuracy. This exceeds other high-performing models, such as RIECNN4 (99.75%) and DeepThin43 (99.72%). While traditional architectures like LeNet-514 exhibit a lower accuracy of 97.53%, more advanced models such as VGG-1615 (99.21%) and CNN3 (99.66%) are also outperformed by the proposed approach. It also more validates the advantages and features that our proposed method has on the GTSRB dataset.

Table 2 ACCURACY results on the GTSRB dataset.
Table 3 Accuracy results on the TSRD dataset.

Table 3 presents a comparison with other approaches on the TSRD dataset. Some of the data in this table is sourced from49. When compared to the Prototypes method48, the proposed method yields identical results, but it is slightly outperformed by the CNN proposed by44 (99.16%) in the improved CNN-based method. However, our method (99.13%) achieves a higher accuracy than the CNN-based method proposed by45 (95.00%) and the combination of MLP, DWT, and DCT46 (94.90%). Additionally, it significantly surpasses methods like PCA, HOG, LBP, and SVM47 (84.44%), demonstrating its competitive performance in traffic sign recognition. The results suggested that our method provides a balanced solution, achieving high accuracy with 99.13%.

Table 4 mAP results on the TT100K dataset.

Table 4 provides a comparison with other approaches on the TT100K dataset, evaluated based on mAP. The data presented in this table is sourced from42. The proposed method (93.84%) outperforms several leading techniques, including CCSPNet50 (90.9%), Faster R-CNN51 (89.5%), and FCOS52 (88.1%). Notably, our method also exceeds the performance of FII-CenterNet53 (89.4%) and is competitive with CAMF42 (92.7%). These findings highlight the effectiveness of our approach in achieving a high mAP on the TT100K dataset. Compared to other methods, our model demonstrates a clear advantage in terms of mAP. It’s suggested that it not only achieves high accuracy but also maintains superior precision and recall balance across different classes. This reinforces the robustness of our method in multi-class traffic sign recognition tasks. In addition, our method can achieve an accuracy of 99.07% on the TT100K dataset.

Figure 6 illustrates the convergence curves for both training and validation accuracy, as well as the loss values. As can be seen, on the GTSRB dataset, the training accuracy gradually increased from an initial 49.89% to 99.83%, and the validation accuracy also steadily increased from 51.99% to 99.84%, while the loss value gradually decreased from 0.69 to 0.06. The convergence patterns of accuracy and loss validate the performance of the Siamese-Mamba&VGG-16 network. The model achieves high accuracy on both the training set and the verification set and maintains a low loss value. This indicates that the model possesses strong learning capabilities and can effectively generalize across different scenarios, thus making it highly suitable for TSR tasks.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Raw values of training loss and accuracy, along with their smoothed counterparts. (a) Accuracy map of GTSRB, (c) Accuracy map of TSRD, (e) Accuracy map of TT100K, (b) Loss map of GTSRB, (d) Loss map of TSRD, (f) Loss map of TT100K.

The Siamese-Mamba&VGG-16 network can achieve a high accuracy of over 99%, which is closely related to the HN-FR module. Figure 7 illustrates the input and output feature maps generated by the HN-FR module, showcasing its advantages in feature processing.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Feature maps before and after fusion. (a) Feature map of VGG-16, (b) Feature map of improved Mamba, (c) Feature map after HN-FR fusion.

In SNN, the detection outcomes for traffic signs are typically determined by the similarity between sample pairs. After normalization, the similarity value generally ranges from 0 to 1, where values closer to 1 indicate a positive sample pair, and values closer to 0 correspond to a negative sample pair. It can be found from Table 5 that the similarity between two different types of traffic signs with similar colors and shapes is still very low. However, the similarity between two traffic signs of the same category with problems such as fuzzy occlusion can be as high as 0.95.

Table 5 Object classification result of traffic signs.
Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

Misclassified Image Pair. (a) Misclassified samples by error type, (b) Blur-Induced Error Image Pair, (c) Occlusion Error Image Pair, (d) Low-Resolution Error Image Pair.

As shown in Fig. 8, although the proposed Siamese-Mamba & VGG-16 network achieves high accuracy across all datasets, it occasionally misclassifies traffic signs under extreme visual degradation, such as severe occlusion, motion blur, or low illumination. In these cases, features become less distinguishable, reducing similarity scores between true positive pairs. Low-resolution or heavily weathered signs may also cause confusion between visually similar categories. Analysis of misclassified samples from GTSRB, TSRD, and TT100K shows that only about 0.6% of test images were misclassified, mostly involving partially occluded or blurred signs (Fig. 8). According to the statistics of misclassified data, samples under blurry, occluded, and low-light conditions account for a large proportion, with blur accounting for 57.1% of the total misclassifications. Despite these limitations, the network still correctly recognizes many challenging signs, demonstrating that it maintains strong discriminative ability under difficult conditions. Overall, the method is robust in most real-world scenarios, achieving high recognition rates under illumination variations and background interference.

Table 6 Performance comparison and improvement of different networks on multiple Datasets.

The designed fusion framework allows flexible use of different backbone networks. Table 6 presents a comparison of the performance on three public traffic sign datasets (TT100K, GTSRB, and TSRD) using four backbone networks (MobileNetV2, VGG-16, ResNet50, and AlexNet). The experimental results show that the proposed fusion framework significantly improves recognition accuracy across all networks and datasets. Specifically, compared with the four networks, SM&MobileNetV2, SM&VGG-16, SM&ResNet50, and SM&AlexNet achieve average accuracy improvements of 9.69%, 7.78%, 2.35%, and 23.40% on the three datasets, respectively, confirming the effectiveness of the proposed HN-FR module and the improved Mamba structure in feature fusion and information enhancement. Notably, for lightweight networks such as MobileNetV2, the fused models exhibit larger gains in accuracy, indicating that the proposed framework can effectively enhance discriminative capability. Overall, the proposed method demonstrates good generalization and robustness across different network architectures and datasets.

Table 7 compares the memory usage and inference speed of our four proposed networks on a PC (RTX 4070 Ti) and an embedded device (Jetson Orin Nano, 67 TOPS, TensorRT-accelerated). The results show that GPU memory usage on the embedded device is significantly reduced (by approximately 7–8 times), while the inference speed is slightly slower. As mentioned before, the proposed method follows the strategy that maximizes recognition performance under sufficient computational resources. Although the accuracy is better than existing methods in comparative experiments, the proposed method cannot run in real-time on resource limited platforms. In future work, inference time can be further reduced through lightweight design, such as model pruning or dimensionality reduction of key features. Moreover, with the continuous improvement of in-vehicle and embedded computing capabilities, our method has potential for deployment on embedded systems.

Table 7 Performance comparison and improvement of different networks on multiple Datasets.

Ablation experiment

To evaluate the effect of the introduced modules, we carried out a series of ablation tests based on the TSRD dataset, and the specific data are shown in Table 8.

Table 8 Results of the ablation experiment.

The ablation study results clearly demonstrate the critical role of the improved Mamba, the DS-Block, and the HN-FR Module in enhancing overall model performance. Removal of the DS-Block leads to a slight reduction in accuracy (by 0.1%) but results in a notable increase in inference time, indicating its primary contribution to improving model efficiency. In contrast, the removal of the HN-FR Module causes a more substantial accuracy drop (by 0.34%), underscoring its primary role in enhancing model precision. In addition, after removing the Mamba module, the overall accuracy shows a certain decrease (by 0.58), which proves the important role of the improved Mamba network in the overall network architecture.

Although the removal of these modules impacts inference time and parameter count, the most significant effect is the degradation in accuracy. These findings validate the effectiveness of the DS-Block in accelerating model inference and the HN-FR Module in boosting model accuracy, highlighting their respective roles in the overall system architecture.

Conclusion

In this paper, we propose to use the SNN framework to fuse the improved Mamba network and the mainstream CNN architectures. The proposed framework is designed to allow the CNN component to be easily replaced or upgraded, providing adaptability to different network architectures without affecting the core feature fusion mechanism. The exploitation of the DS Block structure can improve algorithmic efficiency by reducing computational parameters, which results in a decrease in computational complexity and an enhancement in computational efficiency. Specifically, the Heterogeneous Network Feature Reconstruction module is proposed to reconstruct the convolutional features. With the core objective of effectively merging and dimension-reducing feature vectors from different networks, the proposed neural network enhances performance in classification tasks. Experimental results indicate that the proposed SM&CNN outperforms existing traditional methods and other deep learning models on multiple public traffic sign datasets (such as GTSRB, TSRD, and TT100K). The proposed framework provides a promising solution for traffic sign recognition. In future work, we aim to integrate additional sources of information to further improve the accuracy and robustness of our framework under challenging conditions, such as severe occlusion, motion blur, and adverse weather. Moreover, the proposed strategy can be readily extended to a wider range of autonomous driving perception tasks, including multi-sensor fusion, complex scene understanding, and Vision–Language–Action models. For the issue of efficiency feasibility, we plan to conduct lightweight design, such as model pruning or dimensionality reduction of key features, to adapt to real-time operation on resource constrained platforms.